際際滷

際際滷Share a Scribd company logo
Statistics for justice and fairness
犖犖.犖犖.犖犖迦犖犖犢 犖犖園犖犖巌犖о牽犖о鹸犖犖犢
犖犖項犖犖迦犖о権犖犖迦牽犖犖ム険犖犖犖項犖
Ph.D. and M.Sc. in Business Analytics and Data Science
犖犖迦犖迦牽犖∇犖犖犖萎犖迦肩犖迦犖迦硯犖巌犖迦硯犖巌犖∇顕犖犖迦牽犖犖犖萎犖園犖犖園権犢犖ム鍵犖犖迦牽犖犖犖巌見犖迦牽犖犖о顕犖÷犖犖朽犖∇
犖犖犖萎肩犖犖巌犖巌犖犖萎権犖伍犖犢 犖犖犖迦犖園犖犖園犖犖巌犖犖園犖犖犖犖巌見犖迦牽犖犖迦肩犖犖犢
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Roles of statistics in fairness and justice
 Facilitate fairness
 Detect anomaly and fraud
 Prevent crime and anomaly
 Regulatory Impact Assessment
Statistics and big data for justice and fairness
Test Fairness
Differential Item Functioning
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
There is no crime without any trace!
-Large deviation from normal or average man or cluster.
-Large deviation from past behavior.
-Inconsistency with themselves and surroundings.
-Repeated anomaly pattern.
-Caution on statistical detection of cheating and anomalous detection
Anomaly Detection
Outlier Analysis
0
Percent
Loss
Large deviation from normal or average man or cluster.
Large deviation from normal or average man or cluster.
v
58
Severity
Frequency58
Loss58 = f(Frequency57, Severity57, ICD-1057, ICD-957
,ICD-1058, ICD-958, age, gender)
Loss58
58
Predictors
Under Predict (Fraud or abuse)
v
vvvvv
vvvvv
vvvvv
vv
vv
v
v
vv
v
vv
Large deviation from past behavior.
Large deviation from past behavior.
TOEFL time 2
TOEFL time 1
Under Predict (Fraud or abuse)
v
v
vv
vv
vv
vvv
vv
vvv
vv
vv
v
v
vv
v
v
v
Inconsistency with themselves and surroundings.
-Low ability test taker can answer difficult item.
-K-index for copying! Eight dimensions
-Scoring test with contaminated response vector
-Influence function + Robust estimators
Statistics and big data for justice and fairness
-5 -4 -3 -2 -1 0 1 2 3
0
10
20
Pseudovalue Distribution for an Optima Examinee
Proficiency
Estimaate
Frequency
From Incorrect
Responses
From Correct
Responses
Statistics and big data for justice and fairness
Repeated anomaly pattern.
Predictor
Probability
Y =
0 normal claim
1 Abuse claim
犖犖迦 犖犖犖犖.
Y = ICD10, ICD9, TMT, gender, age, Severityt-1
, Frequencyt-1, Severityt, Frequencyt,
Caution on statistical detection of cheating
PredictorCutoff 1 Cutoff2
Performance
Statistics and big data for justice and fairness
 Positive Predictive Value: PPV
Caution on statistical detection of cheating
64.76 % 99.30%
 Statistical evidence as a red flag or warning
 Physical evidence is always needed.
 Early detection, protection, and prevention.
 Bayesian flip is needed.
Caution on statistical detection of cheating
P(Cheating=Yes|Detection=Yes)
P(Detection=Yes|Cheating=Yes)
P(Cheating=No|Detection=No)
P(Detection=No|Cheating=No)
P(Cheating=Yes|Detection=Yes)=P(Detection=Yes|Cheating=Yes)*P(Cheating=Yes)
P(Detection=Yes)
犖犖迦牽犖犖犢犖迦犢犖犖犖犖迦献犖犖犖犖迦権犖犖迦牽犖犖迦牽犖萎犖犖巌犖犖巌犖犖犖犖巌犖犖
犖犖犖÷犖犖犖÷犖犖犖萎犖園犖犖朽硯犖巌犢犖犖∇犖迦牽犢犖犢犢犖犖犖犖巌 Local
Outlier Factor(LOF)
犖犖園犖犖о鹸犖犖犢 犖о鹸犖犖園献犖∇顕犖о険犖犖幡犢
犖犖犖萎犖園犖犖朽硯犖巌犖犖劇犖犖犖迦
犖犖朽犖÷顕 : http://www.checkraka.com/saving/advertorial/10052/
犖犖迦牽犖犖犖犢犖犖巌犢犖犖犖犖萎犖園犖犖朽硯犖巌
犖犖園硯犖犖∇犖迦犖犖迦牽犖犖犖犢犖犖巌犢犢犖犖犖萎犖園犖犖朽硯犖巌
 犖犖劇犖犖犖犖犖犖伍犖ム犖犖伍犖犖犖萎犖園犖犖朽硯犖巌犖犖朽犖犢犖迦権犖犖犖園犖犢犖犖朽権犖(犖÷元犖犖о顕犖÷犖犖朽犖∇犖犖朽犖犖萎犖犖犢犖犖巌犖÷顕犖犖犖朽犖犖伍犖犖迦犖犖迦牽犢犖迦犖犖巌犖犖犖犖犖
犖犢犖犢犢犖犖犢犖÷顕犖ム犖犖伍)
 犖犖劇犖犖犖犖萎犖園犖犖朽硯犖巌犖犖迦権犖犖朽犖朽犖÷元犖犖迦牽犖犢犖迦権犖犖園犖犖(犢犖犢犖犖園犢犖犖巌犖犖萎賢犖迦犢犖犢犢犖犖о犢)
 犖犖劇犖犖犖犖萎犖園犖犖朽硯犖巌犢犖犖犖犖犖萎犖迦犖朽犖÷元犢犖犖朽犖∇肩犖項犖犖犖劇賢犢犖犖朽犖∇犖犖犖巌見犖ム顕犖∇犖犖÷犖犖犖÷(犢犖犖劇犖犖ム犖犖о顕犖÷肩犢犢犖)
 犖犖迦牽犢犖犢犖犖犢犢犖犖巌犖犢犖犢犖犖犖犖犖園犖犖迦犖÷犖犖萎犖犢犢犖犖巌犖犖劇犢犢犖犖∇犖о犖迦犖犖巌検
 犖犖劇犖犖犖犖÷犖犖犖÷犢犖犖朽犖∇犢犖犖∇犖犢犖犢犖迦犖犢犖犢犢犖ム犖о犖犖巌犖÷犖犖巌検犢犖犖朽犖(犢犖犖劇犖犢犖犢犢犖÷犖÷元犖犖迦牽犖犖犖о犖犖犖)
 犖犖項犢犖犖巌犖犖迦犖犖犖÷犖犖犖÷犢犖犖∇犖÷犖犢犖犢犖犢犖迦権犖犖劇(犖犖ム犖犖∇犖犖÷犖犖犖÷犖犖÷犖犖迦権犖伍犖犢犖犖)
 犢犖犖ム元犢犖∇犖÷厳犖犖犖犖÷犖犖犖÷犖犖犖犢犖犢犖迦犖犖犖犖犖劇賢犖犖項犖犖園犖犖ム犖犖萎犖∇犢犢
 犢犖犢犖犖犖÷犖犖犖÷犢犖犢犢犖犖ム険犖犖犖犖園犖∇犢犢犖犖迦牽犖犖項犢犖犖巌犖犖迦犖犢犖迦犖迦牽
犖犖朽犖÷顕 : http://www.acamstoday.org/what-is-real-money-laundering-risk-in-life-insurance/
Global versus Local Outlier
Mahalanobis Distance
K Nearest Neighbors
Cluster Analysis
Local Outlier Factor (LOF)
犖犖朽犖÷顕 : http://www.slideshare.net/Med_KU/20130318-f-rac-24695067
犢犖犢犢犖о鹸犖犖朽犖迦牽犖犖朽犢犖犢犢犢犖犖迦牽犖犖犖о犖犖犖犖犢犖迦犖巌犖犖犖犖巌犢犖о権犖о鹸犖犖朽犖迦牽
犖犖犖о犖犖о顕犖÷見犢犖迦犢犢犢犖犖犖犖犖園絹犖÷元犖犖犖犖犖伍犖犖朽犖犢犢犖
犖犢犖犖÷弦犖ム犖朽犖÷元犖犢犖迦犖巌犖犖犖犖巌犖萎検犖朽犖萎犢犢犖犖朽犖犖項 犖犖謹犖
犖犖萎犖÷犖÷元犢犖犖犖犢犢犢犖犖園犖犖朽犖犖迦権犖犖園硯
犖犢犖犖÷弦犖ム犖朽犖÷元犖犢犖迦犖犖犖巌犖萎検犖朽犖萎犢犢犖犖犖萎検犖迦 1
 LOF = Local density of k neighbor/Local density of its own point
 The Higher LOF = the more extreme local outlier!!!!
 Determine sigma (radius / reachable distance around point) so
that we can count k neighbor.
 Local density for point = numbers of points within reachable
distance/sum of distance between points and all k neighbors
LOF
犖о鹸犖犖朽原聾迦攻釣姑┯巌姑┯犖迦牽犖о鹸犖犖園権
犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢
N 145,842
Minimum 1.0529
Lower Quartile 3.8028
Mean 6.6356
Median 5.6134
Upper Quartile 8.3377
Maximum 50.6028
Skewness 1.8527
Std Dev 3.9917
Std Error 0.0105
Median+2.5(Q3-Q1) 16.9508
犖犖項犖犖犖橿犖о犖犖橿犖伍犖犖園 犖犖伍犖犖園 犖犖項犢犖犖犖犖犖犖 犖犖項犢犖犖犖犖巌犖犖犖犖 %犖犖項犢犖犖犖犖朽犖犖巌犖犖犖犖
Median+2.5(Q3-Q1) 16.9508 142,170 3,672 3%
0 10 20 30 40 50
0.000.050.100.15
density.default(x = Cust_txn$lofavg)
N = 145842 Bandwidth = 0.2824
Density
16.95
犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢
Max cluster Cluster Frequency
RMS
Standard
Deviation
Maximum
Distance from
Seed to
Observation
Radius
Exceeded
Nearest
Cluster
Distance
Between
Cluster
Centroids
3
1 3,406 0.6395 13.9509 > Radius 3 5.5332
2 97 2.8941 28.1499 > Radius 1 7.8612
3 169 1.4675 27.3884 > Radius 1 5.5332
Pseudo
F Statistic
Observed
R-Squared
Over-All
Approx. Expected
R-Squared
Over-All
Cubic
Clustering
Criterion
477.1300 0.2064 0.1047 108.2590
犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢
Cluster group Normal 1 2 3
犖犖園犖犢犖о犢犖犖巌犖犖朽犖犢犖迦権犖犖犖巌犢犖÷厳犢犖犢犖犖朽権犖犖 犖園犢犖犖巌犖犖朽犖犢犖犖犖犢犖迦権犖犖項犖犖伍 0.0002 -0.0207 0.0219 0.2042
犖犖園犖犢犖о犢犖犖巌犖犖朽犖犢犖迦権犖犖犖巌犢犖÷厳犢犖犢犖犖朽権犖犖 犖園犢犖犖巌犖犖朽犖犢犖犖犖犢犖迦権犖犢犖迦肩犖伍 0.0003 -0.0079 -0.0291 -0.0677
犖犖迦犖о犖犖犖巌犖犢犖犖犖犖迦犖犖迦牽犖犢犖迦権犖犢犖迦犖犖園硯犢犖犖 0.0004 -0.1296 4.1856 -0.0965
犖犖迦犖о犖犖犖萎犖犖犖犖迦牽犖犢犖迦権犢犖犖犖犖朽1 -0.0001 -0.2101 0.7697 3.8544
犖犖迦犖о犖犖犖萎犖犖犖犖迦牽犖犢犖迦権犢犖犖犖犖朽2 0.0002 -0.0965 3.2309 -0.0368
犖犖迦犖о犖犖犖迦犖迦牽犖犖朽犖犖迦犖迦牽犖犢犖迦権 -0.0003 -0.1955 1.4258 3.3523
犖犖迦犖о犖犖項犢犖犖犖犖迦牽犖犢犖迦権犢犖犖朽犖 0.0001 -0.0788 1.7666 0.4684
犖犖迦犖о犖犢犖犖犖犖迦犖犖朽犖ム弦犖犖犢犖迦犖劇犖犖犖犖÷犖犖犖÷ 0.0000 -0.0931 3.4216 -0.0931
犖犖迦犖о犖犖犖園犖犖犖朽犖犢犖迦権犢犖犖朽犖∇犖犖萎 犖園犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0003 -0.0352 0.6315 0.5852
犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖 -0.0001 -0.0456 1.3903 0.1983
犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖犖謹犖犖犖 -0.0006 0.0012 0.1840 0.3477
犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖迦検犢犖犖劇賢犖 0.0000 -0.0195 0.3312 0.1681
犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犢犖犖劇賢犖 -0.0001 -0.0216 0.5261 0.1808
犢犖犖朽犖∇犖犖萎 犖園犖 犖園犖犖犖 犖犖о検犖犖伍犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0003 -0.0350 1.2317 0.2602
犖犖伍犖犖犖萎 犖園犖犖朽硯犖巌犖犖о検犖犖伍犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0008 -0.0195 1.6847 0.0972
犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 0 0 2 1
犖犖迦犖о犖ム弦犖犖犢犖 142,170 3,406 97 169
 Normal
犖犖ム幻犢犖÷献犖項犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犢犖犖ム元犢犖∇犖犖朽権犖 1 犖犖犖÷犖犖犖÷ 犖ム弦犖犖犢犖迦検犖朽犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖萎犖園犖犖伍犖犖犖巌 (犖犖迦権犢犖犖劇賢犖,犖犖迦権
犖犖迦検犢犖犖劇賢犖,犖犖迦権犖犖犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖) 犢犖ム鍵犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犢犖÷犖迦犖о犖犢犖о権犖犢犖犖犖犖迦犢犖犖朽権犖
 Cluster 1
犖÷元犖犖項犢犖犖犢犖犖朽権犖о犖園犖犖ム幻犢犖 Normal
 Cluster 2
犖犖ム幻犢犖÷献犖項犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犖犖犖萎犖園犖犖朽硯犖巌犢犖犖ム元犢犖∇検犖迦犖犖о犖 2 犖犖犖÷犖犖犖÷ 犖犖謹犖犖ム弦犖犖犢犖迦犖犢犖犖犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖犖犖迦権
犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖朽犖犖∇犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犢犖÷犖迦犖о犢犖犖犖ム顕犖∇ 犖犢犖犖犖犖迦犖犖迦牽犖犢犖迦権犖犖園犖犖園硯犢犖犖犖犖迦権犖犖犖萎犖園
 Cluster 3
犖ム弦犖犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犖犖犖萎犖園犖犖朽硯犖巌犢犖犖ム元犢犖∇検犖迦犖犖о犖 1 犖犖犖÷犖犖犖÷犢犖犢犖犖犖ム幻犢犖÷犖朽犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖犖犖迦権
犖犖迦検犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖犢犖犖劇賢犖犢犖ム鍵犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犖巌犖犖迦犖о犢犖犖犖ム顕犖∇ 犖犢犖犖犖犖迦犖犖迦牽犖犖迦牽犖萎犖犖巌 犢犖犢犖 犢犖犖巌犖犖犖犖犖劇賢犖犖園犖
犢犖犖犖犖巌
犖犖犖伍犖犖÷犖迦牽犖о鹸犢犖犖犖迦鍵犖犢
Statistics and big data for justice and fairness
犢犖犖迦鍵犖ム原犖犖犖迦牽犢犖迦犖犖犢犢犢犖ム権犖朽犖ム犖犖犢犖犢犢犖犢犖犖園犖犢犖迦
犢犖犖犖犖迦犖犢犖迦犖犢犖迦硯
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
犖÷顕犖犖犖 77 犖犖犖犖犖園犖犖犖犖÷犖項
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness
Statistics and big data for justice and fairness

More Related Content

Statistics and big data for justice and fairness

  • 1. Statistics for justice and fairness 犖犖.犖犖.犖犖迦犖犖犢 犖犖園犖犖巌犖о牽犖о鹸犖犖犢 犖犖項犖犖迦犖о権犖犖迦牽犖犖ム険犖犖犖項犖 Ph.D. and M.Sc. in Business Analytics and Data Science 犖犖迦犖迦牽犖∇犖犖犖萎犖迦肩犖迦犖迦硯犖巌犖迦硯犖巌犖∇顕犖犖迦牽犖犖犖萎犖園犖犖園権犢犖ム鍵犖犖迦牽犖犖犖巌見犖迦牽犖犖о顕犖÷犖犖朽犖∇ 犖犖犖萎肩犖犖巌犖巌犖犖萎権犖伍犖犢 犖犖犖迦犖園犖犖園犖犖巌犖犖園犖犖犖犖巌見犖迦牽犖犖迦肩犖犖犢
  • 6. Roles of statistics in fairness and justice Facilitate fairness Detect anomaly and fraud Prevent crime and anomaly Regulatory Impact Assessment
  • 15. There is no crime without any trace! -Large deviation from normal or average man or cluster. -Large deviation from past behavior. -Inconsistency with themselves and surroundings. -Repeated anomaly pattern. -Caution on statistical detection of cheating and anomalous detection Anomaly Detection
  • 16. Outlier Analysis 0 Percent Loss Large deviation from normal or average man or cluster.
  • 17. Large deviation from normal or average man or cluster. v 58 Severity Frequency58
  • 18. Loss58 = f(Frequency57, Severity57, ICD-1057, ICD-957 ,ICD-1058, ICD-958, age, gender) Loss58 58 Predictors Under Predict (Fraud or abuse) v vvvvv vvvvv vvvvv vv vv v v vv v vv Large deviation from past behavior.
  • 19. Large deviation from past behavior. TOEFL time 2 TOEFL time 1 Under Predict (Fraud or abuse) v v vv vv vv vvv vv vvv vv vv v v vv v v v
  • 20. Inconsistency with themselves and surroundings. -Low ability test taker can answer difficult item. -K-index for copying! Eight dimensions -Scoring test with contaminated response vector -Influence function + Robust estimators
  • 22. -5 -4 -3 -2 -1 0 1 2 3 0 10 20 Pseudovalue Distribution for an Optima Examinee Proficiency Estimaate Frequency From Incorrect Responses From Correct Responses
  • 24. Repeated anomaly pattern. Predictor Probability Y = 0 normal claim 1 Abuse claim 犖犖迦 犖犖犖犖. Y = ICD10, ICD9, TMT, gender, age, Severityt-1 , Frequencyt-1, Severityt, Frequencyt,
  • 25. Caution on statistical detection of cheating
  • 28. Positive Predictive Value: PPV Caution on statistical detection of cheating 64.76 % 99.30%
  • 29. Statistical evidence as a red flag or warning Physical evidence is always needed. Early detection, protection, and prevention. Bayesian flip is needed. Caution on statistical detection of cheating P(Cheating=Yes|Detection=Yes) P(Detection=Yes|Cheating=Yes) P(Cheating=No|Detection=No) P(Detection=No|Cheating=No) P(Cheating=Yes|Detection=Yes)=P(Detection=Yes|Cheating=Yes)*P(Cheating=Yes) P(Detection=Yes)
  • 32. 犖犖迦牽犖犖犖犢犖犖巌犢犖犖犖犖萎犖園犖犖朽硯犖巌 犖犖園硯犖犖∇犖迦犖犖迦牽犖犖犖犢犖犖巌犢犢犖犖犖萎犖園犖犖朽硯犖巌 犖犖劇犖犖犖犖犖犖伍犖ム犖犖伍犖犖犖萎犖園犖犖朽硯犖巌犖犖朽犖犢犖迦権犖犖犖園犖犢犖犖朽権犖(犖÷元犖犖о顕犖÷犖犖朽犖∇犖犖朽犖犖萎犖犖犢犖犖巌犖÷顕犖犖犖朽犖犖伍犖犖迦犖犖迦牽犢犖迦犖犖巌犖犖犖犖犖 犖犢犖犢犢犖犖犢犖÷顕犖ム犖犖伍) 犖犖劇犖犖犖犖萎犖園犖犖朽硯犖巌犖犖迦権犖犖朽犖朽犖÷元犖犖迦牽犖犢犖迦権犖犖園犖犖(犢犖犢犖犖園犢犖犖巌犖犖萎賢犖迦犢犖犢犢犖犖о犢) 犖犖劇犖犖犖犖萎犖園犖犖朽硯犖巌犢犖犖犖犖犖萎犖迦犖朽犖÷元犢犖犖朽犖∇肩犖項犖犖犖劇賢犢犖犖朽犖∇犖犖犖巌見犖ム顕犖∇犖犖÷犖犖犖÷(犢犖犖劇犖犖ム犖犖о顕犖÷肩犢犢犖) 犖犖迦牽犢犖犢犖犖犢犢犖犖巌犖犢犖犢犖犖犖犖犖園犖犖迦犖÷犖犖萎犖犢犢犖犖巌犖犖劇犢犢犖犖∇犖о犖迦犖犖巌検 犖犖劇犖犖犖犖÷犖犖犖÷犢犖犖朽犖∇犢犖犖∇犖犢犖犢犖迦犖犢犖犢犢犖ム犖о犖犖巌犖÷犖犖巌検犢犖犖朽犖(犢犖犖劇犖犢犖犢犢犖÷犖÷元犖犖迦牽犖犖犖о犖犖犖) 犖犖項犢犖犖巌犖犖迦犖犖犖÷犖犖犖÷犢犖犖∇犖÷犖犢犖犢犖犢犖迦権犖犖劇(犖犖ム犖犖∇犖犖÷犖犖犖÷犖犖÷犖犖迦権犖伍犖犢犖犖) 犢犖犖ム元犢犖∇犖÷厳犖犖犖犖÷犖犖犖÷犖犖犖犢犖犢犖迦犖犖犖犖犖劇賢犖犖項犖犖園犖犖ム犖犖萎犖∇犢犢 犢犖犢犖犖犖÷犖犖犖÷犢犖犢犢犖犖ム険犖犖犖犖園犖∇犢犢犖犖迦牽犖犖項犢犖犖巌犖犖迦犖犢犖迦犖迦牽 犖犖朽犖÷顕 : http://www.acamstoday.org/what-is-real-money-laundering-risk-in-life-insurance/
  • 33. Global versus Local Outlier Mahalanobis Distance K Nearest Neighbors Cluster Analysis
  • 34. Local Outlier Factor (LOF) 犖犖朽犖÷顕 : http://www.slideshare.net/Med_KU/20130318-f-rac-24695067 犢犖犢犢犖о鹸犖犖朽犖迦牽犖犖朽犢犖犢犢犢犖犖迦牽犖犖犖о犖犖犖犖犢犖迦犖巌犖犖犖犖巌犢犖о権犖о鹸犖犖朽犖迦牽 犖犖犖о犖犖о顕犖÷見犢犖迦犢犢犢犖犖犖犖犖園絹犖÷元犖犖犖犖犖伍犖犖朽犖犢犢犖 犖犢犖犖÷弦犖ム犖朽犖÷元犖犢犖迦犖巌犖犖犖犖巌犖萎検犖朽犖萎犢犢犖犖朽犖犖項 犖犖謹犖 犖犖萎犖÷犖÷元犢犖犖犖犢犢犢犖犖園犖犖朽犖犖迦権犖犖園硯 犖犢犖犖÷弦犖ム犖朽犖÷元犖犢犖迦犖犖犖巌犖萎検犖朽犖萎犢犢犖犖犖萎検犖迦 1
  • 35. LOF = Local density of k neighbor/Local density of its own point The Higher LOF = the more extreme local outlier!!!! Determine sigma (radius / reachable distance around point) so that we can count k neighbor. Local density for point = numbers of points within reachable distance/sum of distance between points and all k neighbors LOF
  • 37. 犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢 N 145,842 Minimum 1.0529 Lower Quartile 3.8028 Mean 6.6356 Median 5.6134 Upper Quartile 8.3377 Maximum 50.6028 Skewness 1.8527 Std Dev 3.9917 Std Error 0.0105 Median+2.5(Q3-Q1) 16.9508 犖犖項犖犖犖橿犖о犖犖橿犖伍犖犖園 犖犖伍犖犖園 犖犖項犢犖犖犖犖犖犖 犖犖項犢犖犖犖犖巌犖犖犖犖 %犖犖項犢犖犖犖犖朽犖犖巌犖犖犖犖 Median+2.5(Q3-Q1) 16.9508 142,170 3,672 3% 0 10 20 30 40 50 0.000.050.100.15 density.default(x = Cust_txn$lofavg) N = 145842 Bandwidth = 0.2824 Density 16.95
  • 38. 犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢 Max cluster Cluster Frequency RMS Standard Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids 3 1 3,406 0.6395 13.9509 > Radius 3 5.5332 2 97 2.8941 28.1499 > Radius 1 7.8612 3 169 1.4675 27.3884 > Radius 1 5.5332 Pseudo F Statistic Observed R-Squared Over-All Approx. Expected R-Squared Over-All Cubic Clustering Criterion 477.1300 0.2064 0.1047 108.2590
  • 39. 犖犖ム犖迦牽犖о鹸犢犖犖犖迦鍵犖犢 Cluster group Normal 1 2 3 犖犖園犖犢犖о犢犖犖巌犖犖朽犖犢犖迦権犖犖犖巌犢犖÷厳犢犖犢犖犖朽権犖犖 犖園犢犖犖巌犖犖朽犖犢犖犖犖犢犖迦権犖犖項犖犖伍 0.0002 -0.0207 0.0219 0.2042 犖犖園犖犢犖о犢犖犖巌犖犖朽犖犢犖迦権犖犖犖巌犢犖÷厳犢犖犢犖犖朽権犖犖 犖園犢犖犖巌犖犖朽犖犢犖犖犖犢犖迦権犖犢犖迦肩犖伍 0.0003 -0.0079 -0.0291 -0.0677 犖犖迦犖о犖犖犖巌犖犢犖犖犖犖迦犖犖迦牽犖犢犖迦権犖犢犖迦犖犖園硯犢犖犖 0.0004 -0.1296 4.1856 -0.0965 犖犖迦犖о犖犖犖萎犖犖犖犖迦牽犖犢犖迦権犢犖犖犖犖朽1 -0.0001 -0.2101 0.7697 3.8544 犖犖迦犖о犖犖犖萎犖犖犖犖迦牽犖犢犖迦権犢犖犖犖犖朽2 0.0002 -0.0965 3.2309 -0.0368 犖犖迦犖о犖犖犖迦犖迦牽犖犖朽犖犖迦犖迦牽犖犢犖迦権 -0.0003 -0.1955 1.4258 3.3523 犖犖迦犖о犖犖項犢犖犖犖犖迦牽犖犢犖迦権犢犖犖朽犖 0.0001 -0.0788 1.7666 0.4684 犖犖迦犖о犖犢犖犖犖犖迦犖犖朽犖ム弦犖犖犢犖迦犖劇犖犖犖犖÷犖犖犖÷ 0.0000 -0.0931 3.4216 -0.0931 犖犖迦犖о犖犖犖園犖犖犖朽犖犢犖迦権犢犖犖朽犖∇犖犖萎 犖園犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0003 -0.0352 0.6315 0.5852 犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖 -0.0001 -0.0456 1.3903 0.1983 犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖犖謹犖犖犖 -0.0006 0.0012 0.1840 0.3477 犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犖犖迦検犢犖犖劇賢犖 0.0000 -0.0195 0.3312 0.1681 犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖犢犖迦権犢犖犖犖犖迦権犢犖犖劇賢犖 -0.0001 -0.0216 0.5261 0.1808 犢犖犖朽犖∇犖犖萎 犖園犖 犖園犖犖犖 犖犖о検犖犖伍犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0003 -0.0350 1.2317 0.2602 犖犖伍犖犖犖萎 犖園犖犖朽硯犖巌犖犖о検犖犖伍犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 -0.0008 -0.0195 1.6847 0.0972 犖犖迦犖о犖犖犖÷犖犖犖÷犖犖朽犖÷元犖犖迦牽犖犢犖迦権 0 0 2 1 犖犖迦犖о犖ム弦犖犖犢犖 142,170 3,406 97 169
  • 40. Normal 犖犖ム幻犢犖÷献犖項犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犢犖犖ム元犢犖∇犖犖朽権犖 1 犖犖犖÷犖犖犖÷ 犖ム弦犖犖犢犖迦検犖朽犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖萎犖園犖犖伍犖犖犖巌 (犖犖迦権犢犖犖劇賢犖,犖犖迦権 犖犖迦検犢犖犖劇賢犖,犖犖迦権犖犖犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖) 犢犖ム鍵犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犢犖÷犖迦犖о犖犢犖о権犖犢犖犖犖犖迦犢犖犖朽権犖 Cluster 1 犖÷元犖犖項犢犖犖犢犖犖朽権犖о犖園犖犖ム幻犢犖 Normal Cluster 2 犖犖ム幻犢犖÷献犖項犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犖犖犖萎犖園犖犖朽硯犖巌犢犖犖ム元犢犖∇検犖迦犖犖о犖 2 犖犖犖÷犖犖犖÷ 犖犖謹犖犖ム弦犖犖犢犖迦犖犢犖犖犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖犖犖迦権 犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖朽犖犖∇犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犢犖÷犖迦犖о犢犖犖犖ム顕犖∇ 犖犢犖犖犖犖迦犖犖迦牽犖犢犖迦権犖犖園犖犖園硯犢犖犖犖犖迦権犖犖犖萎犖園 Cluster 3 犖ム弦犖犖犢犖迦犖朽犖÷元犖犖迦犖о犖犖犖÷犖犖犖÷犖犖犖萎犖園犖犖朽硯犖巌犢犖犖ム元犢犖∇検犖迦犖犖о犖 1 犖犖犖÷犖犖犖÷犢犖犢犖犖犖ム幻犢犖÷犖朽犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖朽犖∇犖犖犖犖迦権 犖犖迦検犢犖犖劇賢犖犢犖ム鍵犖犖迦権犖犖犢犖犖劇賢犖犢犖ム鍵犢犖犢犖犖犖迦牽犖犢犖迦権犢犖犖犢犖犖巌犖犖迦犖о犢犖犖犖ム顕犖∇ 犖犢犖犖犖犖迦犖犖迦牽犖犖迦牽犖萎犖犖巌 犢犖犢犖 犢犖犖巌犖犖犖犖犖劇賢犖犖園犖 犢犖犖犖犖巌 犖犖犖伍犖犖÷犖迦牽犖о鹸犢犖犖犖迦鍵犖犢