This document discusses metrics for evaluating the predictive performance of fraud management systems (FMS). It recommends using the area under the receiver operating characteristic (ROC) curve (AUC) as the best metric. The AUC is not sensitive to variations in class distributions over time and between countries. It measures an FMS's ability to distinguish between fraudulent and non-fraudulent cases. The document provides background on FMS architecture and discusses other potential metrics like accuracy, precision, recall, and their limitations for fraud detection.
1 of 11
Download to read offline
More Related Content
Assessing the predictive capacity measur marco scattareggia
1. February 2010
Assessing the predictive capacity measure of a Fraud
Management System
By Marco Scattareggia
Summary
Looking for a unique FMS predictive performance index,
one of the best choices is to adopt the area under the
ROC curve AUC. ROC curves give us more attractive
metrics, if compared to Precision/Recall graphs, because
they are not sensitive to Fraud/Not Fraud class
distributions skewness variability. In fact, the fraud class
distribution skewness changes both from country to
country and over time for the same telecommunication
operator in the same country.
Was the information contained in this article useful? Your feedback to email
is appreciated!
Marco Scattareggia has a degree on Electronic Engineer, works in Rome
(Italy) for EMEA HP and runs the CoE (Center of Excellence) to develop fraud
management solutions for telecommunication operators.
Acknowledgement
I would like to acknowledge the efforts of Luigi Catzola associated researcher
at SEMEION for his support and help in reviewing the KB. Very special thanks
go to Flavio Riciniello, Executive Officer of Eidos Consult and former Fraud
Manager of Telecom Italia, for his authoritative approval and suggestions
about skewness and kurtosis factors.
Introduction
When people use communications without paying for the service, they steal
from the Telecom Operators or Service Providers and commit a fraud. There
is no difference between stealing the property of another and stealing
services, as in both cases, something of value is taken without
compensation. According to the Communications Fraud Control Association
(CFCA), fraud losses for Network & Service Providers (NSP) are in the range
of $35 - $40 billion U.S. dollars worldwide. Operators are aware of the
problem and often use an automated Fraud Management System (FMS) to
fight fraudsters, but they still do not grasp the true complexity of fraud,
because there are many factors that can affect the accuracy of fraud
detection and it is possible to make the threshold for detection very high,
very low, or anywhere in-between, depending upon the FMS settings and
data sources.
Page: 1
2. Every FMS collects demographic information about subscribers from the
Customer Care or Billing department, and monitors the events (calls,
messages, etc.) of subscribers over the network. The conceptual FMS
architecture in Figure 1 shows the fraud intelligence capabilities for scoring
new subscribers, clustering subscribers in different segments to induct
patterns or to allow rules to be applied to create an on-line filter for the
detection of anomalous events. After detecting anomalous events, the FMS
generates alarms and, when sufficient indications of fraud exist, flags fraud
cases to the analysts who are responsible for deciding what to do with these
cases. From an overall perspective, the function of the FMS is the
classification of network events to identify those that could be fraudulent.
The case scoring function enables the isolation of only those cases that have
enough probability to be in the True Positive category and avoids overloading
the analysts and the Fraud Manager with many False Positives.
Figure 1: FMS Architecture
This Knowledge Brief describes how to achieve accuracy in the classification
of fraudulent cases and how to to identify the best parameters to use as a
predictive performance index of a FMS. Key Performance Indicators (KPIs)
such as Accuracy, Misclassification Rate, and Hit Rate are popular with
operators, but are not enough to detect fraud. The couplet of Precision/Recall
works reasonably well in information retrieval applications and is a good
candidate for our purposes, but Sensitivity (the percent of Frauds classified
as fraud) and 1-Specificity (the percent of Not Frauds classified as fraud) fit
the fraud fighting process in the telecommunication arena better, even if the
proportions of Fraud to Not Fraud instances vary significantly from period to
period and country to country. Receiver Operating Characteristic (ROC)
curves, the plot of the Sensitivity against 1-Specificity at many cut-points,
are insensitive to class skews and the area under these curves (AUC)
Page: 2
3. measures the ability of an FMS to separate Frauds from Not Frauds. AUC is
representative, at the same time, of how many true frauds are detected
every day, week or month (effectiveness) and of how many false alarms are
slowing down the fraud management process (efficiency).
The concepts presented here are based on recent data mining literature on
predictive and classification models. For example, see the tutorial "The Many
Faces of ROC Analysis in Machine Learning" issued during the ICML Twenty-
First International Conference on Machine Learning on July 4-8, 2004
(http://www.aicml.cs.ualberta.ca/_banff04/icml/). However, this Knowledge
Brief relies more on years of experience implementing FMS solutions for HP
Customers.
KPIs for FMS
Once Fraud Managers have analyzed what they want to do and have defined
their department goals, they need a way to measure progress toward these
goals via some Key Performance Indicators (KPIs). It is very important to
choose KPIs that are SMART:
Specific Linked to a particular activity, clear and unambiguous
Measurable objective
Attainable incrementally challenging but realistic
Relevant meaningful to the telecom operator
Time bound regularly measured and reviewed
Best practice is to measure a KPI at regular intervals and against specific
benchmarks. Benchmarking enables you to improve performance in a
systematic and logical way by measuring and comparing one operators
performance against the others, and then using lessons learned from the
best of them to make targeted improvements. It involves answering the
questions:
Who performs better?
Why are they better?
What actions do we need to take in order to improve our
performance?
The purpose of our analysis is discovering which KPI might better cover fraud
management necessities.
Notation
The red color represents Frauds while the blue represents Not Frauds. When
there is a mixture of both fraud and not fraud, the black color is used.
The totality of Frauds, 100% of observed fraud cases, matches the Total
Positive p cases given by the sum of frauds hit (True Positives) and frauds
missed (False Negatives) by FMS detection. The true positive rate of
detection of an FMS, TP, can be obtained dividing the number of True
Positives by the totality of frauds p. The false negative rate of FMS
Page: 3
4. detection, FN, is given by dividing the number of False Negatives by the
same p; they are complementary because TP + FN = 100% = 1.
Similarly, the totality of Not Frauds corresponds to the Total Negative n
cases given by the sum of false alarms (False Positives) and residual honest
subscribers (True Negatives); the associated rates FP and TN are
complementary: FP + TN = 100% = 1.
The four rates TP, FN, FP and TN can be summarized in a cross tabulation
matrix with columns of observed Frauds and Not Frauds and rows reporting
FMS predictions (see Table 1). This classification table has been also called a
Confusion Matrix because it helps in getting rid of confusion during
classification activity; it provides a measure of how well the FMS performs.
Frauds Not Frauds
True Positive False Positive
Predicted Positive TP = True Positive/p FP = False Positive/n
False Negative True Negative
Predicted Negative FN = False Negative /p TN = True Negative/n
Total Positive Total Negative
Total Cases True Positive + False Negative = p False Positive + True Negative = n
TP + FN = 1 FP + TN = 1
Table 1: Classification Matrix
The distribution of the four rates should be plotted by frequencies measured
along a delivered service value. Phone time elapsed in minutes or data
transferred in bytes can represent the proper service value in a
telecommunication context. If available, the corresponding value in local
currency: SDRs, dollars, or euros charged by the billing department, works
even better and allows benchmarking across different operators and
countries.
In the following paragraphs, Fraud and Not Fraud frequency distributions are
symmetric and look like Gaussian curves (see Figure 2). This is only for
purposes of illustration, because Fraud and Not Fraud distributions are
asymmetrical with a skewness factor negative for frauds and positive for not
frauds. Besides, it is important to analyze the kurtosis factor flattening the
distribution near the maximum of the curve. Kurtosis is a measure of
whether the curves are peaked or flat relative to a normal distribution. Data
sets with high kurtosis tend to have a distinct peak near the mean, decline
rather rapidly, and have heavy tails; data sets with low kurtosis tend to have
a flat top near the mean rather than a sharp peak (see Figure 3).
Confronting distributions of absolute Fraud and Not Fraud values, we see a
very strong skewness between them because in the telecommunication
context there could be a ratio of 1 Fraud for perhaps as many as every 1,000
or even 10,000 Not Frauds. Figure 4 shows this skewness, lowered by a
logarithmic vertical scale, and synthesizes the basic parameters necessary to
Page: 4
5. analyze FMS predictive capabilities. The area under the red distribution
represents the p cases belonging to the Fraud class, while the blue one gives
the distribution of Not Fraud class to which the n cases belong.
Figure 2 Gauss probability distributions
Figure 3: Skewness and Kurtosis Factors
We might calculate the probability at or above any given threshold
(represented by a green line in Figure 4), for an alarm to be correct or
incorrect, by determining the fractions of cases properly classified if that
threshold were applied. In Figure 4 example, the threshold delivered service
value is 200 and 81% of subscribers would be correctly reported (True
Positives) as fraudsters, while 16% of honest subscribers would be
incorrectly classified (False Positives). At the same time, 19% of real
fraudsters would be incorrectly classified as honest subscribers (False
Negatives) and, finally, 84% of honest subscribers would be correctly
classified (True Negatives). The series of four values so computed should be
reported by Classification Matrixes as showed on the lower right corner of the
Figure 4.
The FMS Case Scoring function interprets such frequencies as probabilities
and predicts positive or negative cases according to the percentages of true
or false alarms at different cut-points given by the threshold values of the
delivered service.
Fraud analysts verify cases predicted positive and classify them as True
Positive or False Positive (resolution phase). All the other cases, predicted
negative, will be initially classified as True Negatives, but some of them
might turn out as False Negatives when accounting for the unpaid invoices in
the credit and risk department.
Page: 5
6. Figure 4: Fraud Case Scoring
Fraud Managers should carefully compile Classification Matrixes and derive
from them powerful indexes combining the basic four variables TP, FN, FP
and TN in different KPIs:
p = total Frauds
True Positive rate = True Positive / p = TP = 1 - FN
False Negative rate = False Negative / p = FN = 1 - TP
n = total Not Frauds
False Positive rate = False Positive / n = FP = 1 - TN
True Negative rate = True Negative / n = TN = 1 - FP
Accuracy = Total correctly classified / Total cases =
(TP+TN)/(p+n)
Misclassification Rate = "Total not correctly classified" / "Total
cases" = (FN+FP)/(p+n)
Accuracy = 1 - Misclassification Rate
Precision = True Positive / (True Positive + False Positive)
Recall = True Positive / p = True Positive rate
Hit Rate 1 = Precision
Hit Rate 2 = Recall
Sensitivity = True Positive rate = Recall
Page: 6
7. Specificity = True Negative rate
Sensitivity = True Positive rate
(1 Specificity) = False Positive rate
Accuracy and Misclassification Rate
Accuracy maximization is very popular within the analyst community, but it
is not appropriate for an FMS because it assumes equal misclassification
costs for both False Positive and False Negative errors. In fraud detection,
the cost of missing a case of fraud can be much higher than the cost of a
false alarm. Moreover, the fraud events class is comparatively rare, and
when fraudulent activity involves only 0.01% of a population to predict all
the events as Not Fraud achieves 99.99% Accuracy, which is highly
acceptable from a global perspective, but it is completely unacceptable for
effectively predicting fraud because, despite the high Accuracy, we would
miss all the fraud! In conclusion, adopting Accuracy as an FMS predictive
evaluation metric, we would wrongly assume that distribution between
Frauds and Not Frauds is constant and balanced.
It is not advisable to use the Misclassification Rate for evaluating FMS for
exactly the same reasons we discussed for discarding Accuracy. The
misclassification Rate is indeed the complementary percentage to Accuracy
(Misclassification Rate = 1- Accuracy).
Precision/Recall and Hit-Rate
Precision and Recall are the basic measures used in evaluating search
strategies. Precision is the ratio of the number of relevant records retrieved
to the total number of irrelevant and relevant records retrieved:
Precision = correctly classified / total predicted positive
Recall is the ratio of the number of relevant records retrieved to the total
number of relevant records in the database:
Recall = correctly classified / total positive existing
As previously done with the information retrieval best practices, we have to
analyze the Precision efficiency together with the Recall effectiveness to see
and to understand the tradeoff existing between them:
without adding information content, we cannot reach simultaneously higher
Precision and wider Recall.
The Precision-Recall graph plotted in Figure 5 visualizes this concept and in
the fraud detection process the definitions of Precision and Recalls became:
Precision = True Positive / (True Positive + False Positive)
Recall = True Positive / p = True Positive rate
We cannot be satisfied using Precision/Recall as an FMS performance
predictive metric, because when the class distribution changes the metric will
Page: 7
8. change too. In other words, Precision/Recall is sensitive to Fraud/Not Fraud
class distributions skewness variability. The minority class (Frauds) has much
lower Precision and Recall than the prevalent class (Not Frauds) and many
practitioners have observed that for extremely skewed classes the Recall of
Frauds is often 0 and there are no classification rules that can be generated
for it.
Often, FMS suppliers propose Hit Rate as the sole parameter to represent
the predictive performance of their system.
Among fraud managers Hit Rate is usually evaluated according to Precision,
while at other times it is described similar to Recall. In any case, Hit Rate
alone cannot be sufficient to estimate, at the same time, the effectiveness in
terms of how many real frauds are detected in comparison with their
totality (i.e., Recall or True Positive rate) and the efficiency in terms of how
few false alarms are going to slow down the resolution phase of the fraud
management process (i.e., Precision).
Figure 5: Adding more Information for Retrieval
Sensitivity/Specificity and ROC curves
The important tradeoff between Precision and Recall we discussed earlier is
similar to the one that exists between the Sensitivity, given by the percent of
Frauds classified as Frauds, and Specificity, corresponding to the percent of
Not Frauds classified as Not Frauds, or its complementary value 1-Specificity
given by the percent of Not Frauds classified as Frauds.
To analyze Sensitivity and Specificity, it is advisable to adopt the ROC
(Receiver Operating Characteristic) curves by plotting, for each potential
threshold value, the frequency of true positive cases (Sensitivity) against the
frequency of false positives (1-Specificity). The diagonal straight line would
Page: 8
9. signify that the system had a 50/50 chance of making a correct alarm (i.e.,
no better than flipping a coin).
In Figure 6 there are two ROC curves plotted by SPSS 13.0 for the
Windows ROC graph tool. These curves are evaluated according to the
threshold scores of one Neural Network and one C5.0 Decision Tree trained
by SPSS/Clementine 8.0 upon an HP-FMS 7.0-3 Case Archive containing
a sample of 1387 total Frauds and 25884 total Not Frauds.
The coordinates of ROC curves are the true and false positive rates or
frequencies:
Sensitivity = Positives correctly classified / total Frauds
Sensitivity = True Positive rate = TP
1 - Specificity = 1 - True Negative rate = 1 TN/(total Not Fraud)
1 - Specificity = False Positive rate = FP
In Telecommunications systems, the proportions of Fraud to Not Fraud
instances vary significantly from period to period and country to country. As
matter of fact, the ROC curves are insensitive to class skews; they do not
change when the absolute dimension of each class varies and skewness
varies as well but TP and FP, being percentages, do not.
Figure 6: ROC curves
Any performance metric that uses values from both Classification Matrix
columns will be inherently sensitive to Fraud / Not Fraud proportion changes;
Page: 9
10. metrics such as Accuracy and Precision use values from both columns of the
Classification Matrix, and when the class distribution changes, these
measures will change as well, even if the fundamental FMS performance does
not. According to our notation, Sensitivity has the red color of Frauds and 1
Specificity has the blue color of Not Frauds but the same is not true for
Accuracy and Precision (e.g., Precision has been defined by TP and FP which
are picked up from heterogeneous columns of the Classification Matrix).
To analyze the tradeoff between TP and FP we can consider their variability
depending on different thresholds. The following Figure 7 and Figure 8 show
the effect of two different thresholds values that change the true and false
positive frequencies from TP=0.489, FP=0.088; to TP=0.882, FP=0.446.
Figure 7: ROC curves analysis
Figure 8: ROC curves analysis
Then, by adding more information content (e.g., with a black list or a
qualitative rule) we see how the increased distance between the means of
the Fraud and Not Fraud distributions enable us to get a higher TP at the
same FP. In Figure 9, the TP has improved from 0.882 to 0.985 while the FP
stands still at the same value 0.446.
Figure 9: ROC curves analysis
Lowering the information content, the distance between the means of the
distributions will be smaller and the two distributions overlap considerably
more, resulting in performance degradation. In this last case we can observe
Page: 10
11. in Figure 10 the lower TP 0.595 taking over the previous 0.882, while the FP
continues to stand still at 0.446.
Figure 10: ROC curves analysis
Comparing the four figures above, we should also take note of the area
under the ROC curve (AUC). In Figure 7 and Figure 8 the AUC is about 0.827,
then by adding information content on Figure 9 the AUC grows to 0.949 while
subtracting it as shown in Figure 10, the AUC value is reduced to 0.607.
AUC measures the ability of an FMS to separate Frauds from Not Frauds. It is
relatively representative (i.e., reflects the percentages) of FMS effectiveness
in terms of how many actual fraudulent events might be detected every day,
week, or month, and of its efficiency in terms of how few false alarms would
slow down the fraud management process.
Telecom Operators can use ROC curve analysis to evaluate FMS available in
the market and buy the best performer or, while configuring a FMS, the
Knowledge Manager could test different parameters or rules and tune FMS
performance in terms of TP and FP rates. One trivial observation is that
frequencies predicted negative, FN and TN, must be available, otherwise it
will not be possible to plot ROC curves and compute the AUC.
Page: 11