�ݺ�ߣ

Evaluation Metrics for
Classification and
Information Retrieval

Who am I
�� Katya
�� Natural Language Processing
�� CTO at Majio
�� Sloth.Works - matches Candidates to Jobs
�� Twitter - @kitkate87
�� Medium - @ekaterinamihailova

Content
�� Classification metrics
�� Information Retrieval metrics
�� Majio��s evaluation metrics
�� Design your own metric

General ML flow
Define goals
and metrics
Gather and
clean data
Build ML
model
Evaluate
results
Analyze
results

Evaluating Image recognition algorithms
�� Setup
�� Images with sloths and images without sloths
�� Goals
�� Distinguish between a sloth and non sloth - 50% sloth pictures

Confusion matrix
True
Negative
False
Positive
FNFP
TNTP
Algorithm
Truth

Confusion matrix
True
Negative
False
Positive
Algorithm
Truth

Accuracy
acc = T / (T + F)
= (TP + TN) / ALL
acc = ( + ) / ALL

�� Setup
�� Goals
�� Distinguish between a sloth and non sloth- 50% sloth pictures - accuracy
�� Distinguish between a sloth and non sloth - 1% sloth pictures

Accuracy with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 99%

Accuracy per class
accP = TP / (TP + FN)
accP = /( + )
accN = TN / (TN + FP)
accN = /( + )
acc = (accP + accN)/2

Accuracy per class with 1% sloth pictures
acc = 50%

�� Setup
�� Goals
�� Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
�� Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
�� Distinguish between a sloth and non sloth and ask a person if not sure

�� Setup
�� Goals
�� Distinguish between a sloth and non sloth and ask a person if not sure - log loss
�� Camera in the forest - 1% sloth pictures

Precision
p = TP / (TP + FP)
p = / ( + )

Precision with 1% sloth pictures
Algorithm �C guesses right exactly one monkey and for
everything else says it is not a monkey
p = 100%

Recall (True positive rate)
r = TP / (TP + FN)
r = / ( + )

Recall with 1% sloth pictures
Algorithm - always says it is a sloth
r = 100%

f1-measure with 1% sloth pictures
Algorithm - always says it is a sloth
f = 0%
Algorithm - always says it is NOT a sloth except for 1
f ~ 0%
Algorithm has 30% precision and 70% recall
f = 42%
f = 50%

Parametrized f-measure
f(b) = (1+b) p*r / ((b*p)+r)

Parametrized f-measure with 1% sloth pictures
b = 3; f = 4*p*r/(3*p + r)
f = 52.5%
f = 50%

�� Setup
�� Goals
�� Camera in the forest - 1% sloth pictures - f-measure
�� Search results for sloth and non sloth - 50% sloth pictures

False positive rate
fpr = FP / (FP + TN)
fpr = / ( + )

False Positive Rate with 1% sloth pictures
fpr = 0%

Search Results
1 2 3 4 5 6 7 8

Search Results
1 0 0 1 1 1 1 1

TPR and FPR for different points
�� At point 1- TPR = 2%, FPR = 0%
�� At point 25 - TPR = 40%, FPR = 10%
�� At point 50 - TPR = 74%, FPR = 26%
�� At point 75 - TPR = 96%, FPR = 54%
�� At point 100 - TPR = 100%, FPR = 100%

ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate

�� Setup
�� Goals
�� Search results for sloth and non sloth - 50% sloth pictures - AUC
�� Search results for sloth and non sloth - 1% sloth pictures

Precision and Recall at different points
�� At point 1 - Recall = 2%, Precision = 100%

Precision - Recall curve
Precision
Recall

Average Precision
1 0 0 1 1 1 1 1

Average Precision
(1/1 + 0 + 0 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
/6
70.5%

Average Precision
/6
(0 + 0 + 1/3 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
59.4%

Average Precision
/6
( 1/1 + 2/2 + 3/3 + 4/4 + 5/5 + 6/6 + 0 + 0)
100%

Mean Average Precision
MAP = (70.5% + 59.4% + 100%) / 3 = 76.64%

Geometric Mean Average Precision
MAP = (70.5% * 59.4% * 100%) = 74.81%?

�� Setup
�� Goals
�� Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
�� Create image search for sloths with different relevance

Cumulative Gain
2 0 1 2 2 1 1 2
11
CG =��rel(i)

Discounted Cumulative Gain
2 0 1 2 2 1 1 2
DCG =��rel(i)/log2(i+1)

Discounted Cumulative Gain
5.12
2 0 1/2 0.86 0.77 0.35 1/3 0.31

Ideal Discounted Cumulative Gain
2 1.24 1 0.86 0.39 0.35 1/3 0
6.17

Normalized Discounted Cumulative Gain
NDCG = DCG / IDCG = 5.12 / 6.17 = 0.83

�� Setup
�� Goals
�� Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
�� Create image search for sloths with different relevance - NDCG

Classification and Information Retrieval metrics for machine learning

Majio Usecase
Matching Candidates to Job
1 2 3

Evaluating Matching Candidates to Job - 1
1 3 2 1 1 2 2 1 2 2
( TP/T - FP/T + 1 ) / 2

1 3 2 1 1 2 2 1 2 2
( 2/4 - 1/4 + 1 ) / 2
62.5%

3 1 2 2 2 2 2 2 2 2

3 1 2 2 2 2 2 2 2 2
( 0/1 - 1/1 + 1 ) / 2
0%

1 3 2 1 1 2 2 1 2 2
Normalized MAP at points 5, 10, 15

1 3 2 1 1 2 2 1 2 2
MAP = (3.3/5 + 5.7/10) / 2
31.8%

1 1 1 1 2 2 2 2 2 3
best MAP=(4.3/5 + 5.7/10) / 2
32.8%

3 2 2 2 2 2 1 1 1 1
worst MAP=(1.3/5 + 5.7/10) / 2
29.8%

1 3 2 1 1 2 2 1 2 2
normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%

normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%
40% 30% 20% 10% 9% 8% 7% 6% 5% 4%

40% 30% 20% 10% 9% 8% 7% 6% 5% 4%
0.8 * f1 + 0.2 * AP

Inter-annotator agreement
How much the annotators make the same decision for the
same search result.

The experiment
�� 4 Annotators
�� 60 randomly generated search results (by order, percentage and cut off line)
�� The search results were equally distributed with majio scores between 1 and
100
�� Annotators had to give score to the search between 1 (perfect) and 4 (horrible)
�� 2 of the search results were there twice but in different context
�� At least 3 out 4 annotators have to agree on ranking in order to be accepted

The results
�� Inter-annotator agreement on 32 out of 60 rankings
�� Two groups of annotators - strict (no 1s fallen behind) and useful (can you do
you with the amount of good candidates we have sent you)
�� 2 out of 4 annotators gave different score to the trap rankings
�� On the rankings in the inter-annotator agreement the scoring was consistent.
Limits for good and bad ranking acquired values.

Conclusions
�� There are a lot of Information Retrieval metrics in the world (only a chosen few
were shown here)
�� None is perfect but some are useful
�� You can craft a metric yourself but then you have to check how good of a metric
it is
�� People don��t generally agree on things in the beginning. Experiment until there is
good enough agreement.

�ݺ�ߣ

Classification and Information Retrieval metrics for machine learning

More Related Content

Classification and Information Retrieval metrics for machine learning