ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Evaluation Metrics for
Classification and
Information Retrieval
Who am I
¡ñ Katya
¡ñ Natural Language Processing
¡ñ CTO at Majio
¡ñ Sloth.Works - matches Candidates to Jobs
¡ñ Twitter - @kitkate87
¡ñ Medium - @ekaterinamihailova
Content
¡ñ Classification metrics
¡ñ Information Retrieval metrics
¡ñ Majio¡¯s evaluation metrics
¡ñ Design your own metric
General ML flow
Define goals
and metrics
Gather and
clean data
Build ML
model
Evaluate
results
Analyze
results
Classification Metrics
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures
Confusion matrix
True
Negative
False
Positive
FNFP
TNTP
Algorithm
Truth
Confusion matrix
True
Negative
False
Positive
Algorithm
Truth
Accuracy
acc = T / (T + F)
= (TP + TN) / ALL
acc = ( + ) / ALL
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth- 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures
Accuracy with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 99%
Accuracy per class
accP = TP / (TP + FN)
accP = /( + )
accN = TN / (TN + FP)
accN = /( + )
acc = (accP + accN)/2
Accuracy per class with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 50%
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure
Logarithmic Loss
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures
Precision
p = TP / (TP + FP)
p = / ( + )
Precision with 1% sloth pictures
Algorithm ¨C guesses right exactly one monkey and for
everything else says it is not a monkey
p = 100%
Recall (True positive rate)
r = TP / (TP + FN)
r = / ( + )
Recall with 1% sloth pictures
Algorithm - always says it is a sloth
r = 100%
f1-measure
f = 2 p*r / (p+r)
f1-measure with 1% sloth pictures
Algorithm - always says it is a sloth
f = 0%
Algorithm - always says it is NOT a sloth except for 1
f ~ 0%
Algorithm has 30% precision and 70% recall
f = 42%
Algorithm has 50% precision and 50% recall
f = 50%
Parametrized f-measure
f(b) = (1+b) p*r / ((b*p)+r)
Parametrized f-measure with 1% sloth pictures
b = 3; f = 4*p*r/(3*p + r)
Algorithm has 30% precision and 70% recall
f = 52.5%
Algorithm has 50% precision and 50% recall
f = 50%
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures
False positive rate
fpr = FP / (FP + TN)
fpr = / ( + )
False Positive Rate with 1% sloth pictures
Algorithm - always says it is not a sloth
fpr = 0%
Information Retrieval metrics
Search Results
1 2 3 4 5 6 7 8
Search Results
1 0 0 1 1 1 1 1
TPR and FPR for different points
¡ñ At point 1- TPR = 2%, FPR = 0%
¡ñ At point 25 - TPR = 40%, FPR = 10%
¡ñ At point 50 - TPR = 74%, FPR = 26%
¡ñ At point 75 - TPR = 96%, FPR = 54%
¡ñ At point 100 - TPR = 100%, FPR = 100%
ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate
ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures
Search Results
1 0 0 1 1 1 1 1
Precision and Recall at different points
¡ñ At point 1 - Recall = 2%, Precision = 100%
¡ñ At point 25 - Recall = 40%, Precision = 80%
¡ñ At point 50 - Recall = 74%, Precision = 74%
¡ñ At point 75 - Recall = 96%, Precision = 64%
¡ñ At point 100 - Recall = 100%, Precision = 50%
Precision - Recall curve
Precision
Recall
Average Precision
1 0 0 1 1 1 1 1
Average Precision
(1/1 + 0 + 0 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
/6
70.5%
Average Precision
/6
(0 + 0 + 1/3 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8)
59.4%
Average Precision
/6
( 1/1 + 2/2 + 3/3 + 4/4 + 5/5 + 6/6 + 0 + 0)
100%
Mean Average Precision
MAP = (70.5% + 59.4% + 100%) / 3 = 76.64%
Geometric Mean Average Precision
MAP = (70.5% * 59.4% * 100%) = 74.81%?
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
¡ð Create image search for sloths with different relevance
Cumulative Gain
2 0 1 2 2 1 1 2
11
CG =¡Ærel(i)
Discounted Cumulative Gain
2 0 1 2 2 1 1 2
DCG =¡Ærel(i)/log2(i+1)
Discounted Cumulative Gain
DCG =¡Ærel(i)/log2(i+1)
5.12
2 0 1/2 0.86 0.77 0.35 1/3 0.31
Ideal Discounted Cumulative Gain
2 1.24 1 0.86 0.39 0.35 1/3 0
6.17
DCG =¡Ærel(i)/log2(i+1)
Normalized Discounted Cumulative Gain
NDCG = DCG / IDCG = 5.12 / 6.17 = 0.83
Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
¡ð Create image search for sloths with different relevance - NDCG
Classification and Information Retrieval metrics for machine learning
Classification and Information Retrieval metrics for machine learning
Classification and Information Retrieval metrics for machine learning
Classification and Information Retrieval metrics for machine learning
Classification and Information Retrieval metrics for machine learning
Classification and Information Retrieval metrics for machine learning
Majio Usecase
Matching Candidates to Job
1 2 3
Evaluating Matching Candidates to Job - 1
1 3 2 1 1 2 2 1 2 2
( TP/T - FP/T + 1 ) / 2
Evaluating Matching Candidates to Job - 1
1 3 2 1 1 2 2 1 2 2
( 2/4 - 1/4 + 1 ) / 2
62.5%
Evaluating Matching Candidates to Job - 1
3 1 2 2 2 2 2 2 2 2
Evaluating Matching Candidates to Job - 1
3 1 2 2 2 2 2 2 2 2
( 0/1 - 1/1 + 1 ) / 2
0%
Evaluating Matching Candidates to Job - 2
1 3 2 1 1 2 2 1 2 2
Normalized MAP at points 5, 10, 15
Evaluating Matching Candidates to Job - 2
1 3 2 1 1 2 2 1 2 2
MAP = (3.3/5 + 5.7/10) / 2
31.8%
Evaluating Matching Candidates to Job - 2
1 1 1 1 2 2 2 2 2 3
best MAP=(4.3/5 + 5.7/10) / 2
32.8%
Evaluating Matching Candidates to Job - 2
3 2 2 2 2 2 1 1 1 1
worst MAP=(1.3/5 + 5.7/10) / 2
29.8%
Evaluating Matching Candidates to Job - 2
1 3 2 1 1 2 2 1 2 2
normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%
Evaluating Matching Candidates to Job - 2
normalized MAP = (MAP - wMAP) / (bMAP - wMAP)
33.3%
40% 30% 20% 10% 9% 8% 7% 6% 5% 4%
Evaluating Matching Candidates to Job - 2
40% 30% 20% 10% 9% 8% 7% 6% 5% 4%
0.8 * f1 + 0.2 * AP
Inter-annotator agreement
How much the annotators make the same decision for the
same search result.
The experiment
¡ñ 4 Annotators
¡ñ 60 randomly generated search results (by order, percentage and cut off line)
¡ñ The search results were equally distributed with majio scores between 1 and
100
¡ñ Annotators had to give score to the search between 1 (perfect) and 4 (horrible)
¡ñ 2 of the search results were there twice but in different context
¡ñ At least 3 out 4 annotators have to agree on ranking in order to be accepted
Classification and Information Retrieval metrics for machine learning
The results
¡ñ Inter-annotator agreement on 32 out of 60 rankings
¡ñ Two groups of annotators - strict (no 1s fallen behind) and useful (can you do
you with the amount of good candidates we have sent you)
¡ñ 2 out of 4 annotators gave different score to the trap rankings
¡ñ On the rankings in the inter-annotator agreement the scoring was consistent.
Limits for good and bad ranking acquired values.
Conclusions
¡ñ There are a lot of Information Retrieval metrics in the world (only a chosen few
were shown here)
¡ñ None is perfect but some are useful
¡ñ You can craft a metric yourself but then you have to check how good of a metric
it is
¡ñ People don¡¯t generally agree on things in the beginning. Experiment until there is
good enough agreement.

More Related Content

Classification and Information Retrieval metrics for machine learning

  • 1. Evaluation Metrics for Classification and Information Retrieval
  • 2. Who am I ¡ñ Katya ¡ñ Natural Language Processing ¡ñ CTO at Majio ¡ñ Sloth.Works - matches Candidates to Jobs ¡ñ Twitter - @kitkate87 ¡ñ Medium - @ekaterinamihailova
  • 3. Content ¡ñ Classification metrics ¡ñ Information Retrieval metrics ¡ñ Majio¡¯s evaluation metrics ¡ñ Design your own metric
  • 4. General ML flow Define goals and metrics Gather and clean data Build ML model Evaluate results Analyze results
  • 6. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures
  • 9. Accuracy acc = T / (T + F) = (TP + TN) / ALL acc = ( + ) / ALL
  • 10. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth- 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures
  • 11. Accuracy with 1% sloth pictures Algorithm - always says it is not a sloth acc = 99%
  • 12. Accuracy per class accP = TP / (TP + FN) accP = /( + ) accN = TN / (TN + FP) accN = /( + ) acc = (accP + accN)/2
  • 13. Accuracy per class with 1% sloth pictures Algorithm - always says it is not a sloth acc = 50%
  • 14. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure
  • 16. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss ¡ð Camera in the forest - 1% sloth pictures
  • 17. Precision p = TP / (TP + FP) p = / ( + )
  • 18. Precision with 1% sloth pictures Algorithm ¨C guesses right exactly one monkey and for everything else says it is not a monkey p = 100%
  • 19. Recall (True positive rate) r = TP / (TP + FN) r = / ( + )
  • 20. Recall with 1% sloth pictures Algorithm - always says it is a sloth r = 100%
  • 21. f1-measure f = 2 p*r / (p+r)
  • 22. f1-measure with 1% sloth pictures Algorithm - always says it is a sloth f = 0% Algorithm - always says it is NOT a sloth except for 1 f ~ 0% Algorithm has 30% precision and 70% recall f = 42% Algorithm has 50% precision and 50% recall f = 50%
  • 23. Parametrized f-measure f(b) = (1+b) p*r / ((b*p)+r)
  • 24. Parametrized f-measure with 1% sloth pictures b = 3; f = 4*p*r/(3*p + r) Algorithm has 30% precision and 70% recall f = 52.5% Algorithm has 50% precision and 50% recall f = 50%
  • 25. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss ¡ð Camera in the forest - 1% sloth pictures - f-measure ¡ð Search results for sloth and non sloth - 50% sloth pictures
  • 26. False positive rate fpr = FP / (FP + TN) fpr = / ( + )
  • 27. False Positive Rate with 1% sloth pictures Algorithm - always says it is not a sloth fpr = 0%
  • 29. Search Results 1 2 3 4 5 6 7 8
  • 30. Search Results 1 0 0 1 1 1 1 1
  • 31. TPR and FPR for different points ¡ñ At point 1- TPR = 2%, FPR = 0% ¡ñ At point 25 - TPR = 40%, FPR = 10% ¡ñ At point 50 - TPR = 74%, FPR = 26% ¡ñ At point 75 - TPR = 96%, FPR = 54% ¡ñ At point 100 - TPR = 100%, FPR = 100%
  • 32. ROC Curve and AUC (Area Under the Curve) True Positive Rate False Positive Rate
  • 33. ROC Curve and AUC (Area Under the Curve) True Positive Rate False Positive Rate
  • 34. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss ¡ð Camera in the forest - 1% sloth pictures - f-measure ¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC ¡ð Search results for sloth and non sloth - 1% sloth pictures
  • 35. Search Results 1 0 0 1 1 1 1 1
  • 36. Precision and Recall at different points ¡ñ At point 1 - Recall = 2%, Precision = 100% ¡ñ At point 25 - Recall = 40%, Precision = 80% ¡ñ At point 50 - Recall = 74%, Precision = 74% ¡ñ At point 75 - Recall = 96%, Precision = 64% ¡ñ At point 100 - Recall = 100%, Precision = 50%
  • 37. Precision - Recall curve Precision Recall
  • 38. Average Precision 1 0 0 1 1 1 1 1
  • 39. Average Precision (1/1 + 0 + 0 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8) /6 70.5%
  • 40. Average Precision /6 (0 + 0 + 1/3 + 2/4 + 3/5 + 4/6 + 5/7 + 6/8) 59.4%
  • 41. Average Precision /6 ( 1/1 + 2/2 + 3/3 + 4/4 + 5/5 + 6/6 + 0 + 0) 100%
  • 42. Mean Average Precision MAP = (70.5% + 59.4% + 100%) / 3 = 76.64%
  • 43. Geometric Mean Average Precision MAP = (70.5% * 59.4% * 100%) = 74.81%?
  • 44. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss ¡ð Camera in the forest - 1% sloth pictures - f-measure ¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC ¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP ¡ð Create image search for sloths with different relevance
  • 45. Cumulative Gain 2 0 1 2 2 1 1 2 11 CG =¡Ærel(i)
  • 46. Discounted Cumulative Gain 2 0 1 2 2 1 1 2 DCG =¡Ærel(i)/log2(i+1)
  • 47. Discounted Cumulative Gain DCG =¡Ærel(i)/log2(i+1) 5.12 2 0 1/2 0.86 0.77 0.35 1/3 0.31
  • 48. Ideal Discounted Cumulative Gain 2 1.24 1 0.86 0.39 0.35 1/3 0 6.17 DCG =¡Ærel(i)/log2(i+1)
  • 49. Normalized Discounted Cumulative Gain NDCG = DCG / IDCG = 5.12 / 6.17 = 0.83
  • 50. Evaluating Image recognition algorithms ¡ñ Setup ¡ð Images with sloths and images without sloths ¡ñ Goals ¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy ¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class ¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss ¡ð Camera in the forest - 1% sloth pictures - f-measure ¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC ¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP ¡ð Create image search for sloths with different relevance - NDCG
  • 58. Evaluating Matching Candidates to Job - 1 1 3 2 1 1 2 2 1 2 2 ( TP/T - FP/T + 1 ) / 2
  • 59. Evaluating Matching Candidates to Job - 1 1 3 2 1 1 2 2 1 2 2 ( 2/4 - 1/4 + 1 ) / 2 62.5%
  • 60. Evaluating Matching Candidates to Job - 1 3 1 2 2 2 2 2 2 2 2
  • 61. Evaluating Matching Candidates to Job - 1 3 1 2 2 2 2 2 2 2 2 ( 0/1 - 1/1 + 1 ) / 2 0%
  • 62. Evaluating Matching Candidates to Job - 2 1 3 2 1 1 2 2 1 2 2 Normalized MAP at points 5, 10, 15
  • 63. Evaluating Matching Candidates to Job - 2 1 3 2 1 1 2 2 1 2 2 MAP = (3.3/5 + 5.7/10) / 2 31.8%
  • 64. Evaluating Matching Candidates to Job - 2 1 1 1 1 2 2 2 2 2 3 best MAP=(4.3/5 + 5.7/10) / 2 32.8%
  • 65. Evaluating Matching Candidates to Job - 2 3 2 2 2 2 2 1 1 1 1 worst MAP=(1.3/5 + 5.7/10) / 2 29.8%
  • 66. Evaluating Matching Candidates to Job - 2 1 3 2 1 1 2 2 1 2 2 normalized MAP = (MAP - wMAP) / (bMAP - wMAP) 33.3%
  • 67. Evaluating Matching Candidates to Job - 2 normalized MAP = (MAP - wMAP) / (bMAP - wMAP) 33.3% 40% 30% 20% 10% 9% 8% 7% 6% 5% 4%
  • 68. Evaluating Matching Candidates to Job - 2 40% 30% 20% 10% 9% 8% 7% 6% 5% 4% 0.8 * f1 + 0.2 * AP
  • 69. Inter-annotator agreement How much the annotators make the same decision for the same search result.
  • 70. The experiment ¡ñ 4 Annotators ¡ñ 60 randomly generated search results (by order, percentage and cut off line) ¡ñ The search results were equally distributed with majio scores between 1 and 100 ¡ñ Annotators had to give score to the search between 1 (perfect) and 4 (horrible) ¡ñ 2 of the search results were there twice but in different context ¡ñ At least 3 out 4 annotators have to agree on ranking in order to be accepted
  • 72. The results ¡ñ Inter-annotator agreement on 32 out of 60 rankings ¡ñ Two groups of annotators - strict (no 1s fallen behind) and useful (can you do you with the amount of good candidates we have sent you) ¡ñ 2 out of 4 annotators gave different score to the trap rankings ¡ñ On the rankings in the inter-annotator agreement the scoring was consistent. Limits for good and bad ranking acquired values.
  • 73. Conclusions ¡ñ There are a lot of Information Retrieval metrics in the world (only a chosen few were shown here) ¡ñ None is perfect but some are useful ¡ñ You can craft a metric yourself but then you have to check how good of a metric it is ¡ñ People don¡¯t generally agree on things in the beginning. Experiment until there is good enough agreement.