This document discusses various metrics for evaluating classification and information retrieval models, including accuracy, precision, recall, F1 score, log loss, ROC AUC, MAP, NDCG, and inter-annotator agreement. It provides examples of how to calculate and interpret these metrics based on classification of images as sloths or non-sloths and ranking search results for sloths. The document also describes how Majio evaluates models that match candidates to jobs using metrics like normalized MAP and a weighted combination of F1 and average precision.
1 of 73
Downloaded 13 times
More Related Content
Classification and Information Retrieval metrics for machine learning
2. Who am I
¡ñ Katya
¡ñ Natural Language Processing
¡ñ CTO at Majio
¡ñ Sloth.Works - matches Candidates to Jobs
¡ñ Twitter - @kitkate87
¡ñ Medium - @ekaterinamihailova
6. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures
10. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth- 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures
11. Accuracy with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 99%
13. Accuracy per class with 1% sloth pictures
Algorithm - always says it is not a sloth
acc = 50%
14. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure
16. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures
22. f1-measure with 1% sloth pictures
Algorithm - always says it is a sloth
f = 0%
Algorithm - always says it is NOT a sloth except for 1
f ~ 0%
Algorithm has 30% precision and 70% recall
f = 42%
Algorithm has 50% precision and 50% recall
f = 50%
24. Parametrized f-measure with 1% sloth pictures
b = 3; f = 4*p*r/(3*p + r)
Algorithm has 30% precision and 70% recall
f = 52.5%
Algorithm has 50% precision and 50% recall
f = 50%
25. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures
31. TPR and FPR for different points
¡ñ At point 1- TPR = 2%, FPR = 0%
¡ñ At point 25 - TPR = 40%, FPR = 10%
¡ñ At point 50 - TPR = 74%, FPR = 26%
¡ñ At point 75 - TPR = 96%, FPR = 54%
¡ñ At point 100 - TPR = 100%, FPR = 100%
32. ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate
33. ROC Curve and AUC (Area Under the Curve)
True Positive Rate
False Positive Rate
34. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures
36. Precision and Recall at different points
¡ñ At point 1 - Recall = 2%, Precision = 100%
¡ñ At point 25 - Recall = 40%, Precision = 80%
¡ñ At point 50 - Recall = 74%, Precision = 74%
¡ñ At point 75 - Recall = 96%, Precision = 64%
¡ñ At point 100 - Recall = 100%, Precision = 50%
44. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
¡ð Create image search for sloths with different relevance
50. Evaluating Image recognition algorithms
¡ñ Setup
¡ð Images with sloths and images without sloths
¡ñ Goals
¡ð Distinguish between a sloth and non sloth - 50% sloth pictures - accuracy
¡ð Distinguish between a sloth and non sloth - 1% sloth pictures - accuracy per class
¡ð Distinguish between a sloth and non sloth and ask a person if not sure - log loss
¡ð Camera in the forest - 1% sloth pictures - f-measure
¡ð Search results for sloth and non sloth - 50% sloth pictures - AUC
¡ð Search results for sloth and non sloth - 1% sloth pictures - MAP, GMAP
¡ð Create image search for sloths with different relevance - NDCG
70. The experiment
¡ñ 4 Annotators
¡ñ 60 randomly generated search results (by order, percentage and cut off line)
¡ñ The search results were equally distributed with majio scores between 1 and
100
¡ñ Annotators had to give score to the search between 1 (perfect) and 4 (horrible)
¡ñ 2 of the search results were there twice but in different context
¡ñ At least 3 out 4 annotators have to agree on ranking in order to be accepted
72. The results
¡ñ Inter-annotator agreement on 32 out of 60 rankings
¡ñ Two groups of annotators - strict (no 1s fallen behind) and useful (can you do
you with the amount of good candidates we have sent you)
¡ñ 2 out of 4 annotators gave different score to the trap rankings
¡ñ On the rankings in the inter-annotator agreement the scoring was consistent.
Limits for good and bad ranking acquired values.
73. Conclusions
¡ñ There are a lot of Information Retrieval metrics in the world (only a chosen few
were shown here)
¡ñ None is perfect but some are useful
¡ñ You can craft a metric yourself but then you have to check how good of a metric
it is
¡ñ People don¡¯t generally agree on things in the beginning. Experiment until there is
good enough agreement.