際際滷

際際滷Share a Scribd company logo
+ 
Automated Evaluation of Crowdsourced 
Annotations in the Cultural Heritage Domain 
Archana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan Fokkink 
VU University Amsterdam and TU Delft, The Netherlands 
1
+ 
Overview 
 Project Overview 
 Use case 
 Research Questions 
 Experiment 
 Results 
 Conclusion 
2
+ 
Context 
 COMMIT Project 
 ICT Project in Netherlands 
 Subprojects: SEALINCMedia and Data2semantics 
 Socially Enriched Access to Linked Cultural Media 
(SEALINCMedia) 
 Collaboration with cultural heritage institutions to enrich their 
collections and make them more accessible 
3
+ 
Use case 
 CH institutions have large collections which are poorly 
annotated (Rijksmuseum Amsterdam: over 1 million items) 
 Lack of sufficient resources: knowledge, cost, labor 
 Solution: Crowd sourcing 
4
+ 
Crowdsourcing Annotation Tasks 
5 
Roses 
Annotator 
From crowd 
Garden 
Provides 
Annotations 
Artefact 
(Painting or Objects) 
Car 
Car 
Garden 
Roses 
Evaluation
+ 
Annotation evaluation 
 Manual evaluation is not feasible 
 Institutions have large collections( Rijksmuseum: over 1 million) 
 Crowd provides quite a lot of annotations 
 Costs time and money 
 Museums have limited resources 
6
+ 
Need for automated algorithms 
 Thus there is a need to develop algorithms to automatically evaluate 
annotations with good accuracy 
7
+ 
Previous approach 
 Building user profile and tracking user reputation based on 
semantic similarity 
 Tracking provenance information for users 
 Realized: There is lot of data provided and meaningful info 
can be derived 
 Current approach: Can we determine quality of information 
based on features? 
8
+ 
Research questions 
 Can we evaluate annotations based on properties of the annotator 
and the annotation? 
 Can we predict reputation of annotator based on annotator 
properties? 
9 
Roses 
Age: 25 
Male 
Arts degree 
No typo 
Noun 
In Wordnet
+ 
Relevant features 
 Features of annotation 
 Annotator 
 Quality score 
 Length 
 Specificity 
 Features of annotator 
 Age 
 Gender 
 Education 
 Tagging experience 
10
+ 
Semantic Representation 
11 
Open Annotation model to represent annotation 
Annotation 
Target 
oac:hasBody 
Tag 
User 
oac:annotator 
Reviewer Review 
Review value 
oac:annotates 
oac:hasBody 
oac:hasTarget 
oac:annotation 
foaf:person 
rdf:type 
foaf:age 
age 
gender 
oac:annotates 
length 
oac:hasTarget 
rdf:type 
... 
... 
... 
... 
oac:annotator 
rdf:type 
rdf:type 
ex:length 
foaf:gender Used to estimate 
FOAF to represent Annotator properties
+ 
Experiment 
Steve.museum dataset 
 We performed our evaluations on Steve.Museum dataset 
 Online dataset of images and annotations 
12 
Stat features Values 
Provided tags 45,733 
Unique tags 13,949 
Tags evaluated as useful 39,931(87%) 
Tags evaluated as not-useful 
5,802(13%) 
Number of 
annotators/registered 
1218/488(40%)
+ 
Steve.museum annotation evaluation 
 The annotations in Steve.museum project were evaluated into 
multiple categories, we classified evaluations as either useful or not-useful 
13 
Usefulness-useful 
Judgement-positive 
Judgement-negative 
Problematic-foreign 
Problematic-typo 
 
Usefulness-not useful
+ 
Identify relevant annotation properties 
 Manually select properties (F_man) 
 Is_adjective, is_english, in_wordnet 
 List of all possible properties (F_all) 
 F_man + [created_day/hour, length, specificty, nrwords, frequency] 
 Apply feature selection algorithm on F_all to choose properties 
(F_ml) 
 Feature selection algorithm from WEKA toolkit 
 WEKA is a collection of machine learning algorithms for data mining 
tasks 
 http://www.weka.net.nz/ 
14 
Usefulness-useful
+ 
Build train and test data 
 Split the Steve dataset annotations into test set and train set 
 The train set has features and goal(quality) 
 Test set: only the features 
 Fairness: Train set had 1000 useful and 1000 not-useful annotations 
15 
Tag Feature 
1 
Feature 
2 
Feature n Quality 
Rose f1 f2 fn Useful 
House f11 f12 f1n Not-useful 
Tag Feature 1 Feature 2 Feature n 
Lily f1 f2 fn 
Sky f11 f12 f1n 
Train data 
Test data
+ 
Machine learning 
 Apply Machine learning techniques 
 Learning: Learn about features and goal from training set 
 Predictions: Apply learning from the training set to the test set 
 Used SVM with default polykernel in WEKA to predict quality of 
annotations 
 Commonly used, fast and resistant against over-fitting 
16
+ 
Results 
 Method is good to predict useful tags, but not for predicting not-useful 
tags 
17 
Feature 
set 
Class Recall Precisio 
n 
F-m 
easure 
F_man Useful 0.90 0.90 0.90 
Not useful 0.20 0.21 0.20 
F_all Useful 0.75 0.91 0.83 
Not useful 0.42 0.18 0.25 
F_ml Useful 0.20 0.98 0.34 
Not useful 0.96 0.13 0.23
+ 
Identify relevant features of annotator 
 Are these features helpful to 
 Determine annotation quality? 
 Predict annotator reputation? 
18 
Age: 25 
Male 
Arts degree
+ 
Building annotator reputation 
 Probabilistic logic called Subjective Logic 
 Annotator opinion = 
 (belief, disbelief, uncertainty) 
 (p,n) = (positive,negative) evaluations 
 Belief = p/(p+n+2) Uncertainty = 2/(p+n+2) 
 Expectation value(E) is the reputation 
 E = (belief + apriori * uncertainty) 
 Apriori = 0.5 
19
+ 
Identify relevant annotator properties 
 Manually identified properties 
 F_man = [Community, age, education, experience, gender, tagging 
experience] 
 List of all properties 
 F_all = F_man + [vocabulary_size, vocab_diversity, is_anonymous, # 
annotations in wordnet] 
 Feature selection algorithm on F_all 
 F_ml_a for annotation 
 F_ml_u for annotator 
20
+ 
Results 
 Trained on features using SVM to make predictions 
21 
Feature 
set 
Class Recall Precisio 
n 
F-measure 
F_man Useful 0.29 0.90 0.44 
Not 
useful 
0.73 0.11 0.20 
F_all Useful 0.69 0.91 0.78 
Not 
useful 
0.43 0.15 0.22 
F_ml_a Useful 0.55 0.91 0.68 
Not 
useful 
0.53 0.13 0.21
+ 
Results 
 Used regression to predict reputation values based on 
features of registered annotator 
 Since annotator reputation is highly skewed (90% > 0.7), we 
could not predict reputation successfully 
22 
Feature_se 
t 
corr RMS 
Error 
Mean Abs Errr Rel Abs Err 
F_man -0.02 0.15 0.10 97.8% 
F_all 0.22 0.13 0.09 95.1% 
F_ml_u 0.29 0.13 0.09 90.4%
+ 
Evaluation 
 The possible reasons why method not successful for 
predicting not-useful annotations: 
 They are minority (13% of whole dataset) 
 Need more in-depth analysis of features to determine not-useful 
annotations 
 Requires study from different datasets 
23
+ 
Relevance 
 Our experiments help to show that there is a correlation 
between features of annotator and annotation to the quality 
of annotations 
 With a small set of features we were able to predict 98% of 
the useful and 13% of the not useful annotations correctly. 
 Helps to identify which features are relevant to certain tasks 
24
+ 
Conclusions 
 Machine learning techniques help to predict useful 
evaluations but not not-useful ones 
 Devised a model 
 using SVM to predict annotation evaluation and annotator 
reputation 
 Using regression to predict annotator reputation 
25
+ 
Future work 
 Need to extract more in-depth information from both 
annotation and annotator 
 Need to build reputation of the annotator per topic 
 Apply the model on different use cases 
26
+ Thank you 
a.nottamkandath@vu.nl 
27

More Related Content

Automated evaluation of crowdsourced annotations in the cultural heritage domain

  • 1. + Automated Evaluation of Crowdsourced Annotations in the Cultural Heritage Domain Archana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan Fokkink VU University Amsterdam and TU Delft, The Netherlands 1
  • 2. + Overview Project Overview Use case Research Questions Experiment Results Conclusion 2
  • 3. + Context COMMIT Project ICT Project in Netherlands Subprojects: SEALINCMedia and Data2semantics Socially Enriched Access to Linked Cultural Media (SEALINCMedia) Collaboration with cultural heritage institutions to enrich their collections and make them more accessible 3
  • 4. + Use case CH institutions have large collections which are poorly annotated (Rijksmuseum Amsterdam: over 1 million items) Lack of sufficient resources: knowledge, cost, labor Solution: Crowd sourcing 4
  • 5. + Crowdsourcing Annotation Tasks 5 Roses Annotator From crowd Garden Provides Annotations Artefact (Painting or Objects) Car Car Garden Roses Evaluation
  • 6. + Annotation evaluation Manual evaluation is not feasible Institutions have large collections( Rijksmuseum: over 1 million) Crowd provides quite a lot of annotations Costs time and money Museums have limited resources 6
  • 7. + Need for automated algorithms Thus there is a need to develop algorithms to automatically evaluate annotations with good accuracy 7
  • 8. + Previous approach Building user profile and tracking user reputation based on semantic similarity Tracking provenance information for users Realized: There is lot of data provided and meaningful info can be derived Current approach: Can we determine quality of information based on features? 8
  • 9. + Research questions Can we evaluate annotations based on properties of the annotator and the annotation? Can we predict reputation of annotator based on annotator properties? 9 Roses Age: 25 Male Arts degree No typo Noun In Wordnet
  • 10. + Relevant features Features of annotation Annotator Quality score Length Specificity Features of annotator Age Gender Education Tagging experience 10
  • 11. + Semantic Representation 11 Open Annotation model to represent annotation Annotation Target oac:hasBody Tag User oac:annotator Reviewer Review Review value oac:annotates oac:hasBody oac:hasTarget oac:annotation foaf:person rdf:type foaf:age age gender oac:annotates length oac:hasTarget rdf:type ... ... ... ... oac:annotator rdf:type rdf:type ex:length foaf:gender Used to estimate FOAF to represent Annotator properties
  • 12. + Experiment Steve.museum dataset We performed our evaluations on Steve.Museum dataset Online dataset of images and annotations 12 Stat features Values Provided tags 45,733 Unique tags 13,949 Tags evaluated as useful 39,931(87%) Tags evaluated as not-useful 5,802(13%) Number of annotators/registered 1218/488(40%)
  • 13. + Steve.museum annotation evaluation The annotations in Steve.museum project were evaluated into multiple categories, we classified evaluations as either useful or not-useful 13 Usefulness-useful Judgement-positive Judgement-negative Problematic-foreign Problematic-typo Usefulness-not useful
  • 14. + Identify relevant annotation properties Manually select properties (F_man) Is_adjective, is_english, in_wordnet List of all possible properties (F_all) F_man + [created_day/hour, length, specificty, nrwords, frequency] Apply feature selection algorithm on F_all to choose properties (F_ml) Feature selection algorithm from WEKA toolkit WEKA is a collection of machine learning algorithms for data mining tasks http://www.weka.net.nz/ 14 Usefulness-useful
  • 15. + Build train and test data Split the Steve dataset annotations into test set and train set The train set has features and goal(quality) Test set: only the features Fairness: Train set had 1000 useful and 1000 not-useful annotations 15 Tag Feature 1 Feature 2 Feature n Quality Rose f1 f2 fn Useful House f11 f12 f1n Not-useful Tag Feature 1 Feature 2 Feature n Lily f1 f2 fn Sky f11 f12 f1n Train data Test data
  • 16. + Machine learning Apply Machine learning techniques Learning: Learn about features and goal from training set Predictions: Apply learning from the training set to the test set Used SVM with default polykernel in WEKA to predict quality of annotations Commonly used, fast and resistant against over-fitting 16
  • 17. + Results Method is good to predict useful tags, but not for predicting not-useful tags 17 Feature set Class Recall Precisio n F-m easure F_man Useful 0.90 0.90 0.90 Not useful 0.20 0.21 0.20 F_all Useful 0.75 0.91 0.83 Not useful 0.42 0.18 0.25 F_ml Useful 0.20 0.98 0.34 Not useful 0.96 0.13 0.23
  • 18. + Identify relevant features of annotator Are these features helpful to Determine annotation quality? Predict annotator reputation? 18 Age: 25 Male Arts degree
  • 19. + Building annotator reputation Probabilistic logic called Subjective Logic Annotator opinion = (belief, disbelief, uncertainty) (p,n) = (positive,negative) evaluations Belief = p/(p+n+2) Uncertainty = 2/(p+n+2) Expectation value(E) is the reputation E = (belief + apriori * uncertainty) Apriori = 0.5 19
  • 20. + Identify relevant annotator properties Manually identified properties F_man = [Community, age, education, experience, gender, tagging experience] List of all properties F_all = F_man + [vocabulary_size, vocab_diversity, is_anonymous, # annotations in wordnet] Feature selection algorithm on F_all F_ml_a for annotation F_ml_u for annotator 20
  • 21. + Results Trained on features using SVM to make predictions 21 Feature set Class Recall Precisio n F-measure F_man Useful 0.29 0.90 0.44 Not useful 0.73 0.11 0.20 F_all Useful 0.69 0.91 0.78 Not useful 0.43 0.15 0.22 F_ml_a Useful 0.55 0.91 0.68 Not useful 0.53 0.13 0.21
  • 22. + Results Used regression to predict reputation values based on features of registered annotator Since annotator reputation is highly skewed (90% > 0.7), we could not predict reputation successfully 22 Feature_se t corr RMS Error Mean Abs Errr Rel Abs Err F_man -0.02 0.15 0.10 97.8% F_all 0.22 0.13 0.09 95.1% F_ml_u 0.29 0.13 0.09 90.4%
  • 23. + Evaluation The possible reasons why method not successful for predicting not-useful annotations: They are minority (13% of whole dataset) Need more in-depth analysis of features to determine not-useful annotations Requires study from different datasets 23
  • 24. + Relevance Our experiments help to show that there is a correlation between features of annotator and annotation to the quality of annotations With a small set of features we were able to predict 98% of the useful and 13% of the not useful annotations correctly. Helps to identify which features are relevant to certain tasks 24
  • 25. + Conclusions Machine learning techniques help to predict useful evaluations but not not-useful ones Devised a model using SVM to predict annotation evaluation and annotator reputation Using regression to predict annotator reputation 25
  • 26. + Future work Need to extract more in-depth information from both annotation and annotator Need to build reputation of the annotator per topic Apply the model on different use cases 26
  • 27. + Thank you a.nottamkandath@vu.nl 27

Editor's Notes

  • #6: An annotation describes certain aspect of the artefact An Annotator provides annotation Annotation process is crowdsourced on the Web Need to evaluate the provided annotations
  • #14: Quality is subjective
  • #15: More about WEKA
  • #16: Add example image here
  • #18: High precision,low recall conservative, whichever it identified are correct Low precision, high recall liberal algorithms, many false negatives Precision = TP/(TP+FP) Recall = TP/ (TP+FN)
  • #20: Expected value is the long-run average value of repetitions of the experiments it represents
  • #22: High precision,low recall conservative, whichever it identified are correct Low precision, high recall liberal algorithms, many false negatives Precision = TP/(TP+FP) Recall = TP/ (TP+FN) F_measure = 2.p.r/(p+r) harmonic mean
  • #23: RMS: sample std deviation of differences between predicted values and actual values Not many bad examples to learn from, values are very high (around 0.7)