This document summarizes research on automatically evaluating crowdsourced annotations in cultural heritage collections. The researchers explored using machine learning techniques to predict the quality of annotations based on annotation and annotator features. Their results showed the techniques could predict useful annotations with 98% accuracy but only 13% accuracy for not useful annotations. The researchers believe more in-depth features are needed to better predict lower quality annotations.
1 of 27
Download to read offline
More Related Content
Automated evaluation of crowdsourced annotations in the cultural heritage domain
1. +
Automated Evaluation of Crowdsourced
Annotations in the Cultural Heritage Domain
Archana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan Fokkink
VU University Amsterdam and TU Delft, The Netherlands
1
2. +
Overview
Project Overview
Use case
Research Questions
Experiment
Results
Conclusion
2
3. +
Context
COMMIT Project
ICT Project in Netherlands
Subprojects: SEALINCMedia and Data2semantics
Socially Enriched Access to Linked Cultural Media
(SEALINCMedia)
Collaboration with cultural heritage institutions to enrich their
collections and make them more accessible
3
4. +
Use case
CH institutions have large collections which are poorly
annotated (Rijksmuseum Amsterdam: over 1 million items)
Lack of sufficient resources: knowledge, cost, labor
Solution: Crowd sourcing
4
5. +
Crowdsourcing Annotation Tasks
5
Roses
Annotator
From crowd
Garden
Provides
Annotations
Artefact
(Painting or Objects)
Car
Car
Garden
Roses
Evaluation
6. +
Annotation evaluation
Manual evaluation is not feasible
Institutions have large collections( Rijksmuseum: over 1 million)
Crowd provides quite a lot of annotations
Costs time and money
Museums have limited resources
6
7. +
Need for automated algorithms
Thus there is a need to develop algorithms to automatically evaluate
annotations with good accuracy
7
8. +
Previous approach
Building user profile and tracking user reputation based on
semantic similarity
Tracking provenance information for users
Realized: There is lot of data provided and meaningful info
can be derived
Current approach: Can we determine quality of information
based on features?
8
9. +
Research questions
Can we evaluate annotations based on properties of the annotator
and the annotation?
Can we predict reputation of annotator based on annotator
properties?
9
Roses
Age: 25
Male
Arts degree
No typo
Noun
In Wordnet
10. +
Relevant features
Features of annotation
Annotator
Quality score
Length
Specificity
Features of annotator
Age
Gender
Education
Tagging experience
10
11. +
Semantic Representation
11
Open Annotation model to represent annotation
Annotation
Target
oac:hasBody
Tag
User
oac:annotator
Reviewer Review
Review value
oac:annotates
oac:hasBody
oac:hasTarget
oac:annotation
foaf:person
rdf:type
foaf:age
age
gender
oac:annotates
length
oac:hasTarget
rdf:type
...
...
...
...
oac:annotator
rdf:type
rdf:type
ex:length
foaf:gender Used to estimate
FOAF to represent Annotator properties
12. +
Experiment
Steve.museum dataset
We performed our evaluations on Steve.Museum dataset
Online dataset of images and annotations
12
Stat features Values
Provided tags 45,733
Unique tags 13,949
Tags evaluated as useful 39,931(87%)
Tags evaluated as not-useful
5,802(13%)
Number of
annotators/registered
1218/488(40%)
13. +
Steve.museum annotation evaluation
The annotations in Steve.museum project were evaluated into
multiple categories, we classified evaluations as either useful or not-useful
13
Usefulness-useful
Judgement-positive
Judgement-negative
Problematic-foreign
Problematic-typo
Usefulness-not useful
14. +
Identify relevant annotation properties
Manually select properties (F_man)
Is_adjective, is_english, in_wordnet
List of all possible properties (F_all)
F_man + [created_day/hour, length, specificty, nrwords, frequency]
Apply feature selection algorithm on F_all to choose properties
(F_ml)
Feature selection algorithm from WEKA toolkit
WEKA is a collection of machine learning algorithms for data mining
tasks
http://www.weka.net.nz/
14
Usefulness-useful
15. +
Build train and test data
Split the Steve dataset annotations into test set and train set
The train set has features and goal(quality)
Test set: only the features
Fairness: Train set had 1000 useful and 1000 not-useful annotations
15
Tag Feature
1
Feature
2
Feature n Quality
Rose f1 f2 fn Useful
House f11 f12 f1n Not-useful
Tag Feature 1 Feature 2 Feature n
Lily f1 f2 fn
Sky f11 f12 f1n
Train data
Test data
16. +
Machine learning
Apply Machine learning techniques
Learning: Learn about features and goal from training set
Predictions: Apply learning from the training set to the test set
Used SVM with default polykernel in WEKA to predict quality of
annotations
Commonly used, fast and resistant against over-fitting
16
17. +
Results
Method is good to predict useful tags, but not for predicting not-useful
tags
17
Feature
set
Class Recall Precisio
n
F-m
easure
F_man Useful 0.90 0.90 0.90
Not useful 0.20 0.21 0.20
F_all Useful 0.75 0.91 0.83
Not useful 0.42 0.18 0.25
F_ml Useful 0.20 0.98 0.34
Not useful 0.96 0.13 0.23
18. +
Identify relevant features of annotator
Are these features helpful to
Determine annotation quality?
Predict annotator reputation?
18
Age: 25
Male
Arts degree
19. +
Building annotator reputation
Probabilistic logic called Subjective Logic
Annotator opinion =
(belief, disbelief, uncertainty)
(p,n) = (positive,negative) evaluations
Belief = p/(p+n+2) Uncertainty = 2/(p+n+2)
Expectation value(E) is the reputation
E = (belief + apriori * uncertainty)
Apriori = 0.5
19
20. +
Identify relevant annotator properties
Manually identified properties
F_man = [Community, age, education, experience, gender, tagging
experience]
List of all properties
F_all = F_man + [vocabulary_size, vocab_diversity, is_anonymous, #
annotations in wordnet]
Feature selection algorithm on F_all
F_ml_a for annotation
F_ml_u for annotator
20
21. +
Results
Trained on features using SVM to make predictions
21
Feature
set
Class Recall Precisio
n
F-measure
F_man Useful 0.29 0.90 0.44
Not
useful
0.73 0.11 0.20
F_all Useful 0.69 0.91 0.78
Not
useful
0.43 0.15 0.22
F_ml_a Useful 0.55 0.91 0.68
Not
useful
0.53 0.13 0.21
22. +
Results
Used regression to predict reputation values based on
features of registered annotator
Since annotator reputation is highly skewed (90% > 0.7), we
could not predict reputation successfully
22
Feature_se
t
corr RMS
Error
Mean Abs Errr Rel Abs Err
F_man -0.02 0.15 0.10 97.8%
F_all 0.22 0.13 0.09 95.1%
F_ml_u 0.29 0.13 0.09 90.4%
23. +
Evaluation
The possible reasons why method not successful for
predicting not-useful annotations:
They are minority (13% of whole dataset)
Need more in-depth analysis of features to determine not-useful
annotations
Requires study from different datasets
23
24. +
Relevance
Our experiments help to show that there is a correlation
between features of annotator and annotation to the quality
of annotations
With a small set of features we were able to predict 98% of
the useful and 13% of the not useful annotations correctly.
Helps to identify which features are relevant to certain tasks
24
25. +
Conclusions
Machine learning techniques help to predict useful
evaluations but not not-useful ones
Devised a model
using SVM to predict annotation evaluation and annotator
reputation
Using regression to predict annotator reputation
25
26. +
Future work
Need to extract more in-depth information from both
annotation and annotator
Need to build reputation of the annotator per topic
Apply the model on different use cases
26
#6: An annotation describes certain aspect of the artefact
An Annotator provides annotation
Annotation process is crowdsourced on the Web
Need to evaluate the provided annotations
#18: High precision,low recall conservative, whichever it identified are correct
Low precision, high recall liberal algorithms, many false negatives
Precision = TP/(TP+FP)
Recall = TP/ (TP+FN)
#20: Expected value is the long-run average value of repetitions of the experiments it represents
#22: High precision,low recall conservative, whichever it identified are correct
Low precision, high recall liberal algorithms, many false negatives
Precision = TP/(TP+FP)
Recall = TP/ (TP+FN) F_measure = 2.p.r/(p+r) harmonic mean
#23: RMS: sample std deviation of differences between predicted values and actual values
Not many bad examples to learn from, values are very high (around 0.7)