�ݺ�ߣ

+
Automated Evaluation of Crowdsourced
Annotations in the Cultural Heritage Domain
Archana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan Fokkink
VU University Amsterdam and TU Delft, The Netherlands
1

+
Overview
 Project Overview
 Use case
 Research Questions
 Experiment
 Results
 Conclusion
2

+
Context
• COMMIT Project
– ICT Project in Netherlands
– Subprojects: SEALINCMedia and Data2semantics
• Socially Enriched Access to Linked Cultural Media
(SEALINCMedia)
– Collaboration with cultural heritage institutions to enrich their
collections and make them more accessible
3

+
Use case
 CH institutions have large collections which are poorly
annotated (Rijksmuseum Amsterdam: over 1 million items)
 Lack of sufficient resources: knowledge, cost, labor
 Solution: Crowd sourcing
4

+
Crowdsourcing Annotation Tasks
5
Roses
Annotator
From crowd
Garden
Provides
Annotations
Artefact
(Painting or Objects)
Car
Car
Garden
Roses
Evaluation

+
Annotation evaluation
 Manual evaluation is not feasible
 Institutions have large collections( Rijksmuseum: over 1 million)
 Crowd provides quite a lot of annotations
 Costs time and money
 Museums have limited resources
6

+
Need for automated algorithms
 Thus there is a need to develop algorithms to automatically evaluate
annotations with good accuracy
7

+
Previous approach
 Building user profile and tracking user reputation based on
semantic similarity
 Tracking provenance information for users
 Realized: There is lot of data provided and meaningful info
can be derived
 Current approach: Can we determine quality of information
based on features?
8

+
Research questions
 Can we evaluate annotations based on properties of the annotator
and the annotation?
 Can we predict reputation of annotator based on annotator
properties?
9
Roses
Age: 25
Male
Arts degree
No typo
Noun
In Wordnet

+
Relevant features
 Features of annotation
 Annotator
 Quality score
 Length
 Specificity…
 Features of annotator
 Age
 Gender
 Education
 Tagging experience…
10

+
Semantic Representation
11
Open Annotation model to represent annotation
Annotation
Target
oac:hasBody
Tag
User
oac:annotator
Reviewer Review
Review value
oac:annotates
oac:hasBody
oac:hasTarget
oac:annotation
foaf:person
rdf:type
foaf:age
age
gender
oac:annotates
length
oac:hasTarget
rdf:type
...
...
...
...
oac:annotator
rdf:type
rdf:type
ex:length
foaf:gender Used to estimate
FOAF to represent Annotator properties

+
Experiment
Steve.museum dataset
 We performed our evaluations on Steve.Museum dataset
 Online dataset of images and annotations
12
Stat features Values
Provided tags 45,733
Unique tags 13,949
Tags evaluated as useful 39,931(87%)
Tags evaluated as not-useful
5,802(13%)
Number of
annotators/registered
1218/488(40%)

+
Steve.museum annotation evaluation
 The annotations in Steve.museum project were evaluated into
multiple categories, we classified evaluations as either useful or not-useful
13
Usefulness-useful
Judgement-positive
Judgement-negative
Problematic-foreign
Problematic-typo
…
Usefulness-not useful

+
Identify relevant annotation properties
 Manually select properties (F_man)
 Is_adjective, is_english, in_wordnet
 List of all possible properties (F_all)
 F_man + [created_day/hour, length, specificty, nrwords, frequency]
 Apply feature selection algorithm on F_all to choose properties
(F_ml)
 Feature selection algorithm from WEKA toolkit
 WEKA is a collection of machine learning algorithms for data mining
tasks
 http://www.weka.net.nz/
14
Usefulness-useful

+
Build train and test data
 Split the Steve dataset annotations into test set and train set
 The train set has features and goal(quality)
 Test set: only the features
 Fairness: Train set had 1000 useful and 1000 not-useful annotations
15
Tag Feature
1
Feature
2
Feature n Quality
Rose f1 f2 fn Useful
House f11 f12 f1n Not-useful
Tag Feature 1 Feature 2 Feature n
Lily f1 f2 fn
Sky f11 f12 f1n
Train data
Test data

+
Machine learning
 Apply Machine learning techniques
 Learning: Learn about features and goal from training set
 Predictions: Apply learning from the training set to the test set
 Used SVM with default polykernel in WEKA to predict quality of
annotations
 Commonly used, fast and resistant against over-fitting
16

+
Results
 Method is good to predict useful tags, but not for predicting not-useful
tags
17
Feature
set
Class Recall Precisio
n
F-m
easure
F_man Useful 0.90 0.90 0.90
Not useful 0.20 0.21 0.20
F_all Useful 0.75 0.91 0.83
Not useful 0.42 0.18 0.25
F_ml Useful 0.20 0.98 0.34
Not useful 0.96 0.13 0.23

+
Identify relevant features of annotator
 Are these features helpful to
 Determine annotation quality?
 Predict annotator reputation?
18
Age: 25
Male
Arts degree

+
Building annotator reputation
 Probabilistic logic called Subjective Logic
 Annotator opinion =
 (belief, disbelief, uncertainty)
 (p,n) = (positive,negative) evaluations
 Belief = p/(p+n+2) Uncertainty = 2/(p+n+2)
 Expectation value(E) is the reputation
 E = (belief + apriori * uncertainty)
 Apriori = 0.5
19

+
Identify relevant annotator properties
 Manually identified properties
 F_man = [Community, age, education, experience, gender, tagging
experience…]
 List of all properties
 F_all = F_man + [vocabulary_size, vocab_diversity, is_anonymous, #
annotations in wordnet]
 Feature selection algorithm on F_all
 F_ml_a for annotation
 F_ml_u for annotator
20

+
Results
 Trained on features using SVM to make predictions
21
Feature
set
Class Recall Precisio
n
F-measure
F_man Useful 0.29 0.90 0.44
Not
useful
0.73 0.11 0.20
F_all Useful 0.69 0.91 0.78
Not
useful
0.43 0.15 0.22
F_ml_a Useful 0.55 0.91 0.68
Not
useful
0.53 0.13 0.21

+
Results
 Used regression to predict reputation values based on
features of registered annotator
 Since annotator reputation is highly skewed (90% > 0.7), we
could not predict reputation successfully
22
Feature_se
t
corr RMS
Error
Mean Abs Errr Rel Abs Err
F_man -0.02 0.15 0.10 97.8%
F_all 0.22 0.13 0.09 95.1%
F_ml_u 0.29 0.13 0.09 90.4%

+
Evaluation
 The possible reasons why method not successful for
predicting not-useful annotations:
 They are minority (13% of whole dataset)
 Need more in-depth analysis of features to determine not-useful
annotations
 Requires study from different datasets
23

+
Relevance
 Our experiments help to show that there is a correlation
between features of annotator and annotation to the quality
of annotations
 With a small set of features we were able to predict 98% of
the useful and 13% of the not useful annotations correctly.
 Helps to identify which features are relevant to certain tasks
24

+
Conclusions
 Machine learning techniques help to predict useful
evaluations but not not-useful ones
 Devised a model
 using SVM to predict annotation evaluation and annotator
reputation
 Using regression to predict annotator reputation
25

+
Future work
 Need to extract more in-depth information from both
annotation and annotator
 Need to build reputation of the annotator per topic
 Apply the model on different use cases
26

+ Thank you
a.nottamkandath@vu.nl
27

�ݺ�ߣ

Automated evaluation of crowdsourced annotations in the cultural heritage domain

More Related Content

Automated evaluation of crowdsourced annotations in the cultural heritage domain

Editor's Notes