This document summarizes a study on quality control mechanisms for crowdsourcing at FamilySearch Indexing. It finds that experienced workers are faster and more accurate than novices. Peer review is nearly as effective as arbitration at maintaining quality, while being more efficient. For some fields, context or language skills are needed. Overall, peer review with expert routing shows promise as an effective quality control method.
1 of 29
Download to read offline
More Related Content
Cscw family searchindexing
1. QUALITY CONTROL MECHANISMS FOR
CROWDSOURCING: PEER REVIEW, ARBITRATION,
& EXPERTISE AT FAMILYSEARCH INDEXING
CSCW, SAN ANTONIO, TX
FEB 26, 2013
Derek Hansen, Patrick Schone, Douglas
Corey, Matthew Reid, & Jake Gehring
5. FSI in Broader Landscape
Crowdsourcing Project
Aggregates discrete tasks completed by
volunteers who replace professionals (Howe,
2006; Doan, et al., 2011)
Human Computation System
Humans use computational system to work on a
problem that may someday be solvable by
computers (Quinn & Bederson, 2011)
Lightweight Peer Production
Largely anonymous contributors independently
completing discrete, repetitive tasks provided by
authorities (Haythornthwaite, 2009)
7. Quality Control Mechanisms
9 Types of quality control for human computation
systems (Quinn & Bederson, 2011)
Redundancy
Multi-level review
Find-Fix-Verify pattern (Bernstein, et al., 2010)
Weight proposed solutions by reputation of contributor
(McCann, et al., 2003)
Peer or expert oversight (Cosley, et al., 2005)
Tournament selection approach (Sun, et al., 2011)
9. Peer review process (A-R-RARB)
A R RARB
Already Filled In
Proposed Mechanism Optional?
10. Two Act Play
Act I: Experience Act II: Quality Control
What is the role of Is peer review or
experience on quality arbitration better in terms
and efficiency? of quality and efficiency?
Historical data analysis Field experiment using
using full US and 2,000 images from the
Canadian Census 1930 US Census Data &
records from 1920 and corresponding truth set
earlier
11. Act I: Experience
Quality is estimated based on A-B
agreement (no truth set)
Efficiency calculated using keystroke-
logging data with idle time and outliers
removed
13. A-B agreement by language
(1871 Canadian Census)
English Language French Language
Given Name: 79.8% Given Name: 62.7%
Surname: 66.4% Surname: 48.8%
14. A-B agreement by experience
Birth Place: All U.S. Censuses
B (novice experienced)
A (novice experienced)
15. A-B agreement by experience
Given Name: All U.S. Censuses
B (novice experienced)
A (novice experienced)
16. A-B agreement by experience
Surname: All U.S. Censuses
B (novice experienced)
A (novice experienced)
17. A-B agreement by experience
Gender: All U.S. Censuses
B (novice experienced)
A (novice experienced)
18. A-B agreement by experience
Birthplace: English-speaking Canadian Census
B (novice experienced)
A (novice experienced)
20. Summary & Implications of Act I
Experienced workers are faster and more
accurate, gains which continue even at high levels
- Focus on retention
- Encourage both novices & experts to do more
- Develop interventions to speed up experience
gains (e.g., send users common mistakes made
by people at their experience level)
21. Summary & Implications of Act I
Contextual knowledge (e.g., Canadian placenames)
and specialized skills (e.g., French language fluency)
is needed for some tasks
- Recruit people with existing knowledge & skills
- Provide contextual information when possible
(e.g., Canadian placename prompts)
- Dont remove context (e.g., captcha)
- Allow users to specialize?
22. Act II: Quality Control
A-B-ARB data from original transcribers (Feb
2011)
A-R-RARB data includes original A data and
newly collected R and RARB data from
people new to this method (Jan-Feb of 2012)
Truth Set data from company with
independent audit by FSI experts
Statistical Test: mixed-model logistic
regression (accurate or not) with random
effects, controlling for expertise
23. Limitations
Experience levels of R and RARB were
lower than expected, though we did
statistically control for this
Original B data used in A-B-ARB for
certain fields was transcribed in non-
standard manner requiring adjustment
24. No Need for RARB
No gains in quality from extra arbitration of
peer reviewed data (A-R = A-R-RARB)
RARB takes some time, so better without
25. Quality
Comparison
Both methods were
statistically better than A
alone
A-B-ARB had slightly
lower error rates than A-R
R missed more
errors, but also
introduced fewer errors
27. Summary & Implications of Act II
Peer Review shows considerable efficiency
gains with nearly as good quality as Arbitration
- Prime reviewers to find errors (e.g., prompt
them with expected # of errors on a page)
- Highlight potential problems (e.g., let A flag
tough fields)
- Route difficult pages to experts
- Consider an A-R1-R2 process when high quality
is critical
28. Summary & Implications of Act II
Reviewing reviewers isnt always worth the time
- At least in some contexts, Find-Fix may not
need Verify
Quality of different fields varies dramatically
- Use different quality control mechanisms for
harder or easier fields
Integrate human and algorithmic transcription
- Use algorithms on easy fields & integrate into
review process so machine learning can occur
29. Questions
Derek Hansen (dlhansen@byu.edu)
Patrick Schone (BoiseBound@aol.com)
Douglas Corey (corey@mathed.byu.edu)
Matthew Reid (matthewreid007@gmail.com)
Jake Gehring (GehringJG@familysearch.org)
Editor's Notes
#3: The goal of FamilySearch.org is to help people find their ancestors. It is a freely available resource that compiles information from databases from around the world. The Church of Jesus Christ of Latter-Day Saints sponsors it, but it can be used by anyone for free.
#4: FamilySearch Indexings role is to transcribe text from scanned images so it is in a machine-readable format that can be searched. This is done by hundreds of thousands of indexers, making it the worlds largest document transcription service. Documents include census records, vital records (e.g., birth, death, marriage, burial), church records (e.g., christening), military records, legal records, cemetery records, and migration records from countries around the globe.
#5: As you can see, transcribing names from hand-written documents is not a trivial task, though a wide range of people are capable of learning to do it and no specialized equipment is needed. Nearly 400,000 contributors have transcribed records, with over 500 new volunteers signing up each day in the recent past. The challenges of transcription work make quality control mechanisms essential to the success of the project, and also underscore the importance of understanding expertise and how it develops over time.
#7: Documents are being scanned at an increasing rate. If we are to benefit from these new resources well need to keep pace with the indexing efforts.Thus, the goals of FSI are to (a) Index as many documents as possible, while (b) assuring a certain level of quality.
#8: And there are others for more complex tasks that require coordination, such as those occurring on Wikipedia (e.g., Kittur & Kraut, 2008).Note that some of these are not mutually exclusive. Many have only been tested in research prototype projects, but not at scale. And others were not designed with efficiency in mind.
#9: The current quality control mechanism is called A-B-Arbitrate (or just A-B-ARB or arbitration for short). In this process, person A and person B index the document independently, and an experience arbitrator (ARB) reviews any discrepancies between the two.
#10: This is a proposed model that has not been tested until this study.The model could include arbitration (ARB) or that step could be skipped if A-B results in high enough quality on its own (see findings).
#11: Quality is measured as agreement between independent coders in Act I. This is not true quality, but is highly correlated with high quality.Quality is measured against a truth set created by a company who assured 99.9% accuracy and was independently audited by expert FSI transcribers.Efficiency is measured in terms of active time spent indexing (after idle time was removed) and keystrokes as captured by the indexing program.
#12: Quality (estimated based on A-B agreement)Measures difficulty more than actual qualityUnderestimates quality, since an experienced Arbitrator reviews all A-B disagreementsGood at capturing differences across people, fields, and projectsTime (calculated using keystroke-logging data)Idle time is tracked separately, making actual time measurements more accurateOutliers removed
#13: Notice the high variation in agreement depending on how many options there are for a field to have (e.g., gender has only a couple options, while surname has many options)
#14: This finding is likely due to the fact that most transcribers are English Speaking which suggests the need to recruit contributors who are native speakers of other languages
#15: Experience is based on EL(U) = round(log5(N(U))) Where U represents the transcriber, N(U) is the number of images that U has transcribed, and EL(U) is the experience level of U.Rank Number of images transcribed0 11 5 2 25 3 125 4 625 5 3125 6 15625 7 78125 8 390625
#18: There isnt much improvement, since its an easy field to agree on. In other words, even novices are good.
#19: Here there isnt much improvement, but the overall agreement is low. This suggests that even experts are not good, likely because of unfamiliarity with Canadian placenames given the predominantly US indexing population. Remember, that expertise is based on all contributions, not just those in this category.
#20: More experienced transcribers are much faster (up to 4 times faster) than inexperienced users. They also have fewer keystrokes (e.g., using help functionality; fixing mistakes)Though not shown here, the paper shows how experienced indexer work also requires less time to arbitrate and fewer keystrokes.Furthermore, English-speaking 1871 Candadian Census were 2.68 seconds faster per line than the French version, even though French version required more keystrokes. Again, this is likely due to the fact that most transcribers are native English speakers.
#23: 2,000 random images including many fields (e.g., surname, county of origin, gender, age) for each of the 50 lines of data (which include a single row for each individual). Note that this is repeated measures data, since the same transcriber transcribes all 50 rows of an image in a batch and some people transcribe more than one page. We use a mixed-model to account for this.Because people performing R were new to this method and the system was not tuned to the needs of reviewers, the A-R-RARB data should be considered a baseline i.e., a lower bound on how well A-R-RARB can do.
#24: A new approach based on peer review instead of independent indexing would likely improve efficiency, but its effect on quality is unknown. Anecdotal evidence suggests that peer reviewing may be twice as fast as indexing from scratch.
#25: This is likely due to the fact that most R edits fix problems they rarely introduce new problems. However, RARB doesnt know who A or R is, and they erroneously agree with A too much, which is why there is no gain from RARB, and in fact some small losses in quality due to RARB.
#27: There are clear gains in time for the A-R model, because reviewing takes about half as much time as transcribing from scratch.
#28: Remember, in our study Peer Review was a new method for those performing it and the system hadnt been customized to support it well, so it may do as well as A-B-ARB with some minor improvements and training.