This document discusses using machine learning techniques to improve the quality of data collected through citizen science projects. It summarizes four parts of the research:
1. Developing a model to estimate citizen scientists' expertise levels and how that affects their ability to detect different bird species.
2. Creating an automated data verification system to identify anomalous bird observation submissions and flag them for review.
3. Clustering citizen scientists based on their skill levels as characterized by their species accumulation curves over time.
4. Developing a multi-species occupancy model to identify which bird species are most commonly confused for each other by less experienced citizen scientists.
1 of 49
Download to read offline
More Related Content
thesis_presentation
1. Machine Learning for Improving the Quality of Citizen
Science Data
Jun Yu
Final Oral Presentation
December 3, 2013
1 / 38
3. Citizen science projects
Citizen science encourages volunteers from general public to participate in
scienti鍖c research.
Accumulate large volumes of data in broader spatial and temporal
extents.
Data quality is a concern due to the variability of citizen scientists
skills.
2 / 38
4. Processors vs. Sensors
Data collection: using humans as processors vs. sensors.
Processors: data can be validated and the same task can be assigned
to di鍖erent participants.
Sensors: no ground truth to validate the data and participants actively
collect data.
3 / 38
5. Processors vs. Sensors
Data collection: using humans as processors vs. sensors.
Processors: data can be validated and the same task can be assigned
to di鍖erent participants.
Sensors: no ground truth to validate the data and participants actively
collect data.
Question
Can we improve the quality of human sensor data in citizen science projects using
machine learning techniques?
3 / 38
6. eBird
A large-scale citizen science project that engages people to identify birds
and report their observations in the form of checklist to eBird.
150K individuals spent 5M hours to submit 140M observations.
A checklist: observer, site, visit, and species.
Species Distribution Models (SDMs) for conservation planning.
Figure: An overview of the eBird system.
4 / 38
7. Outline
1 Modeling citizen scientists expertise for SDMs.
2 Automated data veri鍖cation to identify anomalous submissions.
3 Clustering citizen scientists with similar skill levels.
4 Modeling misidenti鍖cation of bird species by citizen scientists.
5 Multi-species distribution modeling to improve rare species prediction.
5 / 38
8. Part 1: Modeling citizen scientists expertise for SDMs
SDMs predict species occupancy at a site.
Observations = Occupancy + Detection (structured noise).
Occupancy speci鍖es whether the species occupies a site.
Detection speci鍖es whether the observer detects the species when the
species occupy the site.
6 / 38
9. Part 1: Modeling citizen scientists expertise for SDMs
SDMs predict species occupancy at a site.
Observations = Occupancy + Detection (structured noise).
Occupancy speci鍖es whether the species occupies a site.
Detection speci鍖es whether the observer detects the species when the
species occupy the site.
The Occupancy-Detection (OD) model in Ecology:
Two key assumptions:
Population closure: occupancy status is constant across visits.
No false positives: species is never misidenti鍖ed.
6 / 38
11. Occupancy-Detection-Expertise (ODE) Model
Observer expertise as a factor to in鍖uence detection.
Allow false positives for both experts and novices.
Q() =EP(Z|Y ,E)[log P(Y , Z, E|X, U, W )]
Learning: Expectation-Maximization.
Inference: site occupancy (Z), detection (Y ) and observers expertise
(E).
7 / 38
12. Results
No ground truth to evaluate site occupancy on the 鍖eld data.
Table: Number of species that the improvement of ODE model is statistically
signi鍖cant for three di鍖erent groups of species.
Prediction of Y Prediction of E
Bird Groups ODE vs. LR ODE vs. OD ODE vs. LR
Common Birds 4/4 3/4 1/4
Rare Birds 4/4 3/4 3/4
Confusing Birds 4/4 4/4 3/4
8 / 38
13. Part 2: Automated data veri鍖cation for eBird project
Expert-de鍖ned 鍖lters de鍖ne the time window of bird occurrence in one
region based on regional experts experience.
Observations falling out of the window are 鍖agged as anomalous
observations, which will be reviewed by eBird reviewers.
9 / 38
14. Part 2: Automated data veri鍖cation for eBird project
Expert-de鍖ned 鍖lters de鍖ne the time window of bird occurrence in one
region based on regional experts experience.
Observations falling out of the window are 鍖agged as anomalous
observations, which will be reviewed by eBird reviewers.
Drawbacks:
Sometimes they are not very accurate due to experts bias.
Generate a large volume of 鍖agged observations for review.
Do not apply in regions without existing regional experts.
9 / 38
15. Automated Data Filter
We propose a two-step automated data 鍖lter:
Emergent Data Filter: It de鍖ne the time window of bird occurrence based
on the frequency of bird occurrence from historical data.
Score observer expertise: It accepts unusual observations from birders of
high expertise and 鍖ags unusual observations from birders of low expertise.
A birders expertise is predicted using the ODE model.
10 / 38
16. Result
A case study using eBird data from Tompkins Co., NY.
Reduce the workload of reviewing 鍖agged observations.
Identify more potential invalid observations.
Filter Expert-de鍖ned 鍖lter Automated data 鍖lter
Flagged 4006 (101 hrs) 2303 (58 hrs)
Invalid 985 1497
11 / 38
17. Part 3: Clustering citizen scientists with similar skill levels
Motivation: Since citizen scientists vary in their expertise, we would like
to 鍖nd groups of citizen scientists with similar skills.
Understand di鍖erences in detection between citizen scientists.
Develop automated data 鍖lters to improve data quality.
Build more accurate SDMs by accounting for observers skills.
Challenge: No ground truth to validate an observers submissions.
12 / 38
18. Part 3: Clustering citizen scientists with similar skill levels
Motivation: Since citizen scientists vary in their expertise, we would like
to 鍖nd groups of citizen scientists with similar skills.
Understand di鍖erences in detection between citizen scientists.
Develop automated data 鍖lters to improve data quality.
Build more accurate SDMs by accounting for observers skills.
Challenge: No ground truth to validate an observers submissions.
Solution: Characterize observers skill levels by their Species
Accumulation Curves.
12 / 38
19. Species Accumulation Curves
Species Accumulation Curves (SACs): a graph plotting the cumulative
number of unique species detected as a function of cumulative e鍖ort.
Fit a function for a SAC: F(x) = 硫0 + 硫1
x.
13 / 38
20. Species Accumulation Curves
Species Accumulation Curves (SACs): a graph plotting the cumulative
number of unique species detected as a function of cumulative e鍖ort.
Fit a function for a SAC: F(x) = 硫0 + 硫1
x.
The SAC can characterize a birders skill level.
Active birders vs. Occasional birders.
Evolution of birders skills over time.
13 / 38
21. The mixture of SACs model
Q = EZ|Y ,X[log(P(Y , Z|X; , 硫, 2
))]
=
M
i=1
K
k=1
rik log P(Zi = k; )
Ni
j=1
P(Yij |Xij , Zi = k; 硫, 2
)
E-step updates the expected membership of each birder rik .
M-step updates the model parameters {, 硫, 2
}.
14 / 38
23. Result
Experimental setting:
eBird Reference Dataset in 2012.
Remove birders with fewer than 20 checklists.
Four species-rich states: NY, FL, TX and CA.
Determine the number of groups by calculating the average log-likelihood
on a validation set.
15 / 38
25. Individual birders SAC
The SACs of birders from each group in NY.
(a) G1 (b) G2 (c) G3
Birders of the top group in NY.
25/30 are experts from the Cornell Lab of Ornithology or known
regional eBird reviewers.
5/30 are reputable birders submitting high quality checklists to eBird.
17 / 38
26. Detection of hard-to-detect bird species
Hard-to-detect species often require more skills to be identi鍖ed.
Detection rate of 6 hard-to-detect species of each group in NY.
18 / 38
27. Evaluation on eBird hotspots
Two eBird hotspots in NY.
Location a鍖ects the number of clusters.
Stewart Park: all 13 birders of G1 are veri鍖ed to be experts and 10 out of
12 birders of G2 are veri鍖ed to be novice birders.
Hammond Hill: all 10 birders are veri鍖ed to be experts.
19 / 38
28. Part 4: Identifying misidenti鍖cations of bird species
Motivation: We would like to identify how the birds are confused for
other birds in the eBird data.
Teach inexperienced birders.
Leverage this information in data quality control.
More accurate estimation of species occupancies.
20 / 38
29. Part 4: Identifying misidenti鍖cations of bird species
Motivation: We would like to identify how the birds are confused for
other birds in the eBird data.
Teach inexperienced birders.
Leverage this information in data quality control.
More accurate estimation of species occupancies.
Solution: Model multiple species simultaneously and allow false positives
to be explained by the presence of other species.
20 / 38
30. The Multi-Species Occupancy-Detection model
Occupancy (留) determines the occupancy of species s at a site.
Detection (硫) determines the detection of species s given the species
that are confused for s at a site.
Structure (粒) speci鍖es the cross edges to be recovered.
21 / 38
31. The Multi-Species Occupancy-Detection model
Occupancy (留) determines the occupancy of species s at a site.
Detection (硫) determines the detection of species s given the species
that are confused for s at a site.
Structure (粒) speci鍖es the cross edges to be recovered.
The joint probability of the MSOD model:
P(Y , Z|X, W ) =
N
i=1
P(Yi揃揃, Zi揃|Xi , Wi揃) =
N
i=1
S
s=1
P(Zis |Xi )
Ti
t=1
P(Yits |Zi(Yits ), Wit )
21 / 38
32. The parameterization of the MSOD model
The occupancy component:
ois = (Xi 揃 留s )
P(Zis |Xi ; 留s ) = ois
Zis
(1 ois )1Zis
The detection component using Noisy-OR model:
ditrs = (Wit 揃 硫rs )
P(Yits = 0|Zi揃, Wit ) = (1 d0s )
S
r=1
(1 ditrs )粒rs Zir
P(Yits = 1|Zi揃, Wit ) = 1 (1 d0s )
S
r=1
(1 ditrs )粒rs Zir
P(Yits |Zi揃, Wit ) = P(Yits = 1|Zi揃, Wit )Yits
P(Yits = 0|Zi揃, Wit )1Yits
d0s is the leak probability for species s.
22 / 38
33. The structural learning and parameter estimation
Relax 粒rs {0, 1} to 粒rs [0, 1] so that we turn the integer program into a
linear program.
Learn the MSOD model using EM:
E-step: update the occupancy settings P(Zi揃) for each site i.
M-step: update the model parameters = {留, 硫, 粒}.
Threshold the learned adjacency matrix 粒 to identify the 鍖nal learned
structure.
Re-estimate the MSOD model with the 鍖nal learned structure.
23 / 38
34. Synthetic data experiment
Synthetic dataset:
500 sites (3 maximal visits), 4 occupancy and 4 detection covariates.
5 species and randomly add 7 pairs of confusing species.
Generate 30 di鍖erent datasets.
Baselines:
The Occupancy-Detection (OD) model.
The Occupancy-Detection model with Leak Probability (ODLP).
Syn Occupancy (Z) Observation (Y )
AUC Accuracy AUC Accuracy
TRUE 0.941 賊 0.004 0.881 賊 0.004 0.783 賊 0.004 0.756 賊 0.004
OD 0.849 賊 0.006 0.758 賊 0.006 0.751 賊 0.005 0.739 賊 0.004
ODLP 0.868 賊 0.006 0.780 賊 0.007 0.752 賊 0.005 0.741 賊 0.004
MSOD 0.935 賊 0.005
0.872 賊 0.006
0.776 賊 0.004
0.750 賊 0.004
Structure AUC of MSOD: 0.989 賊 0.012
24 / 38
36. eBird data experiment
eBird dataset:
Three case studies: Hawks, Woodpeckers and Finches.
eBird Reference Dataset in 2010.
Group checklists within a radius of 1.6 km into sites.
Checkerboarding to avoid spatial correlation.
Tasks:
Identify groups of misidenti鍖ed species
Prediction on detection (Y ).
26 / 38
39. eBird data experiment
Finches: Purple Finch and House Finch, and Yellow-rumped Warbler
as a distractor species.
Purple Finch House Finch
AUC Accuracy AUC Accuracy
OD 0.807 賊 0.003 0.942 賊 0.001 0.758 賊 0.003 0.689 賊 0.002
ODLP 0.808 賊 0.003 0.943 賊 0.001 0.762 賊 0.003 0.696 賊 0.002
MSOD 0.817 賊 0.002
0.946 賊 0.001
0.775 賊 0.001
0.706 賊 0.001
29 / 38
40. The variational learning for the MSOD model
However, the exact learning and inference in the MSOD model is
exponential in the number of species.
Q() = EZ|Y ,X,W log(P(Y , Z|X, W ))
=
N
i=1 zi揃
P(Zi揃)
S
s=1
log P(Zis |Xi ) +
Ti
t=1
log P(Yits |Zi揃, Wit )
The key is that P(Yits|Zi揃, Wit) can not be factorized when Yits is 1. [Recall
that P(Yits = 0|Zi揃, Wit) = (1 d0s)
S
r=1(1 ditrs)粒rs Zir
can be factorized.]
Assume P(Yits|Zi揃, Wit) can be factorized for both Yits being 1 and 0.
Q() = EZ|Y ,X,W
N
i=1
S
r=1
log P(Zir |Xi ) +
Ti
t=1
S
s=1
log P(Yits |Zir , Wits )
=
N
i=1
S
r=1 zir
P(Zir = zir ) log P(Zir |Xi ) +
Ti
t=1
S
s=1
log P(Yits |Zir , Wits )
30 / 38
41. The variational parameters
Introduce the variational parameters q to put a lower bound on the
probability P(Yits = 1|Zi揃, Wit).
log P(Yits = 1|Zi揃, Wit ) = log 1 (1 d0s )
S
r=1
(1 ditrs )Zir
= log 1 exp(慮0s
S
r=1
Zir 慮itrs ) 慮itrs = log(1 ditrs )
= f (慮0s +
r=1
Zir 慮itrs ) f (x) = log(1 exp(x)) is concave
= f (慮0s +
r=1
qitrs
Zir 慮itrs
qitrs
) Zir 慮itrs are non-negative
S
r=1
qitrs f 慮0s +
Zir 慮itrs
qitrs
Jensens inequality
In the variational EM, we maximize the lower bound of the expected
log-likelihood.
Variational E-step: update q and Z while 鍖xing .
Variational M-step: update while 鍖xing q and Z.
31 / 38
42. Variational inference vs. Exact inference
500 sites (3 maximal visits), 4 occupancy and 4 detection covariates.
Number of species S [2, 7].
Randomly add S pairs of misidenti鍖ed species.
32 / 38
43. Part 5: Multi-species Distribution Modeling
Motivation: Can we improve the species distribution modeling by
accounting for the inter-species information (e.g. competition and
mutualism)?
Solution: Build a model for all species and prediction multiple species
simultaneously.
This can be addressed by multi-label classi鍖cation.
Text categorization
Image annotation
Species distribution modeling
Ensemble of Classi鍖er Chains (ECC):
Classi鍖er Chain: order all the species in a chain and learn a classi鍖er
for the ith species based on both the environmental features and the
observations of the previous i 1 species in the chain.
Ensemble of Classi鍖er Chains: generate di鍖erent ordering of species in
a chain.
33 / 38
44. Result
Experimental Setup:
5 species datasets (4 bird datasets and 1 moth dataset).
Single-species model vs. Multi-species model (ECC)
Two base learners: GLM and BRT.
34 / 38
45. Result
Experimental Setup:
5 species datasets (4 bird datasets and 1 moth dataset).
Single-species model vs. Multi-species model (ECC)
Two base learners: GLM and BRT.
Experiment 1: The overall performance of multi-species models against
single-species models.
34 / 38
46. Result
Experimental Setup:
5 species datasets (4 bird datasets and 1 moth dataset).
Single-species model vs. Multi-species model (ECC)
Two base learners: GLM and BRT.
Experiment 2: Test the performance of multi-species models against
single-species models on rare species verse common species.
35 / 38
48. Future directions
The ODE model:
estimate the model on data from both labeled and unlabeled birders.
replace the logistic regression with more 鍖exible function approximators
such as boosted trees.
The automated data 鍖lter:
improve the expertise prediction of the ODE model by including more
expertise covariates.
test this automated data 鍖lter more broadly across the US.
The mixture of SACs:
explore nonparametric Bayesian approach so that the number of groups
is determined from data itself.
extend this model to capture the evolution of an observers skill level
over time.
The MSOD model:
extend the MSOD model to capture species interaction.
replace the logistic regression with more 鍖exible function approximators
such as boosted trees.
37 / 38
49. Acknowledgements
I would like to thank
Lab mates: Chao, Jana, Jun, Liping, Moy, Xinze and Yuanli.
Basketball teammates: Alan, Chris, David, Eric, Patrick, Ron and
Travis.
Colaborators at OSU: Rebecca Hutchinson, Tom Dietterich, Susan
Shirley, Sarah Frey, Matt Betts and Julia Jones.
Colaborators at CLO: Marshall Ili鍖, Brian Sullivan, Chris Wood, Je鍖
Gerbracht and Steve Kelling.
Ph.D. advisor: Weng-Keen Wong.
Family: my parents, my parents-in-law and my wife.
38 / 38