�ݺ�ߣ

Machine Learning for Improving the Quality of Citizen
Science Data
Jun Yu
Final Oral Presentation
December 3, 2013
1 / 38

Citizen science projects
Citizen science encourages volunteers from general public to participate in
scientiﬁc research.
2 / 38

Citizen science projects
Citizen science encourages volunteers from general public to participate in
scientiﬁc research.
Accumulate large volumes of data in broader spatial and temporal
extents.
Data quality is a concern due to the variability of citizen scientists’
skills.
2 / 38

Processors vs. Sensors
Data collection: using humans as processors vs. sensors.
Processors: data can be validated and the same task can be assigned
to diﬀerent participants.
Sensors: no ground truth to validate the data and participants actively
collect data.
3 / 38

Processors vs. Sensors
Data collection: using humans as processors vs. sensors.
Processors: data can be validated and the same task can be assigned
to diﬀerent participants.
Sensors: no ground truth to validate the data and participants actively
collect data.
Question
Can we improve the quality of human sensor data in citizen science projects using
machine learning techniques?
3 / 38

eBird
A large-scale citizen science project that engages people to identify birds
and report their observations in the form of checklist to eBird.
150K individuals spent 5M hours to submit 140M observations.
A checklist: observer, site, visit, and species.
Species Distribution Models (SDMs) for conservation planning.
Figure: An overview of the eBird system.
4 / 38

Outline
1 Modeling citizen scientists’ expertise for SDMs.
2 Automated data veriﬁcation to identify anomalous submissions.
3 Clustering citizen scientists with similar skill levels.
4 Modeling misidentiﬁcation of bird species by citizen scientists.
5 Multi-species distribution modeling to improve rare species prediction.
5 / 38

Part 1: Modeling citizen scientists’ expertise for SDMs
SDMs predict species occupancy at a site.
Observations = Occupancy + Detection (structured noise).
Occupancy speciﬁes whether the species occupies a site.
Detection speciﬁes whether the observer detects the species when the
species occupy the site.
6 / 38

Part 1: Modeling citizen scientists’ expertise for SDMs
SDMs predict species occupancy at a site.
Observations = Occupancy + Detection (structured noise).
Occupancy specifies whether the species occupies a site.
Detection specifies whether the observer detects the species when the
species occupy the site.
The Occupancy-Detection (OD) model in Ecology:
Two key assumptions:
Population closure: occupancy status is constant across visits.
No false positives: species is never misidentified.
6 / 38

Occupancy-Detection-Expertise (ODE) Model
Observer’ expertise as a factor to inﬂuence detection.
Allow false positives for both experts and novices.
7 / 38

Occupancy-Detection-Expertise (ODE) Model
Observer’ expertise as a factor to inﬂuence detection.
Allow false positives for both experts and novices.
Q(Θ) =EP(Z|Y ,E)[log P(Y , Z, E|X, U, W )]
Learning: Expectation-Maximization.
Inference: site occupancy (Z), detection (Y ) and observer’s expertise
(E).
7 / 38

Results
No ground truth to evaluate site occupancy on the field data.
Table: Number of species that the improvement of ODE model is statistically
significant for three different groups of species.
Prediction of Y Prediction of E
Bird Groups ODE vs. LR ODE vs. OD ODE vs. LR
Common Birds 4/4 3/4 1/4
Rare Birds 4/4 3/4 3/4
Confusing Birds 4/4 4/4 3/4
8 / 38

Part 2: Automated data verification for eBird project
Expert-defined filters define the time window of bird occurrence in one
region based on regional experts’ experience.
Observations falling out of the window are flagged as anomalous
observations, which will be reviewed by eBird reviewers.
9 / 38

Part 2: Automated data verification for eBird project
Expert-defined filters define the time window of bird occurrence in one
region based on regional experts’ experience.
Observations falling out of the window are flagged as anomalous
observations, which will be reviewed by eBird reviewers.
Drawbacks:
Sometimes they are not very accurate due to experts’ bias.
Generate a large volume of flagged observations for review.
Do not apply in regions without existing regional experts.
9 / 38

Automated Data Filter
We propose a two-step automated data filter:
Emergent Data Filter: It define the time window of bird occurrence based
on the frequency of bird occurrence from historical data.
Score observer expertise: It accepts unusual observations from birders of
high expertise and flags unusual observations from birders of low expertise.
A birder’s expertise is predicted using the ODE model.
10 / 38

Result
A case study using eBird data from Tompkins Co., NY.
Reduce the workload of reviewing flagged observations.
Identify more potential invalid observations.
Filter Expert-defined filter Automated data filter
Flagged 4006 (101 hrs) 2303 (58 hrs)
Invalid 985 ≥ 1497
11 / 38

Part 3: Clustering citizen scientists with similar skill levels
Motivation: Since citizen scientists vary in their expertise, we would like
to find groups of citizen scientists with similar skills.
Understand differences in detection between citizen scientists.
Develop automated data filters to improve data quality.
Build more accurate SDMs by accounting for observers’ skills.
Challenge: No ground truth to validate an observer’s submissions.
12 / 38

Part 3: Clustering citizen scientists with similar skill levels
Motivation: Since citizen scientists vary in their expertise, we would like
to find groups of citizen scientists with similar skills.
Understand differences in detection between citizen scientists.
Develop automated data filters to improve data quality.
Build more accurate SDMs by accounting for observers’ skills.
Challenge: No ground truth to validate an observer’s submissions.
Solution: Characterize observers’ skill levels by their Species
Accumulation Curves.
12 / 38

Species Accumulation Curves
Species Accumulation Curves (SACs): a graph plotting the cumulative
number of unique species detected as a function of cumulative eﬀort.
Fit a function for a SAC: F(x) = β0 + β1
√
x.
13 / 38

Species Accumulation Curves
Species Accumulation Curves (SACs): a graph plotting the cumulative
number of unique species detected as a function of cumulative eﬀort.
Fit a function for a SAC: F(x) = β0 + β1
√
x.
The SAC can characterize a birder’s skill level.
Active birders vs. Occasional birders.
Evolution of birders’ skills over time.
13 / 38

The mixture of SACs model
Q = EZ|Y ,X[log(P(Y , Z|X; π, β, σ2
))]
=
M
i=1
K
k=1
rik log P(Zi = k; π)
Ni
j=1
P(Yij |Xij , Zi = k; β, σ2
)
E-step updates the expected membership of each birder rik .
M-step updates the model parameters {π, β, σ2
}.
14 / 38

Result
Experimental setting:
eBird Reference Dataset in 2012.
Remove birders with fewer than 20 checklists.
Four species-rich states: NY, FL, TX and CA.
15 / 38

Result
Experimental setting:
Remove birders with fewer than 20 checklists.
Four species-rich states: NY, FL, TX and CA.
Determine the number of groups by calculating the average log-likelihood
on a validation set.
15 / 38

Individual birder’s SAC
The SACs of birders from each group in NY.
(a) G1 (b) G2 (c) G3
Birders of the top group in NY.
25/30 are experts from the Cornell Lab of Ornithology or known
regional eBird reviewers.
5/30 are reputable birders submitting high quality checklists to eBird.
17 / 38

Detection of hard-to-detect bird species
Hard-to-detect species often require more skills to be identiﬁed.
Detection rate of 6 hard-to-detect species of each group in NY.
18 / 38

Evaluation on eBird hotspots
Two eBird hotspots in NY.
Location affects the number of clusters.
Stewart Park: all 13 birders of G1 are verified to be experts and 10 out of
12 birders of G2 are verified to be novice birders.
Hammond Hill: all 10 birders are verified to be experts.
19 / 38

Part 4: Identifying misidentiﬁcations of bird species
Motivation: We would like to identify how the birds are confused for
other birds in the eBird data.
Teach inexperienced birders.
Leverage this information in data quality control.
More accurate estimation of species occupancies.
20 / 38

Part 4: Identifying misidentiﬁcations of bird species
Motivation: We would like to identify how the birds are confused for
other birds in the eBird data.
Teach inexperienced birders.
Leverage this information in data quality control.
More accurate estimation of species occupancies.
Solution: Model multiple species simultaneously and allow false positives
to be explained by the presence of other species.
20 / 38

The Multi-Species Occupancy-Detection model
Occupancy (α) determines the occupancy of species s at a site.
Detection (β) determines the detection of species s given the species
that are confused for s at a site.
Structure (γ) speciﬁes the cross edges to be recovered.
21 / 38

Occupancy (α) determines the occupancy of species s at a site.
Detection (β) determines the detection of species s given the species
that are confused for s at a site.
Structure (γ) speciﬁes the cross edges to be recovered.
The joint probability of the MSOD model:
P(Y , Z|X, W ) =
N
i=1
P(Yi··, Zi·|Xi , Wi·) =
N
i=1
S
s=1
P(Zis |Xi )
Ti
t=1
P(Yits |Ziπ(Yits ), Wit )
21 / 38

The parameterization of the MSOD model
The occupancy component:
ois = σ(Xi · αs )
P(Zis |Xi ; αs ) = ois
Zis
(1 − ois )1−Zis
The detection component using Noisy-OR model:
ditrs = σ(Wit · βrs )
P(Yits = 0|Zi·, Wit ) = (1 − d0s )
S
r=1
(1 − ditrs )γrs Zir
P(Yits = 1|Zi·, Wit ) = 1 − (1 − d0s )
S
r=1
(1 − ditrs )γrs Zir
P(Yits |Zi·, Wit ) = P(Yits = 1|Zi·, Wit )Yits
P(Yits = 0|Zi·, Wit )1−Yits
d0s is the leak probability for species s.
22 / 38

The structural learning and parameter estimation
Relax γrs ∈ {0, 1} to γrs ∈ [0, 1] so that we turn the integer program into a
linear program.
Learn the MSOD model using EM:
E-step: update the occupancy settings ˜P(Zi·) for each site i.
M-step: update the model parameters Θ = {α, β, γ}.
Threshold the learned adjacency matrix γ to identify the ﬁnal learned
structure.
Re-estimate the MSOD model with the ﬁnal learned structure.
23 / 38

Synthetic data experiment
Synthetic dataset:
500 sites (3 maximal visits), 4 occupancy and 4 detection covariates.
5 species and randomly add 7 pairs of confusing species.
Generate 30 diﬀerent datasets.
Baselines:
The Occupancy-Detection (OD) model.
The Occupancy-Detection model with Leak Probability (ODLP).
Syn Occupancy (Z) Observation (Y )
AUC Accuracy AUC Accuracy
TRUE 0.941 ± 0.004 0.881 ± 0.004 0.783 ± 0.004 0.756 ± 0.004
OD 0.849 ± 0.006 0.758 ± 0.006 0.751 ± 0.005 0.739 ± 0.004
ODLP 0.868 ± 0.006 0.780 ± 0.007 0.752 ± 0.005 0.741 ± 0.004
MSOD 0.935 ± 0.005 †
0.872 ± 0.006 †
0.776 ± 0.004 †
0.750 ± 0.004 †
Structure AUC of MSOD: 0.989 ± 0.012
24 / 38

Synthetic data experiment
With species occupancy interactions.
Syn-I Occupancy (Z) Observation (Y )
TRUE 0.943 ± 0.003 0.885 ± 0.004 0.776 ± 0.003 0.763 ± 0.005
OD 0.842 ± 0.005 0.731 ± 0.010 0.744 ± 0.004 0.746 ± 0.006
ODLP 0.865 ± 0.005 0.757 ± 0.010 0.746 ± 0.004 0.747 ± 0.006
MSOD 0.925 ± 0.004 †
0.862 ± 0.006 †
0.763 ± 0.004 †
0.755 ± 0.006 †
Non-linear occupancy components.
Syn-NL Occupancy (Z) Observation (Y )
TRUE 0.937 ± 0.003 0.878 ± 0.004 0.777 ± 0.005 0.762 ± 0.007
OD 0.837 ± 0.007 0.722 ± 0.010 0.739 ± 0.005 0.743 ± 0.007
ODLP 0.848 ± 0.007 0.734 ± 0.009 0.741 ± 0.005 0.744 ± 0.007
MSOD 0.903 ± 0.006 †
0.842 ± 0.007 †
0.755 ± 0.004 †
0.751 ± 0.007 †
25 / 38

eBird data experiment
eBird dataset:
Three case studies: Hawks, Woodpeckers and Finches.
Group checklists within a radius of 1.6 km into sites.
Checkerboarding to avoid spatial correlation.
Tasks:
Identify groups of misidentiﬁed species
Prediction on detection (Y ).
26 / 38

Hawks: Sharp-shinned Hawk and Cooper’s Hawk, and Turkey
Vulture as a distractor species.
Sharp-shinned Hawk Cooper’s Hawk
OD 0.725 ± 0.005 0.967 ± 0.001 0.765 ± 0.003 0.912 ± 0.001
ODLP 0.737 ± 0.005 0.972 ± 0.001 0.770 ± 0.005 0.917 ± 0.002
MSOD 0.757 ± 0.003 †
0.976 ± 0.001 †
0.780 ± 0.002 †
0.923 ± 0.001
27 / 38

Woodpeckers: Hairy Woodpecker and Downy Woodpecker, and
Dark-eyed Junco as a distractor species.
Hairy Woodpecker Downy Woodpecker
OD 0.833 ± 0.004 0.940 ± 0.001 0.761 ± 0.004 0.903 ± 0.001
ODLP 0.837 ± 0.004 0.944 ± 0.001 0.769 ± 0.004 0.909 ± 0.001
MSOD 0.843 ± 0.002 0.950 ± 0.001 †
0.783 ± 0.002 †
0.916 ± 0.001 †
28 / 38

Finches: Purple Finch and House Finch, and Yellow-rumped Warbler
as a distractor species.
Purple Finch House Finch
OD 0.807 ± 0.003 0.942 ± 0.001 0.758 ± 0.003 0.689 ± 0.002
ODLP 0.808 ± 0.003 0.943 ± 0.001 0.762 ± 0.003 0.696 ± 0.002
MSOD 0.817 ± 0.002 †
0.946 ± 0.001 †
0.775 ± 0.001 †
0.706 ± 0.001 †
29 / 38

The variational learning for the MSOD model
However, the exact learning and inference in the MSOD model is
exponential in the number of species.
Q(Θ) = EZ|Y ,X,W log(P(Y , Z|X, W ))
=
N
i=1 zi·
˜P(Zi·)
S
s=1
log P(Zis |Xi ) +
Ti
t=1
log P(Yits |Zi·, Wit )
The key is that P(Yits|Zi·, Wit) can not be factorized when Yits is 1. [Recall
that P(Yits = 0|Zi·, Wit) = (1 − d0s)
S
r=1(1 − ditrs)γrs Zir
can be factorized.]
Assume P(Yits|Zi·, Wit) can be factorized for both Yits being 1 and 0.
Q(Θ) = EZ|Y ,X,W
N
i=1
S
r=1
log P(Zir |Xi ) +
Ti
t=1
S
s=1
log P(Yits |Zir , Wits )
=
N
i=1
S
r=1 zir
˜P(Zir = zir ) log P(Zir |Xi ) +
Ti
t=1
S
s=1
log P(Yits |Zir , Wits )
30 / 38

The variational parameters
Introduce the variational parameters q to put a lower bound on the
probability P(Yits = 1|Zi·, Wit).
log P(Yits = 1|Zi·, Wit ) = log 1 − (1 − d0s )
S
r=1
(1 − ditrs )Zir
= log 1 − exp(−θ0s −
S
r=1
Zir θitrs ) θitrs = − log(1 − ditrs )
= f (θ0s +
r=1
Zir θitrs ) f (x) = log(1 − exp(−x)) is concave
= f (θ0s +
r=1
qitrs
Zir θitrs
qitrs
) Zir θitrs are non-negative
≥
S
r=1
qitrs f θ0s +
Zir θitrs
qitrs
Jensen’s inequality
In the variational EM, we maximize the lower bound of the expected
log-likelihood.
Variational E-step: update q and Z while ﬁxing Θ.
Variational M-step: update Θ while ﬁxing q and Z.
31 / 38

Variational inference vs. Exact inference
500 sites (3 maximal visits), 4 occupancy and 4 detection covariates.
Number of species S ∈ [2, 7].
Randomly add S pairs of misidentiﬁed species.
32 / 38

Part 5: Multi-species Distribution Modeling
Motivation: Can we improve the species distribution modeling by
accounting for the inter-species information (e.g. competition and
mutualism)?
Solution: Build a model for all species and prediction multiple species
simultaneously.
This can be addressed by multi-label classification.
Text categorization
Image annotation
Species distribution modeling
Ensemble of Classifier Chains (ECC):
Classifier Chain: order all the species in a chain and learn a classifier
for the ith species based on both the environmental features and the
observations of the previous i − 1 species in the chain.
Ensemble of Classifier Chains: generate different ordering of species in
a chain.
33 / 38

Result
Experimental Setup:
5 species datasets (4 bird datasets and 1 moth dataset).
Single-species model vs. Multi-species model (ECC)
Two base learners: GLM and BRT.
34 / 38

Result
Experimental Setup:
Experiment 1: The overall performance of multi-species models against
single-species models.
34 / 38

Result
Experimental Setup:
Experiment 2: Test the performance of multi-species models against
single-species models on rare species verse common species.
35 / 38

Contributions
Three models:
The Occupancy-Detection-Expertise model
The mixture of SACs model
Two applications:
The automated data ﬁlter
The multi-species distribution modeling
36 / 38

Future directions
The ODE model:
estimate the model on data from both labeled and unlabeled birders.
replace the logistic regression with more flexible function approximators
such as boosted trees.
The automated data filter:
improve the expertise prediction of the ODE model by including more
expertise covariates.
test this automated data filter more broadly across the US.
The mixture of SACs:
explore nonparametric Bayesian approach so that the number of groups
is determined from data itself.
extend this model to capture the evolution of an observer’s skill level
over time.
The MSOD model:
extend the MSOD model to capture species interaction.
replace the logistic regression with more flexible function approximators
such as boosted trees.
37 / 38

Acknowledgements
I would like to thank
Lab mates: Chao, Jana, Jun, Liping, Moy, Xinze and Yuanli.
Basketball teammates: Alan, Chris, David, Eric, Patrick, Ron and
Travis.
Colaborators at OSU: Rebecca Hutchinson, Tom Dietterich, Susan
Shirley, Sarah Frey, Matt Betts and Julia Jones.
Colaborators at CLO: Marshall Iliﬀ, Brian Sullivan, Chris Wood, Jeﬀ
Gerbracht and Steve Kelling.
Ph.D. advisor: Weng-Keen Wong.
Family: my parents, my parents-in-law and my wife.
38 / 38

�ݺ�ߣ

thesis_presentation

More Related Content

thesis_presentation