This document summarizes discussions from a workshop on data science challenges. It notes both benefits and limitations of crowdsourced data challenges, including lack of direct access to solutions and limited incentive for collaboration. It then describes the development of a new format called Rapid Analytics and Model Prototyping (RAMP) that aims to address these limitations by facilitating direct access to code and data, and incentivizing diversity and collaboration. RAMP was created based on experiences with the Higgs Boson Machine Learning Challenge and drawing from literature on open innovation.
1 of 56
Downloaded 13 times
More Related Content
What is wrong with data challenges
1. Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay
Center for Data Science
BALZS KGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:
THE GOOD, THE BAD AND THE UGLY
2. 2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3. 3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.
4. 4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
束 CROWDS 損 - > EXPLOSION OF TOOLS
! Crowdsourcing
! is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
! To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
5. Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
6. Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
12. Center for Data Science
Paris-Saclay
Summary of our conclusions after the HiggsML challenge
The good, the bad and the ugly
Elaborating on some of the points
Rapid Analytics and Model Prototyping
an experimental format we have been developing
12
OUTLINE
13. Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
14. Center for Data Science
Paris-Saclay
Publicity, awareness
both in physics (about the technology) and in ML (about the problem)
Triggering open data
http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
Learning a lot from G叩bor on how to win a challenge
G叩bor getting hired by Google Deep Mind
Benchmarking
Tool dissemination (xgboost, keras)
14
THE GOOD
15. Center for Data Science
Paris-Saclay
No direct access to code
No direct access to data scientists
No fundamentally new ideas
No incentive to collaborate
15
THE BAD
16. Center for Data Science
Paris-Saclay
18 months to prepare
legal issues, access to data
problem formulation: intellectually way more interesting than the
challenge itself, but dif鍖cult to market or to crowdsource
once a problem is formalized/formatted to challenge, the problem is
solved (learning is easy - GaelVaroquaux)
16
THE UGLY
17. Center for Data Science
Paris-Saclay
We asked the wrong question, on purpose!
because the right questions are complex and dont 鍖t the challenge
setup
would have led to way less participation
would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY
18. Center for Data Science
Paris-Saclay
The HiggsML challenge on Kaggle
https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
19. Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. K辿gl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
20. Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS
20
HEPML workshop @NIPS14
JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42
CERN Open Data
http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014
DataScience@LHC
http://indico.cern.ch/event/395374/
Flavors of physics challenge
https://www.kaggle.com/c/鍖avours-of-physics
21. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
22. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
22
Sophisticated cross validation, CV bagging
Sophisticated calibration and model averaging
The 鍖rst step: pro participants check if the effort is worthy,
risk assessment
variance estimate of the score
Dont use the public leaderboard score for model selection
None of G叩bors 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
23. Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
24. Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
25. Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery signi鍖cance
鍖ux time
selection
expected background
say, b = 100 events
total count,
say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation 袖b. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and 袖b. As
we estimate the expectation 袖b by its empirical counter-
+ b to obtain the approximate median signi鍖cance
(s + b) ln
1 +
s
b
s
. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3
s
1 + O
s
b
3
,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection
thresholdselection threshold
26. Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
OK, so lets design an objective function that can take background
systematics into consideration
Likelihood with unknown background b N(袖b, b)
L(袖s, 袖b) = P(n, b|袖s, 袖b, b) =
(袖s + 袖b)n
n!
e (袖s+袖b) 1
p
2 b
e (b 袖b)2
/2 b
2
Pro鍖le likelihood ratio (0) =
L(0, 袖b)
L(袖s, 袖b)
The new Approximate Median Signi鍖cance (by Glen Cowan)
AMS =
s
2
(s + b) ln
s + b
b0
s b + b0
+
(b b0)2
b
2
where
b0 =
1
2
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
1 / 1
26
27. Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didnt we use it?
28. Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
The new Approximate Median Signi鍖cance
AMS =
s
2
(s + b) ln
s + b
b0
s b + b0
+
(b b0)2
b
2
where
b0 =
1
2
b b
2
+
p
(b b
2)2 + 4(s + b) b
2
1 / 1
New AMS
ATLAS
Old AMS
29. Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER
29
Sophisticated cross validation, CV bagging
Sophisticated calibration and model averaging
The 鍖rst step: pro participants check if the effort is worthy,
risk assessment
variance estimate of the score
Dont use the public leaderboard score for model selection
None of G叩bors 200 out-of-the-ordinary ideas worked
30. Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
31. Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
32. Center for Data Science
Paris-Saclay
Challenges are useful for
generating visibility in the data science community about novel
application domains
benchmarking in a fair way state-of-the-art techniques on
well-de鍖ned problems
鍖nding talented data scientists
Limitations
not necessary adapted to solving complex and open-ended
data science problems in realistic environments
no direct access to solutions and data scientist
no incentive to collaboration
32
DATA CHALLENGES
34. Center for Data Science
Paris-Saclay
Direct access to code, prototyping
Incentivizing diversity
Incentivizing collaboration
Training
Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)
35. Center for Data Science
Paris-Saclay
Our experience with the HiggsML challenge
Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science
Collaboration with management scientists specializing
in managing innovation
Michel Nielsens book: Reinventing Discovery
5+ iterations so far
35
WHERE DOES IT COME FROM?
36. Center for Data Science
Paris-Saclay
UNIVERSIT PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
37. Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to de鍖ne, structure, and manage
the data science ecosystem at the Universit辿 Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Sup辿lec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Sup辿lec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Sup辿lec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
38. 38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
39. Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
40. Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
interdisciplinary projects
matchmaking tool
design and innovation strategy workshops
data challenges
coding sprints
Open Software Initiative
code consolidator and engineering projects
software engineering
clouds/grids
high-performance
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
data science RAMPs and TSs
IT platform for linked data
annotation tools
SaaS data science platform
41. Center for Data Science
Paris-Saclay
Modularizing the collaboration
independent subtasks
reduces barriers
broadens the range of available expertise
Encouraging small contributions
Rich and well-structured information commons
so people can build on earlier work
41
NIELSENS CROWDSOURCING PRINCIPLES
42. Center for Data Science
Paris-Saclay42
RAMPS
Single-day coding sessions
20-40 participants
preparation is similar to challenges
Goals
focusing and motivating top talents
promoting collaboration, speed, and ef鍖ciency
solving (prototyping) real problems
43. 43
TRAINING SPRINTS
Single-day training sessions
20-40 participants
focusing on a single subject (deep learning, model tuning, functional
data, etc.)
preparing RAMPs
55. 55
CONCLUSIONS
Explore the open innovation space
read Nielsens book
Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
Come to our CIML WS tomorrow