際際滷

際際滷Share a Scribd company logo
Center for Data Science
Paris-Saclay1
CNRS & University Paris Saclay	

Center for Data Science
BALZS KGL
WHAT IS WRONG WITH DATA
CHALLENGES
THE HIGGSML STORY:	

THE GOOD, THE BAD AND THE UGLY
2
Why am I so critical?
!
Why do I mitigate our own
success with the HiggsML?
3
Because I believe that there is
enormous potential in
open innovation/crowdsourcing
in science.
!
The current data challenge format
is a single point in the landscape.
4
Olga Kokshagina 2015
INTERMEDIARIES: THE GROWING INTEREST FOR
束 CROWDS 損 - > EXPLOSION OF TOOLS
! Crowdsourcing
! is a model leveraging
on novel technologies
(web 2.0, mobile apps,
social networks)
! To build content and a
structured set of
information by
gathering contributions
from large groups of
individuals
5
Center for Data Science
Paris-Saclay
CROWDSOURCING ANNOTATION
5
Center for Data Science
Paris-Saclay
CROWDSOURCING COLLECTION AND
ANNOTATION
6
Center for Data Science
Paris-Saclay
CROWDSOURCING MATH
7
Center for Data Science
Paris-Saclay
CROWDSOURCING ANALYTICS
8
Center for Data Science
Paris-Saclay
OPEN SOURCE
9
Center for Data Science
Paris-Saclay
NEW PUBLICATION MODELS
10
Center for Data Science
Paris-Saclay
THE BOOK TO READ
11
Center for Data Science
Paris-Saclay
 Summary of our conclusions after the HiggsML challenge	

 The good, the bad and the ugly	

 Elaborating on some of the points	

 Rapid Analytics and Model Prototyping	

 an experimental format we have been developing
12
OUTLINE
Center for Data Science
Paris-Saclay13
CIML WORKSHOP TOMORROW
Center for Data Science
Paris-Saclay
 Publicity, awareness	

 both in physics (about the technology) and in ML (about the problem)	

 Triggering open data	

 http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

 Learning a lot from G叩bor on how to win a challenge	

 G叩bor getting hired by Google Deep Mind	

 Benchmarking
 Tool dissemination (xgboost, keras)
14
THE GOOD
Center for Data Science
Paris-Saclay
 No direct access to code	

 No direct access to data scientists	

 No fundamentally new ideas	

 No incentive to collaborate
15
THE BAD
Center for Data Science
Paris-Saclay
 18 months to prepare	

 legal issues, access to data	

 problem formulation: intellectually way more interesting than the
challenge itself, but dif鍖cult to market or to crowdsource	

 once a problem is formalized/formatted to challenge, the problem is
solved (learning is easy - GaelVaroquaux)
16
THE UGLY
Center for Data Science
Paris-Saclay
 We asked the wrong question, on purpose!	

 because the right questions are complex and dont 鍖t the challenge
setup	

 would have led to way less participation	

 would have led to bitterness among the participants, bad (?) for
marketing
17
THE UGLY
Center for Data Science
Paris-Saclay
 The HiggsML challenge on Kaggle	

 https://www.kaggle.com/c/higgs-boson
18
PUBLICITY, AWARENESS
Center for Data Science
Paris-Saclay
PUBLICITY, AWARENESS
19
B. K辿gl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
Center for Data Science
Paris-Saclay
AWARENESS DYNAMICS	

20
 HEPML workshop @NIPS14	

 JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42	

 CERN Open Data	

 http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 	

 DataScience@LHC	

 http://indico.cern.ch/event/395374/	

 Flavors of physics challenge	

 https://www.kaggle.com/c/鍖avours-of-physics
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

21
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

22
 Sophisticated cross validation, CV bagging	

 Sophisticated calibration and model averaging	

 The 鍖rst step: pro participants check if the effort is worthy,
risk assessment	

 variance estimate of the score	

 Dont use the public leaderboard score for model selection	

 None of G叩bors 200 out-of-the-ordinary ideas worked
https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
Center for Data Science
Paris-Saclay
BENCHMARKING
23
CLASSIFICATION FOR DISCOVERY
15
Center for Data Science
Paris-Saclay
BENCHMARKING
24
But what score did we
optimize?
!
And why?
Center for Data Science
Paris-Saclay
count (per year)
background
signal
probability
background
signal
CLASSIFICATION FOR DISCOVERY
25
Goal: optimize the expected discovery signi鍖cance
鍖ux  time
selection
expected background	

say, b = 100 events
total count,	

say, 150 events
excess is s = 50 events
AMS = = 5 sigma
ground expectation 袖b. When optimizing the design of
gion G = {x : g(x) = s}, we do not know n and 袖b. As
we estimate the expectation 袖b by its empirical counter-
+ b to obtain the approximate median signi鍖cance

(s + b) ln

1 +
s
b

s

. (14)
x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as
MS3 
s
1 + O
 s
b
3

,
AMS3 =
s
p
b
. (15)
tically indistinguishable when b s. This approxima-
nding on the chosen search region, be a valid surrogate
selection 	

thresholdselection threshold
Center for Data Science
Paris-Saclay
How to handle systematic (model) uncertainties?
 OK, so lets design an objective function that can take background
systematics into consideration
 Likelihood with unknown background b  N(袖b, b)
L(袖s, 袖b) = P(n, b|袖s, 袖b, b) =
(袖s + 袖b)n
n!
e (袖s+袖b) 1
p
2 b
e (b 袖b)2
/2 b
2
 Pro鍖le likelihood ratio (0) =
L(0, 袖b)
L(袖s, 袖b)
 The new Approximate Median Signi鍖cance (by Glen Cowan)
AMS =
s
2

(s + b) ln
s + b
b0
s b + b0

+
(b b0)2
b
2
where
b0 =
1
2

b b
2
+
p
(b b
2)2 + 4(s + b) b
2

1 / 1
26
Center for Data Science
Paris-Saclay
HOW TO HANDLE SYSTEMATIC UNCERTAINTIES
27
Why didnt we use it?
Center for Data Science
Paris-Saclay28
How to handle systematic (model) uncertainties?
 The new Approximate Median Signi鍖cance
AMS =
s
2

(s + b) ln
s + b
b0
s b + b0

+
(b b0)2
b
2
where
b0 =
1
2

b b
2
+
p
(b b
2)2 + 4(s + b) b
2

1 / 1
New AMS
ATLAS
Old AMS
Center for Data Science
Paris-Saclay
LEARNING FROM THE WINNER	

29
 Sophisticated cross validation, CV bagging	

 Sophisticated calibration and model averaging	

 The 鍖rst step: pro participants check if the effort is worthy,
risk assessment	

 variance estimate of the score	

 Dont use the public leaderboard score for model selection	

 None of G叩bors 200 out-of-the-ordinary ideas worked
Center for Data Science
Paris-Saclay
THE TWO MOST COMMON DATA
CHALLENGE KILLERS
30
Leakage
Variance of the test score
Center for Data Science
Paris-Saclay
VARIANCE OF THE TEST SCORE
31
Center for Data Science
Paris-Saclay
 Challenges are useful for	

 generating visibility in the data science community about novel
application domains	

 benchmarking in a fair way state-of-the-art techniques on
well-de鍖ned problems	

 鍖nding talented data scientists	

 Limitations	

 not necessary adapted to solving complex and open-ended
data science problems in realistic environments	

 no direct access to solutions and data scientist	

 no incentive to collaboration
32
DATA CHALLENGES
33
We decided to design something better
Center for Data Science
Paris-Saclay
 Direct access to code, prototyping	

 Incentivizing diversity	

 Incentivizing collaboration
 Training
 Networking
34
RAPID ANALYTICS AND MODEL
PROTOTYPING (RAMP)
Center for Data Science
Paris-Saclay
 Our experience with the HiggsML challenge	

 Need to connect data scientist to domain scientists
and problems at the Paris-Saclay Center for Data
Science	

 Collaboration with management scientists specializing
in managing innovation	

 Michel Nielsens book: Reinventing Discovery	

 5+ iterations so far
35
WHERE DOES IT COME FROM?
Center for Data Science
Paris-Saclay
UNIVERSIT PARIS-SACLAY
36
+ horizontal multi-disciplinary and multi-partner
initiatives to create cohesion
Center for Data Science
Paris-Saclay37
Center for Data Science
Paris-Saclay
A multi-disciplinary initiative to de鍖ne, structure, and manage
the data science ecosystem at the Universit辿 Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformatics
IBISC/UEvry
LRI/UPSud
Hepatinov
CESP/UPSud-UVSQ-Inserm
IGM-I2BC/UPSud
MIA/Agro
MIAj-MIG/INRA
LMAS/Centrale
Chemistry
EA4041/UPSud
Earth sciences
LATMOS/UVSQ
GEOPS/UPSud
IPSL/UVSQ
LSCE/UVSQ
LMD/Polytechnique
Economy
LM/ENSAE
RITM/UPSud
LFA/ENSAE
Neuroscience
UNICOG/Inserm
U1000/Inserm
NeuroSpin/CEA
Particle physics
astrophysics &
cosmology
LPP/Polytechnique
DMPH/ONERA
CosmoStat/CEA
IAS/UPSud
AIM/CEA
LAL/UPSud
250researchers in 35laboratories
Machine learning
LRI/UPSud
LTCI/Telecom
CMLA/Cachan
LS/ENSAE
LIX/Polytechnique
MIA/Agro
CMA/Polytechnique
LSS/Sup辿lec
CVN/Centrale
LMAS/Centrale
DTIM/ONERA
IBISC/UEvry
Visualization
INRIA
LIMSI
Signal processing
LTCI/Telecom
CMA/Polytechnique
CVN/Centrale
LSS/Sup辿lec
CMLA/Cachan
LIMSI
DTIM/ONERA
Statistics
LMO/UPSud
LS/ENSAE
LSS/Sup辿lec
CMA/Polytechnique
LMAS/Centrale
MIA/AgroParisTech
machine learning
information retrieval
signal processing
data visualization
databases
Domain science
human society
life
brain
earth
universe
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Domain scientistSoftware engineer
datascience-paris-saclay.fr
LIST/CEA
38
THE DATA SCIENCE LANDSCAPE
Domain science
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
Tool building
software engineering
clouds/grids
high-performance
computing
optimization
Center for Data Science
Paris-Saclay39
https://medium.com/@balazskegl
Center for Data Science
Paris-Saclay
TOOLS: LANDSCAPE TO ECOSYSTEM
40
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data science
statistics
machine learning
information retrieval
signal processing
data visualization
databases
 interdisciplinary projects
 matchmaking tool
 design and innovation strategy workshops
 data challenges
 coding sprints
 Open Software Initiative
 code consolidator and engineering projects
software engineering
clouds/grids
high-performance
computing
optimization
energy and physical sciences
health and life sciences
Earth and environment
economy and society
brain
 data science RAMPs and TSs
 IT platform for linked data
 annotation tools
 SaaS data science platform
Center for Data Science
Paris-Saclay
 Modularizing the collaboration	

 independent subtasks	

 reduces barriers	

 broadens the range of available expertise	

 Encouraging small contributions	

 Rich and well-structured information commons	

 so people can build on earlier work
41
NIELSENS CROWDSOURCING PRINCIPLES
Center for Data Science
Paris-Saclay42
RAMPS
 Single-day coding sessions
 20-40 participants	

 preparation is similar to challenges
 Goals	

 focusing and motivating top talents	

 promoting collaboration, speed, and ef鍖ciency	

 solving (prototyping) real problems
43
TRAINING SPRINTS
 Single-day training sessions
 20-40 participants	

 focusing on a single subject (deep learning, model tuning, functional
data, etc.)	

 preparing RAMPs
44
ANALYTICS TOOLS TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay45
ANALYTICS TOOL TO PROMOTE 	

COLLABORATION AND CODE REUSE
Center for Data Science
Paris-Saclay
ANALYTICS TOOLS TO MONITOR PROGRESS
46
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Jan 15
The HiggsML challenge
47
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 Apr 10
Classifying variable stars
48
Center for Data Science
Paris-Saclay
VARIABLE STARS
49
Learning to discoverB. K辿gl / CNRS - Saclay
VARIABLE STARS
50
accuracy improvement: 89% to 96%
Center for Data Science
Paris-Saclay
RAPID ANALYTICS AND MODEL PROTOTYPING
2015 June 16 and Sept 26
Predicting El Nino
51
52
RAPID ANALYTICS AND MODEL PROTOTYPING
RMSE improvement: 0.9C to 0.4C
53
2015 October 8
Insect classi鍖cation
RAPID ANALYTICS AND MODEL PROTOTYPING
54
RAPID ANALYTICS AND MODEL PROTOTYPING
accuracy improvement: 30% to 70%
55
CONCLUSIONS
 Explore the open innovation space
 read Nielsens book	

 Drop me a mail (balazs.kegl@gmail.com) if you are
interested in beta-testing the RAMP tool
 Come to our CIML WS tomorrow
Center for Data Science
Paris-Saclay56
THANK YOU!

More Related Content

What is wrong with data challenges

  • 1. Center for Data Science Paris-Saclay1 CNRS & University Paris Saclay Center for Data Science BALZS KGL WHAT IS WRONG WITH DATA CHALLENGES THE HIGGSML STORY: THE GOOD, THE BAD AND THE UGLY
  • 2. 2 Why am I so critical? ! Why do I mitigate our own success with the HiggsML?
  • 3. 3 Because I believe that there is enormous potential in open innovation/crowdsourcing in science. ! The current data challenge format is a single point in the landscape.
  • 4. 4 Olga Kokshagina 2015 INTERMEDIARIES: THE GROWING INTEREST FOR 束 CROWDS 損 - > EXPLOSION OF TOOLS ! Crowdsourcing ! is a model leveraging on novel technologies (web 2.0, mobile apps, social networks) ! To build content and a structured set of information by gathering contributions from large groups of individuals 5
  • 5. Center for Data Science Paris-Saclay CROWDSOURCING ANNOTATION 5
  • 6. Center for Data Science Paris-Saclay CROWDSOURCING COLLECTION AND ANNOTATION 6
  • 7. Center for Data Science Paris-Saclay CROWDSOURCING MATH 7
  • 8. Center for Data Science Paris-Saclay CROWDSOURCING ANALYTICS 8
  • 9. Center for Data Science Paris-Saclay OPEN SOURCE 9
  • 10. Center for Data Science Paris-Saclay NEW PUBLICATION MODELS 10
  • 11. Center for Data Science Paris-Saclay THE BOOK TO READ 11
  • 12. Center for Data Science Paris-Saclay Summary of our conclusions after the HiggsML challenge The good, the bad and the ugly Elaborating on some of the points Rapid Analytics and Model Prototyping an experimental format we have been developing 12 OUTLINE
  • 13. Center for Data Science Paris-Saclay13 CIML WORKSHOP TOMORROW
  • 14. Center for Data Science Paris-Saclay Publicity, awareness both in physics (about the technology) and in ML (about the problem) Triggering open data http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 Learning a lot from G叩bor on how to win a challenge G叩bor getting hired by Google Deep Mind Benchmarking Tool dissemination (xgboost, keras) 14 THE GOOD
  • 15. Center for Data Science Paris-Saclay No direct access to code No direct access to data scientists No fundamentally new ideas No incentive to collaborate 15 THE BAD
  • 16. Center for Data Science Paris-Saclay 18 months to prepare legal issues, access to data problem formulation: intellectually way more interesting than the challenge itself, but dif鍖cult to market or to crowdsource once a problem is formalized/formatted to challenge, the problem is solved (learning is easy - GaelVaroquaux) 16 THE UGLY
  • 17. Center for Data Science Paris-Saclay We asked the wrong question, on purpose! because the right questions are complex and dont 鍖t the challenge setup would have led to way less participation would have led to bitterness among the participants, bad (?) for marketing 17 THE UGLY
  • 18. Center for Data Science Paris-Saclay The HiggsML challenge on Kaggle https://www.kaggle.com/c/higgs-boson 18 PUBLICITY, AWARENESS
  • 19. Center for Data Science Paris-Saclay PUBLICITY, AWARENESS 19 B. K辿gl / AppStat@LAL Learning to discover CLASSIFICATION FOR DISCOVERY 14
  • 20. Center for Data Science Paris-Saclay AWARENESS DYNAMICS 20 HEPML workshop @NIPS14 JMLR WS proceedings: http://jmlr.csail.mit.edu/proceedings/papers/v42 CERN Open Data http://opendata.cern.ch/collection/ATLAS-Higgs-Challenge-2014 DataScience@LHC http://indico.cern.ch/event/395374/ Flavors of physics challenge https://www.kaggle.com/c/鍖avours-of-physics
  • 21. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 21 https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 22. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 22 Sophisticated cross validation, CV bagging Sophisticated calibration and model averaging The 鍖rst step: pro participants check if the effort is worthy, risk assessment variance estimate of the score Dont use the public leaderboard score for model selection None of G叩bors 200 out-of-the-ordinary ideas worked https://indico.lal.in2p3.fr/event/2692/contribution/1/material/slides/0.pdf
  • 23. Center for Data Science Paris-Saclay BENCHMARKING 23 CLASSIFICATION FOR DISCOVERY 15
  • 24. Center for Data Science Paris-Saclay BENCHMARKING 24 But what score did we optimize? ! And why?
  • 25. Center for Data Science Paris-Saclay count (per year) background signal probability background signal CLASSIFICATION FOR DISCOVERY 25 Goal: optimize the expected discovery signi鍖cance 鍖ux time selection expected background say, b = 100 events total count, say, 150 events excess is s = 50 events AMS = = 5 sigma ground expectation 袖b. When optimizing the design of gion G = {x : g(x) = s}, we do not know n and 袖b. As we estimate the expectation 袖b by its empirical counter- + b to obtain the approximate median signi鍖cance (s + b) ln 1 + s b s . (14) x + 1) = x + x2/2 + O(x3), AMS2 can be rewritten as MS3 s 1 + O s b 3 , AMS3 = s p b . (15) tically indistinguishable when b s. This approxima- nding on the chosen search region, be a valid surrogate selection thresholdselection threshold
  • 26. Center for Data Science Paris-Saclay How to handle systematic (model) uncertainties? OK, so lets design an objective function that can take background systematics into consideration Likelihood with unknown background b N(袖b, b) L(袖s, 袖b) = P(n, b|袖s, 袖b, b) = (袖s + 袖b)n n! e (袖s+袖b) 1 p 2 b e (b 袖b)2 /2 b 2 Pro鍖le likelihood ratio (0) = L(0, 袖b) L(袖s, 袖b) The new Approximate Median Signi鍖cance (by Glen Cowan) AMS = s 2 (s + b) ln s + b b0 s b + b0 + (b b0)2 b 2 where b0 = 1 2 b b 2 + p (b b 2)2 + 4(s + b) b 2 1 / 1 26
  • 27. Center for Data Science Paris-Saclay HOW TO HANDLE SYSTEMATIC UNCERTAINTIES 27 Why didnt we use it?
  • 28. Center for Data Science Paris-Saclay28 How to handle systematic (model) uncertainties? The new Approximate Median Signi鍖cance AMS = s 2 (s + b) ln s + b b0 s b + b0 + (b b0)2 b 2 where b0 = 1 2 b b 2 + p (b b 2)2 + 4(s + b) b 2 1 / 1 New AMS ATLAS Old AMS
  • 29. Center for Data Science Paris-Saclay LEARNING FROM THE WINNER 29 Sophisticated cross validation, CV bagging Sophisticated calibration and model averaging The 鍖rst step: pro participants check if the effort is worthy, risk assessment variance estimate of the score Dont use the public leaderboard score for model selection None of G叩bors 200 out-of-the-ordinary ideas worked
  • 30. Center for Data Science Paris-Saclay THE TWO MOST COMMON DATA CHALLENGE KILLERS 30 Leakage Variance of the test score
  • 31. Center for Data Science Paris-Saclay VARIANCE OF THE TEST SCORE 31
  • 32. Center for Data Science Paris-Saclay Challenges are useful for generating visibility in the data science community about novel application domains benchmarking in a fair way state-of-the-art techniques on well-de鍖ned problems 鍖nding talented data scientists Limitations not necessary adapted to solving complex and open-ended data science problems in realistic environments no direct access to solutions and data scientist no incentive to collaboration 32 DATA CHALLENGES
  • 33. 33 We decided to design something better
  • 34. Center for Data Science Paris-Saclay Direct access to code, prototyping Incentivizing diversity Incentivizing collaboration Training Networking 34 RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
  • 35. Center for Data Science Paris-Saclay Our experience with the HiggsML challenge Need to connect data scientist to domain scientists and problems at the Paris-Saclay Center for Data Science Collaboration with management scientists specializing in managing innovation Michel Nielsens book: Reinventing Discovery 5+ iterations so far 35 WHERE DOES IT COME FROM?
  • 36. Center for Data Science Paris-Saclay UNIVERSIT PARIS-SACLAY 36 + horizontal multi-disciplinary and multi-partner initiatives to create cohesion
  • 37. Center for Data Science Paris-Saclay37 Center for Data Science Paris-Saclay A multi-disciplinary initiative to de鍖ne, structure, and manage the data science ecosystem at the Universit辿 Paris-Saclay http://www.datascience-paris-saclay.fr/ Biology & bioinformatics IBISC/UEvry LRI/UPSud Hepatinov CESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/Agro MIAj-MIG/INRA LMAS/Centrale Chemistry EA4041/UPSud Earth sciences LATMOS/UVSQ GEOPS/UPSud IPSL/UVSQ LSCE/UVSQ LMD/Polytechnique Economy LM/ENSAE RITM/UPSud LFA/ENSAE Neuroscience UNICOG/Inserm U1000/Inserm NeuroSpin/CEA Particle physics astrophysics & cosmology LPP/Polytechnique DMPH/ONERA CosmoStat/CEA IAS/UPSud AIM/CEA LAL/UPSud 250researchers in 35laboratories Machine learning LRI/UPSud LTCI/Telecom CMLA/Cachan LS/ENSAE LIX/Polytechnique MIA/Agro CMA/Polytechnique LSS/Sup辿lec CVN/Centrale LMAS/Centrale DTIM/ONERA IBISC/UEvry Visualization INRIA LIMSI Signal processing LTCI/Telecom CMA/Polytechnique CVN/Centrale LSS/Sup辿lec CMLA/Cachan LIMSI DTIM/ONERA Statistics LMO/UPSud LS/ENSAE LSS/Sup辿lec CMA/Polytechnique LMAS/Centrale MIA/AgroParisTech machine learning information retrieval signal processing data visualization databases Domain science human society life brain earth universe Tool building software engineering clouds/grids high-performance computing optimization Domain scientistSoftware engineer datascience-paris-saclay.fr LIST/CEA
  • 38. 38 THE DATA SCIENCE LANDSCAPE Domain science energy and physical sciences health and life sciences Earth and environment economy and society brain Data scientist Data trainer Applied scientist Domain scientistSoftware engineer Data engineer Data science statistics machine learning information retrieval signal processing data visualization databases Tool building software engineering clouds/grids high-performance computing optimization
  • 39. Center for Data Science Paris-Saclay39 https://medium.com/@balazskegl
  • 40. Center for Data Science Paris-Saclay TOOLS: LANDSCAPE TO ECOSYSTEM 40 Data scientist Data trainer Applied scientist Domain expertSoftware engineer Data engineer Tool building Data domains Data science statistics machine learning information retrieval signal processing data visualization databases interdisciplinary projects matchmaking tool design and innovation strategy workshops data challenges coding sprints Open Software Initiative code consolidator and engineering projects software engineering clouds/grids high-performance computing optimization energy and physical sciences health and life sciences Earth and environment economy and society brain data science RAMPs and TSs IT platform for linked data annotation tools SaaS data science platform
  • 41. Center for Data Science Paris-Saclay Modularizing the collaboration independent subtasks reduces barriers broadens the range of available expertise Encouraging small contributions Rich and well-structured information commons so people can build on earlier work 41 NIELSENS CROWDSOURCING PRINCIPLES
  • 42. Center for Data Science Paris-Saclay42 RAMPS Single-day coding sessions 20-40 participants preparation is similar to challenges Goals focusing and motivating top talents promoting collaboration, speed, and ef鍖ciency solving (prototyping) real problems
  • 43. 43 TRAINING SPRINTS Single-day training sessions 20-40 participants focusing on a single subject (deep learning, model tuning, functional data, etc.) preparing RAMPs
  • 44. 44 ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
  • 45. Center for Data Science Paris-Saclay45 ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
  • 46. Center for Data Science Paris-Saclay ANALYTICS TOOLS TO MONITOR PROGRESS 46
  • 47. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Jan 15 The HiggsML challenge 47
  • 48. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 Apr 10 Classifying variable stars 48
  • 49. Center for Data Science Paris-Saclay VARIABLE STARS 49
  • 50. Learning to discoverB. K辿gl / CNRS - Saclay VARIABLE STARS 50 accuracy improvement: 89% to 96%
  • 51. Center for Data Science Paris-Saclay RAPID ANALYTICS AND MODEL PROTOTYPING 2015 June 16 and Sept 26 Predicting El Nino 51
  • 52. 52 RAPID ANALYTICS AND MODEL PROTOTYPING RMSE improvement: 0.9C to 0.4C
  • 53. 53 2015 October 8 Insect classi鍖cation RAPID ANALYTICS AND MODEL PROTOTYPING
  • 54. 54 RAPID ANALYTICS AND MODEL PROTOTYPING accuracy improvement: 30% to 70%
  • 55. 55 CONCLUSIONS Explore the open innovation space read Nielsens book Drop me a mail (balazs.kegl@gmail.com) if you are interested in beta-testing the RAMP tool Come to our CIML WS tomorrow
  • 56. Center for Data Science Paris-Saclay56 THANK YOU!