際際滷

際際滷Share a Scribd company logo
Data: The Good, The Bad
& The Ugly
Lee Harland
@SciBitely
http://www.scibite.com
http://www.slideshare.net/scibitely
Lee Harland
Lilly Global IT Meeting November 2016
Context
 This is an invited talk I gave at Lillys Internal Global IT meeting on the
subject of data
The Good
http://www.nejm.org/doi/full/10.1056/NEJMp1606181
Data: The Good, The Bad & The Ugly
Data: The Good, The Bad & The Ugly
Data: The Good, The Bad & The Ugly
What matters to me!
The Bad
+ =
. (Promotion of) the nutritional importance of spinach over
other foods, lead to an increase of over 30 per cent in its
consumption during the 1920s and 30s.
The action of S. Oleracea on cardiovascular
output and muscular tone
Bad, Bad Data Point
1870 35.2 mg Fe/100g
1937 3.52 mg Fe/100g
The mythical strength-giving properties of
spinach are ... credited to a simple mistake
concerning the iron content of the vegetable.
In 1870, Dr E von Wolf published figures
which were accepted until the 1930s, when
they were rechecked
This revealed that a decimal point had been
placed wrongly and that the real figure was
only one tenth of Dr von Wolf's claim
Still Making Headlines After 140 Years
2013
There Is No
Decimal Point
Error
X X
X
Spinach: One Small Data Point, One Huge Mess
1870 35.2 mg Fe/100g
1937 3.52 mg Fe/100g


Both Values Are Correct  The difference is down to the assay conditions
http://www.merriam-webster.com/dictionary/provenance
35.2
35.2
The datapoint + its provenance
(experimental context)
What people saw
So What?
estimates for the reproducibility of preclinical research range
from 51 percent to 89 percent. They estimate that at least half of
all U.S. preclinical biomedical research fundingabout
$28 billion annuallyis therefore squandered
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165
http://www.merriam-webster.com/dictionary/provenance
Provenance Is A Critical Component of Reproducibility
What L cells, where from,
how old, epigenetic profile
etc etc?
When, how often, in what
way, using what
system?????
What, when, how?
Could you accurately reproduce this experiment from this method?
* I was responsible for
this paragraph
http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html
A first-of-a-kind analysis of Bayer's internal efforts to validate 'new drug
target' claims now not only supports this view but suggests that 50%
may be an underestimate; the company's in-house
experimental data do not match literature claims in
65% of target-validation projects, leading to project
discontinuation.
This is where
Informatics & Data
Science can add real
value to
Drug Discovery
Open PHACTS https://www.openphacts.org/
Open PHACTS: Adding Provenance To Data
http://nanopub.org/
.sub:Head {
this: np:hasAssertion sub:assertion ;
np:hasProvenance sub:provenance ;
np:hasPublicationInfo sub:pubinfo ;
a np:Nanopublication .
}
sub:assertion {
nx:NX_P35712 bfo:BFO_0000066 ts:TS-0276 ; # Protein NX_P35712 is localized in tissue TS-0276
ro:has_quality "positive" .
}
sub:provenance {
<http://www.nextprot.org/help/quality_criteria/silver> a eco:ECO_0000205 ;
rdfs:label "neXtProt silver"^^xsd:string .
sub:_1 a efo:EFO_00027688 .
sub:_10 a eco:ECO_0000218 .
sub:_2 a eco:ECO_0000218 .
sub:_9 a efo:EFO_00027688 .
sub:assertion prv:usedData <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000087&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> ,
<http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000088&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> ,
<http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000090&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> ,
<http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000092&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> ,
<http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000094&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> ;
wi:evidence <http://www.nextprot.org/help/quality_criteria/silver> ;
a eco:ECO_0000220 ;
rdfs:comment " data, NX_P35712 is expressed in Endometrium"^^xsd:string ;
prov:wasDerivedFrom sub:_1 , sub:_3 , sub:_5 , sub:_7 , sub:_9 ;
prov:wasGeneratedBy sub:_10 , sub:_2 , sub:_4 , sub:_6 , sub:_8 .
}
sub:pubinfo {
sub:_11 a eco:ECO_0000205 .
sub:_12 a eco:ECO_0000205 . sub:_15 a eco:ECO_0000205 .
this: dcterms:created "2014-09-19T00:00:00.0Z"^^xsd:dateTime ;
dcterms:rights <http://creativecommons.org/licenses/by/3.0/> ;
dcterms:rightsHolder <http://nextprot.org> ;
prv:usedData "neXtProt database" ;
pav:authoredBy "CALIPHO project" , <http://orcid.org/0000-0001-6710-1373> , <http://orcid.org/0000-0001-6818-334X> , <http://orcid.org/0000-0002-1303-2189> , <http://orcid.org/0000-0003-1813-6857> ;
pav:versionNumber "3" ;
prov:wasGeneratedBy sub:_11 , sub:_12 , sub:_13 , sub:_14 , sub:_15 .
}
http://nanopub.org
https://explorer.openphacts.org
One of the few user interfaces where provenance is intrinsically there
The Ugly
80-90% of all potentially
usable business information
may originate in
unstructured form
https://en.wikipedia.org/wiki/Unstructured_data
The Ugly
Carboxypeptidase B2 Thrombin-Activatable
Fibrinolysis Inhibitor
Plasma CPU
The True Picture
(they are the same thing)
It hasnt just got 3 names its got LOTS
carboxypeptidase B-like protein OR thrombin-activatable fibrinolysis
inhibitor OR CPB type 2 OR Carboxypeptidase type B2 OR
plasma carboxypeptidase type B OR carboxypeptidase type B2 OR
CPB2 OR Plasma carboxypeptidase type B OR CPB-2 OR
carboxypeptidase B2 (plasma),carboxypeptidase U OR
Carboxypeptidase type U OR carboxypeptidase type U OR plasma
carboxypeptidase B2 OR carboxy-peptidylase U OR thrombin-
activable fibrinolysis inhibitor OR plasma carboxypeptidase type
B2 OR carboxypeptidase B2 (plasma OR CPU OR
carboxypeptidase B2 OR PCPB OR pCPB OR Carboxypeptidase U
OR plasma carboxypeptidase B OR TAFI OR Carboxypeptidase B2
OR Plasma carboxypeptidase B OR Thrombin-activable
fibrinolysis inhibitor OR carboxypeptidase B2 plasma OR
carboxypeptidase R
We also manually standardized
data related to lab measurement
units and terminology related to
patient race and ethnicity,
geographical study regions, and
names of drugs and drug
families. 
Yet Another Issue
(an accident waiting to happen)
VARCHAR2
PROJ_TITLE
EXPERIMENT_INFO
ASSAY_DESCRIPTION
KEYWORDS
USER_PROFILE SUMMARY
EXPT_METADATA
SETTINGS_INFO
REPORT_TEXT
EXPT_NAME
Databases: Where Knowledge Goes To Die
MEETING_MINUTES
PROJ_ACTIONS
ASSAY_CONLCUSION
COHORT_DESC
INCLUSION_CRITERIA
POLICY_DETAILS
PROJECT_OVERVIEW
RATIONALE
JUSTIFICATION
Text2Data MicroService
TERMite
Supports basic
keyword search only
TEXT Rich substrate for search
and discovery & insight
DATA
Just What Is The Data?
Mentions of all
 Genes, Diseases, Drugs, Tissues, Cells, Techniques,
Assays, Measures, Protocols, Compounds, Regimens,
Companies, People, Locations, Pathologies, Adverse
Events, Pathways, Metabolism, Manufacturing Concepts,
QC/QA, Pathogens, Strains, Animals  and so on...
≒ And their relationships to each other
≒ And their locations (section, database column)
≒ Inferring relationships between documents/entries
≒ Regardless of actual keyword used
Systems Integration Guide
http://yourcompany.com/termite?
text=<content>
app=<application name>
index=<e.g. page, table or column name>
ELN
Screening
Registry
PDM
Registry
Project
Management
Sharepoint
Whats going on, right now
Trending Today
Data: The Good, The Bad & The Ugly
Data: The Good, The Bad & The Ugly
Why Give Ugly Data A Makeover?
 ELN annotation using Bioassay Ontology
 Find all experiments using any Cell Flourescence technique
 Pharmacovigilance
 Monitoring newsfeeds & internal data for safety signals
 Automatic Process Notification
 Alert groups based on content of CRO documents Etc
 Synergise Both Semantic Technology & Information Professionals
 Re-energise Therapeutic Area Literature Searching
 Build Knowledge Chains (Assertional Provenance)
 Project Management  ELN Data  Screen SOP
Before I go..
Spinach: The Truth Is Out There!
Spinach is high
in iron (!)
..oxalic acid in spinach prevents
more than 90% of iron from being
absorbed..
Acknowledgement
Acknowledgements
IMI Open PHACTS Team
(many more involved, I just
dont have a photo L )
http://openphacts.org
SciBite Team
http://scibite.com

More Related Content

Data: The Good, The Bad & The Ugly

  • 1. Data: The Good, The Bad & The Ugly Lee Harland @SciBitely http://www.scibite.com http://www.slideshare.net/scibitely Lee Harland Lilly Global IT Meeting November 2016
  • 2. Context This is an invited talk I gave at Lillys Internal Global IT meeting on the subject of data
  • 10. + = . (Promotion of) the nutritional importance of spinach over other foods, lead to an increase of over 30 per cent in its consumption during the 1920s and 30s. The action of S. Oleracea on cardiovascular output and muscular tone
  • 11. Bad, Bad Data Point 1870 35.2 mg Fe/100g 1937 3.52 mg Fe/100g The mythical strength-giving properties of spinach are ... credited to a simple mistake concerning the iron content of the vegetable. In 1870, Dr E von Wolf published figures which were accepted until the 1930s, when they were rechecked This revealed that a decimal point had been placed wrongly and that the real figure was only one tenth of Dr von Wolf's claim
  • 12. Still Making Headlines After 140 Years 2013
  • 13. There Is No Decimal Point Error
  • 14. X X
  • 15. X
  • 16. Spinach: One Small Data Point, One Huge Mess 1870 35.2 mg Fe/100g 1937 3.52 mg Fe/100g Both Values Are Correct The difference is down to the assay conditions
  • 18. 35.2 35.2 The datapoint + its provenance (experimental context) What people saw
  • 20. estimates for the reproducibility of preclinical research range from 51 percent to 89 percent. They estimate that at least half of all U.S. preclinical biomedical research fundingabout $28 billion annuallyis therefore squandered http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165
  • 22. Provenance Is A Critical Component of Reproducibility What L cells, where from, how old, epigenetic profile etc etc? When, how often, in what way, using what system????? What, when, how? Could you accurately reproduce this experiment from this method? * I was responsible for this paragraph
  • 23. http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html A first-of-a-kind analysis of Bayer's internal efforts to validate 'new drug target' claims now not only supports this view but suggests that 50% may be an underestimate; the company's in-house experimental data do not match literature claims in 65% of target-validation projects, leading to project discontinuation.
  • 24. This is where Informatics & Data Science can add real value to Drug Discovery
  • 26. Open PHACTS: Adding Provenance To Data http://nanopub.org/
  • 27. .sub:Head { this: np:hasAssertion sub:assertion ; np:hasProvenance sub:provenance ; np:hasPublicationInfo sub:pubinfo ; a np:Nanopublication . } sub:assertion { nx:NX_P35712 bfo:BFO_0000066 ts:TS-0276 ; # Protein NX_P35712 is localized in tissue TS-0276 ro:has_quality "positive" . } sub:provenance { <http://www.nextprot.org/help/quality_criteria/silver> a eco:ECO_0000205 ; rdfs:label "neXtProt silver"^^xsd:string . sub:_1 a efo:EFO_00027688 . sub:_10 a eco:ECO_0000218 . sub:_2 a eco:ECO_0000218 . sub:_9 a efo:EFO_00027688 . sub:assertion prv:usedData <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000087&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000088&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000090&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000092&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> , <http://bgee.unil.ch/bgee/bgee?page=expression&action=data&stage_id=HsapDO:0000094&amp;organ_id=EV:0100115&amp;gene_id=ENSG00000110693&amp;stage_children=on> ; wi:evidence <http://www.nextprot.org/help/quality_criteria/silver> ; a eco:ECO_0000220 ; rdfs:comment " data, NX_P35712 is expressed in Endometrium"^^xsd:string ; prov:wasDerivedFrom sub:_1 , sub:_3 , sub:_5 , sub:_7 , sub:_9 ; prov:wasGeneratedBy sub:_10 , sub:_2 , sub:_4 , sub:_6 , sub:_8 . } sub:pubinfo { sub:_11 a eco:ECO_0000205 . sub:_12 a eco:ECO_0000205 . sub:_15 a eco:ECO_0000205 . this: dcterms:created "2014-09-19T00:00:00.0Z"^^xsd:dateTime ; dcterms:rights <http://creativecommons.org/licenses/by/3.0/> ; dcterms:rightsHolder <http://nextprot.org> ; prv:usedData "neXtProt database" ; pav:authoredBy "CALIPHO project" , <http://orcid.org/0000-0001-6710-1373> , <http://orcid.org/0000-0001-6818-334X> , <http://orcid.org/0000-0002-1303-2189> , <http://orcid.org/0000-0003-1813-6857> ; pav:versionNumber "3" ; prov:wasGeneratedBy sub:_11 , sub:_12 , sub:_13 , sub:_14 , sub:_15 . } http://nanopub.org
  • 29. One of the few user interfaces where provenance is intrinsically there
  • 31. 80-90% of all potentially usable business information may originate in unstructured form https://en.wikipedia.org/wiki/Unstructured_data The Ugly
  • 32. Carboxypeptidase B2 Thrombin-Activatable Fibrinolysis Inhibitor Plasma CPU The True Picture (they are the same thing)
  • 33. It hasnt just got 3 names its got LOTS carboxypeptidase B-like protein OR thrombin-activatable fibrinolysis inhibitor OR CPB type 2 OR Carboxypeptidase type B2 OR plasma carboxypeptidase type B OR carboxypeptidase type B2 OR CPB2 OR Plasma carboxypeptidase type B OR CPB-2 OR carboxypeptidase B2 (plasma),carboxypeptidase U OR Carboxypeptidase type U OR carboxypeptidase type U OR plasma carboxypeptidase B2 OR carboxy-peptidylase U OR thrombin- activable fibrinolysis inhibitor OR plasma carboxypeptidase type B2 OR carboxypeptidase B2 (plasma OR CPU OR carboxypeptidase B2 OR PCPB OR pCPB OR Carboxypeptidase U OR plasma carboxypeptidase B OR TAFI OR Carboxypeptidase B2 OR Plasma carboxypeptidase B OR Thrombin-activable fibrinolysis inhibitor OR carboxypeptidase B2 plasma OR carboxypeptidase R
  • 34. We also manually standardized data related to lab measurement units and terminology related to patient race and ethnicity, geographical study regions, and names of drugs and drug families. Yet Another Issue
  • 35. (an accident waiting to happen)
  • 36. VARCHAR2 PROJ_TITLE EXPERIMENT_INFO ASSAY_DESCRIPTION KEYWORDS USER_PROFILE SUMMARY EXPT_METADATA SETTINGS_INFO REPORT_TEXT EXPT_NAME Databases: Where Knowledge Goes To Die MEETING_MINUTES PROJ_ACTIONS ASSAY_CONLCUSION COHORT_DESC INCLUSION_CRITERIA POLICY_DETAILS PROJECT_OVERVIEW RATIONALE JUSTIFICATION
  • 37. Text2Data MicroService TERMite Supports basic keyword search only TEXT Rich substrate for search and discovery & insight DATA
  • 38. Just What Is The Data? Mentions of all Genes, Diseases, Drugs, Tissues, Cells, Techniques, Assays, Measures, Protocols, Compounds, Regimens, Companies, People, Locations, Pathologies, Adverse Events, Pathways, Metabolism, Manufacturing Concepts, QC/QA, Pathogens, Strains, Animals and so on... ≒ And their relationships to each other ≒ And their locations (section, database column) ≒ Inferring relationships between documents/entries ≒ Regardless of actual keyword used
  • 44. Why Give Ugly Data A Makeover? ELN annotation using Bioassay Ontology Find all experiments using any Cell Flourescence technique Pharmacovigilance Monitoring newsfeeds & internal data for safety signals Automatic Process Notification Alert groups based on content of CRO documents Etc Synergise Both Semantic Technology & Information Professionals Re-energise Therapeutic Area Literature Searching Build Knowledge Chains (Assertional Provenance) Project Management ELN Data Screen SOP
  • 46. Spinach: The Truth Is Out There! Spinach is high in iron (!) ..oxalic acid in spinach prevents more than 90% of iron from being absorbed.. Acknowledgement
  • 47. Acknowledgements IMI Open PHACTS Team (many more involved, I just dont have a photo L ) http://openphacts.org SciBite Team http://scibite.com