An invited talk for Lilly's Global IT Seminar Meeting In November 2016 on the subject of data, machine learning, AI, semantic web, text mining and spinach!
1 of 47
Download to read offline
More Related Content
Data: The Good, The Bad & The Ugly
1. Data: The Good, The Bad
& The Ugly
Lee Harland
@SciBitely
http://www.scibite.com
http://www.slideshare.net/scibitely
Lee Harland
Lilly Global IT Meeting November 2016
2. Context
This is an invited talk I gave at Lillys Internal Global IT meeting on the
subject of data
10. + =
. (Promotion of) the nutritional importance of spinach over
other foods, lead to an increase of over 30 per cent in its
consumption during the 1920s and 30s.
The action of S. Oleracea on cardiovascular
output and muscular tone
11. Bad, Bad Data Point
1870 35.2 mg Fe/100g
1937 3.52 mg Fe/100g
The mythical strength-giving properties of
spinach are ... credited to a simple mistake
concerning the iron content of the vegetable.
In 1870, Dr E von Wolf published figures
which were accepted until the 1930s, when
they were rechecked
This revealed that a decimal point had been
placed wrongly and that the real figure was
only one tenth of Dr von Wolf's claim
16. Spinach: One Small Data Point, One Huge Mess
1870 35.2 mg Fe/100g
1937 3.52 mg Fe/100g
Both Values Are Correct The difference is down to the assay conditions
20. estimates for the reproducibility of preclinical research range
from 51 percent to 89 percent. They estimate that at least half of
all U.S. preclinical biomedical research fundingabout
$28 billion annuallyis therefore squandered
http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002165
22. Provenance Is A Critical Component of Reproducibility
What L cells, where from,
how old, epigenetic profile
etc etc?
When, how often, in what
way, using what
system?????
What, when, how?
Could you accurately reproduce this experiment from this method?
* I was responsible for
this paragraph
23. http://www.nature.com/nrd/journal/v10/n9/full/nrd3439-c1.html
A first-of-a-kind analysis of Bayer's internal efforts to validate 'new drug
target' claims now not only supports this view but suggests that 50%
may be an underestimate; the company's in-house
experimental data do not match literature claims in
65% of target-validation projects, leading to project
discontinuation.
33. It hasnt just got 3 names its got LOTS
carboxypeptidase B-like protein OR thrombin-activatable fibrinolysis
inhibitor OR CPB type 2 OR Carboxypeptidase type B2 OR
plasma carboxypeptidase type B OR carboxypeptidase type B2 OR
CPB2 OR Plasma carboxypeptidase type B OR CPB-2 OR
carboxypeptidase B2 (plasma),carboxypeptidase U OR
Carboxypeptidase type U OR carboxypeptidase type U OR plasma
carboxypeptidase B2 OR carboxy-peptidylase U OR thrombin-
activable fibrinolysis inhibitor OR plasma carboxypeptidase type
B2 OR carboxypeptidase B2 (plasma OR CPU OR
carboxypeptidase B2 OR PCPB OR pCPB OR Carboxypeptidase U
OR plasma carboxypeptidase B OR TAFI OR Carboxypeptidase B2
OR Plasma carboxypeptidase B OR Thrombin-activable
fibrinolysis inhibitor OR carboxypeptidase B2 plasma OR
carboxypeptidase R
34. We also manually standardized
data related to lab measurement
units and terminology related to
patient race and ethnicity,
geographical study regions, and
names of drugs and drug
families.
Yet Another Issue
38. Just What Is The Data?
Mentions of all
Genes, Diseases, Drugs, Tissues, Cells, Techniques,
Assays, Measures, Protocols, Compounds, Regimens,
Companies, People, Locations, Pathologies, Adverse
Events, Pathways, Metabolism, Manufacturing Concepts,
QC/QA, Pathogens, Strains, Animals and so on...
≒ And their relationships to each other
≒ And their locations (section, database column)
≒ Inferring relationships between documents/entries
≒ Regardless of actual keyword used
44. Why Give Ugly Data A Makeover?
ELN annotation using Bioassay Ontology
Find all experiments using any Cell Flourescence technique
Pharmacovigilance
Monitoring newsfeeds & internal data for safety signals
Automatic Process Notification
Alert groups based on content of CRO documents Etc
Synergise Both Semantic Technology & Information Professionals
Re-energise Therapeutic Area Literature Searching
Build Knowledge Chains (Assertional Provenance)
Project Management ELN Data Screen SOP
46. Spinach: The Truth Is Out There!
Spinach is high
in iron (!)
..oxalic acid in spinach prevents
more than 90% of iron from being
absorbed..
Acknowledgement
47. Acknowledgements
IMI Open PHACTS Team
(many more involved, I just
dont have a photo L )
http://openphacts.org
SciBite Team
http://scibite.com