ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Five years of DisGeNET: Lessons 
learned and challenges ahead 
Laura 
I. 
Furlong 
IMIM-?©\UPF 
Big 
Data 
in 
Biomedicine 
Barcelona 
November 
11, 
2014
? Knowledge 
plaDorm 
on 
human 
diseases 
and 
their 
genes 
? Aims 
to 
cover 
all 
disease 
therapeuHc 
areas 
? Developed 
by 
integraHon 
of 
data 
from 
expert-?©\curated 
resources 
and 
from 
the 
literature 
by 
text 
mining 
? Centered 
on 
gene-?©\disease 
associaHon 
(GDA) 
and 
its 
supporHng 
evidence 
and 
provenance 
11/11/14 
Laura 
I. 
Furlong 
2
Re# 
Syndrome 
UMLS:C0035372 
MCEP2 
NCBI:4204 
11/11/14 
Laura 
I. 
Furlong 
3
Re# 
Syndrome 
UMLS:C0035372 
MCEP2 
NCBI:4204 
¨¹? 
¨¹? 
¨¹? 
x 
¨¹? 
11/11/14 
Laura 
I. 
Furlong 
4
UniProt 
CTD 
MGD 
LHGDN 
CTD 
Curated 
RGD 
BEFREE 
GAD 
Predicted 
Literature 
11/11/14 
Laura 
I. 
Furlong 
5
UniProt 
CTD 
MGD 
LHGDN 
381,056 
GDAs 
16,666 
CTD 
genes 
13,172 
diseases 
Curated 
RGD 
BEFREE 
GAD 
Predicted 
Literature 
11/11/14 
Laura 
I. 
Furlong 
6
Key 
points 
1. FragmentaPon 
of 
informaPon 
as 
a 
barrier 
to 
knowledge 
on 
the 
mechanisms 
of 
human 
diseases 
11/11/14 
Laura 
I. 
Furlong 
7
11/11/14 
Laura 
I. 
Furlong 
8
hVp://semanHcscience.org 
hVp://lod-?©\cloud.net/ 
StandardizaPon 
? Controlled 
vocabularies 
and 
ontologies 
? DisGeNET 
associaHon 
type 
ontology 
Open 
access 
? RDF 
and 
nanopublicaHons 
? Data 
distributed 
under 
the 
Open 
database 
commons 
license 
hVp://opendatacommons.org/
Key 
points 
2. High 
rate 
of 
data 
generaPon 
on 
GDAs 
poses 
challenges 
to 
biocuraPon 
pipelines 
11/11/14 
Laura 
I. 
Furlong 
10
PublicaHons 
on 
diseases 
and 
genes 
from 
1980 
(737,712 
publicaHons) 
Supervised 
machine-?©\learning 
approach 
for 
relaHon 
extracHon 
between 
Text 
Mining 
530,347 
associaPons 
14,777 
genes 
12,650 
diseases 
355,976 
publicaHons 
genes 
and 
diseases 
Aprox. 
30,000 
GDAs 
11/11/14 
Laura 
I. 
Furlong 
11
? Nearly 
70 
% 
of 
GDAs 
supported 
by 
only 
one 
publicaHon! 
? 900 
GDAs 
supported 
by 
> 
200 
publicaHons 
? Average 
2.8 
publicaHons/GDA 
11/11/14 
Laura 
I. 
Furlong 
12
GDAs 
supported 
by 
only 
one 
publicaHon 
11/11/14 
Laura 
I. 
Furlong 
13
Li#le 
overlap 
between 
text 
mined 
data 
and 
data 
present 
in 
curated 
repositories 
? Other 
publicaHons, 
full 
text, 
etc 
? FN 
from 
TM 
? Non-?©\relevant 
data 
for 
DBs 
? Relevant 
data 
but 
sHll 
not 
curated 
? FP 
from 
TM 
11/11/14 
Laura 
I. 
Furlong 
14
? This 
is 
not 
raw 
data 
(e.g. 
from 
NGS 
analysis), 
is 
data 
already 
collated 
and 
filtered 
that 
have 
passed 
peer 
review 
before 
publicaHon 
? Text 
mined 
from 
abstracts: 
? Not 
mining 
full 
text, 
nor 
supplementary 
material 
? Not 
mining 
tables, 
figures 
? Focus 
on 
relaHons 
stated 
on 
sentences, 
not 
handling 
anaphoras 
? We 
are 
just 
looking 
into 
a 
small 
frac)on 
of 
all 
the 
available 
data!!! 
11/11/14 
Laura 
I. 
Furlong 
15
Key 
points 
2. High 
rate 
of 
data 
generaPon 
on 
GDAs 
poses 
challenges 
to 
biocuraPon 
pipelines 
? Need 
to 
find 
alternaHve 
strategies 
to 
expert 
curaHon 
? ¡°wisdom 
of 
the 
crowds¡± 
approaches 
(e.g. 
crowdsourcing) 
? ? 
11/11/14 
Laura 
I. 
Furlong 
16
Key 
points 
3. Data 
prioriPzaPon 
to 
support 
interpretaPon 
of 
data 
on 
the 
genePc 
determinants 
of 
human 
diseases 
11/11/14 
Laura 
I. 
Furlong 
17
Ranks 
gene-?©\disease 
associaHons 
based 
on 
the 
supporHng 
evidence 
DisGeNET 
score 
= 
SCURATED 
+ 
SPREDICTED 
+ 
SLITERATURE 
GDAs 
prioriPzaPon 
SCURATED 
= 
WUNIPROT 
+ 
WCTD 
SPREDICTED 
= 
WRat 
+ 
WMouse 
SLITERATURE 
= 
WGAD 
+ 
WLHGDN 
+ 
WBeFree 
11/11/14 
Laura 
I. 
Furlong 
18
Disease 
Gene 
Score 
UniProt 
CTD 
Rat 
Mouse 
Number 
of 
publicaPons 
BeFree 
GAD 
LHGDN 
Wilson¡¯s 
Disease 
ATP7B 
0.99 
0.3 
0.3 
0.1 
0.1 
174 
31 
23 
ReV 
Syndrome 
MECP2 
0.9 
0.3 
0.3 
0 
0.1 
438 
27 
43 
CysHc 
Fibrosis 
CFTR 
0.9 
0.3 
0.3 
0 
0.1 
1429 
150 
78 
Obesity 
MC4R 
0.94 
0.3 
0.3 
0.1 
0.1 
220 
46 
0 
Alzheimer 
Disease 
APP 
0.88 
0.3 
0.3 
0 
0.1 
1096 
18 
81 
11/11/14 
Laura 
I. 
Furlong 
19
Key 
points 
3. Data 
prioriPzaPon 
to 
support 
interpretaPon 
of 
data 
on 
the 
genePc 
determinants 
of 
human 
diseases 
? SHll 
we 
have 
a 
large 
dataset 
with 
low 
score 
(350,000 
GDAs) 
? Other 
approaches 
based 
on: 
? type 
of 
associaHon 
of 
the 
gene 
to 
the 
disease 
? experimental 
evidence 
for 
the 
GDAs 
? network-?©\based 
gene-?©\prioriHzaHon 
algorithms 
? Take 
into 
account 
contradictory 
findings 
? Different 
prioriHzaHon 
approaches 
for 
different 
purposes? 
11/11/14 
Laura 
I. 
Furlong 
20
Key 
points 
4. Large 
number 
of 
genes 
associated 
to 
some 
diseases 
might 
reflect 
phenotypic 
diversity 
11/11/14 
Laura 
I. 
Furlong 
21
11/11/14 
Laura 
I. 
Furlong 
22
Need 
for 
deep 
phenotyping 
of 
diseases 
to 
idenHfy 
disease 
subtypes 
associated 
to 
different 
gene 
networks 
Human 
Phenotype 
Ontology 
(HPO) 
hVp://www.human-?©\phenotype-?©\ontology.org/ 
? Ontology 
of 
phenotypic 
abnormaliHes 
of 
human 
diseases 
? Based 
on 
OMIM, 
Orphanet 
and 
DECIPHER 
11/11/14 
Laura 
I. 
Furlong 
23
DisGeNET-?©\HPO 
annotaPons 
33 
% 
of 
DisGeNET 
diseases 
annotated 
with 
HPO 
terms 
Mainly 
come 
from 
OMIM 
diseases 
(28 
% 
of 
DisGeNET 
diseases 
come 
from 
OMIM) 
Need 
for 
other 
approaches 
to 
annotate 
the 
full 
spectrum 
of 
diseases 
at 
phenotypic 
level 
11/11/14 
Laura 
I. 
Furlong 
24
Key 
points 
1. FragmentaHon 
of 
informaHon 
as 
a 
barrier 
to 
knowledge 
on 
the 
mechanisms 
of 
human 
diseases 
2. High 
rate 
of 
data 
generaHon 
on 
GDAs 
poses 
challenges 
to 
biocuraHon 
pipelines 
3. Data 
prioriHzaHon 
to 
support 
interpretaHon 
of 
data 
on 
the 
geneHc 
determinants 
of 
human 
diseases 
4. Large 
number 
of 
genes 
associated 
to 
some 
diseases 
might 
reflect 
phenotypic 
diversity
IBI 
Group 
Alba 
GuH¨¦rrez 
?lex 
Bravo 
Janet 
Pi?ero 
N¨²ria 
Queralt-?©\Rosi?ach 
Miguel 
A. 
Mayer 
Pablo 
Carbonell 
Laura 
I. 
Furlong 
Ferran 
Sanz 
IBI 
Past 
members 
Montserrat 
Cases 
Michael 
Rautschka 
Anna 
Bauer-?©\Mehren 
Sol¨¨ne 
Grosdidier 
11/11/14 
Laura 
I. 
Furlong 
26

More Related Content

Laura Furlong. Big Data in Biomedicine debate. Barcelona, Nov 11 2014

  • 1. Five years of DisGeNET: Lessons learned and challenges ahead Laura I. Furlong IMIM-?©\UPF Big Data in Biomedicine Barcelona November 11, 2014
  • 2. ? Knowledge plaDorm on human diseases and their genes ? Aims to cover all disease therapeuHc areas ? Developed by integraHon of data from expert-?©\curated resources and from the literature by text mining ? Centered on gene-?©\disease associaHon (GDA) and its supporHng evidence and provenance 11/11/14 Laura I. Furlong 2
  • 3. Re# Syndrome UMLS:C0035372 MCEP2 NCBI:4204 11/11/14 Laura I. Furlong 3
  • 4. Re# Syndrome UMLS:C0035372 MCEP2 NCBI:4204 ¨¹? ¨¹? ¨¹? x ¨¹? 11/11/14 Laura I. Furlong 4
  • 5. UniProt CTD MGD LHGDN CTD Curated RGD BEFREE GAD Predicted Literature 11/11/14 Laura I. Furlong 5
  • 6. UniProt CTD MGD LHGDN 381,056 GDAs 16,666 CTD genes 13,172 diseases Curated RGD BEFREE GAD Predicted Literature 11/11/14 Laura I. Furlong 6
  • 7. Key points 1. FragmentaPon of informaPon as a barrier to knowledge on the mechanisms of human diseases 11/11/14 Laura I. Furlong 7
  • 8. 11/11/14 Laura I. Furlong 8
  • 9. hVp://semanHcscience.org hVp://lod-?©\cloud.net/ StandardizaPon ? Controlled vocabularies and ontologies ? DisGeNET associaHon type ontology Open access ? RDF and nanopublicaHons ? Data distributed under the Open database commons license hVp://opendatacommons.org/
  • 10. Key points 2. High rate of data generaPon on GDAs poses challenges to biocuraPon pipelines 11/11/14 Laura I. Furlong 10
  • 11. PublicaHons on diseases and genes from 1980 (737,712 publicaHons) Supervised machine-?©\learning approach for relaHon extracHon between Text Mining 530,347 associaPons 14,777 genes 12,650 diseases 355,976 publicaHons genes and diseases Aprox. 30,000 GDAs 11/11/14 Laura I. Furlong 11
  • 12. ? Nearly 70 % of GDAs supported by only one publicaHon! ? 900 GDAs supported by > 200 publicaHons ? Average 2.8 publicaHons/GDA 11/11/14 Laura I. Furlong 12
  • 13. GDAs supported by only one publicaHon 11/11/14 Laura I. Furlong 13
  • 14. Li#le overlap between text mined data and data present in curated repositories ? Other publicaHons, full text, etc ? FN from TM ? Non-?©\relevant data for DBs ? Relevant data but sHll not curated ? FP from TM 11/11/14 Laura I. Furlong 14
  • 15. ? This is not raw data (e.g. from NGS analysis), is data already collated and filtered that have passed peer review before publicaHon ? Text mined from abstracts: ? Not mining full text, nor supplementary material ? Not mining tables, figures ? Focus on relaHons stated on sentences, not handling anaphoras ? We are just looking into a small frac)on of all the available data!!! 11/11/14 Laura I. Furlong 15
  • 16. Key points 2. High rate of data generaPon on GDAs poses challenges to biocuraPon pipelines ? Need to find alternaHve strategies to expert curaHon ? ¡°wisdom of the crowds¡± approaches (e.g. crowdsourcing) ? ? 11/11/14 Laura I. Furlong 16
  • 17. Key points 3. Data prioriPzaPon to support interpretaPon of data on the genePc determinants of human diseases 11/11/14 Laura I. Furlong 17
  • 18. Ranks gene-?©\disease associaHons based on the supporHng evidence DisGeNET score = SCURATED + SPREDICTED + SLITERATURE GDAs prioriPzaPon SCURATED = WUNIPROT + WCTD SPREDICTED = WRat + WMouse SLITERATURE = WGAD + WLHGDN + WBeFree 11/11/14 Laura I. Furlong 18
  • 19. Disease Gene Score UniProt CTD Rat Mouse Number of publicaPons BeFree GAD LHGDN Wilson¡¯s Disease ATP7B 0.99 0.3 0.3 0.1 0.1 174 31 23 ReV Syndrome MECP2 0.9 0.3 0.3 0 0.1 438 27 43 CysHc Fibrosis CFTR 0.9 0.3 0.3 0 0.1 1429 150 78 Obesity MC4R 0.94 0.3 0.3 0.1 0.1 220 46 0 Alzheimer Disease APP 0.88 0.3 0.3 0 0.1 1096 18 81 11/11/14 Laura I. Furlong 19
  • 20. Key points 3. Data prioriPzaPon to support interpretaPon of data on the genePc determinants of human diseases ? SHll we have a large dataset with low score (350,000 GDAs) ? Other approaches based on: ? type of associaHon of the gene to the disease ? experimental evidence for the GDAs ? network-?©\based gene-?©\prioriHzaHon algorithms ? Take into account contradictory findings ? Different prioriHzaHon approaches for different purposes? 11/11/14 Laura I. Furlong 20
  • 21. Key points 4. Large number of genes associated to some diseases might reflect phenotypic diversity 11/11/14 Laura I. Furlong 21
  • 22. 11/11/14 Laura I. Furlong 22
  • 23. Need for deep phenotyping of diseases to idenHfy disease subtypes associated to different gene networks Human Phenotype Ontology (HPO) hVp://www.human-?©\phenotype-?©\ontology.org/ ? Ontology of phenotypic abnormaliHes of human diseases ? Based on OMIM, Orphanet and DECIPHER 11/11/14 Laura I. Furlong 23
  • 24. DisGeNET-?©\HPO annotaPons 33 % of DisGeNET diseases annotated with HPO terms Mainly come from OMIM diseases (28 % of DisGeNET diseases come from OMIM) Need for other approaches to annotate the full spectrum of diseases at phenotypic level 11/11/14 Laura I. Furlong 24
  • 25. Key points 1. FragmentaHon of informaHon as a barrier to knowledge on the mechanisms of human diseases 2. High rate of data generaHon on GDAs poses challenges to biocuraHon pipelines 3. Data prioriHzaHon to support interpretaHon of data on the geneHc determinants of human diseases 4. Large number of genes associated to some diseases might reflect phenotypic diversity
  • 26. IBI Group Alba GuH¨¦rrez ?lex Bravo Janet Pi?ero N¨²ria Queralt-?©\Rosi?ach Miguel A. Mayer Pablo Carbonell Laura I. Furlong Ferran Sanz IBI Past members Montserrat Cases Michael Rautschka Anna Bauer-?©\Mehren Sol¨¨ne Grosdidier 11/11/14 Laura I. Furlong 26