This document discusses a project to make large biological databases more useful by connecting genes to disease through annotation and integration of data. The project focuses on annotating microarray data in GEO with ontologies, linking these annotations to expression data in GMiner, and developing tools to highlight candidate genes for phenotypes based on integrated data from sources like RGD, OBO and GMiner loaded into a triple store. Next steps include exporting more data to RDF and refining use cases for candidate gene selection and evaluation.
3. Whats the problem?
large scale repositories
with unused or
inaccessible information
How can these
databases be made
more useful?
How to help researchers
鍖nd and use this
information to connect
genes to disease?
3
Monday, September 27, 2010
4. Rat researchers ask...
What tissue is this gene expressed in?
What expression data is Are any of these genes
known for SD (aka SD/NHsd,
Harlan Sprague Dawley, associated with my
Sprague Dawley) rats? phenotype?
Has this gene been seen in the brain?
What rat expression studies have been done on
Mammary Cancer(aka breast neoplasms/breast
cancer/cancer of the breast, breast carcinoma...)?
Monday, September 27, 2010
5. What's the strategy?
Focus on GEO
GEO Records
(microarray) Create Annotation
Jobs & Queue Up
Q-Out
Use NCBO annotator
1..n Annot. Workers
to markup text, RabbitMQ Index text
review annotations at OBA
and then use for tools Q-In
Parse
Results
and visualization
Results saved to Put results in to
GMiner database queue for save
Combine annotations
with biological data
to derive new
insights.
5
Monday, September 27, 2010
6. Current Ontologies
http://bioportal.bioontology.org/
Monday, September 27, 2010
10. Linking annotations to data
Tm2d1
RGD1306410
Svs4
Hbb
Scgb2a1
Alb
Monday, September 27, 2010
11. Linking annotations to data
Tm2d1
RGD1306410
Svs4
Hbb
Scgb2a1
+
Alb
Hbb is_expressed_in rat kidney
Tm2d1 is_expressed_in rat kidney
Human (U133, U133v2.), Mouse (430, U74, U95) and Rat
(U34a/b/c, 230, 230v2)
62,000 samples x ca. 25,000 genes/sample = 1.5B data points
Monday, September 27, 2010
12. Probeset results on GMiner
Probeset L08490cds_at for
Gabra1 - gamma-aminobutyric
acid (GABA) A receptor, alpha 1
Hs GABRA1
Monday, September 27, 2010
13. QTL
Hypertensive
G G G
Phenotype
Pathway Strain 1 != Strain 2
G
Anatomy
G
(Kidney)
Component
Function
Process
Hypertension
Monday, September 27, 2010
14. QTL Gene Highlighter
QTL
G G G
AllegroGraph
Disease/Pheno.
GMiner RGD OBO etc
Monday, September 27, 2010
15. RDF/OWL sources
Cell Ontology
http://www.berkeleybop.org/ontologies/obo-all/cell/cell.owl
Mouse Adult Gross Anatomy
http://www.berkeleybop.org/ontologies/obo-all/adult_mouse_anatomy/
adult_mouse_anatomy.owl
Mammalian Phenotype
http://www.berkeleybop.org/ontologies/obo-all/mammalian_phenotype/
mammalian_phenotype.owl
GO Function
http://www.berkeleybop.org/ontologies/obo-all/molecular_function/molecular_function.owl
GO Process
http://www.berkeleybop.org/ontologies/obo-all/biological_process/biological_process.owl
GO component
http://www.berkeleybop.org/ontologies/obo-all/cellular_component/cellular_component.owl
Monday, September 27, 2010
16. Rat Genome Database
Wide variety of data types - genomic and physiological
many with corresponding ontologies
16
Monday, September 27, 2010
21. QTL Highlighter
Rails source code will be available on GitHub
RDFizer (ruby) http://github.com/simont/MCW-RDF
Monday, September 27, 2010
22. Next Steps
Register PURL for RGD
Create RGD core object ontology (OWL/RDF)
Select appropriate URIs for RGD data
Ontology annotations - how best to represent in triple store?
Export GMiner data to RDF-> Triple Store
Document & re鍖ne biological use cases related to candidate gene selection/evaluation
Identify additional data required for candidate gene selection, RDFize as appropriate,
load into triple store.
Connections to other RDF collections/LOD, etc.?
Monday, September 27, 2010