際際滷

際際滷Share a Scribd company logo
Bioinformatics in medicine
today
David Montaner
dmontaner@cipf.es
Centro de Investigaci坦n Pr鱈ncipe Felipe
Institute of Computational Genomics
9 May 2013
in Valencia
David Montaner Bioinformatics in medicine 1/26
Genomics
Progress in science depends on new techniques, new
discoveries and new ideas, probably in that order.
Sydney Brenner, 1980
Microarray devices and high-throughput sequencing allow us
measuring thousands or millions of genomic characteristics.
David Montaner Bioinformatics in medicine 2/26
Genomics vs. genetics
Genetics:
 Single genes are responsible for biological changes.
 one gene  one hypothesis  one p-value  conclusions
Genomics:
 Genes or genomic features act together to produce
biological changes.
 many genes  many hypothesis  many p-value 
 more data analysis
 Computational support is needed even for drawing
conclusions
David Montaner Bioinformatics in medicine 3/26
Genomic numbers
Microarray:
 30.000 genes
 2 million SNPs
 100 Mb
Measured features:
 genes, isoforms
 SNPs, Polymorphisms
 IN-DELS
 loss of heterozygosity
 methylation
 copy number alterations
NGS:
 30.000 genes
 30.000 transcripts
 20 million SNPs
 10-100 GB
Registered information:
 Genomic characteristics:
position, chromosome ...
 Biological function
 Disease association
 miRNA targets
David Montaner Bioinformatics in medicine 4/26
Genomic databases
Nucleic Acid Research lists +1500 online databases!
http://www.oxfordjournals.org/nar/database/c
 Many different databases for each category, which should I
use?
 No standards: different IDs, methods, servers, formats, ...
 Lack of international initiatives, many local and small
databases
 Different gene IDs, more than 50
 In vivo vs in silico databases
David Montaner Bioinformatics in medicine 5/26
Biological databases (Wikipedia)
1 Primary nucleotide
sequence databases
2 Metadatabases
3 Genome databases
4 Protein sequence
databases
5 Proteomics databases
6 Protein structure
databases
7 Protein model databases
8 RNA databases
9 Carbohydrate structure
databases
10 Protein-protein interactions
11 Signal transduction
pathway databases
12 Metabolic pathway
databases
13 Experimental data
repositories (Microarrays
NGS, Sanger)
14 Exosomal databases
15 Mathematical model
databases
16 PCR / real time PCR
primer databases
17 Specialized databases
18 Taxonomic databases
19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26
Primary nucleotide sequence
databases
Contain any kind of nucleotide sequences, form genes to
genomes.
The International Nucleotide Sequence Database (INSD)
Collaboration:
 GenBank
National Center for Biotechnology Information (NCBI)
 European Nucleotide Archive (ENA)
European Bioinformatics Institute (EBI)
 DNA Data Bank of Japan (DDBJ)
David Montaner Bioinformatics in medicine 7/26
GenBank
Primary nucleotide sequence databases
 available on the NCBI ftp site:
http://www.ncbi.nlm.nih.gov/Ftp/
 A new release is made every two months.
 3 types of entries:
 CoreNucleotide (the main collection)
 dbEST (Expressed Sequence Tags)
 dbGSS (Genome Survey Sequences)
Access:
 Search for sequence identi鍖ers using Entrez Nucleotide:
http://www.ncbi.nlm.nih.gov/nucleotide/
 Align GenBank sequences to a query sequence using
BLAST (Basic Local Alignment Search Tool).
http://blast.ncbi.nlm.nih.gov/Blast.cgi
 Several other e-utilities (see book)
See an example of a GenBank record.
David Montaner Bioinformatics in medicine 8/26
Metadatabases
 Collect and organize data from primary nucleotide
sequence databases and may other resources.
 Make the information available in a convenient format and
provide data handling resources: web pages, application
programming interface (API) 
 Focus on particular species, diseases 
Examples
 Entrez: searches through almost all NCBI resources.
http://www.ncbi.nlm.nih.gov/sites/gquery
 GeneCards: provides genomic, proteomic, transcriptomic,
genetic and functional information for human genes (known
and predicted)
http://www.genecards.org/
David Montaner Bioinformatics in medicine 9/26
Entrez
Metadatabases
 Searches through almost all NCBI resources.
 Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery
 queries can be saved if you have a a MyNCBI account
http://www.ncbi.nlm.nih.gov/
David Montaner Bioinformatics in medicine 10/26
Genome databases
Collect genome sequences and annotation (speci鍖cation about
genes) for particular organisms, and try to improve them:
 Data curation.
 Complete missing information using insilico methods.
 Generate new relational organization.
 Complement feature IDs.
 Provide easy access, visualization 
Examples
 Ensembl: automatic annotation on selected eukaryote
genomes.
 UCSC Genome Browser: reference sequence and working
draft assemblies for a large collection of genomes
 Wormbase: genome of the model organism C.elegans.
David Montaner Bioinformatics in medicine 11/26
Ensembl
Genome databases
 Ensembl is a joint project between European Bioinformatics
Institute (EBI) the European Molecular Biology Laboratory
(EMBL) and the Wellcome Trust Sanger Institute.
 Develop a software system which produces and maintains
automatic annotation on selected vertebrate and
eukaryote genomes.
 http://www.ensembl.org
David Montaner Bioinformatics in medicine 12/26
UCSC Genome Browser
Genome databases
 UCSC: University of California, Santa Cruz.
 This site contains the reference sequence and working
draft assemblies for a large collection of genomes.
 http://genome.ucsc.edu/
David Montaner Bioinformatics in medicine 13/26
Protein sequence databases
 Most times proteins are the 鍖nal unit of interest to research.
 There is a direct conversion from DNA/RNA sequences to
protein sequences.
 Gene IDs and protein IDs are equivalently used by
researchers (biologists not bioinformaticians )
Examples
 UniProt: Universal Protein Resource (EBI)
 Swiss-Prot (Swiss Institute of Bioinformatics)
 InterPro Classi鍖es proteins into families and predicts the
presence of domains and sites.
 Pfam Protein families database of alignments and HMMs
(Sanger Institute)
David Montaner Bioinformatics in medicine 14/26
RNA databases
 Contain information about RNA molecules.
 Most of them regarding gene regulatory factors. (Gene
information is usually in other repositories).
Examples
 mirBase: microRNAs
http://www.mirbase.org/
 TRANSFAC: transcription factors in eukaryote (Proprietary
database).
 JASPAR: transcription factor binding sites for eukaryote
(Open access, curated, non-redundant).
http://jaspar.genereg.net/
David Montaner Bioinformatics in medicine 15/26
Protein-protein interactions
 Proteins are the main functional units.
 But they do not work in isolation.
 Pretty useless at the moment but promising in the future 
 some information is experimental, but most of it is
generated insilico.
Examples
 IntAct: proteinsmall molecule
and proteinnucleic acid
interactions.
 BIND: Biomolecular Interaction
Network Database.
David Montaner Bioinformatics in medicine 16/26
Signal transduction pathway
databases
& Metabolic pathway databases
 Information about how genes (or proteins) interact among
them.
 not only physical interactions 
Examples
 Reactome: free online database of biological pathways.
http://www.reactome.org
 KEGG: Kyoto Encyclopedia of Genes and Genomes.
Metabolic pathways.
http://www.genome.jp/kegg/pathway.html
David Montaner Bioinformatics in medicine 17/26
KEGG
Metabolic pathway databases
David Montaner Bioinformatics in medicine 18/26
Experimental data repositories
Contain Microarray, NGS, Sanger, and other experimental high
throughput data.
 GEO: Gene Expression Omnibus (NCBI)
http://www.ncbi.nlm.nih.gov/geo/
 ArrayExpress: database of functional genomics
experiments including (EBI)
http://www.ebi.ac.uk/arrayexpress/
 The Cancer Genome Atlas (TCGA): Data on different
cancer related tissues.
http://cancergenome.nih.gov/
David Montaner Bioinformatics in medicine 19/26
Bioinformatics
Training
 Biology 1/3
 Statistics 1/3
 Computer science 1/3 
Ef鍖ciently combine:
 Experimental information
 Database registered knowledge
Time and resources:
 As in the wet lab
David Montaner Bioinformatics in medicine 20/26
Example
David Montaner Bioinformatics in medicine 21/26
Example I
Autistic children
1 (microarray) NGS data processing
 data quality control, 鍖ltering...
 map against reference genome
 CNV calling
2 CNV 鍖ltering
 just 75 rare de novo CNV events (not registered in
databases)
 鍖lter out the long ones
 keep the ones that contain genes
David Montaner Bioinformatics in medicine 22/26
Example II
3 move to the gene level
 47 loci in total affecting 433 human genes
4 Building the background likelihood network
 GO annotations
 KEGG pathways
 InterPro domains
 protein-proteins interactions. Databases: BIND, BioGRID,
DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS
 sequence homology between the gene pair (BLAST)
David Montaner Bioinformatics in medicine 23/26
Example III
5 Search for high scoring clusters affected by CNVs
6 Evaluating signi鍖cance of cluster scores:
10.000 simulations
David Montaner Bioinformatics in medicine 24/26
Example IV
7 Functional characterization of the identi鍖ed network
8 And, 鍖nally, draw conclusions
David Montaner Bioinformatics in medicine 25/26
Questions
Thanks
David Montaner Bioinformatics in medicine 26/26

More Related Content

Bioinformatics Introduction

  • 1. Bioinformatics in medicine today David Montaner dmontaner@cipf.es Centro de Investigaci坦n Pr鱈ncipe Felipe Institute of Computational Genomics 9 May 2013 in Valencia David Montaner Bioinformatics in medicine 1/26
  • 2. Genomics Progress in science depends on new techniques, new discoveries and new ideas, probably in that order. Sydney Brenner, 1980 Microarray devices and high-throughput sequencing allow us measuring thousands or millions of genomic characteristics. David Montaner Bioinformatics in medicine 2/26
  • 3. Genomics vs. genetics Genetics: Single genes are responsible for biological changes. one gene one hypothesis one p-value conclusions Genomics: Genes or genomic features act together to produce biological changes. many genes many hypothesis many p-value more data analysis Computational support is needed even for drawing conclusions David Montaner Bioinformatics in medicine 3/26
  • 4. Genomic numbers Microarray: 30.000 genes 2 million SNPs 100 Mb Measured features: genes, isoforms SNPs, Polymorphisms IN-DELS loss of heterozygosity methylation copy number alterations NGS: 30.000 genes 30.000 transcripts 20 million SNPs 10-100 GB Registered information: Genomic characteristics: position, chromosome ... Biological function Disease association miRNA targets David Montaner Bioinformatics in medicine 4/26
  • 5. Genomic databases Nucleic Acid Research lists +1500 online databases! http://www.oxfordjournals.org/nar/database/c Many different databases for each category, which should I use? No standards: different IDs, methods, servers, formats, ... Lack of international initiatives, many local and small databases Different gene IDs, more than 50 In vivo vs in silico databases David Montaner Bioinformatics in medicine 5/26
  • 6. Biological databases (Wikipedia) 1 Primary nucleotide sequence databases 2 Metadatabases 3 Genome databases 4 Protein sequence databases 5 Proteomics databases 6 Protein structure databases 7 Protein model databases 8 RNA databases 9 Carbohydrate structure databases 10 Protein-protein interactions 11 Signal transduction pathway databases 12 Metabolic pathway databases 13 Experimental data repositories (Microarrays NGS, Sanger) 14 Exosomal databases 15 Mathematical model databases 16 PCR / real time PCR primer databases 17 Specialized databases 18 Taxonomic databases 19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26
  • 7. Primary nucleotide sequence databases Contain any kind of nucleotide sequences, form genes to genomes. The International Nucleotide Sequence Database (INSD) Collaboration: GenBank National Center for Biotechnology Information (NCBI) European Nucleotide Archive (ENA) European Bioinformatics Institute (EBI) DNA Data Bank of Japan (DDBJ) David Montaner Bioinformatics in medicine 7/26
  • 8. GenBank Primary nucleotide sequence databases available on the NCBI ftp site: http://www.ncbi.nlm.nih.gov/Ftp/ A new release is made every two months. 3 types of entries: CoreNucleotide (the main collection) dbEST (Expressed Sequence Tags) dbGSS (Genome Survey Sequences) Access: Search for sequence identi鍖ers using Entrez Nucleotide: http://www.ncbi.nlm.nih.gov/nucleotide/ Align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). http://blast.ncbi.nlm.nih.gov/Blast.cgi Several other e-utilities (see book) See an example of a GenBank record. David Montaner Bioinformatics in medicine 8/26
  • 9. Metadatabases Collect and organize data from primary nucleotide sequence databases and may other resources. Make the information available in a convenient format and provide data handling resources: web pages, application programming interface (API) Focus on particular species, diseases Examples Entrez: searches through almost all NCBI resources. http://www.ncbi.nlm.nih.gov/sites/gquery GeneCards: provides genomic, proteomic, transcriptomic, genetic and functional information for human genes (known and predicted) http://www.genecards.org/ David Montaner Bioinformatics in medicine 9/26
  • 10. Entrez Metadatabases Searches through almost all NCBI resources. Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery queries can be saved if you have a a MyNCBI account http://www.ncbi.nlm.nih.gov/ David Montaner Bioinformatics in medicine 10/26
  • 11. Genome databases Collect genome sequences and annotation (speci鍖cation about genes) for particular organisms, and try to improve them: Data curation. Complete missing information using insilico methods. Generate new relational organization. Complement feature IDs. Provide easy access, visualization Examples Ensembl: automatic annotation on selected eukaryote genomes. UCSC Genome Browser: reference sequence and working draft assemblies for a large collection of genomes Wormbase: genome of the model organism C.elegans. David Montaner Bioinformatics in medicine 11/26
  • 12. Ensembl Genome databases Ensembl is a joint project between European Bioinformatics Institute (EBI) the European Molecular Biology Laboratory (EMBL) and the Wellcome Trust Sanger Institute. Develop a software system which produces and maintains automatic annotation on selected vertebrate and eukaryote genomes. http://www.ensembl.org David Montaner Bioinformatics in medicine 12/26
  • 13. UCSC Genome Browser Genome databases UCSC: University of California, Santa Cruz. This site contains the reference sequence and working draft assemblies for a large collection of genomes. http://genome.ucsc.edu/ David Montaner Bioinformatics in medicine 13/26
  • 14. Protein sequence databases Most times proteins are the 鍖nal unit of interest to research. There is a direct conversion from DNA/RNA sequences to protein sequences. Gene IDs and protein IDs are equivalently used by researchers (biologists not bioinformaticians ) Examples UniProt: Universal Protein Resource (EBI) Swiss-Prot (Swiss Institute of Bioinformatics) InterPro Classi鍖es proteins into families and predicts the presence of domains and sites. Pfam Protein families database of alignments and HMMs (Sanger Institute) David Montaner Bioinformatics in medicine 14/26
  • 15. RNA databases Contain information about RNA molecules. Most of them regarding gene regulatory factors. (Gene information is usually in other repositories). Examples mirBase: microRNAs http://www.mirbase.org/ TRANSFAC: transcription factors in eukaryote (Proprietary database). JASPAR: transcription factor binding sites for eukaryote (Open access, curated, non-redundant). http://jaspar.genereg.net/ David Montaner Bioinformatics in medicine 15/26
  • 16. Protein-protein interactions Proteins are the main functional units. But they do not work in isolation. Pretty useless at the moment but promising in the future some information is experimental, but most of it is generated insilico. Examples IntAct: proteinsmall molecule and proteinnucleic acid interactions. BIND: Biomolecular Interaction Network Database. David Montaner Bioinformatics in medicine 16/26
  • 17. Signal transduction pathway databases & Metabolic pathway databases Information about how genes (or proteins) interact among them. not only physical interactions Examples Reactome: free online database of biological pathways. http://www.reactome.org KEGG: Kyoto Encyclopedia of Genes and Genomes. Metabolic pathways. http://www.genome.jp/kegg/pathway.html David Montaner Bioinformatics in medicine 17/26
  • 18. KEGG Metabolic pathway databases David Montaner Bioinformatics in medicine 18/26
  • 19. Experimental data repositories Contain Microarray, NGS, Sanger, and other experimental high throughput data. GEO: Gene Expression Omnibus (NCBI) http://www.ncbi.nlm.nih.gov/geo/ ArrayExpress: database of functional genomics experiments including (EBI) http://www.ebi.ac.uk/arrayexpress/ The Cancer Genome Atlas (TCGA): Data on different cancer related tissues. http://cancergenome.nih.gov/ David Montaner Bioinformatics in medicine 19/26
  • 20. Bioinformatics Training Biology 1/3 Statistics 1/3 Computer science 1/3 Ef鍖ciently combine: Experimental information Database registered knowledge Time and resources: As in the wet lab David Montaner Bioinformatics in medicine 20/26
  • 22. Example I Autistic children 1 (microarray) NGS data processing data quality control, 鍖ltering... map against reference genome CNV calling 2 CNV 鍖ltering just 75 rare de novo CNV events (not registered in databases) 鍖lter out the long ones keep the ones that contain genes David Montaner Bioinformatics in medicine 22/26
  • 23. Example II 3 move to the gene level 47 loci in total affecting 433 human genes 4 Building the background likelihood network GO annotations KEGG pathways InterPro domains protein-proteins interactions. Databases: BIND, BioGRID, DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS sequence homology between the gene pair (BLAST) David Montaner Bioinformatics in medicine 23/26
  • 24. Example III 5 Search for high scoring clusters affected by CNVs 6 Evaluating signi鍖cance of cluster scores: 10.000 simulations David Montaner Bioinformatics in medicine 24/26
  • 25. Example IV 7 Functional characterization of the identi鍖ed network 8 And, 鍖nally, draw conclusions David Montaner Bioinformatics in medicine 25/26