This document discusses the role of bioinformatics in medicine today. It begins by explaining how genomics differs from genetics in studying many genes and genomic features together rather than single genes. It then describes some of the key genomic databases that are used in bioinformatics, including primary sequence databases like GenBank, metadatabases like Entrez, genome databases like Ensembl and UCSC, and pathway and protein databases. The document provides an example of how bioinformatics is used to analyze autism data, including processing sequencing data, identifying copy number variations, mapping genes, building networks, and identifying significant clusters to understand autism better.
1 of 26
More Related Content
Bioinformatics Introduction
1. Bioinformatics in medicine
today
David Montaner
dmontaner@cipf.es
Centro de Investigaci坦n Pr鱈ncipe Felipe
Institute of Computational Genomics
9 May 2013
in Valencia
David Montaner Bioinformatics in medicine 1/26
2. Genomics
Progress in science depends on new techniques, new
discoveries and new ideas, probably in that order.
Sydney Brenner, 1980
Microarray devices and high-throughput sequencing allow us
measuring thousands or millions of genomic characteristics.
David Montaner Bioinformatics in medicine 2/26
3. Genomics vs. genetics
Genetics:
Single genes are responsible for biological changes.
one gene one hypothesis one p-value conclusions
Genomics:
Genes or genomic features act together to produce
biological changes.
many genes many hypothesis many p-value
more data analysis
Computational support is needed even for drawing
conclusions
David Montaner Bioinformatics in medicine 3/26
4. Genomic numbers
Microarray:
30.000 genes
2 million SNPs
100 Mb
Measured features:
genes, isoforms
SNPs, Polymorphisms
IN-DELS
loss of heterozygosity
methylation
copy number alterations
NGS:
30.000 genes
30.000 transcripts
20 million SNPs
10-100 GB
Registered information:
Genomic characteristics:
position, chromosome ...
Biological function
Disease association
miRNA targets
David Montaner Bioinformatics in medicine 4/26
5. Genomic databases
Nucleic Acid Research lists +1500 online databases!
http://www.oxfordjournals.org/nar/database/c
Many different databases for each category, which should I
use?
No standards: different IDs, methods, servers, formats, ...
Lack of international initiatives, many local and small
databases
Different gene IDs, more than 50
In vivo vs in silico databases
David Montaner Bioinformatics in medicine 5/26
6. Biological databases (Wikipedia)
1 Primary nucleotide
sequence databases
2 Metadatabases
3 Genome databases
4 Protein sequence
databases
5 Proteomics databases
6 Protein structure
databases
7 Protein model databases
8 RNA databases
9 Carbohydrate structure
databases
10 Protein-protein interactions
11 Signal transduction
pathway databases
12 Metabolic pathway
databases
13 Experimental data
repositories (Microarrays
NGS, Sanger)
14 Exosomal databases
15 Mathematical model
databases
16 PCR / real time PCR
primer databases
17 Specialized databases
18 Taxonomic databases
19 Wiki-style databasesDavid Montaner Bioinformatics in medicine 6/26
7. Primary nucleotide sequence
databases
Contain any kind of nucleotide sequences, form genes to
genomes.
The International Nucleotide Sequence Database (INSD)
Collaboration:
GenBank
National Center for Biotechnology Information (NCBI)
European Nucleotide Archive (ENA)
European Bioinformatics Institute (EBI)
DNA Data Bank of Japan (DDBJ)
David Montaner Bioinformatics in medicine 7/26
8. GenBank
Primary nucleotide sequence databases
available on the NCBI ftp site:
http://www.ncbi.nlm.nih.gov/Ftp/
A new release is made every two months.
3 types of entries:
CoreNucleotide (the main collection)
dbEST (Expressed Sequence Tags)
dbGSS (Genome Survey Sequences)
Access:
Search for sequence identi鍖ers using Entrez Nucleotide:
http://www.ncbi.nlm.nih.gov/nucleotide/
Align GenBank sequences to a query sequence using
BLAST (Basic Local Alignment Search Tool).
http://blast.ncbi.nlm.nih.gov/Blast.cgi
Several other e-utilities (see book)
See an example of a GenBank record.
David Montaner Bioinformatics in medicine 8/26
9. Metadatabases
Collect and organize data from primary nucleotide
sequence databases and may other resources.
Make the information available in a convenient format and
provide data handling resources: web pages, application
programming interface (API)
Focus on particular species, diseases
Examples
Entrez: searches through almost all NCBI resources.
http://www.ncbi.nlm.nih.gov/sites/gquery
GeneCards: provides genomic, proteomic, transcriptomic,
genetic and functional information for human genes (known
and predicted)
http://www.genecards.org/
David Montaner Bioinformatics in medicine 9/26
10. Entrez
Metadatabases
Searches through almost all NCBI resources.
Entrez search page: http://www.ncbi.nlm.nih.gov/sites/gquery
queries can be saved if you have a a MyNCBI account
http://www.ncbi.nlm.nih.gov/
David Montaner Bioinformatics in medicine 10/26
11. Genome databases
Collect genome sequences and annotation (speci鍖cation about
genes) for particular organisms, and try to improve them:
Data curation.
Complete missing information using insilico methods.
Generate new relational organization.
Complement feature IDs.
Provide easy access, visualization
Examples
Ensembl: automatic annotation on selected eukaryote
genomes.
UCSC Genome Browser: reference sequence and working
draft assemblies for a large collection of genomes
Wormbase: genome of the model organism C.elegans.
David Montaner Bioinformatics in medicine 11/26
12. Ensembl
Genome databases
Ensembl is a joint project between European Bioinformatics
Institute (EBI) the European Molecular Biology Laboratory
(EMBL) and the Wellcome Trust Sanger Institute.
Develop a software system which produces and maintains
automatic annotation on selected vertebrate and
eukaryote genomes.
http://www.ensembl.org
David Montaner Bioinformatics in medicine 12/26
13. UCSC Genome Browser
Genome databases
UCSC: University of California, Santa Cruz.
This site contains the reference sequence and working
draft assemblies for a large collection of genomes.
http://genome.ucsc.edu/
David Montaner Bioinformatics in medicine 13/26
14. Protein sequence databases
Most times proteins are the 鍖nal unit of interest to research.
There is a direct conversion from DNA/RNA sequences to
protein sequences.
Gene IDs and protein IDs are equivalently used by
researchers (biologists not bioinformaticians )
Examples
UniProt: Universal Protein Resource (EBI)
Swiss-Prot (Swiss Institute of Bioinformatics)
InterPro Classi鍖es proteins into families and predicts the
presence of domains and sites.
Pfam Protein families database of alignments and HMMs
(Sanger Institute)
David Montaner Bioinformatics in medicine 14/26
15. RNA databases
Contain information about RNA molecules.
Most of them regarding gene regulatory factors. (Gene
information is usually in other repositories).
Examples
mirBase: microRNAs
http://www.mirbase.org/
TRANSFAC: transcription factors in eukaryote (Proprietary
database).
JASPAR: transcription factor binding sites for eukaryote
(Open access, curated, non-redundant).
http://jaspar.genereg.net/
David Montaner Bioinformatics in medicine 15/26
16. Protein-protein interactions
Proteins are the main functional units.
But they do not work in isolation.
Pretty useless at the moment but promising in the future
some information is experimental, but most of it is
generated insilico.
Examples
IntAct: proteinsmall molecule
and proteinnucleic acid
interactions.
BIND: Biomolecular Interaction
Network Database.
David Montaner Bioinformatics in medicine 16/26
17. Signal transduction pathway
databases
& Metabolic pathway databases
Information about how genes (or proteins) interact among
them.
not only physical interactions
Examples
Reactome: free online database of biological pathways.
http://www.reactome.org
KEGG: Kyoto Encyclopedia of Genes and Genomes.
Metabolic pathways.
http://www.genome.jp/kegg/pathway.html
David Montaner Bioinformatics in medicine 17/26
19. Experimental data repositories
Contain Microarray, NGS, Sanger, and other experimental high
throughput data.
GEO: Gene Expression Omnibus (NCBI)
http://www.ncbi.nlm.nih.gov/geo/
ArrayExpress: database of functional genomics
experiments including (EBI)
http://www.ebi.ac.uk/arrayexpress/
The Cancer Genome Atlas (TCGA): Data on different
cancer related tissues.
http://cancergenome.nih.gov/
David Montaner Bioinformatics in medicine 19/26
20. Bioinformatics
Training
Biology 1/3
Statistics 1/3
Computer science 1/3
Ef鍖ciently combine:
Experimental information
Database registered knowledge
Time and resources:
As in the wet lab
David Montaner Bioinformatics in medicine 20/26
22. Example I
Autistic children
1 (microarray) NGS data processing
data quality control, 鍖ltering...
map against reference genome
CNV calling
2 CNV 鍖ltering
just 75 rare de novo CNV events (not registered in
databases)
鍖lter out the long ones
keep the ones that contain genes
David Montaner Bioinformatics in medicine 22/26
23. Example II
3 move to the gene level
47 loci in total affecting 433 human genes
4 Building the background likelihood network
GO annotations
KEGG pathways
InterPro domains
protein-proteins interactions. Databases: BIND, BioGRID,
DIP, HPRD, InNetDB, IntAct, BiGG, MINT, and MIPS
sequence homology between the gene pair (BLAST)
David Montaner Bioinformatics in medicine 23/26
24. Example III
5 Search for high scoring clusters affected by CNVs
6 Evaluating signi鍖cance of cluster scores:
10.000 simulations
David Montaner Bioinformatics in medicine 24/26
25. Example IV
7 Functional characterization of the identi鍖ed network
8 And, 鍖nally, draw conclusions
David Montaner Bioinformatics in medicine 25/26