際際滷

際際滷Share a Scribd company logo
1
Introduction to Bioinformatics
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
2
What is Bioinformatics?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
3
What is Bioinformatics? - Research, development,
and application of computational tools and
approaches for expanding the use of biological,
medical, behavioral, and health data, including the
means to acquire, store, organize, archive, analyze,
or visualize such data.
What is Computational Biology? - The
development and application of analytical and
theoretical methods, mathematical modeling and
computational simulation techniques to the study of
biological, behavioral, and social data.
on molecular
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
4
Large databases that can be accessed and analyzed with
sophisticated tools have become central to biological
research and education. The information content in the
genomes of organisms, in the molecular dynamics of
proteins, and in population dynamics, to name but a few
areas, is enormous. Biologists are increasingly finding that
the management of complex data sets is becoming a
bottleneck for scientific advances. Therefore,
bioinformatics is rapidly become a key technology in all
fields of biology.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
5
The present bottlenecks in bioinformatics include the education of
biologists in the use of advanced computing tools, the recruitment
of computer scientists into this evolving field, the limited
availability of developed databases of biological information, and
the need for more efficient and intelligent search engines for
complex databases.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
6
The present bottlenecks in bioinformatics include the education ofthe education of
biologists in the use of advanced computing toolsbiologists in the use of advanced computing tools, the recruitment
of computer scientists into this evolving field, the limited
availability of developed databases of biological information, and
the need for more efficient and intelligent search engines for
complex databases.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
7
Molecular Bioinformatics involves the use
of computational tools to discover new
information in complex data sets (from the
one-dimensional information of DNA through
the two-dimensional information of RNA and
the three-dimensional information of proteins,
to the four-dimensional information of
evolving living systems).
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
From DNA to Genome
8
Watson and Crick
DNA model
Sanger sequences
insulin protein
Sanger dideoxy
DNA sequencing
PCR (Polymerase
Chain Reaction)
1955
1960
1965
1970
1975
1980
1985
ARPANET
(early Internet)
PDB (Protein
Data Bank)
Sequence
alignment
GenBank
database
Dayhoffs Atlas
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
9
1995
1990
2000
SWISS-PROT
database
NCBI
World Wide Web
BLAST
FASTA
EBI
Human Genome
Initiative
First human
genome draft
First bacterial
genome
Yeast genome
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
10
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
Origin of bioinformatics and
biological databases:
Nearly a decade later, the first nucleic acid
sequence was reported, that of yeast
tRNAalanine
with 77 bases.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
11
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure).
The Protein DataBank followed in 1972 with a
collection of ten X-ray crystallographic protein
structures. The SWISSPROT protein sequence
database began in 1987.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Nucleotides
1207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Complete Genomes
13
1994 0
1995 1
December 2006 376
Eukaryotes 22
Bacteria 327
Archaea 27
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
14
What can we do with sequences and other type of molecular information?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
15
Annotation
Open reading frames
Functional sites
Structure, function
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
16
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
17
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG
AAT .................................
.............. TGAAAAACGTA
TF binding sitepromoter
Ribosome binding Site
ORF = Open Reading Frame
CDS = Coding Sequence
Transcription
StartSite
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
18
Comparative
genomics
Comparing ORFs
Identifying orthologs
Inferences on structure
and function
Comparing functional sites
Inferences on regulatory
networks
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
19
Similarity profiles
Researchers can learned a great deal about the structure and
function of human genes by examining their counterparts in
model organisms.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
20
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL
Bos MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
Bos CGSHLVEALYLVCGERGFFYTPKARREVEG
: ** :*::*
Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**. ** * * 
Xenopus EQCCHSTCSLFQLENYCN
Bos EQCCASVCSLYQLENYCN
**** *.***:**
Alignment preproinsulin
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
2107/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Ultraconserved Elements in the
Human Genome
Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart
Stephen, W. James Kent, John S. Mattick, & David Haussler
(Science 2004. 304:1321-1325)
There are 481 segments longer than 200 base pairs (bp) that are
absolutely conserved (100% identity with no insertions or
deletions) between orthologous regions of the human, rat, and
mouse genomes. Nearly all of these segments are also conserved
in the chicken and dog genomes, with an average of 95 and 99%
identity, respectively. Many are also significantly conserved in
fish. These ultraconserved elements of the human genome are
most often located either overlapping exons in genes involved in
RNA processing or in introns or nearby genes involved in the
regulation of transcription and development.
There are 156 intergenic, untranscribed,
ultraconserved segments
2207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
23
Junk is real!
Junk:Junk:
Supporting evidenceSupporting evidence
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
24
Functional
genomics
Genome-wide profiling of:
 mRNA levels
 Protein levels
Co-expression of genes
and/or proteins
Identifying protein-protein
interactions
Networks of interactions
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
25
Understanding the function of genes and otherUnderstanding the function of genes and other
parts of the genomeparts of the genome
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
26
Structural
genomics
Assign structure to all
proteins encoded in
a genome
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Structural Genomics
27
~300
unique folds
in PDB
~300 unique folds
CurrentlyCurrently
27761 structures
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Structural Genomics
28
1000-3000
unique folds
in structure space
EstimateEstimate
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
29
Origin of tools
Immediately after the establishment of the
first databases, tools became available to
search them - at first in a very simple
manner, looking for keyword matches and
short sequence words and, then, in a more
sophisticated manner by using pattern
matching, alignment based methods, and
machine learning techniques.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
30
Despite the huge explosion in the number
and length of sequences, the tools used for
storage, retrieval, analysis, and
dissemination of data in bioinformatics are
very similar to those from 15-20 years ago.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
07/17/14 31BAL HARI POUDEL CDBT KIRTIPUR
Database or databank?
Initially
Databank (in UK)
Database (in the USA)
Solution
The abbreviation db
07/17/14 32BAL HARI POUDEL CDBT KIRTIPUR
What is a database?A collection of data
structured
searchable (index) -> table of contents
updated periodically (release) -> new edition
cross-referenced (hyperlinks) -> links with other db
Includes also associated tools (software) necessary for
access, updating, information insertion, information
deletion.
Data storage management: flat files, relational
databases
07/17/14 33BAL HARI POUDEL CDBT KIRTIPUR
Database: a 束 flat file 損 example
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: Pottery 2000; Pottery 2001;
//
Accession number: 2
First Name: Dan
Last name: Graur
Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002
//
Accession number 3:
First Name: John
Last name: Travolta
Course: Ballet 2001; Ballet 2002;
//
Easy to manage: all the entries are visible at the same time !
Flat-file database (束 flat file, 3 entries 損):
07/17/14 34BAL HARI POUDEL CDBT KIRTIPUR
Database: a 束 relational 損 example
Course Year Involved
teachers
Advanced
Pottery
2000; 2001 1; 2
Ballet for Fat
People
2001; 2002 2; 3
Teacher Accession
number
Education
Amos 1 Biochemistry
Dan 2 Genetics
John 3 Scientology
Relational database (束table file損):
07/17/14 35BAL HARI POUDEL CDBT KIRTIPUR
Why biological databases?Exponential growth in biological data.
Data (genomic sequences, 3D structures, 2D gel
analysis, MS analysis, Microarrays.) are no longer
published in a conventional manner, but directly
submitted to databases.
Essential tools for biological research.
07/17/14 36BAL HARI POUDEL CDBT KIRTIPUR
Distribution of sequences
Books, articles 1968 -> 1985
Computer tapes 1982 -> 1992
Floppy disks 1984 -> 1990
CD-ROM 1989 ->
FTP 1989 ->
On-line services 1982 -> 1994
WWW 1993 ->
DVD 2001 ->
07/17/14 37BAL HARI POUDEL CDBT KIRTIPUR
Some statistics
More than 1000 different biological databases
Variable size: <100Kb to >20Gb
 DNA: > 20 Gb
 Protein: 1 Gb
 3D structure: 5 Gb
 Other: smaller
Update frequency: daily to annually to seldom to forget about
it.
Usually accessible through the web (some free, some not)
07/17/14 38BAL HARI POUDEL CDBT KIRTIPUR
Some databases in the field of molecular biology
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb, BBDB, BCGD, Beanref, Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
07/17/14 39BAL HARI POUDEL CDBT KIRTIPUR
Categories of databases for Life Sciences
Sequences (DNA, protein)
Genomics
Mutation/polymorphism
Protein domain/family
Proteomics (2D gel, Mass Spectrometry)
3D structure
Metabolism
Bibliography
Expression (Microarrays,)
Specialized
07/17/14 40BAL HARI POUDEL CDBT KIRTIPUR
41
NCBI (National Center for Biotechnology Information) is a
resource for molecular biology information. NCBI creates and
maintains public databases, conducts research in computational
biology, develops software tools for analyzing genome data, and
disseminates biomedical information. The NCBI site is constantly
being updated and some of the changes include new databases
and tools for data mining.
NCBI offers several searchable literature, molecular and
genomic databases and many bioinformatic tools. An up-to-date
list of databases and tools can be found on the NCBI Sitemap.
Resources
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
42
Bookshelf: A collection of searchable biomedical books linked to
PubMed.
PubMed: Allows searching by author names, journal titles, and a
new Preview/Index option. PubMed database provides access to
over 12 million MEDLINE citations back to the mid-1960's. It
includes History and Clipboard options which may enhance your
search session.
PubMed Central: The U.S. National Library of Medicine digital
archive of life science journal literature.
OMIM: Online Mendelian Inheritance in Man is a database of
human genes and genetic disorders (also OMIA).
Literature Databases:
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
43
GenBank:
http://www.ncbi.nlm.nih.gov/Genbank/
EBI:
http://www.ebi.ac.uk/
DDBJ:
http://www.ddbj.nig.ac.jp/
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Type in a Query term
Enter your search words in the
query box and hit the Go button
44
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
The Syntax 
45
1. Boolean operators: AND, OR, NOT must be entered in
UPPERCASE (e.g., promoters OR response elements). The default
is AND.
2. Entrez processes all Boolean operators in a left-to-right sequence.
The order in which Entrez processes a search statement can be
changed by enclosing individual concepts in parentheses. The terms
inside the parentheses are processed first. For example, the search
statement: g1p3 OR (response AND element AND promoter).
3. Quotation marks: The term inside the quotation marks is read as one
phrase (e.g. public health is different than public health, which will
also include articles on public latrines and their effect on health
workers).
4. Asterisk: Extends the search to all terms that start with the letters
before the asterisk. For example, dia* will include such terms as
diaphragm, dial, and diameter.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
4607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Refine the Query
Often a search finds too many (or too few) sequences, so you
can go back and try again with more (or fewer) keywords in
your query
The History feature allows you to combine any of your past
queries.
The Limits feature allows you to limit a query to specific
organisms, sequences submitted during a specific period of
time, etc.
[Many other features are designed to search for literature in
MEDLINE]
4707/17/14 BAL HARI POUDEL CDBT KIRTIPUR
48
You can search for a text term in sequence annotations or in
MEDLINE abstracts, and find all articles, DNA, and protein
sequences that mention that term.
Then from any article or sequence, you can move to "related
articles" or "related sequences".
Relationships between sequences are computed with BLAST
Relationships between articles are computed with "MESH" terms
(shared keywords)
Relationships between DNA and protein sequences rely on accession
numbers
Relationships between sequences and MEDLINE articles rely on
both shared keywords and the mention of accession numbers in the
articles.
Related Items
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
4907/17/14 BAL HARI POUDEL CDBT KIRTIPUR
5007/17/14 BAL HARI POUDEL CDBT KIRTIPUR
5107/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Database Search Strategies
General search principles - not limited to
sequence (or to biology).
Start with broad keywords and narrow the search
using more specific terms.
Try variants of spelling, numbers, etc.
Search many databases.
Be persistent!!
5207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
PubMed
MEDLINE publication database
Over 17,000 journals
Some other citations
Papers from 1960 and on
Over 12,000,000 entries
Alerting services
http://www.pubcrawler.ie/
http://www.biomail.org/
5307/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Searching PubMed
Structureless searches
Automatic term mapping
Structured searches
Tags, e.g. [au], [ta], [dp], [ti]
Boolean operators, e.g. AND, OR, NOT, ()
Additional features
Subsets, limits
Clipboard, history
5407/17/14 BAL HARI POUDEL CDBT KIRTIPUR
55
Start working:
Search PubMed
1. cuban cigars
2. cuban OR cigars
3. cuban cigars
4. cuba* cigar*
5. (cuba* cigar*) NOT smok*
6. Fidel Castro
7. fidel castro
8. #6 NOT #7
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
5607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
5707/17/14 BAL HARI POUDEL CDBT KIRTIPUR
The OMIM (Online Mendelian
Inheritance in Man)
Genes and genetic disorders
Edited by team at Johns Hopkins
Updated daily
5807/17/14 BAL HARI POUDEL CDBT KIRTIPUR
59
MIM Number Prefixes
* gene with known sequence
+ gene with known sequence and
phenotype
# phenotype description, molecular
basis known
% mendelian phenotype or locus,
molecular basis unknown
no prefix other, mainly phenotypes with
suspected mendelian basis
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
Searching OMIM
Search Fields
Name of trait, e.g., hypertension
Cytogenetic location, e.g., 1p31.6
Inheritance, e.g., autosomal dominant
Gene, e.g., coagulation factor VIII
6007/17/14 BAL HARI POUDEL CDBT KIRTIPUR
61
OMIM search tags
All Fields [ALL]
Allelic Variant [AV] or [VAR]
Chromosome [CH] or [CHR]
Clinical Synopsis [CS] or [CLIN]
Gene Map [GM] or [MAP]
Gene Name [GN] or [GENE]
Reference [RE] or [REF]
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
6207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
63
Start working:
Search OMIM
How many types of hemophilia are there?
For how many is the affected gene known?
What are the genes involved in hemophilia A?
What are the mutations in hemophilia A?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
64
Online Literature databases
1. How to use the UH online Library?
2. Online glossaries
3. Google Scholar
4. Google Books
5. Web of Science
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
65
How to use the online UH Library?
http://info.lib.uh.edu/07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
6607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
67
Online Glossaries
Bioinformatics :
http://www.geocities.com/bioinformaticsweb/glossary.html
http://big.mcw.edu/
Genomics:
http://www.geocities.com/bioinformaticsweb/genomicglossary.html
Molecular Evolution:
http://workshop.molecularevolution.org/resources/glossary/
Biology dictionary:
http://www.biology-online.org/dictionary/satellite_cells
Other glossaries, e.g., the list of phobias:
http://www.phobialist.com/class.html
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
68
4. Google Scholar
http://www.scholar.google.com/
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
69
Enables you to search specifically for scholarly
literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical
reports from all broad areas of research.
What is Google Scholar?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
70
Use Google Scholar to find articles from a
wide variety of academic publishers,
professional societies, preprint repositories
and universities, as well as scholarly articles
available across the web.
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
71
Google Scholar
orders your
search results by
how relevant they
are to your query,
so the most
useful references
should appear at
the top of the
page
This relevance
ranking takes into
account the: full
text of each article.
the article's author,
the publication in
which the article
appeared and how
often it has been
cited in scholarly
literature.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
72
What other DATA can we retrieve from the record?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
7307/17/14 BAL HARI POUDEL CDBT KIRTIPUR
7407/17/14 BAL HARI POUDEL CDBT KIRTIPUR
75
5. Google Book Search
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
7607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
77
Start working:
Search Google Books
How many times is the tail of the giraffe
mentioned in On the Origin of Species by Mr.
Darwin?
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
78
6. Web of science
http://portal01.isiknowledge.com.ezproxy.lib.uh.edu/portal.cgi?DestApp=WOS&Func=Frame
07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
7907/17/14 BAL HARI POUDEL CDBT KIRTIPUR
8007/17/14 BAL HARI POUDEL CDBT KIRTIPUR

More Related Content

1.bioinformatics introduction 32.03.2071

  • 1. 1 Introduction to Bioinformatics 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 2. 2 What is Bioinformatics? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 3. 3 What is Bioinformatics? - Research, development, and application of computational tools and approaches for expanding the use of biological, medical, behavioral, and health data, including the means to acquire, store, organize, archive, analyze, or visualize such data. What is Computational Biology? - The development and application of analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social data. on molecular 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 4. 4 Large databases that can be accessed and analyzed with sophisticated tools have become central to biological research and education. The information content in the genomes of organisms, in the molecular dynamics of proteins, and in population dynamics, to name but a few areas, is enormous. Biologists are increasingly finding that the management of complex data sets is becoming a bottleneck for scientific advances. Therefore, bioinformatics is rapidly become a key technology in all fields of biology. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 5. 5 The present bottlenecks in bioinformatics include the education of biologists in the use of advanced computing tools, the recruitment of computer scientists into this evolving field, the limited availability of developed databases of biological information, and the need for more efficient and intelligent search engines for complex databases. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 6. 6 The present bottlenecks in bioinformatics include the education ofthe education of biologists in the use of advanced computing toolsbiologists in the use of advanced computing tools, the recruitment of computer scientists into this evolving field, the limited availability of developed databases of biological information, and the need for more efficient and intelligent search engines for complex databases. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 7. 7 Molecular Bioinformatics involves the use of computational tools to discover new information in complex data sets (from the one-dimensional information of DNA through the two-dimensional information of RNA and the three-dimensional information of proteins, to the four-dimensional information of evolving living systems). 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 8. From DNA to Genome 8 Watson and Crick DNA model Sanger sequences insulin protein Sanger dideoxy DNA sequencing PCR (Polymerase Chain Reaction) 1955 1960 1965 1970 1975 1980 1985 ARPANET (early Internet) PDB (Protein Data Bank) Sequence alignment GenBank database Dayhoffs Atlas 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 9. 9 1995 1990 2000 SWISS-PROT database NCBI World Wide Web BLAST FASTA EBI Human Genome Initiative First human genome draft First bacterial genome Yeast genome 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 10. 10 The first protein sequence reported was that of bovine insulin in 1956, consisting of 51 residues. Origin of bioinformatics and biological databases: Nearly a decade later, the first nucleic acid sequence was reported, that of yeast tRNAalanine with 77 bases. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 11. 11 In 1965, Dayhoff gathered all the available sequence data to create the first bioinformatic database (Atlas of Protein Sequence and Structure). The Protein DataBank followed in 1972 with a collection of ten X-ray crystallographic protein structures. The SWISSPROT protein sequence database began in 1987.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 12. Nucleotides 1207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 13. Complete Genomes 13 1994 0 1995 1 December 2006 376 Eukaryotes 22 Bacteria 327 Archaea 27 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 14. 14 What can we do with sequences and other type of molecular information? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 15. 15 Annotation Open reading frames Functional sites Structure, function 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 17. 17 CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG AAT ................................. .............. TGAAAAACGTA TF binding sitepromoter Ribosome binding Site ORF = Open Reading Frame CDS = Coding Sequence Transcription StartSite 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 18. 18 Comparative genomics Comparing ORFs Identifying orthologs Inferences on structure and function Comparing functional sites Inferences on regulatory networks 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 19. 19 Similarity profiles Researchers can learned a great deal about the structure and function of human genes by examining their counterparts in model organisms.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 20. 20 Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos MALWTRLRPLLALLALWPPPPARAFVNQHL **** : * *.*: *:..* :. *:**** Xenopus CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos CGSHLVEALYLVCGERGFFYTPKARREVEG : ** :*::* Xenopus AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos PQVG---ALELAGGPGAGGLEGPPQKRGIV .**. ** * * Xenopus EQCCHSTCSLFQLENYCN Bos EQCCASVCSLYQLENYCN **** *.***:** Alignment preproinsulin 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 21. 2107/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 22. Ultraconserved Elements in the Human Genome Gill Bejerano, Michael Pheasant, Igor Makunin, Stuart Stephen, W. James Kent, John S. Mattick, & David Haussler (Science 2004. 304:1321-1325) There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. There are 156 intergenic, untranscribed, ultraconserved segments 2207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 23. 23 Junk is real! Junk:Junk: Supporting evidenceSupporting evidence 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 24. 24 Functional genomics Genome-wide profiling of: mRNA levels Protein levels Co-expression of genes and/or proteins Identifying protein-protein interactions Networks of interactions 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 25. 25 Understanding the function of genes and otherUnderstanding the function of genes and other parts of the genomeparts of the genome 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 26. 26 Structural genomics Assign structure to all proteins encoded in a genome 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 27. Structural Genomics 27 ~300 unique folds in PDB ~300 unique folds CurrentlyCurrently 27761 structures 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 28. Structural Genomics 28 1000-3000 unique folds in structure space EstimateEstimate 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 29. 29 Origin of tools Immediately after the establishment of the first databases, tools became available to search them - at first in a very simple manner, looking for keyword matches and short sequence words and, then, in a more sophisticated manner by using pattern matching, alignment based methods, and machine learning techniques. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 30. 30 Despite the huge explosion in the number and length of sequences, the tools used for storage, retrieval, analysis, and dissemination of data in bioinformatics are very similar to those from 15-20 years ago. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 31. 07/17/14 31BAL HARI POUDEL CDBT KIRTIPUR
  • 32. Database or databank? Initially Databank (in UK) Database (in the USA) Solution The abbreviation db 07/17/14 32BAL HARI POUDEL CDBT KIRTIPUR
  • 33. What is a database?A collection of data structured searchable (index) -> table of contents updated periodically (release) -> new edition cross-referenced (hyperlinks) -> links with other db Includes also associated tools (software) necessary for access, updating, information insertion, information deletion. Data storage management: flat files, relational databases 07/17/14 33BAL HARI POUDEL CDBT KIRTIPUR
  • 34. Database: a 束 flat file 損 example Accession number: 1 First Name: Amos Last Name: Bairoch Course: Pottery 2000; Pottery 2001; // Accession number: 2 First Name: Dan Last name: Graur Course: Pottery 2000, Pottery 2001; Ballet 2001, Ballet 2002 // Accession number 3: First Name: John Last name: Travolta Course: Ballet 2001; Ballet 2002; // Easy to manage: all the entries are visible at the same time ! Flat-file database (束 flat file, 3 entries 損): 07/17/14 34BAL HARI POUDEL CDBT KIRTIPUR
  • 35. Database: a 束 relational 損 example Course Year Involved teachers Advanced Pottery 2000; 2001 1; 2 Ballet for Fat People 2001; 2002 2; 3 Teacher Accession number Education Amos 1 Biochemistry Dan 2 Genetics John 3 Scientology Relational database (束table file損): 07/17/14 35BAL HARI POUDEL CDBT KIRTIPUR
  • 36. Why biological databases?Exponential growth in biological data. Data (genomic sequences, 3D structures, 2D gel analysis, MS analysis, Microarrays.) are no longer published in a conventional manner, but directly submitted to databases. Essential tools for biological research. 07/17/14 36BAL HARI POUDEL CDBT KIRTIPUR
  • 37. Distribution of sequences Books, articles 1968 -> 1985 Computer tapes 1982 -> 1992 Floppy disks 1984 -> 1990 CD-ROM 1989 -> FTP 1989 -> On-line services 1982 -> 1994 WWW 1993 -> DVD 2001 -> 07/17/14 37BAL HARI POUDEL CDBT KIRTIPUR
  • 38. Some statistics More than 1000 different biological databases Variable size: <100Kb to >20Gb DNA: > 20 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller Update frequency: daily to annually to seldom to forget about it. Usually accessible through the web (some free, some not) 07/17/14 38BAL HARI POUDEL CDBT KIRTIPUR
  • 39. Some databases in the field of molecular biology AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!! 07/17/14 39BAL HARI POUDEL CDBT KIRTIPUR
  • 40. Categories of databases for Life Sciences Sequences (DNA, protein) Genomics Mutation/polymorphism Protein domain/family Proteomics (2D gel, Mass Spectrometry) 3D structure Metabolism Bibliography Expression (Microarrays,) Specialized 07/17/14 40BAL HARI POUDEL CDBT KIRTIPUR
  • 41. 41 NCBI (National Center for Biotechnology Information) is a resource for molecular biology information. NCBI creates and maintains public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. The NCBI site is constantly being updated and some of the changes include new databases and tools for data mining. NCBI offers several searchable literature, molecular and genomic databases and many bioinformatic tools. An up-to-date list of databases and tools can be found on the NCBI Sitemap. Resources 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 42. 42 Bookshelf: A collection of searchable biomedical books linked to PubMed. PubMed: Allows searching by author names, journal titles, and a new Preview/Index option. PubMed database provides access to over 12 million MEDLINE citations back to the mid-1960's. It includes History and Clipboard options which may enhance your search session. PubMed Central: The U.S. National Library of Medicine digital archive of life science journal literature. OMIM: Online Mendelian Inheritance in Man is a database of human genes and genetic disorders (also OMIA). Literature Databases: 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 44. Type in a Query term Enter your search words in the query box and hit the Go button 44 http://www.ncbi.nlm.nih.gov/entrez/query/static/help/helpdoc.html#Searching 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 45. The Syntax 45 1. Boolean operators: AND, OR, NOT must be entered in UPPERCASE (e.g., promoters OR response elements). The default is AND. 2. Entrez processes all Boolean operators in a left-to-right sequence. The order in which Entrez processes a search statement can be changed by enclosing individual concepts in parentheses. The terms inside the parentheses are processed first. For example, the search statement: g1p3 OR (response AND element AND promoter). 3. Quotation marks: The term inside the quotation marks is read as one phrase (e.g. public health is different than public health, which will also include articles on public latrines and their effect on health workers). 4. Asterisk: Extends the search to all terms that start with the letters before the asterisk. For example, dia* will include such terms as diaphragm, dial, and diameter.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 46. 4607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 47. Refine the Query Often a search finds too many (or too few) sequences, so you can go back and try again with more (or fewer) keywords in your query The History feature allows you to combine any of your past queries. The Limits feature allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc. [Many other features are designed to search for literature in MEDLINE] 4707/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 48. 48 You can search for a text term in sequence annotations or in MEDLINE abstracts, and find all articles, DNA, and protein sequences that mention that term. Then from any article or sequence, you can move to "related articles" or "related sequences". Relationships between sequences are computed with BLAST Relationships between articles are computed with "MESH" terms (shared keywords) Relationships between DNA and protein sequences rely on accession numbers Relationships between sequences and MEDLINE articles rely on both shared keywords and the mention of accession numbers in the articles. Related Items 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 49. 4907/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 50. 5007/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 51. 5107/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 52. Database Search Strategies General search principles - not limited to sequence (or to biology). Start with broad keywords and narrow the search using more specific terms. Try variants of spelling, numbers, etc. Search many databases. Be persistent!! 5207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 53. PubMed MEDLINE publication database Over 17,000 journals Some other citations Papers from 1960 and on Over 12,000,000 entries Alerting services http://www.pubcrawler.ie/ http://www.biomail.org/ 5307/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 54. Searching PubMed Structureless searches Automatic term mapping Structured searches Tags, e.g. [au], [ta], [dp], [ti] Boolean operators, e.g. AND, OR, NOT, () Additional features Subsets, limits Clipboard, history 5407/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 55. 55 Start working: Search PubMed 1. cuban cigars 2. cuban OR cigars 3. cuban cigars 4. cuba* cigar* 5. (cuba* cigar*) NOT smok* 6. Fidel Castro 7. fidel castro 8. #6 NOT #7 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 56. 5607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 57. 5707/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 58. The OMIM (Online Mendelian Inheritance in Man) Genes and genetic disorders Edited by team at Johns Hopkins Updated daily 5807/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 59. 59 MIM Number Prefixes * gene with known sequence + gene with known sequence and phenotype # phenotype description, molecular basis known % mendelian phenotype or locus, molecular basis unknown no prefix other, mainly phenotypes with suspected mendelian basis 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 60. Searching OMIM Search Fields Name of trait, e.g., hypertension Cytogenetic location, e.g., 1p31.6 Inheritance, e.g., autosomal dominant Gene, e.g., coagulation factor VIII 6007/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 61. 61 OMIM search tags All Fields [ALL] Allelic Variant [AV] or [VAR] Chromosome [CH] or [CHR] Clinical Synopsis [CS] or [CLIN] Gene Map [GM] or [MAP] Gene Name [GN] or [GENE] Reference [RE] or [REF] 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 62. 6207/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 63. 63 Start working: Search OMIM How many types of hemophilia are there? For how many is the affected gene known? What are the genes involved in hemophilia A? What are the mutations in hemophilia A? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 64. 64 Online Literature databases 1. How to use the UH online Library? 2. Online glossaries 3. Google Scholar 4. Google Books 5. Web of Science 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 65. 65 How to use the online UH Library? http://info.lib.uh.edu/07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 66. 6607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 67. 67 Online Glossaries Bioinformatics : http://www.geocities.com/bioinformaticsweb/glossary.html http://big.mcw.edu/ Genomics: http://www.geocities.com/bioinformaticsweb/genomicglossary.html Molecular Evolution: http://workshop.molecularevolution.org/resources/glossary/ Biology dictionary: http://www.biology-online.org/dictionary/satellite_cells Other glossaries, e.g., the list of phobias: http://www.phobialist.com/class.html 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 69. 69 Enables you to search specifically for scholarly literature, including peer-reviewed papers, theses, books, preprints, abstracts and technical reports from all broad areas of research. What is Google Scholar? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 70. 70 Use Google Scholar to find articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web. 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 71. 71 Google Scholar orders your search results by how relevant they are to your query, so the most useful references should appear at the top of the page This relevance ranking takes into account the: full text of each article. the article's author, the publication in which the article appeared and how often it has been cited in scholarly literature.07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 72. 72 What other DATA can we retrieve from the record? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 73. 7307/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 74. 7407/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 75. 75 5. Google Book Search 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 76. 7607/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 77. 77 Start working: Search Google Books How many times is the tail of the giraffe mentioned in On the Origin of Species by Mr. Darwin? 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 78. 78 6. Web of science http://portal01.isiknowledge.com.ezproxy.lib.uh.edu/portal.cgi?DestApp=WOS&Func=Frame 07/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 79. 7907/17/14 BAL HARI POUDEL CDBT KIRTIPUR
  • 80. 8007/17/14 BAL HARI POUDEL CDBT KIRTIPUR

Editor's Notes

  • #74: This option will lunch regular Google