�ݺ�ߣ

Prediction in bioinformatics
Important prediction problems:
Protein sequence from genomic DNA.
Protein 3D structure from sequence.
Protein function from structure.
Protein function from sequence.

From DNA to Cell Function
DNA sequence
(split into genes)
AminoAcid
Sequence
Protein
3D
Structure
Protein
Function
Cell
Activity
codes for
folds into
dictates determines
has
MNIFEMLRID EGLRLKIYKD TEGYYTIGIG
HLLTKSPSLN AAKSELDKAI GRNCNGVITK
DEAEKLFNQD VDAAVRGILR NAKLKPVYDS
LDAVRRCALI NMVFQMGETG VAGFTNSLRM
LQQKRWDEAA VNLAKSRWYN QTPNRAKRVI
TTFRTGTWDA YKNL
?

Protein structure: Limitations
• Not all proteins or parts of proteins assume a well-defined
3D structure in solution.
• Protein structure is not static, there are various degrees of
thermal motion for different parts of the structure.
• There may be a number of slightly different
conformations in solution.
• Some proteins undergo conformational changes when
interacting with certain substances.
• Expected best residue-by-residue accuracies for secondary
structure prediction from multiple protein sequence
alignment.
• To address detailed functional biological questions.

Experimental Protein Structure Determination
• X-ray crystallography
– the most advanced method available for obtaining high-resolution
structural information about biological macromolecules
– in vitro
– needs crystals
– ~$100-200K per structure
• NMR
– fairly accurate
– in vivo
– no need for crystals
– limited to very small proteins
• Cryo-electron-microscopy
– imaging technology
– low resolution

Why predict protein structure?
• Over millions known sequences, 1,25,309 known structures.
• Structural knowledge brings understanding of function and
mechanism of action.
• Predicted structures can be used in structure-based drug design.
• It can help us understand the effects of mutations on structure and
function.
• To analyze sequence structure gap.
• Can help in prediction of function.
• It is a very interesting scientific problem-50 years effort.
• Prediction in one dimension
– Secondary structure prediction
– Surface accessibility prediction

• Historically first structure prediction methods predicted
secondary structure.
• Can be used to improve alignment accuracy.
• Can be used to detect domain boundaries within proteins
with remote sequence homology.
• Often the first step towards 3D structure prediction.
• Informative for mutagenesis studies.
Secondary structure prediction

Predicting Secondary Structure From Primary Structure
• accuracy 64-75%.
• higher accuracy for a-helices than for b-sheets.
• accuracy is dependent on protein family.
• predictions of engineered (artificial) proteins are less accurate.
Assumptions
• The entire information for forming secondary structure is contained
in the primary sequence.
• Side groups of residues will determine structure.
• Examining windows of 13-17 residues is sufficient to predict secondary
structure .
-α-helices 5–40 residues long
-β-strands 5–10 residues long

Why Secondary Structure Prediction?
• Simply easier problem than 3D structure prediction.
• Accurate secondary structure prediction can be an important
information for the tertiary structure prediction.
• Improving alignment accuracy.
• Protein function prediction.
• Protein classification.

Protein structure prediction
• The inference of the three-dimensional structure of
a protein from its amino acid sequence.
– i.e. the prediction of its folding and its secondary and tertiary
structure from its primary structure.
• Structure prediction is fundamentally different from the
inverse problem of protein design.
• Protein structure prediction is one of the most important
goals pursued by bioinformatics and theoretical chemistry.
• It is highly important in medicine (in drug design)
and biotechnology (in the design of novel enzymes).

Methods of structure prediction
Ab initio protein folding approaches
Comparative (homology) modelling
Fold recognition/threading

History of protein secondary structure prediction
First generation
Based on single residue statistics.
Example: Chou-Fasman method, LIM method, GOR I, etc
Accuracy: low
Secondary generation
Based on segment statistics.
Examples: ALB method, GOR III, etc
Accuracy: ~60%
Third generation
Based on long-range interaction, homology based
Examples: PHD
Accuracy: ~70%

First generation methods:
single residue statistics
Chou & Fasman (1974 & 1978) :
 Some residues have particular secondary-structure preferences.
 Based on experimental frequencies of residues in -helices, -sheets,
and coils.
Examples: Glu α-helix
Val β-strand
 Accuracy ~50 - 60% Q3

Chou-Fasman statistics
• R – amino acid, S- secondary structure
• f(R,S) – number of occurrences of R in S
• Ns – total number of amino acids in conformation S
• N – total number of amino acids
• P(R,S) – propensity of amino acid R to be in structure S
• P(R,S) = (f(R,S)/f(R))/(Ns/N)

Example
• #residues=20,000,
• #helix=4,000,
• #Ala=2,000,
• #Ala in helix=500
• f(Ala, ) = 500/20,000,
α
• f(Ala) = 2,000/20,000
• p( ) = / =4,000/20,000
α Να Ν
• P = (500/2000) / (4,000/20000) = 1.25

Second generation methods: segment statistics
• Similar to single-residue methods, but incorporating
additional information (adjacent residues, segmental
statistics).
• Problems:
– Low accuracy - Q3 below 66% (results).
– Q3 of -strands (E) : 28% - 48%.
– Predicted structures were too short.

The GOR method
• Developed by Garnier, Osguthorpe & Robson
• Build on Chou-Fasman Pij values
• Evaluate each residue PLUS adjacent 8 N-terminal and 8
carboxyl-terminal residues
• Sliding window of 17 residues.
• underpredicts b-strand regions
• GOR method accuracy Q3 = ~64%

Third generation methods
• Third generation methods reached 77% accuracy.
• They consist of two new ideas:
1. A biological idea –
Using evolutionary information based on
conservation analysis of multiple sequence
alignments.
2. A technological idea –
Using neural networks.

Artificial Neural Networks
An attempt to imitate the human brain (assuming that
this is the way it works).

Neural network models
- machine learning approach
- provide training sets of structures (e.g. a-helices, non
a -helices)
- computers are trained to recognize patterns in known
secondary structures
- provide test set (proteins with known structures)
- accuracy ~ 70 –75%

Correlation coefficient
True positive
pα
False positive
(overpredicted)
oα
True negative
nα
False negative
(underpredicted)
uα
])
][
][
[
]
([ 











 o
p
u
p
o
n
u
n
o
u
n
p
C 







Ca = 1 (=100%)

Reasons for improved accuracy
• Align sequence with other related proteins of the
same protein family.
• Find members that has a known structure.
• If significant matches between structure and sequence
assign secondary structures to corresponding
residues.

New and Improved Third-Generation Methods
Exploit evolutionary information. Based on conservation
analysis of multiple sequence alignments.
• PHD (Q3 ~ 70%)
Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599.
• PSIPRED (Q3 ~ 77%)
Jones, D. T. (1999) J. Mol. Biol. 292, 195-202.
Arguably remains the top secondary structure prediction method.

Secondary Structure Prediction Summary
1st Generation - 1970s
• Q3 = 50-55%
• Chou & Fausman, GOR
2nd Generation -1980s
• Q3 = 60-65%
• Qian & Sejnowski, GORIII
3rd Generation - 1990s
• Q3 = 70-80%
• PhD, PSIPRED
Many 3rd+ generation methods exist:
PSI-PRED - http://bioinf.cs.ucl.ac.uk/psipred/
JPRED - http://www.compbio.dundee.ac.uk/~www-jpred/
PHD -
http://www.embl-heidelberg.de/predictprotein/predictprotein.html
NNPRED - http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html

Protein 3D structure data
The structure of a protein consists of the 3D (X,Y,Z) coordinates of each
non-hydrogen atom of the protein.
Some protein structure also include coordinates of covalently linked
prosthetic groups, non-covalently linked ligand molecules, or metal ions.
For some purposes (e.g. structural alignment) only the Cα coordinates are
needed.
Example of PDB format: X Y Z occupancy / temp.
ATOM 18 N GLY 27 40.315 161.004 11.211 1.00 10.11
ATOM 19 CA GLY 27 39.049 160.737 10.462 1.00 14.18
ATOM 20 C GLY 27 38.729 159.239 10.784 1.00 20.75
ATOM 21 O GLY 27 39.507 158.484 11.404 1.00 21.88
Note: the PDB format provides no information about connectivity between
atoms. The last two numbers (occupancy, temperature factor) relate to
disorders of atomic positions in crystals.

protein structure prediction in bioinformatics.ppt

Building a protein structure model from X-ray data
Building a protein structure model from NMR data
Computing the energy for a given protein structure (conformation)
Energy minimization: Finding the structure with the minimal energy according
to some empirical “force fields”.
Simulating the protein folding process (molecular dynamics)
Structure visualization
Structure visualization
Computing secondary structure from atomic coordinates
Protein superposition, structural alignment
Protein superposition, structural alignment
Protein fold classification
Protein fold classification
Threading: finding a fold (prototype structure) that fits to a sequence
Threading: finding a fold (prototype structure) that fits to a sequence
Docking: fitting ligands onto a protein surface by molecular dynamics or energy
minimization
Protein 3D structure prediction from sequence
Protein 3D structure prediction from sequence
Protein structure: Some computational tasks
Protein structure: Some computational tasks

Viewing protein structures
When looking at a protein structure, we may ask the following types of
questions:
• Is a particular residue on the inside or outside of a protein?
• Which amino acids interact with each other?
• Which amino acids are in contact with a ligand (DNA, peptide
hormone, small molecule, etc.)?
• Is an observed mutation likely to disturb the protein structure?
Standard capabilities of protein structure software:
• Display of protein structures in different ways (wireframe, backbone,
sticks, spacefill, ribbon.
• Highlighting of individual atoms, residues or groups of residues
• Calculation of interatomic distances
• Advanced feature: Superposition of related structures

Example: c-abl oncoprotein SH2 domain, display wireframe

Example: c-abl oncoprotein SH2 domain, display sticks

Example: c-abl oncoprotein SH2 domain, display backbone

Example: c-abl oncoprotein SH2 domain, display spacefill

Example: c-abl oncoprotein SH2 domain, display ribbons

�ݺ�ߣ

protein structure prediction in bioinformatics.ppt

Recommended

More Related Content

Similar to protein structure prediction in bioinformatics.ppt (20)

More from DrSudha2 (11)

Recently uploaded (20)

protein structure prediction in bioinformatics.ppt

Editor's Notes