ݺߣ

ݺߣShare a Scribd company logo
Protein loop classification
           using Artificial Neural
                  Networks


       Armando Vieira1 and Baldomero Oliva2
1
    ISEP and Centro de Física Computacional, Coimbra, Portugal
                     www.defi.isep.ipp.pt/~asv
           2
            Structural Bioinformatics Laboratory (GRIB)
        IMIM/Universitat Pompeu Fabra, Barcelona, Spain
XXI: the century of BIO
BIOINFORMATICS
joining two worlds apart
Outline
Brief review of protein structure
Statement of problem and why is so hard
Data pre-processing, corrections, updates
and beyond multiple alignments…
Neural Networks in protein structure
prediction
HLVQ
Results and future work
Proteins

All proteins are chains of 20 amino acids
Not all chains of amino acids are proteins
Fold rapidly and repeatedly
Proteins are the machinery of live
Essential to all (known) organisms
The Gist of it



Amino acid        Physical    Function
 sequence         structure
Typical globular protein
ME
MEKKEFHIVAETG
MEKKEFHIVA
MEKK
MEK
IHARPATLLVQTA
SLFNSDINLETLG
KSVNLKSIMGVMS
LGVGQGSDVTITV
DGADEADGMAAI
VETLQLQGLAQ
Coarse-Grained Model
+180
  b    b   b   p   o   M   e   e    e


  b    b   b   p   o   M   M   e    e


  b    b   b   p   .   l   l   s    e


  a    a   a   T   .   l   l   g    N      Ψ
 N     a   a   a   .   U   l   g    N


 N     a   a   a   .   U   g   g    N


  I    a   a   a   .   G   G   G     I


  e    F   F   F   o   e   e   e    e


  b    b   b   p   o   e   e   e    e


                                          -180
-180               φ               +180
Ramachandran Alphabet
    180°
                B
     90°


ψ    0°
                       A         G

     -90°
                                       E
    -180°
        -180°   -90°       0°   90°   180°
                           φ
5-letter alphabet
Residue Sequence    3° Structure



 MEKKEFHIVAET      ACCDECBAABDE
 GIHARPATLLVQT     CBDABCDBEABD
 ASLFNSDINLETL     BCBDBAEBDBDB
 GKSVNLKSIMGV      AEBABDCBBDBA
 MSLGVGQGSDVT      DDCBDBCBDBEB
 ITVDGADEADGM      DBCBBDCAABDE
 AAIVETLQLQGLA     DCDCEAABACAA
 Q...              AADC…
What shall we do?
• Ab initio:
  Quantum Mechanics +
  big computers +
  large # configurations
= huge problems…

• Machine Learning:
Use known cases to learn a suitable
  map:
       sequence→ structure
Machine Learning Approach
Artificial Neural Networks
• A problem-solving paradigm modeled after the
  physiological functioning of the human brain.

• Synapses in the brain are modeled by computational nodes.

• The firing of a synapse is modeled by input, output, and
  threshold functions.

• The network “learns” based on problems to which answers
  are known (supervised learning).

• The network can then produce answers to entirely new
  problems of the same type.
Neural Networks
Input
Layer
                Hidden
                Layers




                          Output
                          Layer
Overfitting – high risk!




Less complicated hypothesis has lower error rate
Hidden Layer Vector
        Quantization- HLVQ

   Traditional NN                  HLVQ
                            z
           o                      o
                o o                    o o
           x                      x
                o oo                   o oo
           x                      x
             ox   x                 ox   x
         xxx                    xxx
             x                      x




Main advantage: detect and correct prediction for
                    outliers
Loops, loops everywhere!!!
Look for a ǴDZ…
Geometry of the Motif
Loop Types

α−α : α -helix - α -helix
α−β : α -helix – β strand
β−α : β strand - α -helix


β -hairpin: β strand - β strand
β - link: β strand - β strand
α−α
       Similar conformation aa{b}aa / aa{p}aa
       Identical geometry (4,6)(0,45)(45,90)(180,225)


              Pro 75%


                              Ser 75%


                                  1.3.1 aa{p}aa
                                  1.1.2 aa{b}aa




© Baldomero Oliva
Class α−α
ArchDB database

~ 20 000 loops classified into ~ 3000 classes.
          EE-3.4.1
 Loop type - loop size . consensus . motif
TASK: classify a loop from sequence alone
If not possible, get as much information as
possible
Problems

• Coding of aminoacids

• Huge searching space, sparsely populated

• How to assign the loop classes?

• High dimensionality → Large Networks → poor
  generalization
Aminoacid coding
 the classical way
      A → (1, 0, …0)
      C → (0, 1, …0)
      Y → (0, 0, …1)

  Useful but not efficient!!!
I am working to improve it…
Theory; but how about applications?!
β-β link and β-β harpins from
          sequence

   HLVQ         Predicted Predicted
   (MLP)        β-β link β-β harpin
   Real           88.4      11.6
   β-β link      (79.4)    (20.6)
   Real           12.5      87.5
   β-β harpin    (16.1)    (83.9)
Prediction of all loop types
   from sequence alone

          β-β lk   α-β    β-β hp   β-α    α-α

 β-β lk   45.9     28.5    3.7     19.8   2.1

 α-β       8.8     67.4    1.2     18.0   4.6

 β-β hp    0.4     0.9     96.1    2.1    0.5

 β-α      4.4      6.2     2.4     79.5   7.6
 α-α      4.0      15.7    1.3     20.3   58.6
What’s it all mean?
Given a loop residue sequence, we can
(usually) identify its native structure.
Not ab initio: We cannot tell the structure
of a novel sequence.
HLVQ is superior to MLP
Future Work


Better coding of aminoacids
Larger sequences / low complexity
Going beyond structure
Clever alphabet that explore similarities
Multiobjective Genetic Algorithms
Beyond Multiple Alignments

• Alligments are good … but expensive and
  boring ...
• Information contained in a multiple
  alignment can, in principle, be expressed
  using an adequate aminoacid coding
  scheme      Sensibility

• How?        Genetic Algorithm
Coded Amino Acids

  Alanine (A)          Arginine (R)         Asparagine (N) Aspartic Acid (D) Cysteine (C)




Glutamic Acid (E)   Glutamine (Q)         Glycine (G)           Histidine (H)       Isoleucine (I)




  Leucine (L)       Lysine (K)       Methionine (M)     Phenylalanine (F)       Proline (P)




    Serine (S)       Threonine (T)        Tryptophan          Tyrosine (Y)        Valine (V)
                                                        http://www.chemie.fu-berlin.de/chemistry/bio/
ArchDB database
Protein Data Bank (PDB)
http://www.rcsb.org contains ~ 25 000
proteins with known structure of ~ 106
entries in SWISS-PROT

ArchDB ~ 20 000 classified loops

More Related Content

Similar to Barcelona sabatica (20)

PDF
OpenCL applications in genomics
USC
PPTX
Bioinformatica t3-scoring matrices
Prof. Wim Van Criekinge
PPTX
Bioinformatica t4-alignments
Prof. Wim Van Criekinge
PPTX
Bioinformatics t4-alignments wim_vancriekingev2013
Prof. Wim Van Criekinge
PDF
Bioalgo 2012-03-massspec
BioinformaticsInstitute
PDF
Ch08 massspec
BioinformaticsInstitute
PDF
Two numerical graph algorithms
David Gleich
PPT
Ontology mapping needs context & approximation
Frank van Harmelen
PDF
Ch06 alignment
BioinformaticsInstitute
PDF
Cross Product Extensions to the Gene Ontology
Chris Mungall
PPTX
2015 bioinformatics score_matrices_wim_vancriekinge
Prof. Wim Van Criekinge
PDF
A Julia package for iterative SVDs with applications to genomics data analysis
Jiahao Chen
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
PPT
Assembly and finishing
Nikolay Vyahhi
PPTX
Sequence Alignment - Data Bioinformatics Introduction
TenaAvdic
PDF
Genotype Imputation via Matrix Completion
echi99
PPTX
2016 bioinformatics i_score_matrices_wim_vancriekinge
Prof. Wim Van Criekinge
PDF
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Ann Loraine
PDF
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
Aksw Group
PDF
Scaling up genomic analysis with ADAM
fnothaft
OpenCL applications in genomics
USC
Bioinformatica t3-scoring matrices
Prof. Wim Van Criekinge
Bioinformatica t4-alignments
Prof. Wim Van Criekinge
Bioinformatics t4-alignments wim_vancriekingev2013
Prof. Wim Van Criekinge
Bioalgo 2012-03-massspec
BioinformaticsInstitute
Two numerical graph algorithms
David Gleich
Ontology mapping needs context & approximation
Frank van Harmelen
Cross Product Extensions to the Gene Ontology
Chris Mungall
2015 bioinformatics score_matrices_wim_vancriekinge
Prof. Wim Van Criekinge
A Julia package for iterative SVDs with applications to genomics data analysis
Jiahao Chen
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
Assembly and finishing
Nikolay Vyahhi
Sequence Alignment - Data Bioinformatics Introduction
TenaAvdic
Genotype Imputation via Matrix Completion
echi99
2016 bioinformatics i_score_matrices_wim_vancriekinge
Prof. Wim Van Criekinge
IGB genome genometry data models by Gregg Helt and Cyrus Harmon
Ann Loraine
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
Aksw Group
Scaling up genomic analysis with ADAM
fnothaft

More from Armando Vieira (20)

PPTX
Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)
Armando Vieira
PDF
Predicting online user behaviour using deep learning algorithms
Armando Vieira
DOCX
Boosting conversion rates on ecommerce using deep learning algorithms
Armando Vieira
DOCX
Seasonality effects on second hand cars sales
Armando Vieira
PPTX
Visualizations of high dimensional data using R and Shiny
Armando Vieira
PDF
Dl2 computing gpu
Armando Vieira
PDF
Dl1 deep learning_algorithms
Armando Vieira
PDF
Extracting Knowledge from Pydata London 2015
Armando Vieira
PDF
Hidden Layer Leraning Vector Quantizatio
Armando Vieira
PPTX
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
PPT
Neural Networks and Genetic Algorithms Multiobjective acceleration
Armando Vieira
PDF
Optimization of digital marketing campaigns
Armando Vieira
PPT
Credit risk with neural networks bankruptcy prediction machine learning
Armando Vieira
PDF
Online democracy Armando Vieira
Armando Vieira
PDF
Invtur conference aveiro 2010
Armando Vieira
PDF
Tourism with recomendation systems
Armando Vieira
PDF
Manifold learning for bankruptcy prediction
Armando Vieira
PDF
Credit iconip
Armando Vieira
DOC
Requiem pelo ensino
Armando Vieira
PDF
Eurogen v
Armando Vieira
Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)
Armando Vieira
Predicting online user behaviour using deep learning algorithms
Armando Vieira
Boosting conversion rates on ecommerce using deep learning algorithms
Armando Vieira
Seasonality effects on second hand cars sales
Armando Vieira
Visualizations of high dimensional data using R and Shiny
Armando Vieira
Dl2 computing gpu
Armando Vieira
Dl1 deep learning_algorithms
Armando Vieira
Extracting Knowledge from Pydata London 2015
Armando Vieira
Hidden Layer Leraning Vector Quantizatio
Armando Vieira
machine learning in the age of big data: new approaches and business applicat...
Armando Vieira
Neural Networks and Genetic Algorithms Multiobjective acceleration
Armando Vieira
Optimization of digital marketing campaigns
Armando Vieira
Credit risk with neural networks bankruptcy prediction machine learning
Armando Vieira
Online democracy Armando Vieira
Armando Vieira
Invtur conference aveiro 2010
Armando Vieira
Tourism with recomendation systems
Armando Vieira
Manifold learning for bankruptcy prediction
Armando Vieira
Credit iconip
Armando Vieira
Requiem pelo ensino
Armando Vieira
Ad

Barcelona sabatica

  • 1. Protein loop classification using Artificial Neural Networks Armando Vieira1 and Baldomero Oliva2 1 ISEP and Centro de Física Computacional, Coimbra, Portugal www.defi.isep.ipp.pt/~asv 2 Structural Bioinformatics Laboratory (GRIB) IMIM/Universitat Pompeu Fabra, Barcelona, Spain
  • 4. Outline Brief review of protein structure Statement of problem and why is so hard Data pre-processing, corrections, updates and beyond multiple alignments… Neural Networks in protein structure prediction HLVQ Results and future work
  • 5. Proteins All proteins are chains of 20 amino acids Not all chains of amino acids are proteins Fold rapidly and repeatedly Proteins are the machinery of live Essential to all (known) organisms
  • 6. The Gist of it Amino acid Physical Function sequence structure
  • 9. +180 b b b p o M e e e b b b p o M M e e b b b p . l l s e a a a T . l l g N Ψ N a a a . U l g N N a a a . U g g N I a a a . G G G I e F F F o e e e e b b b p o e e e e -180 -180 φ +180
  • 10. Ramachandran Alphabet 180° B 90° ψ 0° A G -90° E -180° -180° -90° 0° 90° 180° φ
  • 11. 5-letter alphabet Residue Sequence 3° Structure MEKKEFHIVAET ACCDECBAABDE GIHARPATLLVQT CBDABCDBEABD ASLFNSDINLETL BCBDBAEBDBDB GKSVNLKSIMGV AEBABDCBBDBA MSLGVGQGSDVT DDCBDBCBDBEB ITVDGADEADGM DBCBBDCAABDE AAIVETLQLQGLA DCDCEAABACAA Q... AADC…
  • 12. What shall we do? • Ab initio: Quantum Mechanics + big computers + large # configurations = huge problems… • Machine Learning: Use known cases to learn a suitable map: sequence→ structure
  • 14. Artificial Neural Networks • A problem-solving paradigm modeled after the physiological functioning of the human brain. • Synapses in the brain are modeled by computational nodes. • The firing of a synapse is modeled by input, output, and threshold functions. • The network “learns” based on problems to which answers are known (supervised learning). • The network can then produce answers to entirely new problems of the same type.
  • 15. Neural Networks Input Layer Hidden Layers Output Layer
  • 16. Overfitting – high risk! Less complicated hypothesis has lower error rate
  • 17. Hidden Layer Vector Quantization- HLVQ Traditional NN HLVQ z o o o o o o x x o oo o oo x x ox x ox x xxx xxx x x Main advantage: detect and correct prediction for outliers
  • 19. Look for a ǴDZ…
  • 21. Loop Types α−α : α -helix - α -helix α−β : α -helix – β strand β−α : β strand - α -helix β -hairpin: β strand - β strand β - link: β strand - β strand
  • 22. α−α Similar conformation aa{b}aa / aa{p}aa Identical geometry (4,6)(0,45)(45,90)(180,225) Pro 75% Ser 75% 1.3.1 aa{p}aa 1.1.2 aa{b}aa © Baldomero Oliva
  • 24. ArchDB database ~ 20 000 loops classified into ~ 3000 classes. EE-3.4.1 Loop type - loop size . consensus . motif TASK: classify a loop from sequence alone If not possible, get as much information as possible
  • 25. Problems • Coding of aminoacids • Huge searching space, sparsely populated • How to assign the loop classes? • High dimensionality → Large Networks → poor generalization
  • 26. Aminoacid coding the classical way A → (1, 0, …0) C → (0, 1, …0) Y → (0, 0, …1) Useful but not efficient!!! I am working to improve it…
  • 27. Theory; but how about applications?!
  • 28. β-β link and β-β harpins from sequence HLVQ Predicted Predicted (MLP) β-β link β-β harpin Real 88.4 11.6 β-β link (79.4) (20.6) Real 12.5 87.5 β-β harpin (16.1) (83.9)
  • 29. Prediction of all loop types from sequence alone β-β lk α-β β-β hp β-α α-α β-β lk 45.9 28.5 3.7 19.8 2.1 α-β 8.8 67.4 1.2 18.0 4.6 β-β hp 0.4 0.9 96.1 2.1 0.5 β-α 4.4 6.2 2.4 79.5 7.6 α-α 4.0 15.7 1.3 20.3 58.6
  • 30. What’s it all mean? Given a loop residue sequence, we can (usually) identify its native structure. Not ab initio: We cannot tell the structure of a novel sequence. HLVQ is superior to MLP
  • 31. Future Work Better coding of aminoacids Larger sequences / low complexity Going beyond structure Clever alphabet that explore similarities Multiobjective Genetic Algorithms
  • 32. Beyond Multiple Alignments • Alligments are good … but expensive and boring ... • Information contained in a multiple alignment can, in principle, be expressed using an adequate aminoacid coding scheme Sensibility • How? Genetic Algorithm
  • 33. Coded Amino Acids Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C) Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Serine (S) Threonine (T) Tryptophan Tyrosine (Y) Valine (V) http://www.chemie.fu-berlin.de/chemistry/bio/
  • 34. ArchDB database Protein Data Bank (PDB) http://www.rcsb.org contains ~ 25 000 proteins with known structure of ~ 106 entries in SWISS-PROT ArchDB ~ 20 000 classified loops