�ݺ�ߣ

1. A generic motif discovery algorithm for diverse biomolecular data Kyle Jensen Gregory Stephanopoulos Department of Chemical Engineering Massachusetts Institute of Technology

2. Motif discovery is the automated search for similar regions in streams of data Un-sequential data No “ordering” Sequential data A natural ordering of the data Nucleotide and amino acid sequences

3. Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream

4. There are two classes of motif discovery tools commonly used for sequence analysis “Exhaustive” regular-expression based tools Teiresias

5. Pratt “Descriptive” position weight matrix-based tools Gibbs sampler

6. MEME

7. Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG

8. “Gemoda” was designed to be exhaustive and have descriptive power Gemoda exhaustively returns maximal motifs Uses convolution of Teiresias Way of “stiching together” smaller patterns combinatorially Gets descriptiveness from similarity metric Generic, context dependent definition of similarity MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix

9. Gemoda proceeds in three steps: comparison, clustering, and convolution Jensen, K., Styczynski,M., Rigoutsos,I. and Stephanopoulos,G. (2005) A generic motif discovery algorithm for sequential data. Bioinformatics, in press

10. The comparison stage is used to map the pairwise similarities between all windows in the data streams Creates an distance matrix Does an all-by-all comparison of windows in the data

11. Comparison function is context-specific F(w 1 , w 2 )

12. The clustering phase is used to find groups of mutually similar windows Different clustering functions have different uses Clique-finding is provably exhaustive

13. K-means and other methods are faster Output clusters become “elementary motifs” which are convolved to make longer, maximal motifs

14. The convolution phase is used to “stitch” together the clusters into maximal motifs The motifs should be as long as possible, without decreasing the support elementary motifs (clusters) window ordering

15. Here we show a few representative ways in which Gemoda can be used Motif discovery in... Protein sequences (ppGpp)ase enzymes & finding known domains DNA sequences The LD-motif challenge problem Protein structures Conserved structures without conserved sequences

16. Gemoda can be applied to amino acid sequences as well Example: (ppGpp)ase family from ENZYME database Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes EC 3.1.7.2

17. Ave. length ~700 amino acids

18. 8 sequences from 8 species Searched using Gemoda Minimum length = 50 amino acids

19. Minimum Blosum62 bit score = 50 bits

20. Minimum support = 100% (8/8 sequences)

21. Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to “noise?”

22. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif

23. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences

24. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to “noise”

25. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Pevzner & Sze, Proc. ISMB, 2000

26. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length ? Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extension and novel solution to the Motif Challenge Problem. Genome Informatics, 15 (2).

27. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...

28. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...

29. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs ?

30. Gemoda can also be applied to protein structures Treat protein structure as alpha-carbon trace Series of x,y,z coordinates Use a clustering function that compares x,y,z windows Root mean square deviation (RMSD)

31. unit-RMSD x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M

32. Protein structure example: human FIT vs. uridylyltransferase

33. Questions?

34. The Gemoda algorithm has guarantees of maximality and exhaustiveness Maximality Motifs are as long as possible

35. Motifs are as specific as possible

36. Motifs are not missing an occurrences Exhaustiveness All maximal motifs are found

37. No non-maximal motifs are found = motif1 = motif2

38. Gemoda may be used with nucleotide sequences to find regulatory motifs The LD-motif problem: an example for the board >AACTG >AATTA >AATTG Look for motifs of at least 3 nucletides with a Hamming distance between any window of 3 of 1 or less given: 1 = AAC 2 = ACT 3 = CTG 4 = AAT 5 = ATT 6 = TTA 7 = AAT 8 = ATT 9 = TTG We get the following windows:

39. A simple natural language example Choosing a window length of L=4 gives 7 unique windows in the three sequences Seq 1: motif Seq 2: motor Seq 3: potion

40. Here we show the comparison phase using two different similarity metrics X's and dotted lines Identify matrix: ¾ O's and solid lines Consonant/vowel matrix: ¾ Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Input sequences Seq 1: motif Seq 2: motor Seq 3: potion Similarity graph

41. The clustered windows (elementary motifs) are different depending on the similarity function Clustering phase Clique-finding

42. Support >= 2 Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Solid lines (vowel/cons): Dotted lines (identity):

43. Likewise, the final, convolved motifs depend on the similarity function choice Motif 1 motif motor potio Seq 1: motif Seq 2: motor Seq 3: potion Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Vowel/cons: Motif 1 motif potio Motif 2 moti moto Identity: Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Vowel/cons: Identity:

�ݺ�ߣ

Gemoda

Recommended

More Related Content

What's hot (7)

Viewers also liked (6)

Similar to Gemoda (20)

More from Kyle Jensen (20)

Gemoda