際際滷

際際滷Share a Scribd company logo
A generic motif discovery algorithm for diverse biomolecular data Kyle Jensen Gregory Stephanopoulos Department of Chemical Engineering Massachusetts Institute of Technology
Motif discovery is the automated search for similar regions in streams of data Un-sequential data No ordering Sequential data A natural ordering of the data Nucleotide and amino acid sequences
Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
There are two classes of motif discovery tools commonly used for sequence analysis Exhaustive regular-expression based tools Teiresias
Pratt Descriptive position weight matrix-based tools Gibbs sampler
MEME
Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
Gemoda was designed to be exhaustive and have descriptive power Gemoda exhaustively returns maximal motifs Uses convolution of Teiresias Way of stiching together smaller patterns combinatorially Gets descriptiveness from similarity metric Generic, context dependent definition of similarity MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix
Gemoda proceeds in three steps: comparison, clustering, and convolution Jensen, K., Styczynski,M., Rigoutsos,I. and Stephanopoulos,G. (2005) A generic motif discovery algorithm for sequential data.  Bioinformatics, in press
The comparison stage is used to map the pairwise similarities between all windows in the data streams Creates an distance matrix Does an all-by-all comparison of windows in the data
Comparison function is context-specific F(w 1 , w 2 )
The clustering phase is used to find groups of mutually similar windows Different clustering functions have different uses Clique-finding is provably exhaustive
K-means and other methods are faster Output clusters become elementary motifs which are convolved to make longer, maximal motifs
The convolution phase is used to stitch together the clusters into maximal motifs The motifs should be as long as possible, without decreasing the support elementary motifs (clusters) window ordering
Here we show a few representative ways in which Gemoda can be used Motif discovery in... Protein sequences (ppGpp)ase enzymes & finding known domains DNA sequences The LD-motif challenge problem Protein structures Conserved structures without conserved sequences
Gemoda can be applied to amino acid sequences as well Example: (ppGpp)ase family from ENZYME database Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes EC 3.1.7.2
Ave. length ~700 amino acids
8 sequences from 8 species Searched using Gemoda Minimum length = 50 amino acids
Minimum Blosum62 bit score = 50 bits
Minimum support = 100% (8/8 sequences)
Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to noise?
(ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
(ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
(ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to noise
The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...  ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Pevzner & Sze, Proc. ISMB, 2000
Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...  ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length ? Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extension and novel solution to the Motif Challenge Problem. Genome Informatics, 15 (2).
Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ...  ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...

More Related Content

What's hot (7)

Anastomosis intestinalesAnastomosis intestinales
Anastomosis intestinales
lacsuarez
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
Tony Fast
Analysis of messy data vol i designed experiments 2nd ed
Analysis of messy data vol i designed experiments 2nd edAnalysis of messy data vol i designed experiments 2nd ed
Analysis of messy data vol i designed experiments 2nd ed
Javier Buitrago Gantiva
Statika
StatikaStatika
Statika
theo_rifai
Tecnica de colocacion de tubo de Torax.pptxTecnica de colocacion de tubo de Torax.pptx
Tecnica de colocacion de tubo de Torax.pptx
RafaelMora55
Advance Non-Small Cell Lung Cancer final
Advance Non-Small Cell Lung Cancer finalAdvance Non-Small Cell Lung Cancer final
Advance Non-Small Cell Lung Cancer final
Tauhid Bhuiyan
How I treat Relapsed Ca Ovary
How I treat Relapsed Ca OvaryHow I treat Relapsed Ca Ovary
How I treat Relapsed Ca Ovary
Chandan K Das
Anastomosis intestinalesAnastomosis intestinales
Anastomosis intestinales
lacsuarez
Materials Informatics Overview
Materials Informatics OverviewMaterials Informatics Overview
Materials Informatics Overview
Tony Fast
Analysis of messy data vol i designed experiments 2nd ed
Analysis of messy data vol i designed experiments 2nd edAnalysis of messy data vol i designed experiments 2nd ed
Analysis of messy data vol i designed experiments 2nd ed
Javier Buitrago Gantiva
Tecnica de colocacion de tubo de Torax.pptxTecnica de colocacion de tubo de Torax.pptx
Tecnica de colocacion de tubo de Torax.pptx
RafaelMora55
Advance Non-Small Cell Lung Cancer final
Advance Non-Small Cell Lung Cancer finalAdvance Non-Small Cell Lung Cancer final
Advance Non-Small Cell Lung Cancer final
Tauhid Bhuiyan
How I treat Relapsed Ca Ovary
How I treat Relapsed Ca OvaryHow I treat Relapsed Ca Ovary
How I treat Relapsed Ca Ovary
Chandan K Das

Viewers also liked (6)

MEME An Integrated Tool For Advanced Computational Experiments
MEME  An Integrated Tool For Advanced Computational ExperimentsMEME  An Integrated Tool For Advanced Computational Experiments
MEME An Integrated Tool For Advanced Computational Experiments
GIScRG
ComparativeMotifFinding
ComparativeMotifFindingComparativeMotifFinding
ComparativeMotifFinding
Dieter De Witte
Scaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In ProductionScaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In Production
Dieter De Witte
What Is a Meme
What Is a MemeWhat Is a Meme
What Is a Meme
Steve Richey
Clique
Clique Clique
Clique
sk_klms
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011
photomatt
MEME An Integrated Tool For Advanced Computational Experiments
MEME  An Integrated Tool For Advanced Computational ExperimentsMEME  An Integrated Tool For Advanced Computational Experiments
MEME An Integrated Tool For Advanced Computational Experiments
GIScRG
ComparativeMotifFinding
ComparativeMotifFindingComparativeMotifFinding
ComparativeMotifFinding
Dieter De Witte
Scaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In ProductionScaling out federated queries for Life Sciences Data In Production
Scaling out federated queries for Life Sciences Data In Production
Dieter De Witte
Clique
Clique Clique
Clique
sk_klms
State of the Word 2011
State of the Word 2011State of the Word 2011
State of the Word 2011
photomatt

Similar to Gemoda (20)

DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)
Marwa Al-Rikaby
In silico analysis for unknown data
In silico analysis for unknown dataIn silico analysis for unknown data
In silico analysis for unknown data
Santosh Rama Bhadra Tata
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
Christian Have
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
IJERD Editor
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
TenaAvdic
Final Presentation-Delta
Final Presentation-DeltaFinal Presentation-Delta
Final Presentation-Delta
Anna Blendermann
Similarity
SimilaritySimilarity
Similarity
hiratufail
cloning
cloningcloning
cloning
Prasit Chanarat
cloning
cloningcloning
cloning
Prasit Chanarat
Cloning
CloningCloning
Cloning
Prasit Chanarat
C:\fakepath\cloning
C:\fakepath\cloningC:\fakepath\cloning
C:\fakepath\cloning
Prasit Chanarat
Cloning
CloningCloning
Cloning
minhdaovan
Agro pract 2
Agro pract 2Agro pract 2
Agro pract 2
RohitGupta795609
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
IOSR Journals
Python lec 1004_ch02_excercies
Python lec 1004_ch02_excerciesPython lec 1004_ch02_excercies
Python lec 1004_ch02_excercies
Ramadan Babers, PhD
Towards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniquesTowards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniques
Wesley De Neve
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
Ch09 combinatorialpatternmatching
Ch09 combinatorialpatternmatchingCh09 combinatorialpatternmatching
Ch09 combinatorialpatternmatching
BioinformaticsInstitute
Wagner chapter 1
Wagner chapter 1Wagner chapter 1
Wagner chapter 1
Giovanni Marco Dall'Olio
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema
DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)DNA Compression (Encoded using Huffman Encoding Method)
DNA Compression (Encoded using Huffman Encoding Method)
Marwa Al-Rikaby
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
ICLP 2009 doctoral consortium presentation; Logic-Statistic Models with Const...
Christian Have
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
MS-Extractor: An Innovative Approach to Extract Microsatellites on Y Chrom...
IJERD Editor
Sequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics IntroductionSequence Alignment - Data Bioinformatics Introduction
Sequence Alignment - Data Bioinformatics Introduction
TenaAvdic
Final Presentation-Delta
Final Presentation-DeltaFinal Presentation-Delta
Final Presentation-Delta
Anna Blendermann
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...An Efficient Biological Sequence Compression Technique Using  LUT and Repeat ...
An Efficient Biological Sequence Compression Technique Using LUT and Repeat ...
IOSR Journals
Python lec 1004_ch02_excercies
Python lec 1004_ch02_excerciesPython lec 1004_ch02_excercies
Python lec 1004_ch02_excercies
Ramadan Babers, PhD
Towards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniquesTowards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniques
Wesley De Neve
SAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene ExpressionSAGE- Serial Analysis of Gene Expression
SAGE- Serial Analysis of Gene Expression
Aashish Patel
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema

More from Kyle Jensen (20)

The intellectual property landscape of the human genome
The intellectual property landscape of the human genomeThe intellectual property landscape of the human genome
The intellectual property landscape of the human genome
Kyle Jensen
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis DefenseKyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen
Eschew Obfuscation
Eschew ObfuscationEschew Obfuscation
Eschew Obfuscation
Kyle Jensen
A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...
Kyle Jensen
Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005
Kyle Jensen
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen
HO畉T 畛NG NGHIN C畛U KHOA H畛C V CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
HO畉T 畛NG NGHIN C畛U KHOA H畛C V  CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董HO畉T 畛NG NGHIN C畛U KHOA H畛C V  CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
HO畉T 畛NG NGHIN C畛U KHOA H畛C V CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
Kyle Jensen
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛NgChuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Kyle Jensen
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DNB畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
Kyle Jensen
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTTKh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kyle Jensen
Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)
Kyle Jensen
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnamChuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Kyle Jensen
畉u t動 m畉o hi畛m 畛 Vi畛t Nam
畉u t動 m畉o hi畛m 畛 Vi畛t Nam畉u t動 m畉o hi畛m 畛 Vi畛t Nam
畉u t動 m畉o hi畛m 畛 Vi畛t Nam
Kyle Jensen
H狸nh thnh doanh nghi畛p 畛 Vi畛tnam
H狸nh thnh doanh nghi畛p 畛 Vi畛tnamH狸nh thnh doanh nghi畛p 畛 Vi畛tnam
H狸nh thnh doanh nghi畛p 畛 Vi畛tnam
Kyle Jensen
Chuy畛n giao (li-xng) c担ng ngh畛
Chuy畛n giao (li-xng) c担ng ngh畛Chuy畛n giao (li-xng) c担ng ngh畛
Chuy畛n giao (li-xng) c担ng ngh畛
Kyle Jensen
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
Kyle Jensen
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
Kyle Jensen
T狸nh hu畛ng
T狸nh hu畛ngT狸nh hu畛ng
T狸nh hu畛ng
Kyle Jensen
Th畛c trang BHGCT 畛 Vi畛tnam
Th畛c trang BHGCT 畛 Vi畛tnamTh畛c trang BHGCT 畛 Vi畛tnam
Th畛c trang BHGCT 畛 Vi畛tnam
Kyle Jensen
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Kyle Jensen
The intellectual property landscape of the human genome
The intellectual property landscape of the human genomeThe intellectual property landscape of the human genome
The intellectual property landscape of the human genome
Kyle Jensen
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis DefenseKyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen
Eschew Obfuscation
Eschew ObfuscationEschew Obfuscation
Eschew Obfuscation
Kyle Jensen
A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...A simple method for incorporating sequence information into directed evolutio...
A simple method for incorporating sequence information into directed evolutio...
Kyle Jensen
Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005Kyle Jensen Research summary poster 2005
Kyle Jensen Research summary poster 2005
Kyle Jensen
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis ProposalKyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen's MIT Ph.D. Thesis Proposal
Kyle Jensen
HO畉T 畛NG NGHIN C畛U KHOA H畛C V CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
HO畉T 畛NG NGHIN C畛U KHOA H畛C V  CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董HO畉T 畛NG NGHIN C畛U KHOA H畛C V  CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
HO畉T 畛NG NGHIN C畛U KHOA H畛C V CHUY畛N GIAO CNG NGH畛 Tr動畛ng 畉i h畛c C畉n Th董
Kyle Jensen
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛NgChuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Chuy畛N Giao Quy畛N 畛I V畛I Gi畛Ng C但Y Tr畛Ng
Kyle Jensen
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DNB畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
B畉O H畛 GI畛NG CY TR畛NG V 畉C QUY畛N C畛A NNG DN
Kyle Jensen
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTTKh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kh叩i qu叩t v畛 nh畛ng nguy棚n t畉c c董 b畉n trong qu畉n l箪 TSTT
Kyle Jensen
Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)Htqt Vietnam Chih Am Agreements License (Tv)
Htqt Vietnam Chih Am Agreements License (Tv)
Kyle Jensen
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnamChuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Chuy畛n giao c担ng ngh畛 畛 Vi畛tnam
Kyle Jensen
畉u t動 m畉o hi畛m 畛 Vi畛t Nam
畉u t動 m畉o hi畛m 畛 Vi畛t Nam畉u t動 m畉o hi畛m 畛 Vi畛t Nam
畉u t動 m畉o hi畛m 畛 Vi畛t Nam
Kyle Jensen
H狸nh thnh doanh nghi畛p 畛 Vi畛tnam
H狸nh thnh doanh nghi畛p 畛 Vi畛tnamH狸nh thnh doanh nghi畛p 畛 Vi畛tnam
H狸nh thnh doanh nghi畛p 畛 Vi畛tnam
Kyle Jensen
Chuy畛n giao (li-xng) c担ng ngh畛
Chuy畛n giao (li-xng) c担ng ngh畛Chuy畛n giao (li-xng) c担ng ngh畛
Chuy畛n giao (li-xng) c担ng ngh畛
Kyle Jensen
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
H畛p 畛ng chuy畛n giao v畉t li畛u: m畛t c担ng c畛 cho chuy畛n giao c担ng ngh畛
Kyle Jensen
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
L畛i gi畛i thi畛u v畛 trang web mi畛n ph鱈 cho vi畛c tra c畛u s叩ng ch畉
Kyle Jensen
T狸nh hu畛ng
T狸nh hu畛ngT狸nh hu畛ng
T狸nh hu畛ng
Kyle Jensen
Th畛c trang BHGCT 畛 Vi畛tnam
Th畛c trang BHGCT 畛 Vi畛tnamTh畛c trang BHGCT 畛 Vi畛tnam
Th畛c trang BHGCT 畛 Vi畛tnam
Kyle Jensen
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Th動董ng m畉i h坦a ho畉t 畛ng nghi棚n c畛u trong l挑nh v畛c c担ng ngh畛 sinh h畛c n担ng ng...
Kyle Jensen

Gemoda

  • 1. A generic motif discovery algorithm for diverse biomolecular data Kyle Jensen Gregory Stephanopoulos Department of Chemical Engineering Massachusetts Institute of Technology
  • 2. Motif discovery is the automated search for similar regions in streams of data Un-sequential data No ordering Sequential data A natural ordering of the data Nucleotide and amino acid sequences
  • 3. Stock prices, protein structures MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA A motif is just a collection of mutually similar regions in the data stream
  • 4. There are two classes of motif discovery tools commonly used for sequence analysis Exhaustive regular-expression based tools Teiresias
  • 5. Pratt Descriptive position weight matrix-based tools Gibbs sampler
  • 7. Consensus TGCTGTATATACTCACAGCA AACTGTATATACACCCAGGG TACTGTATGAGCATACAGTA ACCTGAATGAATATACAGTA TACTGTACATCCATACAGTA TACTGTATATTCATTCAGGT AACTGTTTTTTTATCCAGTA ATCTGTATATATACCCAGCT TACTGTATATAAAAACAGTA CT[AT].[GT]....A..CAG
  • 8. Gemoda was designed to be exhaustive and have descriptive power Gemoda exhaustively returns maximal motifs Uses convolution of Teiresias Way of stiching together smaller patterns combinatorially Gets descriptiveness from similarity metric Generic, context dependent definition of similarity MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA F(w 1 , w 2 ) = square error F(w 1 , w 2 ) = aa scoring matrix
  • 9. Gemoda proceeds in three steps: comparison, clustering, and convolution Jensen, K., Styczynski,M., Rigoutsos,I. and Stephanopoulos,G. (2005) A generic motif discovery algorithm for sequential data. Bioinformatics, in press
  • 10. The comparison stage is used to map the pairwise similarities between all windows in the data streams Creates an distance matrix Does an all-by-all comparison of windows in the data
  • 11. Comparison function is context-specific F(w 1 , w 2 )
  • 12. The clustering phase is used to find groups of mutually similar windows Different clustering functions have different uses Clique-finding is provably exhaustive
  • 13. K-means and other methods are faster Output clusters become elementary motifs which are convolved to make longer, maximal motifs
  • 14. The convolution phase is used to stitch together the clusters into maximal motifs The motifs should be as long as possible, without decreasing the support elementary motifs (clusters) window ordering
  • 15. Here we show a few representative ways in which Gemoda can be used Motif discovery in... Protein sequences (ppGpp)ase enzymes & finding known domains DNA sequences The LD-motif challenge problem Protein structures Conserved structures without conserved sequences
  • 16. Gemoda can be applied to amino acid sequences as well Example: (ppGpp)ase family from ENZYME database Guanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymes EC 3.1.7.2
  • 17. Ave. length ~700 amino acids
  • 18. 8 sequences from 8 species Searched using Gemoda Minimum length = 50 amino acids
  • 19. Minimum Blosum62 bit score = 50 bits
  • 20. Minimum support = 100% (8/8 sequences)
  • 21. Clustering method = clique finding Can Gemoda find this known motif? How sensitive is Gemoda to noise?
  • 22. (ppGpp)ase example: the comparison phase shows many regions of local similarity Dots indicate 50aa windows that are pairwise similar Streaks indicate regions that will probably be convolved into a maximal motif
  • 23. (ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
  • 24. (ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database Maximal motif (one of three, ~100 aa in length) This particular cluster represents the first set of 8 50aa windows in the above motif. Results are insensitive to noise
  • 25. The LD-motif problem models the subtle binding site discovery problem GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCT CT CTCGAT T GCGAC T TTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG TA AG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Pevzner & Sze, Proc. ISMB, 2000
  • 26. Gemoda can solve both the LD-motif problem and a more generalized version of the same GG GACTCGATAGCGACG CCG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT... Total motif length ? Styczynski,M., Jensen,K., Rigoutsos,I. and Stephanopoulos,G. (2004) An extension and novel solution to the Motif Challenge Problem. Genome Informatics, 15 (2).
  • 27. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG X All sequences ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 28. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Number of mutations ? Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATCTTATTCGACTAGTACGACT...
  • 29. Gemoda can solve both the LD-motif problem and a more generalized version of the same GACTCGATAGCGACG Sequence #1: ATGAT GA G TC T ATTG C G C CG CGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG... Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA... Sequence #3: ATGTACTACGA G T CTC C ATAGCG TT G CTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT... Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTA TATCTGGTTCGACTT AGCTATCTATTCGAC GACTCG TGG GCG G CG ... ... Sequence #m: ATGCTAC TATCTTATTCGACTG AGTACGACTATAGCTACT GA T TCG T TAG G GACG ATAGCTACTATGACTAGTGACT... Number of unique motifs ?
  • 30. Gemoda can also be applied to protein structures Treat protein structure as alpha-carbon trace Series of x,y,z coordinates Use a clustering function that compares x,y,z windows Root mean square deviation (RMSD)
  • 31. unit-RMSD x 1 y 1 z 1 x 2 y 2 z 2 x 3 y 3 z 3 ........................... x M y M z M
  • 32. Protein structure example: human FIT vs. uridylyltransferase
  • 34. The Gemoda algorithm has guarantees of maximality and exhaustiveness Maximality Motifs are as long as possible
  • 35. Motifs are as specific as possible
  • 36. Motifs are not missing an occurrences Exhaustiveness All maximal motifs are found
  • 37. No non-maximal motifs are found = motif1 = motif2
  • 38. Gemoda may be used with nucleotide sequences to find regulatory motifs The LD-motif problem: an example for the board >AACTG >AATTA >AATTG Look for motifs of at least 3 nucletides with a Hamming distance between any window of 3 of 1 or less given: 1 = AAC 2 = ACT 3 = CTG 4 = AAT 5 = ATT 6 = TTA 7 = AAT 8 = ATT 9 = TTG We get the following windows:
  • 39. A simple natural language example Choosing a window length of L=4 gives 7 unique windows in the three sequences Seq 1: motif Seq 2: motor Seq 3: potion
  • 40. Here we show the comparison phase using two different similarity metrics X's and dotted lines Identify matrix: 他 O's and solid lines Consonant/vowel matrix: 他 Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Input sequences Seq 1: motif Seq 2: motor Seq 3: potion Similarity graph
  • 41. The clustered windows (elementary motifs) are different depending on the similarity function Clustering phase Clique-finding
  • 42. Support >= 2 Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Solid lines (vowel/cons): Dotted lines (identity):
  • 43. Likewise, the final, convolved motifs depend on the similarity function choice Motif 1 motif motor potio Seq 1: motif Seq 2: motor Seq 3: potion Windows 1: moti 2: otif 3: moto 4: otor 5: poti 6: otio 7: tion Vowel/cons: Motif 1 motif potio Motif 2 moti moto Identity: Cluster 1 1: moti 3: moto 5: poti Cluster 2 2: otif 4: otor 6: otio Cluster 1 1: moti 3: moto Cluster 2 1: moti 5: poti Cluster 3 2: otif 6: otio Vowel/cons: Identity: