狠狠撸

狠狠撸Share a Scribd company logo
Department of Information Technology (INTEC), Ghent University, Belgium
A Parallel, Distributed-Memory Framework
for Comparative Motif Discovery
1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester,
1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier
email: jan.fostier@intec.ugent.be
1Department of Information Technology (INTEC), Ghent University, Belgium
2Department of Plant Systems Biology, VIB, Ghent, Belgium
3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
Background (1)
Department of Information Technology (INTEC), Ghent University, Belgium 2
CCACGTG
gene
promoter
transcription
factor
protein
regulate
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 3
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Gene families
Department of Information Technology (INTEC), Ghent University, Belgium 4
zma
sbi
bdi
osa
DNA sequences
orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012)
TSS
2kbp promoter
gene family 1
TSS
2kbp promoter
gene family 17724
TSS
2kbp promoter
gene family 2
…
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 5
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 17.20 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 6
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 64.52 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 7
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 100 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 8
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 0 %
2kbp promoter
Scoring conservation within a gene family
motif occurrence
Genomewide scoring (1)
Department of Information Technology (INTEC), Ghent University, Belgium 9
2kbp promoter
gene family 1
2kbp promoter
gene family 17724
2kbp promoter
gene family 2
…
Genome wide scoring of conservation
BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %
> BLSthres ? > BLSthres ? > BLSthres ?
yes no yes
Fw = # families with BLS > BLSthres
Genomewide scoring (2)
Department of Information Technology (INTEC), Ghent University, Belgium 10
Genome wide scoring of conservation (2)
Fw = # families with BLS > BLSthres for word w
Fbg = median # families with BLS > BLSthres for random permutations of w
C = 1 ?
Fbg
Fw
confidence score
Retain only motifs with confidence score C > 0.9
Stark et al., Genome Res. 2007
Example
Department of Information Technology (INTEC), Ghent University, Belgium 11
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
>40 % > 50% > 60% > 70% > 80% >90%
CCACGTG 578 545 423 224 135 116
C = 1 ?
Fbg
Fw
= 1 -
201
578
= 0.64
1000 random permutations of w
BLS threshold
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 12
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Motif discovery
Department of Information Technology (INTEC), Ghent University, Belgium 13
Main idea: extend previous ideas
to all words that occur in the sequences
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 14
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 15
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 16
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 17
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT
S2 = TTATGC
Actual discovery
Department of Information Technology (INTEC), Ghent University, Belgium 18
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
word seq1 seq2
A y y
AT y y
ATG y y
ATGC - y
ATGT y -
ATGTA y -
ATGTAT y -
C - y
G y y
GC - y
…
Depth-first walk in the tree
Can be extended to IUPAC alphabet (Sagot, 2001)
Discovery of words
Department of Information Technology (INTEC), Ghent University, Belgium 19
Single gene family, words with length [6 … 12], max 3 deg. chars
0
50
100
150
200
250
300
350
400
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
ACGT ACGT+N IUPAC  {BDHV} Full IUPAC
Time(s)
Numberofwords
Number of words Time
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 20
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Parallel discovery
Department of Information Technology (INTEC), Ghent University, Belgium 21
Intra-family phase
Process 1 Process P
w conserved in
3/4 organisms
BLS score
for w: 73% w’ conserved in
2/4 organisms
BLS score
for w’: 23%
gene family x gene family y
local hash table local hash table...
Communication phase
Department of Information Technology (INTEC), Ghent University, Belgium 22
Communication
phase
...
Core idea: words with the same composition are sent to the same processes
e.g. CACGTG and CCATGG
Process 1 Process P
Statistical evaluation
Department of Information Technology (INTEC), Ghent University, Belgium 23
Inter-family phase
aggregated hash table aggregated hash table
statistical evaluation
…
CACGTG: significant
ACCGGT: not significant
TGGCAC: not significant
…
statistical evaluation
…
TGANTGA: significant
GTATAGN: not significant
GANATTG: not significant
…
...
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 24
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
class Mapper
method Map(familyID id, family F)
GST = createGeneralizedSuffixTree(F);
for all word w ? GST do
BLS = computeBLS(w)
f = computeFrequencyVector(BLS)
Emit(sort(w), pair(w, f))
w = CCACGTG
BLS = 62%
f = [ 1 | 1 | 1 | 0 | 0 | 0 ]
Emit(ACCCGGT, <CCACGTG, [111000]>)
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 25
class Reducer
method Reduce(word s, pairs[<w1,f1>, <w2,f2>, …])
H = new AssociativeArray
for all <w,f> ? [<w1,f1>, <w2,f2>, …] do
H{w} = H{w} + f
BGM = computeBackgroundModel(H)
for all w ? H
C = computeConfidence(w, H{w}, BGM)
if (C > 0.9) Emit(w, H{w})
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
H C = 0.75
C = 0.60
C = 0.91
C = 0.93
C = 0.75
Results
? Dataset description
? 4 organisms (zma, sbi, osa, bdi)
? 17639 gene families
? Alphabet: IUPAC  {BDHV} alphabet with length [6 ... 12]
? maximum of 3 degenerate characters
? 20 instances on AWS (19 workers type m1.xlarge)
? Map time: 20h 50 min
? Reduce time: 12h 38 min
? Cost price of 421.6$
? Number of <k, v> emitted by mappers: 531 835 669 220
? Map output bytes = 13 827 727 399 720
? 26 bytes per kv
? Output materialized: 4 019 787 705 377 = 3.65 TByte
? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)
? Number of <k, v> emitted by reducers: 620 294 857
Department of Information Technology (INTEC), Ghent University, Belgium 26
Where do we go from here?
? Output data is still big (~ 50 Gbyte of data)
? Redundancy present in output
? Clustering algorithms
? Mapping words back to the promoter sequences
? Filtering output is easy
? Subsets with higher confidence scores
? Subsets with specific BLS scores
? Subsets which match specific motif
– Worked example: KN1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 27
We need further post-processing
Knotted 1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 28
Motif BLS min Confidence PWM score # target predicted
CHiP-seq
overlap
TGACTGACTGAC 15 100 13.94 5 3
TGACKGACTGAC 15 100 13.69 5 3
TGACKGAYTGAC 15 100 13.69 5 3
TGACTGACKGAC 15 100 13.45 5 3
TGACTGAYKGAC 15 100 13.45 5 3
TGACTGACWGAC 15 100 13.25 5 3
TGACTGAYWGAC 15 100 13.25 5 3
TGACKGACKGAC 15 100 13.20 6 3
TGACKGACWGAC 15 100 13.00 5 3
TGACWGACTGAC 15 100 12.33 7 4
TGACWGAYTGAC 15 100 12.33 8 4
… … … … … …
Bolduc et al. 2011
Conclusion
? We developed…
? A Parallel Framework for Motif Discovery
? Based on Phylogenetic Footprinting
? MPI implementation (memory issues)
? MapReduce implementation (scales well)
? Key advantages…
? Runs in the cloud
? Exhaustively explores search space (IUPAC)
? Alignment-free (unique feature)
? Flexible framework
? Future work…
? Development of post-processing tools
? Likely also cloud-based
Department of Information Technology (INTEC), Ghent University, Belgium 29
Department of Information Technology – Internet Based Communication Networks and Services (IBCN)
Questions ?
Jan Fostier
jan.fostier@intec.ugent.be
Department of Information Technology (INTEC)
Ghent University

More Related Content

Similar to Ppam presentatie (20)

Data Structure: Algorithm and analysis
Data Structure: Algorithm and analysisData Structure: Algorithm and analysis
Data Structure: Algorithm and analysis
Dr. Rajdeep Chatterjee
?
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Voica Gavrilut
?
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
Victor Asanza
?
Biochip
BiochipBiochip
Biochip
nayakslideshare
?
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopython
tiago
?
Towards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniquesTowards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniques
Wesley De Neve
?
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Koji Yamamoto
?
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflow
Ramdhan Rizki
?
High Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLabHigh Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLab
NECST Lab @ Politecnico di Milano
?
DReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLabDReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLab
NECST Lab @ Politecnico di Milano
?
Multimodal Deep Learning
Multimodal Deep LearningMultimodal Deep Learning
Multimodal Deep Learning
Universitat Politècnica de Catalunya
?
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
bosc_2008
?
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
Stéphanie Roger
?
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Moses Boudourides
?
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...
IRJET Journal
?
HUG + Nomica: a scalable FPGA-based architecture for variant-calling
HUG + Nomica: a scalable FPGA-based architecture for variant-callingHUG + Nomica: a scalable FPGA-based architecture for variant-calling
HUG + Nomica: a scalable FPGA-based architecture for variant-calling
NECST Lab @ Politecnico di Milano
?
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdfChapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
freepcgame75
?
第二周課程 Arduino介紹
第二周課程 Arduino介紹第二周課程 Arduino介紹
第二周課程 Arduino介紹
冠宇 陳
?
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
F?rderverein Technische Fakult?t
?
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
jesujoseph
?
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Scheduling in Time-Sensitive Networks (TSN) for Mixed-Criticality Industrial ...
Voica Gavrilut
?
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
????? Device Free Indoor Localization in the 28 GHz band based on machine lea...
Victor Asanza
?
BOSC 2008 Biopython
BOSC 2008 BiopythonBOSC 2008 Biopython
BOSC 2008 Biopython
tiago
?
Towards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniquesTowards reading genomic data using deep learning-driven NLP techniques
Towards reading genomic data using deep learning-driven NLP techniques
Wesley De Neve
?
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Machine Learning and Stochastic Geometry: Statistical Frameworks Against Unce...
Koji Yamamoto
?
Tech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflowTech day ngobrol santai tensorflow
Tech day ngobrol santai tensorflow
Ramdhan Rizki
?
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
bosc_2008
?
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
Stéphanie Roger
?
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Αν?λυση Δικτ?ων με το NetworkX τη? Python: Μια προκαταρκτικ? (αλλ? ημιτελ?? ω...
Moses Boudourides
?
Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...Design and Implementation of Test Vector Generation using Random Forest Techn...
Design and Implementation of Test Vector Generation using Random Forest Techn...
IRJET Journal
?
HUG + Nomica: a scalable FPGA-based architecture for variant-calling
HUG + Nomica: a scalable FPGA-based architecture for variant-callingHUG + Nomica: a scalable FPGA-based architecture for variant-calling
HUG + Nomica: a scalable FPGA-based architecture for variant-calling
NECST Lab @ Politecnico di Milano
?
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdfChapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
Chapter 3_P2_MSI Lgic Circuit (Multiplexers).pdf
freepcgame75
?
第二周課程 Arduino介紹
第二周課程 Arduino介紹第二周課程 Arduino介紹
第二周課程 Arduino介紹
冠宇 陳
?
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
Intra-coding using non-linear prediction, KLT and Texture Synthesis: AV1 enco...
F?rderverein Technische Fakult?t
?
B Eng Final Year Project Presentation
B Eng Final Year Project PresentationB Eng Final Year Project Presentation
B Eng Final Year Project Presentation
jesujoseph
?

Recently uploaded (20)

Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
?
Class 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptxClass 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptx
angelananalucky
?
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
?
Lesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptxLesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptx
1045858
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
?
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
?
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
?
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
?
Stasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesiaStasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesia
fikrimanurung1
?
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
taqyed
?
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
?
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
?
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
taqyed
?
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
?
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
加拿大成绩单购买原版(鲍颁毕业证书)卡尔加里大学毕业证文凭
taqyed
?
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
?
Class 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptxClass 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptx
angelananalucky
?
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
?
Lesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptxLesson 9- Data Governance and Ethics.pptx
Lesson 9- Data Governance and Ethics.pptx
1045858
?
data mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmglegdata mining tools.pptxvdvjdggmgmgelmgleg
data mining tools.pptxvdvjdggmgmgelmgleg
1052LaxmanrajS
?
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
?
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptxvnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
vnptloveeeeeeeeeeeeeeeeeeeeeeeeeeee.pptx
deomom129
?
Cost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheetCost sheet. with basics and formats of sheet
Cost sheet. with basics and formats of sheet
supreetk82004
?
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
?
Stasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesiaStasiun kernel pengolahan kelapa sawit indonesia
Stasiun kernel pengolahan kelapa sawit indonesia
fikrimanurung1
?
Stasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesiaStasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesia
fikrimanurung1
?
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
加拿大成绩单购买原版(顿补濒毕业证书)戴尔豪斯大学毕业证文凭
taqyed
?
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
?
The Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability AnalyticsThe Role of Christopher Campos Orlando in Sustainability Analytics
The Role of Christopher Campos Orlando in Sustainability Analytics
christophercamposus1
?
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
?
Introduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdfIntroduction Lecture 01 Data Science.pdf
Introduction Lecture 01 Data Science.pdf
messagetome133
?
exampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexampleexample
exampleexampleexampleexampleexampleexampleexampleexample
lembiczkat
?
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
原版复刻加拿大多伦多大学成绩单(UTSG毕业证书) 文凭
taqyed
?

Ppam presentatie

  • 1. Department of Information Technology (INTEC), Ghent University, Belgium A Parallel, Distributed-Memory Framework for Comparative Motif Discovery 1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester, 1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier email: jan.fostier@intec.ugent.be 1Department of Information Technology (INTEC), Ghent University, Belgium 2Department of Plant Systems Biology, VIB, Ghent, Belgium 3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
  • 2. Background (1) Department of Information Technology (INTEC), Ghent University, Belgium 2 CCACGTG gene promoter transcription factor protein regulate
  • 3. Outline Department of Information Technology (INTEC), Ghent University, Belgium 3 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 4. Gene families Department of Information Technology (INTEC), Ghent University, Belgium 4 zma sbi bdi osa DNA sequences orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012) TSS 2kbp promoter gene family 1 TSS 2kbp promoter gene family 17724 TSS 2kbp promoter gene family 2 …
  • 5. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 5 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 17.20 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 6. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 6 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 64.52 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 7. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 7 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 100 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 8. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 8 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 0 % 2kbp promoter Scoring conservation within a gene family motif occurrence
  • 9. Genomewide scoring (1) Department of Information Technology (INTEC), Ghent University, Belgium 9 2kbp promoter gene family 1 2kbp promoter gene family 17724 2kbp promoter gene family 2 … Genome wide scoring of conservation BLS = 64,52 % BLS = 17,20 % BLS = 72,72 % > BLSthres ? > BLSthres ? > BLSthres ? yes no yes Fw = # families with BLS > BLSthres
  • 10. Genomewide scoring (2) Department of Information Technology (INTEC), Ghent University, Belgium 10 Genome wide scoring of conservation (2) Fw = # families with BLS > BLSthres for word w Fbg = median # families with BLS > BLSthres for random permutations of w C = 1 ? Fbg Fw confidence score Retain only motifs with confidence score C > 0.9 Stark et al., Genome Res. 2007
  • 11. Example Department of Information Technology (INTEC), Ghent University, Belgium 11 ACCCGGT 201 195 139 46 10 6 TCCAGCG 158 100 95 28 13 12 CGATCCG 204 191 185 100 46 11 CGCTGAC 246 203 145 39 23 12 GTGCACC 0 0 0 0 0 0 … >40 % > 50% > 60% > 70% > 80% >90% CCACGTG 578 545 423 224 135 116 C = 1 ? Fbg Fw = 1 - 201 578 = 0.64 1000 random permutations of w BLS threshold
  • 12. Outline Department of Information Technology (INTEC), Ghent University, Belgium 12 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 13. Motif discovery Department of Information Technology (INTEC), Ghent University, Belgium 13 Main idea: extend previous ideas to all words that occur in the sequences
  • 14. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 14 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2
  • 15. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 15 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2
  • 16. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 16 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2
  • 17. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 17 GC AT C G T G TAT TAT C C AT G TATGC GC Generalized suffix tree (GST) Example sequences: S1 = ATGTAT S2 = TTATGC
  • 18. Actual discovery Department of Information Technology (INTEC), Ghent University, Belgium 18 GC AT C G T G TAT TAT C C AT G TATGC GC word seq1 seq2 A y y AT y y ATG y y ATGC - y ATGT y - ATGTA y - ATGTAT y - C - y G y y GC - y … Depth-first walk in the tree Can be extended to IUPAC alphabet (Sagot, 2001)
  • 19. Discovery of words Department of Information Technology (INTEC), Ghent University, Belgium 19 Single gene family, words with length [6 … 12], max 3 deg. chars 0 50 100 150 200 250 300 350 400 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 ACGT ACGT+N IUPAC {BDHV} Full IUPAC Time(s) Numberofwords Number of words Time
  • 20. Outline Department of Information Technology (INTEC), Ghent University, Belgium 20 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 21. Parallel discovery Department of Information Technology (INTEC), Ghent University, Belgium 21 Intra-family phase Process 1 Process P w conserved in 3/4 organisms BLS score for w: 73% w’ conserved in 2/4 organisms BLS score for w’: 23% gene family x gene family y local hash table local hash table...
  • 22. Communication phase Department of Information Technology (INTEC), Ghent University, Belgium 22 Communication phase ... Core idea: words with the same composition are sent to the same processes e.g. CACGTG and CCATGG Process 1 Process P
  • 23. Statistical evaluation Department of Information Technology (INTEC), Ghent University, Belgium 23 Inter-family phase aggregated hash table aggregated hash table statistical evaluation … CACGTG: significant ACCGGT: not significant TGGCAC: not significant … statistical evaluation … TGANTGA: significant GTATAGN: not significant GANATTG: not significant … ...
  • 24. MapReduce pseudocode Department of Information Technology (INTEC), Ghent University, Belgium 24 GC AT C G T G TAT TAT C C AT G TATGC GC class Mapper method Map(familyID id, family F) GST = createGeneralizedSuffixTree(F); for all word w ? GST do BLS = computeBLS(w) f = computeFrequencyVector(BLS) Emit(sort(w), pair(w, f)) w = CCACGTG BLS = 62% f = [ 1 | 1 | 1 | 0 | 0 | 0 ] Emit(ACCCGGT, <CCACGTG, [111000]>)
  • 25. MapReduce pseudocode Department of Information Technology (INTEC), Ghent University, Belgium 25 class Reducer method Reduce(word s, pairs[<w1,f1>, <w2,f2>, …]) H = new AssociativeArray for all <w,f> ? [<w1,f1>, <w2,f2>, …] do H{w} = H{w} + f BGM = computeBackgroundModel(H) for all w ? H C = computeConfidence(w, H{w}, BGM) if (C > 0.9) Emit(w, H{w}) ACCCGGT 201 195 139 46 10 6 TCCAGCG 158 100 95 28 13 12 CGATCCG 204 191 185 100 46 11 CGCTGAC 246 203 145 39 23 12 GTGCACC 0 0 0 0 0 0 … H C = 0.75 C = 0.60 C = 0.91 C = 0.93 C = 0.75
  • 26. Results ? Dataset description ? 4 organisms (zma, sbi, osa, bdi) ? 17639 gene families ? Alphabet: IUPAC {BDHV} alphabet with length [6 ... 12] ? maximum of 3 degenerate characters ? 20 instances on AWS (19 workers type m1.xlarge) ? Map time: 20h 50 min ? Reduce time: 12h 38 min ? Cost price of 421.6$ ? Number of <k, v> emitted by mappers: 531 835 669 220 ? Map output bytes = 13 827 727 399 720 ? 26 bytes per kv ? Output materialized: 4 019 787 705 377 = 3.65 TByte ? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr) ? Number of <k, v> emitted by reducers: 620 294 857 Department of Information Technology (INTEC), Ghent University, Belgium 26
  • 27. Where do we go from here? ? Output data is still big (~ 50 Gbyte of data) ? Redundancy present in output ? Clustering algorithms ? Mapping words back to the promoter sequences ? Filtering output is easy ? Subsets with higher confidence scores ? Subsets with specific BLS scores ? Subsets which match specific motif – Worked example: KN1 motif Department of Information Technology (INTEC), Ghent University, Belgium 27 We need further post-processing
  • 28. Knotted 1 motif Department of Information Technology (INTEC), Ghent University, Belgium 28 Motif BLS min Confidence PWM score # target predicted CHiP-seq overlap TGACTGACTGAC 15 100 13.94 5 3 TGACKGACTGAC 15 100 13.69 5 3 TGACKGAYTGAC 15 100 13.69 5 3 TGACTGACKGAC 15 100 13.45 5 3 TGACTGAYKGAC 15 100 13.45 5 3 TGACTGACWGAC 15 100 13.25 5 3 TGACTGAYWGAC 15 100 13.25 5 3 TGACKGACKGAC 15 100 13.20 6 3 TGACKGACWGAC 15 100 13.00 5 3 TGACWGACTGAC 15 100 12.33 7 4 TGACWGAYTGAC 15 100 12.33 8 4 … … … … … … Bolduc et al. 2011
  • 29. Conclusion ? We developed… ? A Parallel Framework for Motif Discovery ? Based on Phylogenetic Footprinting ? MPI implementation (memory issues) ? MapReduce implementation (scales well) ? Key advantages… ? Runs in the cloud ? Exhaustively explores search space (IUPAC) ? Alignment-free (unique feature) ? Flexible framework ? Future work… ? Development of post-processing tools ? Likely also cloud-based Department of Information Technology (INTEC), Ghent University, Belgium 29
  • 30. Department of Information Technology – Internet Based Communication Networks and Services (IBCN) Questions ? Jan Fostier jan.fostier@intec.ugent.be Department of Information Technology (INTEC) Ghent University