狠狠撸

Department of Information Technology (INTEC), Ghent University, Belgium
A Parallel, Distributed-Memory Framework
for Comparative Motif Discovery
1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester,
1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier
email: jan.fostier@intec.ugent.be
1Department of Information Technology (INTEC), Ghent University, Belgium
2Department of Plant Systems Biology, VIB, Ghent, Belgium
3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium

Background (1)
Department of Information Technology (INTEC), Ghent University, Belgium 2
CCACGTG
gene
promoter
transcription
factor
protein
regulate

Outline
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework

Gene families
zma
sbi
bdi
osa
DNA sequences
orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012)
TSS
2kbp promoter
gene family 1
TSS
2kbp promoter
gene family 17724
TSS
2kbp promoter
gene family 2
…

Branch Length Score (BLS)
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 17.20 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST

zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 64.52 %
2kbp promoter
motif occurrences
MST

zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 100 %
2kbp promoter
motif occurrences
MST

zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 0 %
2kbp promoter
motif occurrence

Genomewide scoring (1)
2kbp promoter
gene family 1
2kbp promoter
gene family 17724
2kbp promoter
gene family 2
…
Genome wide scoring of conservation
BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %
> BLSthres ? > BLSthres ? > BLSthres ?
yes no yes
Fw = # families with BLS > BLSthres

Genomewide scoring (2)
Genome wide scoring of conservation (2)
Fw = # families with BLS > BLSthres for word w
Fbg = median # families with BLS > BLSthres for random permutations of w
C = 1 ?
Fbg
Fw
confidence score
Retain only motifs with confidence score C > 0.9
Stark et al., Genome Res. 2007

Example
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
>40 % > 50% > 60% > 70% > 80% >90%
CCACGTG 578 545 423 224 135 116
C = 1 ?
Fbg
Fw
= 1 -
201
578
= 0.64
1000 random permutations of w
BLS threshold

Outline
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework

Motif discovery
Main idea: extend previous ideas
to all words that occur in the sequences

Enumerating all words
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2

GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2

GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
Example sequences:
S1 = ATGTAT
S2 = TTATGC

Actual discovery
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
word seq1 seq2
A y y
AT y y
ATG y y
ATGC - y
ATGT y -
ATGTA y -
ATGTAT y -
C - y
G y y
GC - y
…
Depth-first walk in the tree
Can be extended to IUPAC alphabet (Sagot, 2001)

Discovery of words
Single gene family, words with length [6 … 12], max 3 deg. chars
0
50
100
150
200
250
300
350
400
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
ACGT ACGT+N IUPAC {BDHV} Full IUPAC
Time(s)
Numberofwords
Number of words Time

Outline
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework

Parallel discovery
Intra-family phase
Process 1 Process P
w conserved in
3/4 organisms
BLS score
for w: 73% w’ conserved in
2/4 organisms
BLS score
for w’: 23%
gene family x gene family y
local hash table local hash table...

Communication phase
Communication
phase
...
Core idea: words with the same composition are sent to the same processes
e.g. CACGTG and CCATGG
Process 1 Process P

Statistical evaluation
Inter-family phase
aggregated hash table aggregated hash table
statistical evaluation
…
CACGTG: significant
ACCGGT: not significant
TGGCAC: not significant
…
statistical evaluation
…
TGANTGA: significant
GTATAGN: not significant
GANATTG: not significant
…
...

MapReduce pseudocode
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
class Mapper
method Map(familyID id, family F)
GST = createGeneralizedSuffixTree(F);
for all word w ? GST do
BLS = computeBLS(w)
f = computeFrequencyVector(BLS)
Emit(sort(w), pair(w, f))
w = CCACGTG
BLS = 62%
f = [ 1 | 1 | 1 | 0 | 0 | 0 ]
Emit(ACCCGGT, <CCACGTG, [111000]>)

MapReduce pseudocode
class Reducer
method Reduce(word s, pairs[<w1,f1>, <w2,f2>, …])
H = new AssociativeArray
for all <w,f> ? [<w1,f1>, <w2,f2>, …] do
H{w} = H{w} + f
BGM = computeBackgroundModel(H)
for all w ? H
C = computeConfidence(w, H{w}, BGM)
if (C > 0.9) Emit(w, H{w})
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
…
H C = 0.75
C = 0.60
C = 0.91
C = 0.93
C = 0.75

Results
? Dataset description
? 4 organisms (zma, sbi, osa, bdi)
? 17639 gene families
? Alphabet: IUPAC {BDHV} alphabet with length [6 ... 12]
? maximum of 3 degenerate characters
? 20 instances on AWS (19 workers type m1.xlarge)
? Map time: 20h 50 min
? Reduce time: 12h 38 min
? Cost price of 421.6$
? Number of <k, v> emitted by mappers: 531 835 669 220
? Map output bytes = 13 827 727 399 720
? 26 bytes per kv
? Output materialized: 4 019 787 705 377 = 3.65 TByte
? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)
? Number of <k, v> emitted by reducers: 620 294 857

Where do we go from here?
? Output data is still big (~ 50 Gbyte of data)
? Redundancy present in output
? Clustering algorithms
? Mapping words back to the promoter sequences
? Filtering output is easy
? Subsets with higher confidence scores
? Subsets with specific BLS scores
? Subsets which match specific motif
– Worked example: KN1 motif
We need further post-processing

Knotted 1 motif
Motif BLS min Confidence PWM score # target predicted
CHiP-seq
overlap
TGACTGACTGAC 15 100 13.94 5 3
TGACKGACTGAC 15 100 13.69 5 3
TGACKGAYTGAC 15 100 13.69 5 3
TGACTGACKGAC 15 100 13.45 5 3
TGACTGAYKGAC 15 100 13.45 5 3
TGACTGACWGAC 15 100 13.25 5 3
TGACTGAYWGAC 15 100 13.25 5 3
TGACKGACKGAC 15 100 13.20 6 3
TGACKGACWGAC 15 100 13.00 5 3
TGACWGACTGAC 15 100 12.33 7 4
TGACWGAYTGAC 15 100 12.33 8 4
… … … … … …
Bolduc et al. 2011

Conclusion
? We developed…
? A Parallel Framework for Motif Discovery
? Based on Phylogenetic Footprinting
? MPI implementation (memory issues)
? MapReduce implementation (scales well)
? Key advantages…
? Runs in the cloud
? Exhaustively explores search space (IUPAC)
? Alignment-free (unique feature)
? Flexible framework
? Future work…
? Development of post-processing tools
? Likely also cloud-based

Department of Information Technology – Internet Based Communication Networks and Services (IBCN)
Questions ?
Jan Fostier
jan.fostier@intec.ugent.be
Department of Information Technology (INTEC)
Ghent University

狠狠撸

Ppam presentatie

Recommended

More Related Content

Similar to Ppam presentatie (20)

Recently uploaded (20)

Ppam presentatie