This document describes a parallel, distributed-memory framework for comparative motif discovery. The framework uses a MapReduce implementation to exhaustively explore sequence words in a gene family dataset from multiple organisms. Keywords are enumerated using a generalized suffix tree and scored for conservation across organisms. Significant motifs are identified based on branch length scores and statistical evaluation in comparison to random permutations. The framework scales well using cloud computing resources. Further work is needed on post-processing and filtering of motif output.
1 of 30
Download to read offline
More Related Content
Ppam presentatie
1. Department of Information Technology (INTEC), Ghent University, Belgium
A Parallel, Distributed-Memory Framework
for Comparative Motif Discovery
1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester,
1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier
email: jan.fostier@intec.ugent.be
1Department of Information Technology (INTEC), Ghent University, Belgium
2Department of Plant Systems Biology, VIB, Ghent, Belgium
3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
2. Background (1)
Department of Information Technology (INTEC), Ghent University, Belgium 2
CCACGTG
gene
promoter
transcription
factor
protein
regulate
3. Outline
Department of Information Technology (INTEC), Ghent University, Belgium 3
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
4. Gene families
Department of Information Technology (INTEC), Ghent University, Belgium 4
zma
sbi
bdi
osa
DNA sequences
orthologous genes (Van Bel et al., Plaza 2.5, Plant Phys. 2012)
TSS
2kbp promoter
gene family 1
TSS
2kbp promoter
gene family 17724
TSS
2kbp promoter
gene family 2
5. Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 5
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 17.20 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
6. Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 6
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 64.52 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
7. Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 7
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 100 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
8. Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 8
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 0 %
2kbp promoter
Scoring conservation within a gene family
motif occurrence
9. Genomewide scoring (1)
Department of Information Technology (INTEC), Ghent University, Belgium 9
2kbp promoter
gene family 1
2kbp promoter
gene family 17724
2kbp promoter
gene family 2
Genome wide scoring of conservation
BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %
> BLSthres ? > BLSthres ? > BLSthres ?
yes no yes
Fw = # families with BLS > BLSthres
10. Genomewide scoring (2)
Department of Information Technology (INTEC), Ghent University, Belgium 10
Genome wide scoring of conservation (2)
Fw = # families with BLS > BLSthres for word w
Fbg = median # families with BLS > BLSthres for random permutations of w
C = 1 ?
Fbg
Fw
confidence score
Retain only motifs with confidence score C > 0.9
Stark et al., Genome Res. 2007
12. Outline
Department of Information Technology (INTEC), Ghent University, Belgium 12
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
13. Motif discovery
Department of Information Technology (INTEC), Ghent University, Belgium 13
Main idea: extend previous ideas
to all words that occur in the sequences
14. Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 14
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
15. Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 15
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
16. Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 16
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
17. Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 17
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT
S2 = TTATGC
18. Actual discovery
Department of Information Technology (INTEC), Ghent University, Belgium 18
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
word seq1 seq2
A y y
AT y y
ATG y y
ATGC - y
ATGT y -
ATGTA y -
ATGTAT y -
C - y
G y y
GC - y
Depth-first walk in the tree
Can be extended to IUPAC alphabet (Sagot, 2001)
19. Discovery of words
Department of Information Technology (INTEC), Ghent University, Belgium 19
Single gene family, words with length [6 12], max 3 deg. chars
0
50
100
150
200
250
300
350
400
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
ACGT ACGT+N IUPAC {BDHV} Full IUPAC
Time(s)
Numberofwords
Number of words Time
20. Outline
Department of Information Technology (INTEC), Ghent University, Belgium 20
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
21. Parallel discovery
Department of Information Technology (INTEC), Ghent University, Belgium 21
Intra-family phase
Process 1 Process P
w conserved in
3/4 organisms
BLS score
for w: 73% w conserved in
2/4 organisms
BLS score
for w: 23%
gene family x gene family y
local hash table local hash table...
22. Communication phase
Department of Information Technology (INTEC), Ghent University, Belgium 22
Communication
phase
...
Core idea: words with the same composition are sent to the same processes
e.g. CACGTG and CCATGG
Process 1 Process P
23. Statistical evaluation
Department of Information Technology (INTEC), Ghent University, Belgium 23
Inter-family phase
aggregated hash table aggregated hash table
statistical evaluation
CACGTG: significant
ACCGGT: not significant
TGGCAC: not significant
statistical evaluation
TGANTGA: significant
GTATAGN: not significant
GANATTG: not significant
...
24. MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 24
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
class Mapper
method Map(familyID id, family F)
GST = createGeneralizedSuffixTree(F);
for all word w ? GST do
BLS = computeBLS(w)
f = computeFrequencyVector(BLS)
Emit(sort(w), pair(w, f))
w = CCACGTG
BLS = 62%
f = [ 1 | 1 | 1 | 0 | 0 | 0 ]
Emit(ACCCGGT, <CCACGTG, [111000]>)
25. MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 25
class Reducer
method Reduce(word s, pairs[<w1,f1>, <w2,f2>, ])
H = new AssociativeArray
for all <w,f> ? [<w1,f1>, <w2,f2>, ] do
H{w} = H{w} + f
BGM = computeBackgroundModel(H)
for all w ? H
C = computeConfidence(w, H{w}, BGM)
if (C > 0.9) Emit(w, H{w})
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0
H C = 0.75
C = 0.60
C = 0.91
C = 0.93
C = 0.75
26. Results
? Dataset description
? 4 organisms (zma, sbi, osa, bdi)
? 17639 gene families
? Alphabet: IUPAC {BDHV} alphabet with length [6 ... 12]
? maximum of 3 degenerate characters
? 20 instances on AWS (19 workers type m1.xlarge)
? Map time: 20h 50 min
? Reduce time: 12h 38 min
? Cost price of 421.6$
? Number of <k, v> emitted by mappers: 531 835 669 220
? Map output bytes = 13 827 727 399 720
? 26 bytes per kv
? Output materialized: 4 019 787 705 377 = 3.65 TByte
? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)
? Number of <k, v> emitted by reducers: 620 294 857
Department of Information Technology (INTEC), Ghent University, Belgium 26
27. Where do we go from here?
? Output data is still big (~ 50 Gbyte of data)
? Redundancy present in output
? Clustering algorithms
? Mapping words back to the promoter sequences
? Filtering output is easy
? Subsets with higher confidence scores
? Subsets with specific BLS scores
? Subsets which match specific motif
C Worked example: KN1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 27
We need further post-processing
29. Conclusion
? We developed
? A Parallel Framework for Motif Discovery
? Based on Phylogenetic Footprinting
? MPI implementation (memory issues)
? MapReduce implementation (scales well)
? Key advantages
? Runs in the cloud
? Exhaustively explores search space (IUPAC)
? Alignment-free (unique feature)
? Flexible framework
? Future work
? Development of post-processing tools
? Likely also cloud-based
Department of Information Technology (INTEC), Ghent University, Belgium 29
30. Department of Information Technology C Internet Based Communication Networks and Services (IBCN)
Questions ?
Jan Fostier
jan.fostier@intec.ugent.be
Department of Information Technology (INTEC)
Ghent University