ݺߣ

ݺߣShare a Scribd company logo
Department of Information Technology (INTEC), Ghent University, Belgium
A Parallel, Distributed-Memory Framework
for Comparative Motif Discovery
1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester,
1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier
email: jan.fostier@intec.ugent.be
1Department of Information Technology (INTEC), Ghent University, Belgium
2Department of Plant Systems Biology, VIB, Ghent, Belgium
3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
Background (1)
Department of Information Technology (INTEC), Ghent University, Belgium 2
CCACGTG
gene
promoter
transcription
factor
protein
regulate
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 3
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Gene families
Department of Information Technology (INTEC), Ghent University, Belgium 4
zma
sbi
bdi
osa
DNA sequences
orthologous genes (Van Bel et al., Plaza 2.5, Plant Phys. 2012)
TSS
2kbp promoter
gene family 1
TSS
2kbp promoter
gene family 17724
TSS
2kbp promoter
gene family 2
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 5
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 17.20 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 6
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 64.52 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 7
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 100 %
2kbp promoter
Scoring conservation within a gene family
motif occurrences
MST
Branch Length Score (BLS)
Department of Information Technology (INTEC), Ghent University, Belgium 8
zma
sbi
osa
bdi
23,66%
5,38%
26,88%
8,60%
Branch Length Score (BLS) 0 %
2kbp promoter
Scoring conservation within a gene family
motif occurrence
Genomewide scoring (1)
Department of Information Technology (INTEC), Ghent University, Belgium 9
2kbp promoter
gene family 1
2kbp promoter
gene family 17724
2kbp promoter
gene family 2

Genome wide scoring of conservation
BLS = 64,52 % BLS = 17,20 % BLS = 72,72 %
> BLSthres ? > BLSthres ? > BLSthres ?
yes no yes
Fw = # families with BLS > BLSthres
Genomewide scoring (2)
Department of Information Technology (INTEC), Ghent University, Belgium 10
Genome wide scoring of conservation (2)
Fw = # families with BLS > BLSthres for word w
Fbg = median # families with BLS > BLSthres for random permutations of w
C = 1 ?
Fbg
Fw
confidence score
Retain only motifs with confidence score C > 0.9
Stark et al., Genome Res. 2007
Example
Department of Information Technology (INTEC), Ghent University, Belgium 11
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0

>40 % > 50% > 60% > 70% > 80% >90%
CCACGTG 578 545 423 224 135 116
C = 1 ?
Fbg
Fw
= 1 -
201
578
= 0.64
1000 random permutations of w
BLS threshold
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 12
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Motif discovery
Department of Information Technology (INTEC), Ghent University, Belgium 13
Main idea: extend previous ideas
to all words that occur in the sequences
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 14
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 15
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 16
GC$2
AT C$2
G
T
G
TAT$1
TAT$1$1
C$2
C$2 $1
AT G TATGC$2
GC$2$1 $1
$1
$2
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT$1
S2 = TTATGC$2
Enumerating all words
Department of Information Technology (INTEC), Ghent University, Belgium 17
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
Generalized suffix tree (GST)
Example sequences:
S1 = ATGTAT
S2 = TTATGC
Actual discovery
Department of Information Technology (INTEC), Ghent University, Belgium 18
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
word seq1 seq2
A y y
AT y y
ATG y y
ATGC - y
ATGT y -
ATGTA y -
ATGTAT y -
C - y
G y y
GC - y

Depth-first walk in the tree
Can be extended to IUPAC alphabet (Sagot, 2001)
Discovery of words
Department of Information Technology (INTEC), Ghent University, Belgium 19
Single gene family, words with length [6  12], max 3 deg. chars
0
50
100
150
200
250
300
350
400
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
ACGT ACGT+N IUPAC  {BDHV} Full IUPAC
Time(s)
Numberofwords
Number of words Time
Outline
Department of Information Technology (INTEC), Ghent University, Belgium 20
1) motif
target
prediction
2) motif
discovery
algorithm
3) parallel
framework
Parallel, comparative motif discovery framework
Parallel discovery
Department of Information Technology (INTEC), Ghent University, Belgium 21
Intra-family phase
Process 1 Process P
w conserved in
3/4 organisms
BLS score
for w: 73% w conserved in
2/4 organisms
BLS score
for w: 23%
gene family x gene family y
local hash table local hash table...
Communication phase
Department of Information Technology (INTEC), Ghent University, Belgium 22
Communication
phase
...
Core idea: words with the same composition are sent to the same processes
e.g. CACGTG and CCATGG
Process 1 Process P
Statistical evaluation
Department of Information Technology (INTEC), Ghent University, Belgium 23
Inter-family phase
aggregated hash table aggregated hash table
statistical evaluation

CACGTG: significant
ACCGGT: not significant
TGGCAC: not significant

statistical evaluation

TGANTGA: significant
GTATAGN: not significant
GANATTG: not significant

...
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 24
GC
AT C
G
T
G
TAT
TAT
C
C AT G TATGC
GC
class Mapper
method Map(familyID id, family F)
GST = createGeneralizedSuffixTree(F);
for all word w ? GST do
BLS = computeBLS(w)
f = computeFrequencyVector(BLS)
Emit(sort(w), pair(w, f))
w = CCACGTG
BLS = 62%
f = [ 1 | 1 | 1 | 0 | 0 | 0 ]
Emit(ACCCGGT, <CCACGTG, [111000]>)
MapReduce pseudocode
Department of Information Technology (INTEC), Ghent University, Belgium 25
class Reducer
method Reduce(word s, pairs[<w1,f1>, <w2,f2>, ])
H = new AssociativeArray
for all <w,f> ? [<w1,f1>, <w2,f2>, ] do
H{w} = H{w} + f
BGM = computeBackgroundModel(H)
for all w ? H
C = computeConfidence(w, H{w}, BGM)
if (C > 0.9) Emit(w, H{w})
ACCCGGT 201 195 139 46 10 6
TCCAGCG 158 100 95 28 13 12
CGATCCG 204 191 185 100 46 11
CGCTGAC 246 203 145 39 23 12
GTGCACC 0 0 0 0 0 0

H C = 0.75
C = 0.60
C = 0.91
C = 0.93
C = 0.75
Results
? Dataset description
? 4 organisms (zma, sbi, osa, bdi)
? 17639 gene families
? Alphabet: IUPAC  {BDHV} alphabet with length [6 ... 12]
? maximum of 3 degenerate characters
? 20 instances on AWS (19 workers type m1.xlarge)
? Map time: 20h 50 min
? Reduce time: 12h 38 min
? Cost price of 421.6$
? Number of <k, v> emitted by mappers: 531 835 669 220
? Map output bytes = 13 827 727 399 720
? 26 bytes per kv
? Output materialized: 4 019 787 705 377 = 3.65 TByte
? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr)
? Number of <k, v> emitted by reducers: 620 294 857
Department of Information Technology (INTEC), Ghent University, Belgium 26
Where do we go from here?
? Output data is still big (~ 50 Gbyte of data)
? Redundancy present in output
? Clustering algorithms
? Mapping words back to the promoter sequences
? Filtering output is easy
? Subsets with higher confidence scores
? Subsets with specific BLS scores
? Subsets which match specific motif
C Worked example: KN1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 27
We need further post-processing
Knotted 1 motif
Department of Information Technology (INTEC), Ghent University, Belgium 28
Motif BLS min Confidence PWM score # target predicted
CHiP-seq
overlap
TGACTGACTGAC 15 100 13.94 5 3
TGACKGACTGAC 15 100 13.69 5 3
TGACKGAYTGAC 15 100 13.69 5 3
TGACTGACKGAC 15 100 13.45 5 3
TGACTGAYKGAC 15 100 13.45 5 3
TGACTGACWGAC 15 100 13.25 5 3
TGACTGAYWGAC 15 100 13.25 5 3
TGACKGACKGAC 15 100 13.20 6 3
TGACKGACWGAC 15 100 13.00 5 3
TGACWGACTGAC 15 100 12.33 7 4
TGACWGAYTGAC 15 100 12.33 8 4
     
Bolduc et al. 2011
Conclusion
? We developed
? A Parallel Framework for Motif Discovery
? Based on Phylogenetic Footprinting
? MPI implementation (memory issues)
? MapReduce implementation (scales well)
? Key advantages
? Runs in the cloud
? Exhaustively explores search space (IUPAC)
? Alignment-free (unique feature)
? Flexible framework
? Future work
? Development of post-processing tools
? Likely also cloud-based
Department of Information Technology (INTEC), Ghent University, Belgium 29
Department of Information Technology C Internet Based Communication Networks and Services (IBCN)
Questions ?
Jan Fostier
jan.fostier@intec.ugent.be
Department of Information Technology (INTEC)
Ghent University

More Related Content

Ppam presentatie

  • 1. Department of Information Technology (INTEC), Ghent University, Belgium A Parallel, Distributed-Memory Framework for Comparative Motif Discovery 1Dieter De Witte, 2,3Michiel Van Bel, 1Pieter Audenaert, 1Piet Demeester, 1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier email: jan.fostier@intec.ugent.be 1Department of Information Technology (INTEC), Ghent University, Belgium 2Department of Plant Systems Biology, VIB, Ghent, Belgium 3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium
  • 2. Background (1) Department of Information Technology (INTEC), Ghent University, Belgium 2 CCACGTG gene promoter transcription factor protein regulate
  • 3. Outline Department of Information Technology (INTEC), Ghent University, Belgium 3 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 4. Gene families Department of Information Technology (INTEC), Ghent University, Belgium 4 zma sbi bdi osa DNA sequences orthologous genes (Van Bel et al., Plaza 2.5, Plant Phys. 2012) TSS 2kbp promoter gene family 1 TSS 2kbp promoter gene family 17724 TSS 2kbp promoter gene family 2
  • 5. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 5 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 17.20 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 6. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 6 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 64.52 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 7. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 7 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 100 % 2kbp promoter Scoring conservation within a gene family motif occurrences MST
  • 8. Branch Length Score (BLS) Department of Information Technology (INTEC), Ghent University, Belgium 8 zma sbi osa bdi 23,66% 5,38% 26,88% 8,60% Branch Length Score (BLS) 0 % 2kbp promoter Scoring conservation within a gene family motif occurrence
  • 9. Genomewide scoring (1) Department of Information Technology (INTEC), Ghent University, Belgium 9 2kbp promoter gene family 1 2kbp promoter gene family 17724 2kbp promoter gene family 2 Genome wide scoring of conservation BLS = 64,52 % BLS = 17,20 % BLS = 72,72 % > BLSthres ? > BLSthres ? > BLSthres ? yes no yes Fw = # families with BLS > BLSthres
  • 10. Genomewide scoring (2) Department of Information Technology (INTEC), Ghent University, Belgium 10 Genome wide scoring of conservation (2) Fw = # families with BLS > BLSthres for word w Fbg = median # families with BLS > BLSthres for random permutations of w C = 1 ? Fbg Fw confidence score Retain only motifs with confidence score C > 0.9 Stark et al., Genome Res. 2007
  • 11. Example Department of Information Technology (INTEC), Ghent University, Belgium 11 ACCCGGT 201 195 139 46 10 6 TCCAGCG 158 100 95 28 13 12 CGATCCG 204 191 185 100 46 11 CGCTGAC 246 203 145 39 23 12 GTGCACC 0 0 0 0 0 0 >40 % > 50% > 60% > 70% > 80% >90% CCACGTG 578 545 423 224 135 116 C = 1 ? Fbg Fw = 1 - 201 578 = 0.64 1000 random permutations of w BLS threshold
  • 12. Outline Department of Information Technology (INTEC), Ghent University, Belgium 12 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 13. Motif discovery Department of Information Technology (INTEC), Ghent University, Belgium 13 Main idea: extend previous ideas to all words that occur in the sequences
  • 14. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 14 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2
  • 15. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 15 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2
  • 16. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 16 GC$2 AT C$2 G T G TAT$1 TAT$1$1 C$2 C$2 $1 AT G TATGC$2 GC$2$1 $1 $1 $2 Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2
  • 17. Enumerating all words Department of Information Technology (INTEC), Ghent University, Belgium 17 GC AT C G T G TAT TAT C C AT G TATGC GC Generalized suffix tree (GST) Example sequences: S1 = ATGTAT S2 = TTATGC
  • 18. Actual discovery Department of Information Technology (INTEC), Ghent University, Belgium 18 GC AT C G T G TAT TAT C C AT G TATGC GC word seq1 seq2 A y y AT y y ATG y y ATGC - y ATGT y - ATGTA y - ATGTAT y - C - y G y y GC - y Depth-first walk in the tree Can be extended to IUPAC alphabet (Sagot, 2001)
  • 19. Discovery of words Department of Information Technology (INTEC), Ghent University, Belgium 19 Single gene family, words with length [6 12], max 3 deg. chars 0 50 100 150 200 250 300 350 400 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09 ACGT ACGT+N IUPAC {BDHV} Full IUPAC Time(s) Numberofwords Number of words Time
  • 20. Outline Department of Information Technology (INTEC), Ghent University, Belgium 20 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Parallel, comparative motif discovery framework
  • 21. Parallel discovery Department of Information Technology (INTEC), Ghent University, Belgium 21 Intra-family phase Process 1 Process P w conserved in 3/4 organisms BLS score for w: 73% w conserved in 2/4 organisms BLS score for w: 23% gene family x gene family y local hash table local hash table...
  • 22. Communication phase Department of Information Technology (INTEC), Ghent University, Belgium 22 Communication phase ... Core idea: words with the same composition are sent to the same processes e.g. CACGTG and CCATGG Process 1 Process P
  • 23. Statistical evaluation Department of Information Technology (INTEC), Ghent University, Belgium 23 Inter-family phase aggregated hash table aggregated hash table statistical evaluation CACGTG: significant ACCGGT: not significant TGGCAC: not significant statistical evaluation TGANTGA: significant GTATAGN: not significant GANATTG: not significant ...
  • 24. MapReduce pseudocode Department of Information Technology (INTEC), Ghent University, Belgium 24 GC AT C G T G TAT TAT C C AT G TATGC GC class Mapper method Map(familyID id, family F) GST = createGeneralizedSuffixTree(F); for all word w ? GST do BLS = computeBLS(w) f = computeFrequencyVector(BLS) Emit(sort(w), pair(w, f)) w = CCACGTG BLS = 62% f = [ 1 | 1 | 1 | 0 | 0 | 0 ] Emit(ACCCGGT, <CCACGTG, [111000]>)
  • 25. MapReduce pseudocode Department of Information Technology (INTEC), Ghent University, Belgium 25 class Reducer method Reduce(word s, pairs[<w1,f1>, <w2,f2>, ]) H = new AssociativeArray for all <w,f> ? [<w1,f1>, <w2,f2>, ] do H{w} = H{w} + f BGM = computeBackgroundModel(H) for all w ? H C = computeConfidence(w, H{w}, BGM) if (C > 0.9) Emit(w, H{w}) ACCCGGT 201 195 139 46 10 6 TCCAGCG 158 100 95 28 13 12 CGATCCG 204 191 185 100 46 11 CGCTGAC 246 203 145 39 23 12 GTGCACC 0 0 0 0 0 0 H C = 0.75 C = 0.60 C = 0.91 C = 0.93 C = 0.75
  • 26. Results ? Dataset description ? 4 organisms (zma, sbi, osa, bdi) ? 17639 gene families ? Alphabet: IUPAC {BDHV} alphabet with length [6 ... 12] ? maximum of 3 degenerate characters ? 20 instances on AWS (19 workers type m1.xlarge) ? Map time: 20h 50 min ? Reduce time: 12h 38 min ? Cost price of 421.6$ ? Number of <k, v> emitted by mappers: 531 835 669 220 ? Map output bytes = 13 827 727 399 720 ? 26 bytes per kv ? Output materialized: 4 019 787 705 377 = 3.65 TByte ? Average 7.55 byte per intermediate <k, v> pair (SnappyCompr) ? Number of <k, v> emitted by reducers: 620 294 857 Department of Information Technology (INTEC), Ghent University, Belgium 26
  • 27. Where do we go from here? ? Output data is still big (~ 50 Gbyte of data) ? Redundancy present in output ? Clustering algorithms ? Mapping words back to the promoter sequences ? Filtering output is easy ? Subsets with higher confidence scores ? Subsets with specific BLS scores ? Subsets which match specific motif C Worked example: KN1 motif Department of Information Technology (INTEC), Ghent University, Belgium 27 We need further post-processing
  • 28. Knotted 1 motif Department of Information Technology (INTEC), Ghent University, Belgium 28 Motif BLS min Confidence PWM score # target predicted CHiP-seq overlap TGACTGACTGAC 15 100 13.94 5 3 TGACKGACTGAC 15 100 13.69 5 3 TGACKGAYTGAC 15 100 13.69 5 3 TGACTGACKGAC 15 100 13.45 5 3 TGACTGAYKGAC 15 100 13.45 5 3 TGACTGACWGAC 15 100 13.25 5 3 TGACTGAYWGAC 15 100 13.25 5 3 TGACKGACKGAC 15 100 13.20 6 3 TGACKGACWGAC 15 100 13.00 5 3 TGACWGACTGAC 15 100 12.33 7 4 TGACWGAYTGAC 15 100 12.33 8 4 Bolduc et al. 2011
  • 29. Conclusion ? We developed ? A Parallel Framework for Motif Discovery ? Based on Phylogenetic Footprinting ? MPI implementation (memory issues) ? MapReduce implementation (scales well) ? Key advantages ? Runs in the cloud ? Exhaustively explores search space (IUPAC) ? Alignment-free (unique feature) ? Flexible framework ? Future work ? Development of post-processing tools ? Likely also cloud-based Department of Information Technology (INTEC), Ghent University, Belgium 29
  • 30. Department of Information Technology C Internet Based Communication Networks and Services (IBCN) Questions ? Jan Fostier jan.fostier@intec.ugent.be Department of Information Technology (INTEC) Ghent University