ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Stochastic Context Free GrammarsStochastic Context Free Grammars
Grammars
¡ñ Wiki
a grammar is a set of
rewriting rules for forming
strings in a formal language
¡ñ context-free:
rewrite single variables
¡ñ Formal definition
a grammar is a 4-tuple
¡ñ N set of nonterminals
¡ñ V set of terminals
¡ñ P set of rules
¡ñ S start symbol
¡ñ Example
generates {a
m
u
n
¨O m ,n¡Ý0}S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ? aSu ? aaSuu ? aauu
S ? aS ? aaS ? aaSu ? aaSuu ? aauu
Stochastic CFGs
¡ñ A context free grammar (CFG) + probabilities
¡ñ Assign probabilities to generated strings
¡ñ Example
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
SCFGs
¡ñ Purpose:
¡ñ generate the same string using different sets of rules
¡ñ each set of rules tells a different story
¡ñ each set of rules assigns a different probability to the string
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
SCFGs & RNA
¡ñ Relation to RNA and 2nd
structure prediction
¡ñ generates RNA sequences ¨C strings over {A, C, G, U}
¡ñ 2nd
structure is given by the set of rules used
¡ñ assigns probabilities to structures
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
SCFGs & RNA
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
«á «â . .
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
«á «â «á«á «â«â «á«á«â«â
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
. .. .. . .. .. ....
A better example
S «Ï aS ¨O cS ¨O gS ¨O uS
Sa ¨O Sc ¨O Sg ¨O Su
aSu ¨O cSg ¨O gSu
uSa ¨O gSc ¨O uSg
SS
Algorithms
¡ñ Determine the most probable structure for a RNA sequence
¡ñ Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
¡ñ Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
Algorithms
¡ñ Determine the most probable structure for a RNA sequence
¡ñ Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
¡ñ Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
Chomsky Normal Form
A«ÏBC
A«Ïd
A«Ï«þ
¡ñ Only rules of the form
S «Ï aS ?
S «Ï AS
A «Ï a
S «Ï Sa ?
S «Ï SA
A «Ï a
¡ñ Any CFG can be rewritten in CNF
Cocke¨CYounger¨CKasami
¡ñ Calculate best structure for small subsequences and work
outwards to larger and larger subsequences
¡ñ Notations
¡ñ Grammar G in CNF with nonterminals V1
, ..., Vm
¡ñ V1
is the start symbol
¡ñ t(x, y, z) is the probability of rule Vx
¡ú Vy
Vz
¡ñ e(x, a) is the probability of rule Vx
¡ú a
¡ñ score[x, i, j] is the maximum probability of generating
seq[i, j] from Vx
CYK
¡ñ Vx
¡ú seq[i]
score[x, i, i] = e(x, seq[i])
¡ñ Vx
¡ú Vy
Vz
and for some i ¡Ü k < j
score[x, i, j] = score[y, i, k] ¡¤ score[z, k+1, j] ¡¤ t(x, y, z)
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
Time?
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
V x
Vy Vz
i k k+1 j
CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
O(r? n2
)
V x
Vy Vz
i k k+1 j
SCFG design
¡ñ Dowell & Eddy (2004)
G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ
G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS
L «Ï d S «Ôd ¨O aL
R «Ï Rd ¨O «þ
G3: S «Ï d S ¨O d S «Ôd S ¨O «þ
G4: S «Ï d S ¨O T ¨O «þ
T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd
G5: S «Ï LS ¨O L
L «Ï d F «Ôd ¨O d
F «Ï d F «Ôd ¨O LS
SCFG design
¡ñ Dowell & Eddy (2004)
G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ
G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS
L «Ï d S «Ôd ¨O aL
R «Ï Rd ¨O «þ
G3: S «Ï d S ¨O d S «Ôd S ¨O «þ
G4: S «Ï d S ¨O T ¨O «þ
T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd
G5: S «Ï LS ¨O L
L «Ï d F «Ôd ¨O d
F «Ï d F «Ôd ¨O LS
Prediction accuracy
¡ñ Sensitivity and specificity
sensitivity =
TN
TN«ÆFP
specificity =
TP
TP«ÆFN
sensitivity =
4
4«Æ2
= 0.666
specificity =
4
4«Æ2
= 0.666
Prediction accuracy
sensitivity =
4
4«Æ2
= 0.666
specificity =
4
4«Æ2
= 0.666
sensitivity =
5
5«Æ2
= 0.714
specificity =
2
2«Æ3
= 0.4
Prediction accuracy
sensitivity =
4
4«Æ2
= 0.666
specificity =
4
4«Æ2
= 0.666
sensitivity =
5
5«Æ2
= 0.714
specificity =
2
2«Æ3
= 0.4
Use RNA 2nd
structure metrics
(Moulton et al. 2000)
Search for better SCFGs
¡ñ Evolutionary algorithm
¡ñ Initial population
¡ñ Mutation model
¡ñ Breeding model
¡ñ Selection

More Related Content

AB-RNA-SCFG-2010

  • 1. Stochastic Context Free GrammarsStochastic Context Free Grammars
  • 2. Grammars ¡ñ Wiki a grammar is a set of rewriting rules for forming strings in a formal language ¡ñ context-free: rewrite single variables ¡ñ Formal definition a grammar is a 4-tuple ¡ñ N set of nonterminals ¡ñ V set of terminals ¡ñ P set of rules ¡ñ S start symbol ¡ñ Example generates {a m u n ¨O m ,n¡Ý0}S «Ï aSu ¨O aS ¨O Su ¨O «þ S ? aSu ? aaSuu ? aauu S ? aS ? aaS ? aaSu ? aaSuu ? aauu
  • 3. Stochastic CFGs ¡ñ A context free grammar (CFG) + probabilities ¡ñ Assign probabilities to generated strings ¡ñ Example 0.1 0.4 0.4 0.1 S «Ï aSu ¨O aS ¨O Su ¨O «þ S ? 0.1 aSu ? 0.1 aaSuu ? 0.1 aauu S ? 0.4 aS ? 0.4 aaS ? 0.4 aaSu ? 0.4 aaSuu ? 0.1 aauu 0.001 0.00256
  • 4. SCFGs ¡ñ Purpose: ¡ñ generate the same string using different sets of rules ¡ñ each set of rules tells a different story ¡ñ each set of rules assigns a different probability to the string 0.1 0.4 0.4 0.1 S «Ï aSu ¨O aS ¨O Su ¨O «þ S ? 0.1 aSu ? 0.1 aaSuu ? 0.1 aauu S ? 0.4 aS ? 0.4 aaS ? 0.4 aaSu ? 0.4 aaSuu ? 0.1 aauu 0.001 0.00256
  • 5. SCFGs & RNA ¡ñ Relation to RNA and 2nd structure prediction ¡ñ generates RNA sequences ¨C strings over {A, C, G, U} ¡ñ 2nd structure is given by the set of rules used ¡ñ assigns probabilities to structures 0.1 0.4 0.4 0.1 S «Ï aSu ¨O aS ¨O Su ¨O «þ S ? 0.1 aSu ? 0.1 aaSuu ? 0.1 aauu S ? 0.4 aS ? 0.4 aaS ? 0.4 aaSu ? 0.4 aaSuu ? 0.1 aauu 0.001 0.00256
  • 6. SCFGs & RNA 0.1 0.4 0.4 0.1 S «Ï aSu ¨O aS ¨O Su ¨O «þ «á «â . . S ? 0.1 aSu ? 0.1 aaSuu ? 0.1 aauu «á «â «á«á «â«â «á«á«â«â S ? 0.4 aS ? 0.4 aaS ? 0.4 aaSu ? 0.4 aaSuu ? 0.1 aauu . .. .. . .. .. ....
  • 7. A better example S «Ï aS ¨O cS ¨O gS ¨O uS Sa ¨O Sc ¨O Sg ¨O Su aSu ¨O cSg ¨O gSu uSa ¨O gSc ¨O uSg SS
  • 8. Algorithms ¡ñ Determine the most probable structure for a RNA sequence ¡ñ Determine the total probability of generating a sequence (the sum of probabilities of all ways of generating it) ¡ñ Given a data set with sequences and associated structures, determine the rules' probabilities that maximize the total probability of generating the right structures from the set
  • 9. Algorithms ¡ñ Determine the most probable structure for a RNA sequence ¡ñ Determine the total probability of generating a sequence (the sum of probabilities of all ways of generating it) ¡ñ Given a data set with sequences and associated structures, determine the rules' probabilities that maximize the total probability of generating the right structures from the set
  • 10. Chomsky Normal Form A«ÏBC A«Ïd A«Ï«þ ¡ñ Only rules of the form S «Ï aS ? S «Ï AS A «Ï a S «Ï Sa ? S «Ï SA A «Ï a ¡ñ Any CFG can be rewritten in CNF
  • 11. Cocke¨CYounger¨CKasami ¡ñ Calculate best structure for small subsequences and work outwards to larger and larger subsequences ¡ñ Notations ¡ñ Grammar G in CNF with nonterminals V1 , ..., Vm ¡ñ V1 is the start symbol ¡ñ t(x, y, z) is the probability of rule Vx ¡ú Vy Vz ¡ñ e(x, a) is the probability of rule Vx ¡ú a ¡ñ score[x, i, j] is the maximum probability of generating seq[i, j] from Vx
  • 12. CYK ¡ñ Vx ¡ú seq[i] score[x, i, i] = e(x, seq[i]) ¡ñ Vx ¡ú Vy Vz and for some i ¡Ü k < j score[x, i, j] = score[y, i, k] ¡¤ score[z, k+1, j] ¡¤ t(x, y, z) V x Vy Vz i k k+1 j
  • 13. CYK score[x ,i , j]= { 0 if j«Çi e«áx , seq[i]«â if i= j max i¡Ük«Ç j V x «ÏVy Vz score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â V x Vy Vz i k k+1 j
  • 14. CYK score[x ,i , j]= { 0 if j«Çi e«áx , seq[i]«â if i= j max i¡Ük«Ç j V x «ÏVy Vz score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â Space? Time? V x Vy Vz i k k+1 j
  • 15. CYK score[x ,i , j]= { 0 if j«Çi e«áx , seq[i]«â if i= j max i¡Ük«Ç j V x «ÏVy Vz score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â Space? O(m ? n2 ) Time? O(m? r? n3 ) V x Vy Vz i k k+1 j
  • 16. CYK score[x ,i , j]= { 0 if j«Çi e«áx , seq[i]«â if i= j max i¡Ük«Ç j V x «ÏVy Vz score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â Space? O(m ? n2 ) Time? O(m? r? n3 ) Backtracking? V x Vy Vz i k k+1 j
  • 17. CYK score[x ,i , j]= { 0 if j«Çi e«áx , seq[i]«â if i= j max i¡Ük«Ç j V x «ÏVy Vz score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â Space? O(m ? n2 ) Time? O(m? r? n3 ) Backtracking? O(r? n2 ) V x Vy Vz i k k+1 j
  • 18. SCFG design ¡ñ Dowell & Eddy (2004) G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS L «Ï d S «Ôd ¨O aL R «Ï Rd ¨O «þ G3: S «Ï d S ¨O d S «Ôd S ¨O «þ G4: S «Ï d S ¨O T ¨O «þ T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd G5: S «Ï LS ¨O L L «Ï d F «Ôd ¨O d F «Ï d F «Ôd ¨O LS
  • 19. SCFG design ¡ñ Dowell & Eddy (2004) G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS L «Ï d S «Ôd ¨O aL R «Ï Rd ¨O «þ G3: S «Ï d S ¨O d S «Ôd S ¨O «þ G4: S «Ï d S ¨O T ¨O «þ T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd G5: S «Ï LS ¨O L L «Ï d F «Ôd ¨O d F «Ï d F «Ôd ¨O LS
  • 20. Prediction accuracy ¡ñ Sensitivity and specificity sensitivity = TN TN«ÆFP specificity = TP TP«ÆFN sensitivity = 4 4«Æ2 = 0.666 specificity = 4 4«Æ2 = 0.666
  • 21. Prediction accuracy sensitivity = 4 4«Æ2 = 0.666 specificity = 4 4«Æ2 = 0.666 sensitivity = 5 5«Æ2 = 0.714 specificity = 2 2«Æ3 = 0.4
  • 22. Prediction accuracy sensitivity = 4 4«Æ2 = 0.666 specificity = 4 4«Æ2 = 0.666 sensitivity = 5 5«Æ2 = 0.714 specificity = 2 2«Æ3 = 0.4 Use RNA 2nd structure metrics (Moulton et al. 2000)
  • 23. Search for better SCFGs ¡ñ Evolutionary algorithm ¡ñ Initial population ¡ñ Mutation model ¡ñ Breeding model ¡ñ Selection