Stochastic context free grammars (SCFGs) are context free grammars that assign probabilities to generated strings. SCFGs can generate the same string using different derivation rules, with each set of rules assigning a different probability to the string. SCFGs are useful for modeling RNA sequences and predicting secondary structure, as they can generate RNA sequences and assign probabilities to potential secondary structures. The Cocke-Younger-Kasami algorithm is commonly used to calculate the most probable structure for an RNA sequence using a SCFG.
2. Grammars
¡ñ Wiki
a grammar is a set of
rewriting rules for forming
strings in a formal language
¡ñ context-free:
rewrite single variables
¡ñ Formal definition
a grammar is a 4-tuple
¡ñ N set of nonterminals
¡ñ V set of terminals
¡ñ P set of rules
¡ñ S start symbol
¡ñ Example
generates {a
m
u
n
¨O m ,n¡Ý0}S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ? aSu ? aaSuu ? aauu
S ? aS ? aaS ? aaSu ? aaSuu ? aauu
3. Stochastic CFGs
¡ñ A context free grammar (CFG) + probabilities
¡ñ Assign probabilities to generated strings
¡ñ Example
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
4. SCFGs
¡ñ Purpose:
¡ñ generate the same string using different sets of rules
¡ñ each set of rules tells a different story
¡ñ each set of rules assigns a different probability to the string
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
5. SCFGs & RNA
¡ñ Relation to RNA and 2nd
structure prediction
¡ñ generates RNA sequences ¨C strings over {A, C, G, U}
¡ñ 2nd
structure is given by the set of rules used
¡ñ assigns probabilities to structures
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256
6. SCFGs & RNA
0.1 0.4 0.4 0.1
S «Ï aSu ¨O aS ¨O Su ¨O «þ
«á «â . .
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
«á «â «á«á «â«â «á«á«â«â
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
. .. .. . .. .. ....
7. A better example
S «Ï aS ¨O cS ¨O gS ¨O uS
Sa ¨O Sc ¨O Sg ¨O Su
aSu ¨O cSg ¨O gSu
uSa ¨O gSc ¨O uSg
SS
8. Algorithms
¡ñ Determine the most probable structure for a RNA sequence
¡ñ Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
¡ñ Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
9. Algorithms
¡ñ Determine the most probable structure for a RNA sequence
¡ñ Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
¡ñ Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set
11. Cocke¨CYounger¨CKasami
¡ñ Calculate best structure for small subsequences and work
outwards to larger and larger subsequences
¡ñ Notations
¡ñ Grammar G in CNF with nonterminals V1
, ..., Vm
¡ñ V1
is the start symbol
¡ñ t(x, y, z) is the probability of rule Vx
¡ú Vy
Vz
¡ñ e(x, a) is the probability of rule Vx
¡ú a
¡ñ score[x, i, j] is the maximum probability of generating
seq[i, j] from Vx
12. CYK
¡ñ Vx
¡ú seq[i]
score[x, i, i] = e(x, seq[i])
¡ñ Vx
¡ú Vy
Vz
and for some i ¡Ü k < j
score[x, i, j] = score[y, i, k] ¡¤ score[z, k+1, j] ¡¤ t(x, y, z)
V x
Vy Vz
i k k+1 j
13. CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
V x
Vy Vz
i k k+1 j
14. CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
Time?
V x
Vy Vz
i k k+1 j
15. CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
V x
Vy Vz
i k k+1 j
16. CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
V x
Vy Vz
i k k+1 j
17. CYK
score[x ,i , j]=
{
0 if j«Çi
e«áx , seq[i]«â if i= j
max
i¡Ük«Ç j
V x «ÏVy Vz
score[y ,i ,k]?score[z ,k«Æ1, j]?t«áx ,y ,z«â
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
O(r? n2
)
V x
Vy Vz
i k k+1 j
18. SCFG design
¡ñ Dowell & Eddy (2004)
G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ
G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS
L «Ï d S «Ôd ¨O aL
R «Ï Rd ¨O «þ
G3: S «Ï d S ¨O d S «Ôd S ¨O «þ
G4: S «Ï d S ¨O T ¨O «þ
T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd
G5: S «Ï LS ¨O L
L «Ï d F «Ôd ¨O d
F «Ï d F «Ôd ¨O LS
19. SCFG design
¡ñ Dowell & Eddy (2004)
G1: S «Ï dS «Ôd ¨O d S ¨O S d ¨O SS ¨O «þ
G2: S «Ï d S «Ôd ¨O d L ¨O Rd ¨O LS
L «Ï d S «Ôd ¨O aL
R «Ï Rd ¨O «þ
G3: S «Ï d S ¨O d S «Ôd S ¨O «þ
G4: S «Ï d S ¨O T ¨O «þ
T «Ï T d ¨O d S «Ôd ¨O T d S «Ôd
G5: S «Ï LS ¨O L
L «Ï d F «Ôd ¨O d
F «Ï d F «Ôd ¨O LS