狠狠撸

Stochastic Context Free GrammarsStochastic Context Free Grammars

Grammars
● Wiki
a grammar is a set of
rewriting rules for forming
strings in a formal language
● context-free:
rewrite single variables
● Formal definition
a grammar is a 4-tuple
● N set of nonterminals
● V set of terminals
● P set of rules
● S start symbol
● Example
generates {a
m
u
n
∣ m ,n≥0}S  aSu ∣ aS ∣ Su ∣ 
S ? aSu ? aaSuu ? aauu
S ? aS ? aaS ? aaSu ? aaSuu ? aauu

Stochastic CFGs
● A context free grammar (CFG) + probabilities
● Assign probabilities to generated strings
● Example
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256

SCFGs
● Purpose:
● generate the same string using different sets of rules
● each set of rules tells a different story
● each set of rules assigns a different probability to the string
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256

SCFGs & RNA
● Relation to RNA and 2nd
structure prediction
● generates RNA sequences – strings over {A, C, G, U}
● 2nd
structure is given by the set of rules used
● assigns probabilities to structures
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
0.001
0.00256

SCFGs & RNA
0.1 0.4 0.4 0.1
S  aSu ∣ aS ∣ Su ∣ 
  . .
S ?
0.1
aSu ?
0.1
aaSuu ?
0.1
aauu
    
S ?
0.4
aS ?
0.4
aaS ?
0.4
aaSu ?
0.4
aaSuu ?
0.1
aauu
. .. .. . .. .. ....

A better example
S  aS ∣ cS ∣ gS ∣ uS
Sa ∣ Sc ∣ Sg ∣ Su
aSu ∣ cSg ∣ gSu
uSa ∣ gSc ∣ uSg
SS

Algorithms
● Determine the most probable structure for a RNA sequence
● Determine the total probability of generating a sequence
(the sum of probabilities of all ways of generating it)
● Given a data set with sequences and associated structures,
determine the rules' probabilities that maximize the total
probability of generating the right structures from the set

Chomsky Normal Form
ABC
Ad
A
● Only rules of the form
S  aS ?
S  AS
A  a
S  Sa ?
S  SA
A  a
● Any CFG can be rewritten in CNF

Cocke–Younger–Kasami
● Calculate best structure for small subsequences and work
outwards to larger and larger subsequences
● Notations
● Grammar G in CNF with nonterminals V1
, ..., Vm
● V1
is the start symbol
● t(x, y, z) is the probability of rule Vx
→ Vy
Vz
● e(x, a) is the probability of rule Vx
→ a
● score[x, i, j] is the maximum probability of generating
seq[i, j] from Vx

CYK
● Vx
→ seq[i]
score[x, i, i] = e(x, seq[i])
● Vx
→ Vy
Vz
and for some i ≤ k < j
score[x, i, j] = score[y, i, k] · score[z, k+1, j] · t(x, y, z)
V x
Vy Vz
i k k+1 j

CYK
score[x ,i , j]=
{
0 if ji
ex , seq[i] if i= j
max
i≤k j
V x Vy Vz
score[y ,i ,k]?score[z ,k1, j]?tx ,y ,z
V x
Vy Vz
i k k+1 j

CYK
score[x ,i , j]=
{
0 if ji
max
i≤k j
V x Vy Vz
Space?
Time?
V x
Vy Vz
i k k+1 j

CYK
score[x ,i , j]=
{
0 if ji
max
i≤k j
V x Vy Vz
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
V x
Vy Vz
i k k+1 j

CYK
score[x ,i , j]=
{
0 if ji
max
i≤k j
V x Vy Vz
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
V x
Vy Vz
i k k+1 j

CYK
score[x ,i , j]=
{
0 if ji
max
i≤k j
V x Vy Vz
Space?
O(m ? n2
)
Time?
O(m? r? n3
)
Backtracking?
O(r? n2
)
V x
Vy Vz
i k k+1 j

SCFG design
● Dowell & Eddy (2004)
G1: S  dS d ∣ d S ∣ S d ∣ SS ∣ 
G2: S  d S d ∣ d L ∣ Rd ∣ LS
L  d S d ∣ aL
R  Rd ∣ 
G3: S  d S ∣ d S d S ∣ 
G4: S  d S ∣ T ∣ 
T  T d ∣ d S d ∣ T d S d
G5: S  LS ∣ L
L  d F d ∣ d
F  d F d ∣ LS

Prediction accuracy
● Sensitivity and specificity
sensitivity =
TN
TNFP
specificity =
TP
TPFN
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666

Prediction accuracy
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666
sensitivity =
5
52
= 0.714
specificity =
2
23
= 0.4

Prediction accuracy
sensitivity =
4
42
= 0.666
specificity =
4
42
= 0.666
sensitivity =
5
52
= 0.714
specificity =
2
23
= 0.4
Use RNA 2nd
structure metrics
(Moulton et al. 2000)

Search for better SCFGs
● Evolutionary algorithm
● Initial population
● Mutation model
● Breeding model
● Selection

狠狠撸

AB-RNA-SCFG-2010

More Related Content

AB-RNA-SCFG-2010