2. / 22
覈谿
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements
3. / 223
Can we omit policy gradient completely?
Advantage: action average覲企 朱 譬螳
Policy 螻,
Advantage襯 maximize action 谿場.
(at |st) = 1 if at = argmaxat
A
(st, at)
(at |st) = 0 otherwise
A(s, a) = Q(s,a) - V(s)
4. / 224
Policy iteration
Algorithm
1. Policy evaluation
: 豈 伎伎 豕 Value function 蟲蠍謂
(N 覦覲. 覦覲牛襦 True Value 螳蟾讌)
2. Policy improvement
譯殊伎 Value function朱
Policy襯 覦る 蟆(1 )
A
(s, a)
(at |st) = 1 if at = argmaxat
A
(st, at)
(at |st) = 0 otherwise
8. / 228
Dynamic programming
s, a螳 discrete
- 16 states, 4 actions per state
- 覈 state 伎 Value 螳 table襦 モ
-> approximation error X
- transition matrix: 16 x 16 x 4 tensor
Bootstrapped update :
policy螳 1, 0朱 deterministic蠍 覓語 れ朱 .
V
(s) = Ea赦(a|s)[r(s, a) + 粒Es霞p(s|s,a)[V
(s)]]
V
(s) r(s, (s)) + 粒Es霞p(s|s,(s))[V
(s)]
9. / 229
Policy iteration with dynamic programming
Policy iteration
1. evaluate
2. set
Policy evaluation
V
(s)
V
(s) r(s, (s)) + 粒Es霞p(s|s,(s))[V
(s)]
10. / 2210
Even simpler dynamic programming (1/2)
Bootstrap 覦 郁鍵 覓語
Advantage襯 豕 蟆
蟆郁記 Q function 螳 豕 蟆螻 螳
讀 Q function 豕襦 螻磯 覦 Value iteration
1. set
2. set
A
(s, a) = r(s, a) + 粒E[V
(s)] V
(s)
Q function
Q(s, a) r(s, a) + 粒E[V(s)]
V(s) maxaQ(s, a)
11. / 2211
Even simpler dynamic programming (2/2)
覈 state, action combination
Q function 螳 一検 豌 ロ覃
Q max 螳 谿場 .
Policy襯 谿城 螻殊 牛螻 れ 覦覲牛 蟆
Value iteration 企.
1. set
2. set
Q(s, a) r(s, a) + 粒E[V(s)]
V(s) maxaQ(s, a)
14. / 2214
Fitted value iteration (1/2)
Image 螳 一危一 伎 Grid world 覦 .
curse of dimensionality
200 by 200 by 3 shape企朱
|S| x |A| 螳 Q function 螳 蟲伎狩.
|S| = (2553
)200200
15. / 2215
Fitted value iteration (2/2)
Value function Neural nets襦 approximation
Value iteration algorithm
1. set
2. set
L() =
1
2
||V(s) maxaQ(s, a)||2
yi maxai
(r(si, ai) + 粒E[V(si)])
argmin
1
2
i
||V(si) yi ||2
y hat y
16. / 2216
What if we don't know the transition dynamics?
Value iteration transition dynamics襯 螳ロ 覦
Policy iteration Value螳 Q function朱 evaluate覃 .
1. evaluate
2. set
Policy evaluation
Q
(s, a)
Q
(s, a) r(s, a) + 粒Es霞p(s|s,(s))[Q
(s, (s))]
maxai
(r(si, ai) + 粒E[V(si)])
17. / 2217
Can we do the "max" trick again?
Policy iteration Value iteration朱 螳 Value螳 螳 action
=> Q-value transition dynamic 覈襯 .
fitted Q iteration algorithm
1. set
2. set
yi r(si, ai) + 粒E[V(si)]
argmin
1
2
i
||Q(si, ai) yi ||2
E[V(si)] maxaQ(si, ai)
企 Q function 蠍 覓語
朱讌 螳 詞 螻,
max 螳 .
+ off-policy 螳
+ ろ語 1螳 -> lower variance
- no convergence guarantees.
18. / 2218
Fitted Q-iteration
full fitted Q-iteration algorithm
1. collect dataset using some policy
2. set
3. set
{(si, ai, si, ri)}
yi r(si, ai) + 粒maxai
Q(si, ai)
argmin
1
2
i
||Q(si, ai) yi ||2
hyperparameters
- dataset size N
- collection policy
- iterations K
- gradient steps S
K x
19. / 2219
Why is this algorithm off-policy?
Off-policy
- data螳 蠎 action maximization policy 蟆 企 蟯谿.
=> action exploration襷 蟲譟磯 襷れ企 蟯
- action 蟆一 policy, 旧 一企 policy襯 覿襴(objective function)
state, action朱 企伎 big bucket of data襯 螳讌 蟆企殊
螳覲 一危磯ゼ 語誤蟆 蟆曙一 .
牛 maximum Q function 螳螻
蠍 覓語 action螻 覲螳襦 給 蟯谿.
20. / 2220
Online Q-learning algorithm
algorithm
1. take some action and observe
2.
3.
蠍一ヾ full fitted Q-iteration螻 るゴ蟆 1螳
off-policy algorithm企殊 1覯 覿覿 螳 .
ai (si, ai, si, ri)
yi = r(si, ai) + 粒maxaQ(si, ai)
留
dQ
d
(si, ai)(Q(si, ai) yi)