際際滷

際際滷Share a Scribd company logo
Deep Reinforcement Learning
CS294-112, 2017 Fall
Lecture 6
蠏觜

螻る蟲 一蟆曙螻牛螻
/ 22
覈谿
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements
/ 223
Can we omit policy gradient completely?
Advantage:  action average覲企 朱  譬螳

Policy  螻,

Advantage襯 maximize action 谿場.
(at |st) = 1 if at = argmaxat
A
(st, at)
(at |st) = 0 otherwise
A(s, a) = Q(s,a) - V(s)
/ 224
Policy iteration
Algorithm

1. Policy evaluation
:  豈 伎伎 豕 Value function 蟲蠍謂
(N 覦覲. 覦覲牛襦 True Value 螳蟾讌)

2. Policy improvement
譯殊伎 Value function朱
Policy襯 覦る 蟆(1 )
A
(s, a)
  
(at |st) = 1 if at = argmaxat
A
(st, at)
(at |st) = 0 otherwise
/ 225
/ 226
/ 227
/ 228
Dynamic programming
s, a螳   discrete  

- 16 states, 4 actions per state

- 覈 state 伎 Value 螳 table襦 モ
-> approximation error X

- transition matrix: 16 x 16 x 4 tensor

Bootstrapped update :

policy螳 1, 0朱 deterministic蠍 覓語   れ朱   .
V
(s) = Ea赦(a|s)[r(s, a) + 粒Es霞p(s|s,a)[V
(s)]]
V
(s)  r(s, (s)) + 粒Es霞p(s|s,(s))[V
(s)]
/ 229
Policy iteration with dynamic programming
Policy iteration

1. evaluate 

2. set 

Policy evaluation
V
(s)
  
V
(s)  r(s, (s)) + 粒Es霞p(s|s,(s))[V
(s)]
/ 2210
Even simpler dynamic programming (1/2)
Bootstrap 覦 郁鍵 覓語

Advantage襯 豕 蟆

蟆郁記 Q function 螳 豕 蟆螻 螳

讀 Q function 豕襦 螻磯 覦 Value iteration

1. set

2. set
A
(s, a) = r(s, a) + 粒E[V
(s)]  V
(s)
Q function
Q(s, a)  r(s, a) + 粒E[V(s)]
V(s)  maxaQ(s, a)
/ 2211
Even simpler dynamic programming (2/2)
覈 state, action combination 

Q function 螳 一検 豌 ロ覃

Q max 螳 谿場  .

Policy襯 谿城 螻殊 牛螻 れ 覦覲牛 蟆

Value iteration 企.

1. set

2. set
Q(s, a)  r(s, a) + 粒E[V(s)]
V(s)  maxaQ(s, a)
/ 2212
/ 2213
/ 2214
Fitted value iteration (1/2)
Image 螳 一危一 伎 Grid world 覦   .

curse of dimensionality

200 by 200 by 3  shape企朱

|S| x |A| 螳 Q function 螳 蟲伎狩.
|S| = (2553
)200200
/ 2215
Fitted value iteration (2/2)
Value function Neural nets襦 approximation

Value iteration algorithm

1. set

2. set
L() =
1
2
||V(s)  maxaQ(s, a)||2
yi  maxai
(r(si, ai) + 粒E[V(si)])
  argmin
1
2 
i
||V(si)  yi ||2
y hat y
/ 2216
What if we don't know the transition dynamics?
Value iteration transition dynamics襯  螳ロ 覦

Policy iteration Value螳  Q function朱 evaluate覃 .

1. evaluate

2. set

Policy evaluation
Q
(s, a)
  
Q
(s, a)  r(s, a) + 粒Es霞p(s|s,(s))[Q
(s, (s))]
maxai
(r(si, ai) + 粒E[V(si)])
/ 2217
Can we do the "max" trick again?
Policy iteration Value iteration朱 螳  Value螳 螳  action 

=> Q-value transition dynamic 覈襯    .

fitted Q iteration algorithm

1. set

2. set
yi  r(si, ai) + 粒E[V(si)]
  argmin
1
2 
i
||Q(si, ai)  yi ||2
E[V(si)]  maxaQ(si, ai)
企 Q function 蠍 覓語

朱讌  螳 詞  螻,

 max 螳 .
+ off-policy 螳

+ ろ語 1螳 -> lower variance

- no convergence guarantees.
/ 2218
Fitted Q-iteration
full fitted Q-iteration algorithm

1. collect dataset using some policy

2. set

3. set

{(si, ai, si, ri)}
yi  r(si, ai) + 粒maxai
Q(si, ai)
  argmin
1
2 
i
||Q(si, ai)  yi ||2
hyperparameters

- dataset size N

- collection policy

- iterations K

- gradient steps S
K x
/ 2219
Why is this algorithm off-policy?
Off-policy

- data螳 蠎 action maximization policy  蟆 企 蟯谿.
=> action exploration襷  蟲譟磯 襷れ企 蟯

- action 蟆一 policy, 旧 一企 policy襯 覿襴(objective function)

state, action朱 企伎 big bucket of data襯 螳讌 蟆企殊

螳覲 一危磯ゼ 語誤蟆 蟆曙一 .

牛  maximum Q function 螳螻

蠍 覓語 action螻 覲螳襦 給 蟯谿.
/ 2220
Online Q-learning algorithm
algorithm

1. take some action and observe 

2. 

3. 

蠍一ヾ full fitted Q-iteration螻 るゴ蟆  1螳

off-policy algorithm企殊 1覯 覿覿 螳 .
ai (si, ai, si, ri)
yi = r(si, ai) + 粒maxaQ(si, ai)
    留
dQ
d
(si, ai)(Q(si, ai)  yi)
/ 2221
Exploration with Q-learning
Q-learning exploration 

- Epsilon greedy
10% 襯襦 ろ蟆   襯 譴(exploration rate)
90% 襯襦 Q function 豕螳 螳讌 action 

- Boltzmann exploration : Q value襯 襯螳朱 覦蠑語 action
/ 22
螳.

More Related Content

CS294-112 Lecture 06

  • 1. Deep Reinforcement Learning CS294-112, 2017 Fall Lecture 6 蠏觜 螻る蟲 一蟆曙螻牛螻
  • 2. / 22 覈谿 1. What if we just use a critic, without an actor? 2. Extracting a policy from a value function 3. The Q-learning algorithm 4. Extensions: continuous actions, improvements
  • 3. / 223 Can we omit policy gradient completely? Advantage: action average覲企 朱 譬螳 Policy 螻, Advantage襯 maximize action 谿場. (at |st) = 1 if at = argmaxat A (st, at) (at |st) = 0 otherwise A(s, a) = Q(s,a) - V(s)
  • 4. / 224 Policy iteration Algorithm 1. Policy evaluation : 豈 伎伎 豕 Value function 蟲蠍謂 (N 覦覲. 覦覲牛襦 True Value 螳蟾讌) 2. Policy improvement 譯殊伎 Value function朱 Policy襯 覦る 蟆(1 ) A (s, a) (at |st) = 1 if at = argmaxat A (st, at) (at |st) = 0 otherwise
  • 8. / 228 Dynamic programming s, a螳 discrete - 16 states, 4 actions per state - 覈 state 伎 Value 螳 table襦 モ -> approximation error X - transition matrix: 16 x 16 x 4 tensor Bootstrapped update : policy螳 1, 0朱 deterministic蠍 覓語 れ朱 . V (s) = Ea赦(a|s)[r(s, a) + 粒Es霞p(s|s,a)[V (s)]] V (s) r(s, (s)) + 粒Es霞p(s|s,(s))[V (s)]
  • 9. / 229 Policy iteration with dynamic programming Policy iteration 1. evaluate 2. set Policy evaluation V (s) V (s) r(s, (s)) + 粒Es霞p(s|s,(s))[V (s)]
  • 10. / 2210 Even simpler dynamic programming (1/2) Bootstrap 覦 郁鍵 覓語 Advantage襯 豕 蟆 蟆郁記 Q function 螳 豕 蟆螻 螳 讀 Q function 豕襦 螻磯 覦 Value iteration 1. set 2. set A (s, a) = r(s, a) + 粒E[V (s)] V (s) Q function Q(s, a) r(s, a) + 粒E[V(s)] V(s) maxaQ(s, a)
  • 11. / 2211 Even simpler dynamic programming (2/2) 覈 state, action combination Q function 螳 一検 豌 ロ覃 Q max 螳 谿場 . Policy襯 谿城 螻殊 牛螻 れ 覦覲牛 蟆 Value iteration 企. 1. set 2. set Q(s, a) r(s, a) + 粒E[V(s)] V(s) maxaQ(s, a)
  • 14. / 2214 Fitted value iteration (1/2) Image 螳 一危一 伎 Grid world 覦 . curse of dimensionality 200 by 200 by 3 shape企朱 |S| x |A| 螳 Q function 螳 蟲伎狩. |S| = (2553 )200200
  • 15. / 2215 Fitted value iteration (2/2) Value function Neural nets襦 approximation Value iteration algorithm 1. set 2. set L() = 1 2 ||V(s) maxaQ(s, a)||2 yi maxai (r(si, ai) + 粒E[V(si)]) argmin 1 2 i ||V(si) yi ||2 y hat y
  • 16. / 2216 What if we don't know the transition dynamics? Value iteration transition dynamics襯 螳ロ 覦 Policy iteration Value螳 Q function朱 evaluate覃 . 1. evaluate 2. set Policy evaluation Q (s, a) Q (s, a) r(s, a) + 粒Es霞p(s|s,(s))[Q (s, (s))] maxai (r(si, ai) + 粒E[V(si)])
  • 17. / 2217 Can we do the "max" trick again? Policy iteration Value iteration朱 螳 Value螳 螳 action => Q-value transition dynamic 覈襯 . fitted Q iteration algorithm 1. set 2. set yi r(si, ai) + 粒E[V(si)] argmin 1 2 i ||Q(si, ai) yi ||2 E[V(si)] maxaQ(si, ai) 企 Q function 蠍 覓語 朱讌 螳 詞 螻, max 螳 . + off-policy 螳 + ろ語 1螳 -> lower variance - no convergence guarantees.
  • 18. / 2218 Fitted Q-iteration full fitted Q-iteration algorithm 1. collect dataset using some policy 2. set 3. set {(si, ai, si, ri)} yi r(si, ai) + 粒maxai Q(si, ai) argmin 1 2 i ||Q(si, ai) yi ||2 hyperparameters - dataset size N - collection policy - iterations K - gradient steps S K x
  • 19. / 2219 Why is this algorithm off-policy? Off-policy - data螳 蠎 action maximization policy 蟆 企 蟯谿. => action exploration襷 蟲譟磯 襷れ企 蟯 - action 蟆一 policy, 旧 一企 policy襯 覿襴(objective function) state, action朱 企伎 big bucket of data襯 螳讌 蟆企殊 螳覲 一危磯ゼ 語誤蟆 蟆曙一 . 牛 maximum Q function 螳螻 蠍 覓語 action螻 覲螳襦 給 蟯谿.
  • 20. / 2220 Online Q-learning algorithm algorithm 1. take some action and observe 2. 3. 蠍一ヾ full fitted Q-iteration螻 るゴ蟆 1螳 off-policy algorithm企殊 1覯 覿覿 螳 . ai (si, ai, si, ri) yi = r(si, ai) + 粒maxaQ(si, ai) 留 dQ d (si, ai)(Q(si, ai) yi)
  • 21. / 2221 Exploration with Q-learning Q-learning exploration - Epsilon greedy 10% 襯襦 ろ蟆 襯 譴(exploration rate) 90% 襯襦 Q function 豕螳 螳讌 action - Boltzmann exploration : Q value襯 襯螳朱 覦蠑語 action