ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Deep Reinforcement Learning
CS294-112, 2017 Fall
Lecture 5
???

????? ???????
??
1. Improving the policy gradient with a critic
2. The policy evaluation problem
3. Discount factors
4. The actor-critic algorithm
Recap: policy gradient
1. Improving the policy gradient with a critic 3/20
REINFORCE algorithm

1. Sample { } from ?
2. ?


3.
¦Ói ¦Ð¦È(at |st)
¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t) ?Q¦Ð
i,t
¦È ¡û ¦È + ¦Á¦¤¦ÈJ(¦È)
Improving the policy gradient (1/2)
1. Improving the policy gradient with a critic 4/20
Various trajectories are possible

1. Policy choice randomly.

2. Stochastic dynamics of?
the environment might lead.
Improving the policy gradient (2/2)
1. Improving the policy gradient with a critic 5/20
# of Samples

1. Single sample -> high variance?
2. Infinite samples -> lower variance?
policy gradient?? variance? ????
??? ??? ??? ??
¦¤¦ÈJ(¦È) ¡Ö
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)Q(si,t, ai,t)
What about the baseline?
1. Improving the policy gradient with a critic 6/20
Baseline

- ?? ??? ?? ???, ?? ??? ???? ?? ??

- Average Q function, ? Value function? ???
(Q function? Policy(action distribution)? weighted average)

?
- ???? ???, ?? ?? ??? Advantage ?? ??.
V(st) = Eat¡«¦Ð¦È(at|st)[Q(st, at)]
¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)(Q(si,t, ai,t) ? V(si,t))
A(si,t, ai,t)
State & state-action value functions (1/2)
1. Improving the policy gradient with a critic 7/20
Functions

- Q function : ?? state??, action? ??? ? total discounted reward

- Value function : ?? state??? total reward, average of Q function

- Advantage : Q - V, ? action? ??? ???.

Q¦Ð
(st, at) =
T
¡Æ
t¡ä=t
E¦Ð¦È
[r(st¡ä, at¡ä|st, at]
V¦Ð
(st) = Eat¡«¦Ð¦È(at|st)[Q¦Ð
(st, at)]
A¦Ð
(st, at) = Q¦Ð
(st, at) ? V¦Ð
(st)
State & state-action value functions (2/2)
1. Improving the policy gradient with a critic 8/20
Objective functions : unbiased, high variance

¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)(
t
¡Æ
t¡ä=1
r(si,t¡ä, ii,t¡ä) ? b))
-Single episode

-unbiased: policy? ?? ??? gradient? ??

-high variance

-Neural Net? ?? bias? ???? variance? ?? ? ??

-Policy gradient? ?? variance? ??????
?? task?? ?? trade off?? ? ? ??.
baseline
Value function fitting
1. Improving the policy gradient with a critic 9/20
Q, V, A ? Value function?? fitting.

Q function? ???? ?????
deterministic? current reward? ???

next value function? ?? ??.

Q¦Ð
(st, at) ¡Ö r(st, at) + Est+1¡«p(st+1|st,at)[V¦Ð
(st+1)]
¡Ö r(st, at) + V¦Ð
(st+1)
A¦Ð
(st, at) ¡Ö r(st, at) + V¦Ð
(st+1) ? V¦Ð
(st)
Just fit Value function!
Policy evaluation
1. Improving the policy gradient with a critic 10/20
??? policy? value function? fit ??? ??(policy ??)

Monte Carlo policy evaluation

- ?? trajectory? ??? rewards? ?? ??

- ?? trajectory? ?? ?? ??
Monte Carlo evaluation with function approximation
1. Improving the policy gradient with a critic 11/20
Training data

Value function fit: Supervised regression
{(si,t,
T
¡Æ
t¡ä=t
r(si,t¡ä, ai,t¡ä)}
L(?) =
1
2 ¡Æ
i
|| ?V¦Ð
?(si) ? yi ||2
Can we do better? Bootstrap.
1. Improving the policy gradient with a critic 12/20
Directly use previous fitted value function.

- Bootstrap ??: bias? ???? variance? ?? ? ??.

- ??? ?? ????? y label? ????.

{(si,t, r(si,t, ai,t) + ?V¦Ð
?(si,t+1))}
current reward + next value of?
previous fitted

value function
Discount factor
3. Discount factor 13/20
Reward? timestep? ?? ??

???? 4?? state? ?? ????

reward? ??? ?? ???? state? ??.

???? ?? ?? state? ???? ???

gamma ???? ?? ???? ???
yi,t ¡Ö r(si,t, ai,t) + ¦Ã ?V¦Ð
?(si,t+1)
¦Ã
¦Ã ¡Ê [0,1]
An actor-critic algorithm
4. An actor-critic algorithm 14/20
Batch actor-critic algorithm

1. Sample from 

2. fit to sampled reward sums

3. evaluate

4. Optimize objective using gradient ascent

Online actor-critic algorithm

: single sample
? function? approximate

-Actor: Policy function

-Critic: Value function{si, ai} ¦Ð¦È(a|s)
?V¦Ð
?(s)
?A¦Ð
(si, ai) = r(si, ai) + ¦Ã ?V¦Ð
?(si+1) ? ?V¦Ð
?(si)
An actor-critic architecture design
4. An actor-critic algorithm 15/20
Online actor-critic algorithm

<two network design> <shared network design>
scalar value
- continuous: GMM

- discrete: softmax
? ??????

Neural net tuning? ???
Online actor-critic in practice
4. An actor-critic algorithm 16/20
parallel?? ?? worker? ?? update??? ???

? ?? step? ??? ? ?? ????
Critics as state-dependent baselines
4. An actor-critic algorithm 17/20
Actor-critic

- lower variance(due to critic)

- biased(if the critic is not perfect)

Policy gradient

- higher variance(single sample estimate)

- unbiased

? ?? ?? ? ?? ?? ?????
Eligibility traces & n-step returns
4. An actor-critic algorithm 18/20
Critic approach :

- lower variance, higher bias

Monte-Carlo :

- higher variance, no bias

Can we combine?

?A¦Ð
C(st, at) = r(st, at) + ¦Ã ?V¦Ð
?(st+1) ? ?V¦Ð
?(st)
?A¦Ð
MC(st, at) =
¡Þ
¡Æ
t¡ä=t
¦Ãt¡ä?t
r(st¡ä, at¡ä) ? ?V¦Ð
?(st)
?A¦Ð
n(st, at) =
t+n
¡Æ
t¡ä=n
¦Ãt¡ä?t
r(st¡ä, at¡ä) ? ?V¦Ð
?(st) + ¦Ãn ?V¦Ð
?(st+n)
End
?????.

More Related Content

CS294-112 Lec 05

  • 1. Deep Reinforcement Learning CS294-112, 2017 Fall Lecture 5 ??? ????? ???????
  • 2. ?? 1. Improving the policy gradient with a critic 2. The policy evaluation problem 3. Discount factors 4. The actor-critic algorithm
  • 3. Recap: policy gradient 1. Improving the policy gradient with a critic 3/20 REINFORCE algorithm 1. Sample { } from ? 2. ? 3. ¦Ói ¦Ð¦È(at |st) ¦¤¦ÈJ(¦È) ¡Ö 1 N N ¡Æ i=1 T ¡Æ t=1 ¦¤¦Èlog¦Ð¦È(ai,t |si,t) ?Q¦Ð i,t ¦È ¡û ¦È + ¦Á¦¤¦ÈJ(¦È)
  • 4. Improving the policy gradient (1/2) 1. Improving the policy gradient with a critic 4/20 Various trajectories are possible 1. Policy choice randomly. 2. Stochastic dynamics of? the environment might lead.
  • 5. Improving the policy gradient (2/2) 1. Improving the policy gradient with a critic 5/20 # of Samples 1. Single sample -> high variance? 2. Infinite samples -> lower variance? policy gradient?? variance? ???? ??? ??? ??? ?? ¦¤¦ÈJ(¦È) ¡Ö T ¡Æ t=1 ¦¤¦Èlog¦Ð¦È(ai,t |si,t)Q(si,t, ai,t)
  • 6. What about the baseline? 1. Improving the policy gradient with a critic 6/20 Baseline - ?? ??? ?? ???, ?? ??? ???? ?? ?? - Average Q function, ? Value function? ??? (Q function? Policy(action distribution)? weighted average) ? - ???? ???, ?? ?? ??? Advantage ?? ??. V(st) = Eat¡«¦Ð¦È(at|st)[Q(st, at)] ¦¤¦ÈJ(¦È) ¡Ö 1 N N ¡Æ i=1 T ¡Æ t=1 ¦¤¦Èlog¦Ð¦È(ai,t |si,t)(Q(si,t, ai,t) ? V(si,t)) A(si,t, ai,t)
  • 7. State & state-action value functions (1/2) 1. Improving the policy gradient with a critic 7/20 Functions - Q function : ?? state??, action? ??? ? total discounted reward - Value function : ?? state??? total reward, average of Q function - Advantage : Q - V, ? action? ??? ???. Q¦Ð (st, at) = T ¡Æ t¡ä=t E¦Ð¦È [r(st¡ä, at¡ä|st, at] V¦Ð (st) = Eat¡«¦Ð¦È(at|st)[Q¦Ð (st, at)] A¦Ð (st, at) = Q¦Ð (st, at) ? V¦Ð (st)
  • 8. State & state-action value functions (2/2) 1. Improving the policy gradient with a critic 8/20 Objective functions : unbiased, high variance ¦¤¦ÈJ(¦È) ¡Ö 1 N N ¡Æ i=1 T ¡Æ t=1 ¦¤¦Èlog¦Ð¦È(ai,t |si,t)( t ¡Æ t¡ä=1 r(si,t¡ä, ii,t¡ä) ? b)) -Single episode -unbiased: policy? ?? ??? gradient? ?? -high variance -Neural Net? ?? bias? ???? variance? ?? ? ?? -Policy gradient? ?? variance? ?????? ?? task?? ?? trade off?? ? ? ??. baseline
  • 9. Value function fitting 1. Improving the policy gradient with a critic 9/20 Q, V, A ? Value function?? fitting. Q function? ???? ????? deterministic? current reward? ??? next value function? ?? ??. Q¦Ð (st, at) ¡Ö r(st, at) + Est+1¡«p(st+1|st,at)[V¦Ð (st+1)] ¡Ö r(st, at) + V¦Ð (st+1) A¦Ð (st, at) ¡Ö r(st, at) + V¦Ð (st+1) ? V¦Ð (st) Just fit Value function!
  • 10. Policy evaluation 1. Improving the policy gradient with a critic 10/20 ??? policy? value function? fit ??? ??(policy ??) Monte Carlo policy evaluation - ?? trajectory? ??? rewards? ?? ?? - ?? trajectory? ?? ?? ??
  • 11. Monte Carlo evaluation with function approximation 1. Improving the policy gradient with a critic 11/20 Training data Value function fit: Supervised regression {(si,t, T ¡Æ t¡ä=t r(si,t¡ä, ai,t¡ä)} L(?) = 1 2 ¡Æ i || ?V¦Ð ?(si) ? yi ||2
  • 12. Can we do better? Bootstrap. 1. Improving the policy gradient with a critic 12/20 Directly use previous fitted value function. - Bootstrap ??: bias? ???? variance? ?? ? ??. - ??? ?? ????? y label? ????. {(si,t, r(si,t, ai,t) + ?V¦Ð ?(si,t+1))} current reward + next value of? previous fitted value function
  • 13. Discount factor 3. Discount factor 13/20 Reward? timestep? ?? ?? ???? 4?? state? ?? ???? reward? ??? ?? ???? state? ??. ???? ?? ?? state? ???? ??? gamma ???? ?? ???? ??? yi,t ¡Ö r(si,t, ai,t) + ¦Ã ?V¦Ð ?(si,t+1) ¦Ã ¦Ã ¡Ê [0,1]
  • 14. An actor-critic algorithm 4. An actor-critic algorithm 14/20 Batch actor-critic algorithm 1. Sample from 2. fit to sampled reward sums 3. evaluate 4. Optimize objective using gradient ascent Online actor-critic algorithm : single sample ? function? approximate -Actor: Policy function -Critic: Value function{si, ai} ¦Ð¦È(a|s) ?V¦Ð ?(s) ?A¦Ð (si, ai) = r(si, ai) + ¦Ã ?V¦Ð ?(si+1) ? ?V¦Ð ?(si)
  • 15. An actor-critic architecture design 4. An actor-critic algorithm 15/20 Online actor-critic algorithm <two network design> <shared network design> scalar value - continuous: GMM - discrete: softmax ? ?????? Neural net tuning? ???
  • 16. Online actor-critic in practice 4. An actor-critic algorithm 16/20 parallel?? ?? worker? ?? update??? ??? ? ?? step? ??? ? ?? ????
  • 17. Critics as state-dependent baselines 4. An actor-critic algorithm 17/20 Actor-critic - lower variance(due to critic) - biased(if the critic is not perfect) Policy gradient - higher variance(single sample estimate) - unbiased ? ?? ?? ? ?? ?? ?????
  • 18. Eligibility traces & n-step returns 4. An actor-critic algorithm 18/20 Critic approach : - lower variance, higher bias Monte-Carlo : - higher variance, no bias Can we combine? ?A¦Ð C(st, at) = r(st, at) + ¦Ã ?V¦Ð ?(st+1) ? ?V¦Ð ?(st) ?A¦Ð MC(st, at) = ¡Þ ¡Æ t¡ä=t ¦Ãt¡ä?t r(st¡ä, at¡ä) ? ?V¦Ð ?(st) ?A¦Ð n(st, at) = t+n ¡Æ t¡ä=n ¦Ãt¡ä?t r(st¡ä, at¡ä) ? ?V¦Ð ?(st) + ¦Ãn ?V¦Ð ?(st+n)