This document summarizes an academic lecture on deep reinforcement learning. It discusses:
1. Improving the policy gradient method with a critic network to reduce variance. The critic fits a value function to estimate advantages.
2. Methods for policy evaluation including Monte Carlo evaluation and temporal difference bootstrapping to fit the value function for a fixed policy.
3. The actor-critic algorithm which approximates both the policy and value function with neural networks and optimizes them together online using sampled episodes.
2. ??
1. Improving the policy gradient with a critic
2. The policy evaluation problem
3. Discount factors
4. The actor-critic algorithm
3. Recap: policy gradient
1. Improving the policy gradient with a critic 3/20
REINFORCE algorithm
1. Sample { } from ?
2. ?
3.
¦Ói ¦Ð¦È(at |st)
¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t) ?Q¦Ð
i,t
¦È ¡û ¦È + ¦Á¦¤¦ÈJ(¦È)
4. Improving the policy gradient (1/2)
1. Improving the policy gradient with a critic 4/20
Various trajectories are possible
1. Policy choice randomly.
2. Stochastic dynamics of?
the environment might lead.
5. Improving the policy gradient (2/2)
1. Improving the policy gradient with a critic 5/20
# of Samples
1. Single sample -> high variance?
2. Infinite samples -> lower variance?
policy gradient?? variance? ????
??? ??? ??? ??
¦¤¦ÈJ(¦È) ¡Ö
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)Q(si,t, ai,t)
6. What about the baseline?
1. Improving the policy gradient with a critic 6/20
Baseline
- ?? ??? ?? ???, ?? ??? ???? ?? ??
- Average Q function, ? Value function? ???
(Q function? Policy(action distribution)? weighted average)
?
- ???? ???, ?? ?? ??? Advantage ?? ??.
V(st) = Eat¡«¦Ð¦È(at|st)[Q(st, at)]
¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)(Q(si,t, ai,t) ? V(si,t))
A(si,t, ai,t)
7. State & state-action value functions (1/2)
1. Improving the policy gradient with a critic 7/20
Functions
- Q function : ?? state??, action? ??? ? total discounted reward
- Value function : ?? state??? total reward, average of Q function
- Advantage : Q - V, ? action? ??? ???.
Q¦Ð
(st, at) =
T
¡Æ
t¡ä=t
E¦Ð¦È
[r(st¡ä, at¡ä|st, at]
V¦Ð
(st) = Eat¡«¦Ð¦È(at|st)[Q¦Ð
(st, at)]
A¦Ð
(st, at) = Q¦Ð
(st, at) ? V¦Ð
(st)
8. State & state-action value functions (2/2)
1. Improving the policy gradient with a critic 8/20
Objective functions : unbiased, high variance
¦¤¦ÈJ(¦È) ¡Ö
1
N
N
¡Æ
i=1
T
¡Æ
t=1
¦¤¦Èlog¦Ð¦È(ai,t |si,t)(
t
¡Æ
t¡ä=1
r(si,t¡ä, ii,t¡ä) ? b))
-Single episode
-unbiased: policy? ?? ??? gradient? ??
-high variance
-Neural Net? ?? bias? ???? variance? ?? ? ??
-Policy gradient? ?? variance? ??????
?? task?? ?? trade off?? ? ? ??.
baseline
9. Value function fitting
1. Improving the policy gradient with a critic 9/20
Q, V, A ? Value function?? fitting.
Q function? ???? ?????
deterministic? current reward? ???
next value function? ?? ??.
Q¦Ð
(st, at) ¡Ö r(st, at) + Est+1¡«p(st+1|st,at)[V¦Ð
(st+1)]
¡Ö r(st, at) + V¦Ð
(st+1)
A¦Ð
(st, at) ¡Ö r(st, at) + V¦Ð
(st+1) ? V¦Ð
(st)
Just fit Value function!
10. Policy evaluation
1. Improving the policy gradient with a critic 10/20
??? policy? value function? fit ??? ??(policy ??)
Monte Carlo policy evaluation
- ?? trajectory? ??? rewards? ?? ??
- ?? trajectory? ?? ?? ??
11. Monte Carlo evaluation with function approximation
1. Improving the policy gradient with a critic 11/20
Training data
Value function fit: Supervised regression
{(si,t,
T
¡Æ
t¡ä=t
r(si,t¡ä, ai,t¡ä)}
L(?) =
1
2 ¡Æ
i
|| ?V¦Ð
?(si) ? yi ||2
12. Can we do better? Bootstrap.
1. Improving the policy gradient with a critic 12/20
Directly use previous fitted value function.
- Bootstrap ??: bias? ???? variance? ?? ? ??.
- ??? ?? ????? y label? ????.
{(si,t, r(si,t, ai,t) + ?V¦Ð
?(si,t+1))}
current reward + next value of?
previous fitted
value function