狠狠撸

Deep Reinforcement Learning
CS294-112, 2017 Fall
Lecture 5
???

????? ???????

??
1. Improving the policy gradient with a critic
2. The policy evaluation problem
3. Discount factors
4. The actor-critic algorithm

Recap: policy gradient
1. Improving the policy gradient with a critic 3/20
REINFORCE algorithm

1. Sample { } from ?
2. ?

3.
τi πθ(at |st)
ΔθJ(θ) ≈
1
N
N
∑
i=1
T
∑
t=1
Δθlogπθ(ai,t |si,t) ?Qπ
i,t
θ ← θ + αΔθJ(θ)

Improving the policy gradient (1/2)
Various trajectories are possible

1. Policy choice randomly.

2. Stochastic dynamics of?
the environment might lead.

Improving the policy gradient (2/2)
# of Samples

1. Single sample -> high variance?
2. Infinite samples -> lower variance?
policy gradient?? variance? ????
??? ??? ??? ??
ΔθJ(θ) ≈
T
∑
t=1
Δθlogπθ(ai,t |si,t)Q(si,t, ai,t)

What about the baseline?
Baseline

- ?? ??? ?? ???, ?? ??? ???? ?? ??

- Average Q function, ? Value function? ???
(Q function? Policy(action distribution)? weighted average)

?
- ???? ???, ?? ?? ??? Advantage ?? ??.
V(st) = Eat～πθ(at|st)[Q(st, at)]
ΔθJ(θ) ≈
1
N
N
∑
i=1
T
∑
t=1
Δθlogπθ(ai,t |si,t)(Q(si,t, ai,t) ? V(si,t))
A(si,t, ai,t)

State & state-action value functions (1/2)
Functions

- Q function : ?? state??, action? ??? ? total discounted reward

- Value function : ?? state??? total reward, average of Q function

- Advantage : Q - V, ? action? ??? ???.

Qπ
(st, at) =
T
∑
t′=t
Eπθ
[r(st′, at′|st, at]
Vπ
(st) = Eat～πθ(at|st)[Qπ
(st, at)]
Aπ
(st, at) = Qπ
(st, at) ? Vπ
(st)

State & state-action value functions (2/2)
Objective functions : unbiased, high variance

ΔθJ(θ) ≈
1
N
N
∑
i=1
T
∑
t=1
Δθlogπθ(ai,t |si,t)(
t
∑
t′=1
r(si,t′, ii,t′) ? b))
-Single episode

-unbiased: policy? ?? ??? gradient? ??

-high variance

-Neural Net? ?? bias? ???? variance? ?? ? ??

-Policy gradient? ?? variance? ??????
?? task?? ?? trade off?? ? ? ??.
baseline

Value function fitting
Q, V, A ? Value function?? fitting.

Q function? ???? ?????
deterministic? current reward? ???

next value function? ?? ??.

Qπ
(st, at) ≈ r(st, at) + Est+1～p(st+1|st,at)[Vπ
(st+1)]
≈ r(st, at) + Vπ
(st+1)
Aπ
(st, at) ≈ r(st, at) + Vπ
(st+1) ? Vπ
(st)
Just fit Value function!

Policy evaluation
??? policy? value function? fit ??? ??(policy ??)

Monte Carlo policy evaluation

- ?? trajectory? ??? rewards? ?? ??

- ?? trajectory? ?? ?? ??

Monte Carlo evaluation with function approximation
Training data

Value function fit: Supervised regression
{(si,t,
T
∑
t′=t
r(si,t′, ai,t′)}
L(?) =
1
2 ∑
i
|| ?Vπ
?(si) ? yi ||2

Can we do better? Bootstrap.
Directly use previous fitted value function.

- Bootstrap ??: bias? ???? variance? ?? ? ??.

- ??? ?? ????? y label? ????.

{(si,t, r(si,t, ai,t) + ?Vπ
?(si,t+1))}
current reward + next value of?
previous fitted

value function

Discount factor
3. Discount factor 13/20
Reward? timestep? ?? ??

???? 4?? state? ?? ????

reward? ??? ?? ???? state? ??.

???? ?? ?? state? ???? ???

gamma ???? ?? ???? ???
yi,t ≈ r(si,t, ai,t) + γ ?Vπ
?(si,t+1)
γ
γ ∈ [0,1]

An actor-critic algorithm
4. An actor-critic algorithm 14/20
Batch actor-critic algorithm

1. Sample from

2. fit to sampled reward sums

3. evaluate

4. Optimize objective using gradient ascent

Online actor-critic algorithm

: single sample
? function? approximate

-Actor: Policy function

-Critic: Value function{si, ai} πθ(a|s)
?Vπ
?(s)
?Aπ
(si, ai) = r(si, ai) + γ ?Vπ
?(si+1) ? ?Vπ
?(si)

An actor-critic architecture design
Online actor-critic algorithm

<two network design> <shared network design>
scalar value
- continuous: GMM

- discrete: softmax
? ??????

Neural net tuning? ???

Online actor-critic in practice
parallel?? ?? worker? ?? update??? ???

? ?? step? ??? ? ?? ????

Critics as state-dependent baselines
Actor-critic

- lower variance(due to critic)

- biased(if the critic is not perfect)

Policy gradient

- higher variance(single sample estimate)

- unbiased

? ?? ?? ? ?? ?? ?????

Eligibility traces & n-step returns
Critic approach :

- lower variance, higher bias

Monte-Carlo :

- higher variance, no bias

Can we combine?

?Aπ
C(st, at) = r(st, at) + γ ?Vπ
?(st+1) ? ?Vπ
?(st)
?Aπ
MC(st, at) =
∞
∑
t′=t
γt′?t
r(st′, at′) ? ?Vπ
?(st)
?Aπ
n(st, at) =
t+n
∑
t′=n
γt′?t
r(st′, at′) ? ?Vπ
?(st) + γn ?Vπ
?(st+n)

狠狠撸

CS294-112 Lec 05

More Related Content

CS294-112 Lec 05