�ݺ�ߣ

????
Part 2 Lower bound of performance
Part 3 Trust Region Policy Optimization
Part 4 Code review of TRPO
Part 1 Problem of stochastic policy gradient

????
1. Stochastic policy gradient? parameter space??? update
? ?? policy? ?? ?? ? collapse of performance
2. ??? Performance? improvement? ????? ???? ??
? Policy space?? ??? ?????
3. Policy? ?? ? performance? ??? ?? : lower bound
? ????? lower bound? optimize
4. Penalty ??? constraint ??? ?? ? Trust Region method
? KL-divergence constraint

1. Problem of stochastic policy gradient
???? ?
TRPO

???? ??? ??
? MDP : Tuple (?, ?, ?, ?, ?0, ?)
? ? : state transition probability
? ?0: distribution of initial state ?0
? ? : reward function
? ? : discount factor
? ? �� ? �� ? �� 0, 1 : stochastic policy
? ????? ????? ?? ??? ????? policy? ??? ??
? ???? ????? ?? ??
? ??? ??? ??? ?? ?? ??? ???? ??

????? ??
1. Value-based RL
? Q-function? ?? action? ??(ex. ? ? ??????)
? SARSA, Q-Learning, DQN, Dueling Network, PER, etc..
2. Policy-based RL
? Explicit? policy? ??? ??? policy? parameterized
? REINFORCE, Actor-Critic, A3C, TRPO, PPO, etc..

Policy Gradient
? Optimize? ??? ?? : objective function
? Objective function : ?(? ?) ? policy parameter ?? ??
? Optimization problem : max
?
?(? ?) ? ???
? Policy gradient : iteration?? objective function? gradient? ?? parameter update
?�� = ? + ??? ?(?)
https://www.linkedin.com/pulse/logistic-regression-gradient-descent-hands-on-marinho-de-oliveira

Policy Gradient
? Objective function ? ?? ? ??? value function
? Objective function? policy? ???? ?
? PG? ??? objective function? ????? policy? ???
? ? = ?0, ?0, ?1, ?1, ?1, ?2, ?
? ? = ??~? ?
?=0
��
? ? ?(??)
= ??0~?0
? ? ?0

Policy Gradient
? Q-function, Value function, Advantage function? ??
? ? ??, ? ? = ?? ?+1,? ?+1,? ?
?=0
��
? ? ?(??+?)
?? ?? = ? ? ?, ? ?+1,? ?+1,? ?
?=0
��
? ? ?(??+?)
? ?~? ? ? ?? , ??+1~?(??+1|??, ? ?)
? ? ?, ? = ? ? ?, ? ? ??(?)

Policy Gradient
? Parameterized policy ? ?, objective function ? ? ?
? Policy gradient equation
1. Policy gradient of REINFORCE
?? ? ? = ??~? ?
?
?=0
��
?? ?? ???? ? ? ? ??
2. Policy gradient of Actor-Critic
?? ? ? = ?? ?~? ?, ? ?~? ?
? ? ?
(??, ? ?)?? ???? ? ? ? ??

Stochastic Policy Gradient
? Stochastic PG : Expectation? ???? ?? sampling?? ??
? ? episode ?? timestep ?? policy gradient? estimate
1. Policy gradient of REINFORCE
?? ? ? ?
1
?
?
?=0
?
??? ?? ???? ? ? ? ??
2. Policy gradient of Actor-Critic
?? ? ? ? ??(??, ? ?)?? ???? ? ? ? ??

Policy Gradient? ??
1. Sample efficiency is poor
? Policy gradient? ?? policy? ?? estimate
? Estimate? policy gradient? ? ?? update
? ?? policy? ?? data? ???? ???
2. Distance in parameter space �� distance in policy space
? Policy gradient? parameter?? step
? ??? parameter??? small step? policy? ?? ???? ? ??
? Policy? ??? ??? ?? parameter space??? step size? ??!

PG with Importance sampling
1. PG?? ?? policy? sample? ???? ?? importance sampling ??? ? ??
?? ? ? = ?? ?~? ?, ? ?~? ?
? ? ?
(??, ? ?)?? ???? ? ? ? ??
= ?? ?~? ? ???
, ? ?~? ? ???
? ?(? ?|??)
? ? ???
(? ?|??)
? ? ?
(??, ? ?)?? ???? ? ? ? ??
2. Importance weight? unbound ? ??? ? ??? ?
? ?? ?? ACER(Sample Efficient Actor-Critic with Experience Replay)

Step in policy space
? Parameter space? ?? policy space?? ??? update ? ? ???
? KL-divergence of two policy (old policy & new policy)? constraint? (trust region!)
? Update ? ? monotonic improvement? ??? ? ???
? Performance? lower bound? ???? lower bound? ???
? TRPO (Trust Region Policy Optimization)

2. Lower bound of performance
TRPO

Relative policy performance identity
? Policy ? ???? objective function : ? ? ??? = ??0,?0,?~? ???
�� ?=0
��
? ? ?(??)
? Policy ?? objective function : ? ? = ??0,?0,?~? �� ?=0
��
? ? ?(??)
? Bellman Equation ?? objective function ??? ???? ?????
? ? = ? ? ??? + ??0,?0,?~? ?
?=0
��
? ? ? ? ???
(??, ? ?)
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
Kakade & Langford ��Approximately optimal approximate reinforcement learning��, 2002

Proof of Relative policy performance identity (??)

Policy Iteration & objective function
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
? Policy improvement ? greedy policy improvement
? ?? ? ?? ??? positive advantage? ??? policy? improve
? = ?????? ? ? ? ???
(?, ?)
? Approximation error ? ? ? ???
?, ? < 0 ??, ? ?(?) ??? ???
? ? ? ? local approximation
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)

Local Approximation
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)
Approximation? ?? error? ??? ???
? Lower bound
Policy? ??? ??? steady state
distribution? ??? ??? ? ?? ???

Local Approximation with parameterized policy
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)
? ? ? ?0
? ?0
= ? ? ?0
? ? ? ? ?0
? ? ?
?=?0
= ? ? ? ? ? ?
?=?0
Kakade & Langford (2002)? ??
?, ? ?0
?? policy? ??? ????? ?
local approximation? improve ? objective function? improve

Conservative Policy Iteration
? ???? ??? objective function? improve? ????? ? ? ??
? ��Conservative Policy Iteration�� : explicit? ? ? lower bound? ??
1. Policy improvement(mixture policy ??)
? ?? policy : ? ???, ??? policy : ? ???
1 ?�� = ?????? ?�� ? ? ???
?��
2 ? ??? a s = (1 ? ?)? ??? a s + ??��(?|?)

Conservative Policy Iteration
2. Lower bound of ?
? ? ??? �� ? ? ???
? ??? ?
2??
1 ? ? 2
?2
????? ? = max
?
? ?~?�� ? ? ? ?(?, ?)
? ??? lower bound? ????? ??
? Parameterized policy? ??? mixture policy? ?? ? ??
Lower bound

Conservative Policy Iteration? ??
1. Mixture policy? ?? policy ??? ??? ???? ?? KL-div? ??
?2
�� ? ??
???
? ???, ? ???
2. ??? lower bound ?
? ? ??? �� ? ? ???
? ??? ? ?? ??
???
? ???, ? ???
????? ? =
4??
1 ? ? 2
, ? = max
?
? ?~?�� ? ? ? ?(?, ?)

Lower bound of objective function
? ? ??? �� ? ? ???
? ??? ? ?? ??
???
? ???, ? ???
?? ???? ?? ?? ??

3. Trust Region Policy Optimization
TRPO

Conservative Policy Iteration? ??
Policy improvement
Policy evaluation

KL-constraint optimization
1. Parameterized policy? lower bound ?? ??
? ? �� ? ? ???
? ? ?? ??
???
? ???, ?
2. Lower bound? optimization : C? ?? ?? step? ??
???????? ? ? ? ???
? ? ?? ??
???
? ???, ?
3. Large and robust step : KL-penalty ? KL-constraint
???????? ? ? ? ???
?
?. ?. ? ??
???
? ???, ? �� ?

KL-constraint optimization
???????? ? ? ? ???
?
?. ?. ? ??
???
? ???, ? �� ?
Trust region
small step in policy space

Approximation of KL-divergence
???????? ? ? ? ???
? ?. ?. ? ??
???
? ???, ? �� ?
? ? ?, ? ??
???
? ???, ? ? ?? ??? ??? max?? ??? practical X
? Approximation!
? ??
???
? ???, ? ~? ??
?
? ???, ? ? ??~? ???
? ?? ? ? ???
? ? || ? ? ? ?
? ??? sub-problem ?
???????? ? ? ? ???
? ?. ?. ? ??
? ???
? ???, ? �� ?

Surrogate advantage function
? ? ???
? = ? ? ??? + ?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?)
???????? ? ? ? ???
? �� ???????? ? ?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?)
�� ? ? ? ? ???
(?) �� ? ? ?(?|?)? ? ? ???
(?, ?)? ??? ??? ? approximation

?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?) = ??~? ? ???
,?~? ?
? ? ? ???
(?, ?)
= ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)
???????? ? ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?) ?. ?. ? ??
? ???
? ???, ? �� ?
Importance
sampling

Natural Policy Gradient
? How to solve this problem? ? ? ???
? = ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)
???????? ? ? ? ???
? ?. ?. ? ??
? ???
? ???, ? �� ?
1. 1st order approximation to surrogate advantage function
? ? ???
? �� ? ? ???
? ??? + ? ?
? ? ? ??? , ? = ? ? ? ? ???
? ?
? ???
2. 2nd order approximation to KL-divergence
? ??
? ???
? ???, ? ��
1
2
? ? ? ???
?
? ? ? ? ??? , ? = ? ?
2
? ??
? ???
? ???, ? ?
? ???
Fisher Information Matrix

? Change constraint problem
???????? ? ? ?
? ? ? ???
?. ?.
1
2
? ? ? ???
?
? ? ? ? ??? �� ?
? Langrange Multiplier method
? ? ?? ?? ? ?
? ? ? ??? ? ?
1
2
? ? ? ???
?
? ? ? ? ??? ? ??? 0?? ??
? ? = 1 ? ?, s = ? ? ? ???direction ? ? ?? ??
? ? ?? = 0
? = ??1 ?

? KL-divergence constraint? ?? ???? ? ???? ?? ??
? Lagrange multiplier? ?? ?? direction? boundary??? direction
? ? ? ??? = ??
1
2
? ? ? ???
?
? ? ? ? ??? = ?
��
1
2
?? ?
? ?? = ?
�� ? =
2?
? ? ??
=
2?
? ? ??1 ?
(? = ??1
?)

? problem
???????? ? ? ?
? ? ? ???
?. ?.
1
2
? ? ? ???
?
? ? ? ? ??? �� ?
? Solution
? ??? = ? ??? +
2?
? ? ??1 ?
??1 ?

Truncated Natural Policy Gradient
? NPG?? Neural Network? ?? parameter? ?? ?? ??1
? ??? ???
? Parameter ??? ???? ?? ??? ?2
, ??1
??? ?(?3
)
? Conjugate gradient method? ??? ??1? ???? ?? ??1 ? ???
? Truncated Natural Policy Gradient
? CG(conjugate gradient method)? ?? = ?? ????? ??? ?? ??
? Analytic?? ??? ?? iterative?? ?? ???? ??
? ??? ?? ?? ?? hessian-vector product ??? ???? ?? = ?? ??? ?? ??

Truncated Natural Policy Gradient
? Truncated Natural Policy Gradient? ??
1. Might not be robust to trust region size ; at some iterations may be too large and
performance can degrade
2. Because of quadratic approximation, KL-divergence constraint may be violated

Trust Region method
Approximation
Sub-problem
Trust-region

Trust Region Policy Optimization
? ?? ??? sub-problem?? ??? ? sub-problem? ? step?? ?
1. Finding search direction
2. Do line search on that direction inside trust region
? Trust Region Policy Optimization
1. Search direction ?=
2?
? ? ??1 ?
??1 ?
2. Backtracking line search ? ??? = ? ??? + ? ?? (? ? ???
? > 0 ??? ? ??
? ???
? ???, ? �� ?? ?, stop)

OpenAI baselines
https://github.com/openai/baselines

TRPO ?? ??
1. ?? policy? sample ???
2. ?? sample? GAE ????
3. Surrogate advantage function ????
4. Surrogate? gradient? KL-divergence? hessian? ???
5. g? H? ?? (CG) search direction ??
6. Search direction? ?? backtracking line search

TRPO baseline ?? ??
1. run_atari.py : atari ???? ???? main loop
2. nosharing_cnn_policy.py : actor-critic network (actor? critic? ???? ?? X)
3. trpo_mpi.py : cnn_policy? ?? ??? ??? ?? ??
run_atari.py
noshaing_cnn_policy.py trpo_mpi.py

?? policy? sample ???
def traj_segment_generator(pi, env, horizon, stochastic)

Surrogate advantage function ????
? ? ???
? = ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)

Surrogate? gradient? KL-divergence? hessian? ???
1. Surrogate? gradient
? State, action, GAE? policy gradient ??
2. KL-divergence? hessian matrix(2? ??) : FIM ???
? KL-div :
? KL-div 1? ?? :

Surrogate? gradient? KL-divergence? hessian? ???
2. KL-divergence? hessian matrix(2? ??) : FIM ???
? FIM ?? :

g? H? ?? (CG) search direction ??
1. ??1
? ??
2. Search direction ?=
2?
? ? ??1 ?
??1
? ??

Search direction? ?? backtracking line search
? ? ??? = ? ??? + ? ?? (? ? ???
? > 0 ??? ? ??
? ???
? ???, ? �� ?? ?, stop)

�ݺ�ߣ

???? ? Trpo

More Related Content

???? ? Trpo