ݺߣ

ݺߣShare a Scribd company logo
???
2017.02.03
???? ?
TRPO
????
Part 2 Lower bound of performance
Part 3 Trust Region Policy Optimization
Part 4 Code review of TRPO
Part 1 Problem of stochastic policy gradient
????
1. Stochastic policy gradient? parameter space??? update
? ?? policy? ?? ?? ? collapse of performance
2. ??? Performance? improvement? ????? ???? ??
? Policy space?? ??? ?????
3. Policy? ?? ? performance? ??? ?? : lower bound
? ????? lower bound? optimize
4. Penalty ??? constraint ??? ?? ? Trust Region method
? KL-divergence constraint
1. Problem of stochastic policy gradient
???? ?
TRPO
???? ??? ??
? MDP : Tuple (?, ?, ?, ?, ?0, ?)
? ? : state transition probability
? ?0: distribution of initial state ?0
? ? : reward function
? ? : discount factor
? ?  ?  ?  0, 1 : stochastic policy
? ????? ????? ?? ??? ????? policy? ??? ??
? ???? ????? ?? ??
? ??? ??? ??? ?? ?? ??? ???? ??
????? ??
1. Value-based RL
? Q-function? ?? action? ??(ex. ? ? ??????)
? SARSA, Q-Learning, DQN, Dueling Network, PER, etc..
2. Policy-based RL
? Explicit? policy? ??? ??? policy? parameterized
? REINFORCE, Actor-Critic, A3C, TRPO, PPO, etc..
Policy Gradient
? Optimize? ??? ?? : objective function
? Objective function : ?(? ?) ? policy parameter ?? ??
? Optimization problem : max
?
?(? ?) ? ???
? Policy gradient : iteration?? objective function? gradient? ?? parameter update
? = ? + ??? ?(?)
https://www.linkedin.com/pulse/logistic-regression-gradient-descent-hands-on-marinho-de-oliveira
Policy Gradient
? Objective function ? ?? ? ??? value function
? Objective function? policy? ???? ?
? PG? ??? objective function? ????? policy? ???
? ? = ?0, ?0, ?1, ?1, ?1, ?2, ?
? ? = ??~? ?
?=0

? ? ?(??)
= ??0~?0
? ? ?0
Policy Gradient
? Q-function, Value function, Advantage function? ??
? ? ??, ? ? = ?? ?+1,? ?+1,? ?
?=0

? ? ?(??+?)
?? ?? = ? ? ?, ? ?+1,? ?+1,? ?
?=0

? ? ?(??+?)
? ?~? ? ? ?? , ??+1~?(??+1|??, ? ?)
? ? ?, ? = ? ? ?, ? ? ??(?)
Policy Gradient
? Parameterized policy ? ?, objective function ? ? ?
? Policy gradient equation
1. Policy gradient of REINFORCE
?? ? ? = ??~? ?
?
?=0

?? ?? ???? ? ? ? ??
2. Policy gradient of Actor-Critic
?? ? ? = ?? ?~? ?, ? ?~? ?
? ? ?
(??, ? ?)?? ???? ? ? ? ??
Stochastic Policy Gradient
? Stochastic PG : Expectation? ???? ?? sampling?? ??
? ? episode ?? timestep ?? policy gradient? estimate
1. Policy gradient of REINFORCE
?? ? ? ?
1
?
?
?=0
?
??? ?? ???? ? ? ? ??
2. Policy gradient of Actor-Critic
?? ? ? ? ??(??, ? ?)?? ???? ? ? ? ??
Policy Gradient? ??
1. Sample efficiency is poor
? Policy gradient? ?? policy? ?? estimate
? Estimate? policy gradient? ? ?? update
? ?? policy? ?? data? ???? ???
2. Distance in parameter space  distance in policy space
? Policy gradient? parameter?? step
? ??? parameter??? small step? policy? ?? ???? ? ??
? Policy? ??? ??? ?? parameter space??? step size? ??!
PG with Importance sampling
1. PG?? ?? policy? sample? ???? ?? importance sampling ??? ? ??
?? ? ? = ?? ?~? ?, ? ?~? ?
? ? ?
(??, ? ?)?? ???? ? ? ? ??
= ?? ?~? ? ???
, ? ?~? ? ???
? ?(? ?|??)
? ? ???
(? ?|??)
? ? ?
(??, ? ?)?? ???? ? ? ? ??
2. Importance weight? unbound ? ??? ? ??? ?
? ?? ?? ACER(Sample Efficient Actor-Critic with Experience Replay)
Step in policy space
? Parameter space? ?? policy space?? ??? update ? ? ???
? KL-divergence of two policy (old policy & new policy)? constraint? (trust region!)
? Update ? ? monotonic improvement? ??? ? ???
? Performance? lower bound? ???? lower bound? ???
? TRPO (Trust Region Policy Optimization)
2. Lower bound of performance
TRPO
Relative policy performance identity
? Policy ? ???? objective function : ? ? ??? = ??0,?0,?~? ???
 ?=0

? ? ?(??)
? Policy ?? objective function : ? ? = ??0,?0,?~?  ?=0

? ? ?(??)
? Bellman Equation ?? objective function ??? ???? ?????
? ? = ? ? ??? + ??0,?0,?~? ?
?=0

? ? ? ? ???
(??, ? ?)
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
Kakade & Langford Approximately optimal approximate reinforcement learning, 2002
Proof of Relative policy performance identity (??)
Policy Iteration & objective function
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
? Policy improvement ? greedy policy improvement
? ?? ? ?? ??? positive advantage? ??? policy? improve
? = ?????? ? ? ? ???
(?, ?)
? Approximation error ? ? ? ???
?, ? < 0 ??, ? ?(?) ??? ???
? ? ? ? local approximation
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)
Local Approximation
? ? = ? ? ??? + ?
?
? ?(?) ?
?
?(?|?)? ? ???
(?, ?)
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)
Approximation? ?? error? ??? ???
? Lower bound
Policy? ??? ??? steady state
distribution? ??? ??? ? ?? ???
Local Approximation with parameterized policy
? ? ???
(?) = ? ? ??? + ?
?
? ? ???
(?) ?
?
?(?|?)? ? ???
(?, ?)
? ? ? ?0
? ?0
= ? ? ?0
? ? ? ? ?0
? ? ?
?=?0
= ? ? ? ? ? ?
?=?0
Kakade & Langford (2002)? ??
?, ? ?0
?? policy? ??? ????? ?
local approximation? improve ? objective function? improve
Conservative Policy Iteration
? ???? ??? objective function? improve? ????? ? ? ??
? Conservative Policy Iteration : explicit? ? ? lower bound? ??
1. Policy improvement(mixture policy ??)
? ?? policy : ? ???, ??? policy : ? ???
1 ? = ?????? ? ? ? ???
?
2 ? ??? a s = (1 ? ?)? ??? a s + ??(?|?)
Conservative Policy Iteration
2. Lower bound of ?
? ? ???  ? ? ???
? ??? ?
2??
1 ? ? 2
?2
????? ? = max
?
? ?~? ? ? ? ?(?, ?)
? ??? lower bound? ????? ??
? Parameterized policy? ??? mixture policy? ?? ? ??
Lower bound
Conservative Policy Iteration? ??
1. Mixture policy? ?? policy ??? ??? ???? ?? KL-div? ??
?2
 ? ??
???
? ???, ? ???
2. ??? lower bound ?
? ? ???  ? ? ???
? ??? ? ?? ??
???
? ???, ? ???
????? ? =
4??
1 ? ? 2
, ? = max
?
? ?~? ? ? ? ?(?, ?)
Lower bound of objective function
? ? ???  ? ? ???
? ??? ? ?? ??
???
? ???, ? ???
?? ???? ?? ?? ??
3. Trust Region Policy Optimization
TRPO
Conservative Policy Iteration? ??
Policy improvement
Policy evaluation
KL-constraint optimization
1. Parameterized policy? lower bound ?? ??
? ?  ? ? ???
? ? ?? ??
???
? ???, ?
2. Lower bound? optimization : C? ?? ?? step? ??
???????? ? ? ? ???
? ? ?? ??
???
? ???, ?
3. Large and robust step : KL-penalty ? KL-constraint
???????? ? ? ? ???
?
?. ?. ? ??
???
? ???, ?  ?
KL-constraint optimization
???????? ? ? ? ???
?
?. ?. ? ??
???
? ???, ?  ?
Trust region
small step in policy space
Approximation of KL-divergence
???????? ? ? ? ???
? ?. ?. ? ??
???
? ???, ?  ?
? ? ?, ? ??
???
? ???, ? ? ?? ??? ??? max?? ??? practical X
? Approximation!
? ??
???
? ???, ? ~? ??
?
? ???, ? ? ??~? ???
? ?? ? ? ???
? ? || ? ? ? ?
? ??? sub-problem ?
???????? ? ? ? ???
? ?. ?. ? ??
? ???
? ???, ?  ?
Surrogate advantage function
? ? ???
? = ? ? ??? + ?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?)
???????? ? ? ? ???
?  ???????? ? ?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?)
 ? ? ? ? ???
(?)  ? ? ?(?|?)? ? ? ???
(?, ?)? ??? ??? ? approximation
Surrogate advantage function
?
?
? ? ? ???
(?) ?
?
? ?(?|?)? ? ? ???
(?, ?) = ??~? ? ???
,?~? ?
? ? ? ???
(?, ?)
= ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)
???????? ? ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?) ?. ?. ? ??
? ???
? ???, ?  ?
Importance
sampling
Surrogate advantage function
Natural Policy Gradient
? How to solve this problem? ? ? ???
? = ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)
???????? ? ? ? ???
? ?. ?. ? ??
? ???
? ???, ?  ?
1. 1st order approximation to surrogate advantage function
? ? ???
?  ? ? ???
? ??? + ? ?
? ? ? ??? , ? = ? ? ? ? ???
? ?
? ???
2. 2nd order approximation to KL-divergence
? ??
? ???
? ???, ? 
1
2
? ? ? ???
?
? ? ? ? ??? , ? = ? ?
2
? ??
? ???
? ???, ? ?
? ???
Fisher Information Matrix
Natural Policy Gradient
? Change constraint problem
???????? ? ? ?
? ? ? ???
?. ?.
1
2
? ? ? ???
?
? ? ? ? ???  ?
? Langrange Multiplier method
? ? ?? ?? ? ?
? ? ? ??? ? ?
1
2
? ? ? ???
?
? ? ? ? ??? ? ??? 0?? ??
? ? = 1 ? ?, s = ? ? ? ???direction ? ? ?? ??
? ? ?? = 0
? = ??1 ?
Natural Policy Gradient
? KL-divergence constraint? ?? ???? ? ???? ?? ??
? Lagrange multiplier? ?? ?? direction? boundary??? direction
? ? ? ??? = ??
1
2
? ? ? ???
?
? ? ? ? ??? = ?

1
2
?? ?
? ?? = ?
 ? =
2?
? ? ??
=
2?
? ? ??1 ?
(? = ??1
?)
Natural Policy Gradient
? problem
???????? ? ? ?
? ? ? ???
?. ?.
1
2
? ? ? ???
?
? ? ? ? ???  ?
? Solution
? ??? = ? ??? +
2?
? ? ??1 ?
??1 ?
Truncated Natural Policy Gradient
? NPG?? Neural Network? ?? parameter? ?? ?? ??1
? ??? ???
? Parameter ??? ???? ?? ??? ?2
, ??1
??? ?(?3
)
? Conjugate gradient method? ??? ??1? ???? ?? ??1 ? ???
? Truncated Natural Policy Gradient
? CG(conjugate gradient method)? ?? = ?? ????? ??? ?? ??
? Analytic?? ??? ?? iterative?? ?? ???? ??
? ??? ?? ?? ?? hessian-vector product ??? ???? ?? = ?? ??? ?? ??
Truncated Natural Policy Gradient
? Truncated Natural Policy Gradient? ??
1. Might not be robust to trust region size ; at some iterations may be too large and
performance can degrade
2. Because of quadratic approximation, KL-divergence constraint may be violated
Trust Region method
Trust Region method
Approximation
Sub-problem
Trust-region
Trust Region method
Trust Region Policy Optimization
? ?? ??? sub-problem?? ??? ? sub-problem? ? step?? ?
1. Finding search direction
2. Do line search on that direction inside trust region
? Trust Region Policy Optimization
1. Search direction ?=
2?
? ? ??1 ?
??1 ?
2. Backtracking line search ? ??? = ? ??? + ? ?? (? ? ???
? > 0 ??? ? ??
? ???
? ???, ?  ?? ?, stop)
4. Code review of TRPO
TRPO
OpenAI baselines
https://github.com/openai/baselines
TRPO ?? ??
1. ?? policy? sample ???
2. ?? sample? GAE ????
3. Surrogate advantage function ????
4. Surrogate? gradient? KL-divergence? hessian? ???
5. g? H? ?? (CG) search direction ??
6. Search direction? ?? backtracking line search
TRPO baseline ?? ??
1. run_atari.py : atari ???? ???? main loop
2. nosharing_cnn_policy.py : actor-critic network (actor? critic? ???? ?? X)
3. trpo_mpi.py : cnn_policy? ?? ??? ??? ?? ??
run_atari.py
noshaing_cnn_policy.py trpo_mpi.py
?? policy? sample ???
def traj_segment_generator(pi, env, horizon, stochastic)
?? sample? GAE ????
Surrogate advantage function ????
? ? ???
? = ??~? ? ???
,?~? ? ???
? ?(?|?)
? ? ???
(?|?)
? ? ? ???
(?, ?)
Surrogate? gradient? KL-divergence? hessian? ???
1. Surrogate? gradient
? State, action, GAE? policy gradient ??
2. KL-divergence? hessian matrix(2? ??) : FIM ???
? KL-div :
? KL-div 1? ?? :
Surrogate? gradient? KL-divergence? hessian? ???
2. KL-divergence? hessian matrix(2? ??) : FIM ???
? FIM ?? :
g? H? ?? (CG) search direction ??
1. ??1
? ??
2. Search direction ?=
2?
? ? ??1 ?
??1
? ??
Search direction? ?? backtracking line search
? ? ??? = ? ??? + ? ?? (? ? ???
? > 0 ??? ? ??
? ???
? ???, ?  ?? ?, stop)
Thank you
TRPO

More Related Content

???? ? Trpo

  • 2. ???? Part 2 Lower bound of performance Part 3 Trust Region Policy Optimization Part 4 Code review of TRPO Part 1 Problem of stochastic policy gradient
  • 3. ???? 1. Stochastic policy gradient? parameter space??? update ? ?? policy? ?? ?? ? collapse of performance 2. ??? Performance? improvement? ????? ???? ?? ? Policy space?? ??? ????? 3. Policy? ?? ? performance? ??? ?? : lower bound ? ????? lower bound? optimize 4. Penalty ??? constraint ??? ?? ? Trust Region method ? KL-divergence constraint
  • 4. 1. Problem of stochastic policy gradient ???? ? TRPO
  • 5. ???? ??? ?? ? MDP : Tuple (?, ?, ?, ?, ?0, ?) ? ? : state transition probability ? ?0: distribution of initial state ?0 ? ? : reward function ? ? : discount factor ? ? ? ? 0, 1 : stochastic policy ? ????? ????? ?? ??? ????? policy? ??? ?? ? ???? ????? ?? ?? ? ??? ??? ??? ?? ?? ??? ???? ??
  • 6. ????? ?? 1. Value-based RL ? Q-function? ?? action? ??(ex. ? ? ??????) ? SARSA, Q-Learning, DQN, Dueling Network, PER, etc.. 2. Policy-based RL ? Explicit? policy? ??? ??? policy? parameterized ? REINFORCE, Actor-Critic, A3C, TRPO, PPO, etc..
  • 7. Policy Gradient ? Optimize? ??? ?? : objective function ? Objective function : ?(? ?) ? policy parameter ?? ?? ? Optimization problem : max ? ?(? ?) ? ??? ? Policy gradient : iteration?? objective function? gradient? ?? parameter update ? = ? + ??? ?(?) https://www.linkedin.com/pulse/logistic-regression-gradient-descent-hands-on-marinho-de-oliveira
  • 8. Policy Gradient ? Objective function ? ?? ? ??? value function ? Objective function? policy? ???? ? ? PG? ??? objective function? ????? policy? ??? ? ? = ?0, ?0, ?1, ?1, ?1, ?2, ? ? ? = ??~? ? ?=0 ? ? ?(??) = ??0~?0 ? ? ?0
  • 9. Policy Gradient ? Q-function, Value function, Advantage function? ?? ? ? ??, ? ? = ?? ?+1,? ?+1,? ? ?=0 ? ? ?(??+?) ?? ?? = ? ? ?, ? ?+1,? ?+1,? ? ?=0 ? ? ?(??+?) ? ?~? ? ? ?? , ??+1~?(??+1|??, ? ?) ? ? ?, ? = ? ? ?, ? ? ??(?)
  • 10. Policy Gradient ? Parameterized policy ? ?, objective function ? ? ? ? Policy gradient equation 1. Policy gradient of REINFORCE ?? ? ? = ??~? ? ? ?=0 ?? ?? ???? ? ? ? ?? 2. Policy gradient of Actor-Critic ?? ? ? = ?? ?~? ?, ? ?~? ? ? ? ? (??, ? ?)?? ???? ? ? ? ??
  • 11. Stochastic Policy Gradient ? Stochastic PG : Expectation? ???? ?? sampling?? ?? ? ? episode ?? timestep ?? policy gradient? estimate 1. Policy gradient of REINFORCE ?? ? ? ? 1 ? ? ?=0 ? ??? ?? ???? ? ? ? ?? 2. Policy gradient of Actor-Critic ?? ? ? ? ??(??, ? ?)?? ???? ? ? ? ??
  • 12. Policy Gradient? ?? 1. Sample efficiency is poor ? Policy gradient? ?? policy? ?? estimate ? Estimate? policy gradient? ? ?? update ? ?? policy? ?? data? ???? ??? 2. Distance in parameter space distance in policy space ? Policy gradient? parameter?? step ? ??? parameter??? small step? policy? ?? ???? ? ?? ? Policy? ??? ??? ?? parameter space??? step size? ??!
  • 13. PG with Importance sampling 1. PG?? ?? policy? sample? ???? ?? importance sampling ??? ? ?? ?? ? ? = ?? ?~? ?, ? ?~? ? ? ? ? (??, ? ?)?? ???? ? ? ? ?? = ?? ?~? ? ??? , ? ?~? ? ??? ? ?(? ?|??) ? ? ??? (? ?|??) ? ? ? (??, ? ?)?? ???? ? ? ? ?? 2. Importance weight? unbound ? ??? ? ??? ? ? ?? ?? ACER(Sample Efficient Actor-Critic with Experience Replay)
  • 14. Step in policy space ? Parameter space? ?? policy space?? ??? update ? ? ??? ? KL-divergence of two policy (old policy & new policy)? constraint? (trust region!) ? Update ? ? monotonic improvement? ??? ? ??? ? Performance? lower bound? ???? lower bound? ??? ? TRPO (Trust Region Policy Optimization)
  • 15. 2. Lower bound of performance TRPO
  • 16. Relative policy performance identity ? Policy ? ???? objective function : ? ? ??? = ??0,?0,?~? ??? ?=0 ? ? ?(??) ? Policy ?? objective function : ? ? = ??0,?0,?~? ?=0 ? ? ?(??) ? Bellman Equation ?? objective function ??? ???? ????? ? ? = ? ? ??? + ??0,?0,?~? ? ?=0 ? ? ? ? ??? (??, ? ?) ? ? = ? ? ??? + ? ? ? ?(?) ? ? ?(?|?)? ? ??? (?, ?) Kakade & Langford Approximately optimal approximate reinforcement learning, 2002
  • 17. Proof of Relative policy performance identity (??)
  • 18. Policy Iteration & objective function ? ? = ? ? ??? + ? ? ? ?(?) ? ? ?(?|?)? ? ??? (?, ?) ? Policy improvement ? greedy policy improvement ? ?? ? ?? ??? positive advantage? ??? policy? improve ? = ?????? ? ? ? ??? (?, ?) ? Approximation error ? ? ? ??? ?, ? < 0 ??, ? ?(?) ??? ??? ? ? ? ? local approximation ? ? ??? (?) = ? ? ??? + ? ? ? ? ??? (?) ? ? ?(?|?)? ? ??? (?, ?)
  • 19. Local Approximation ? ? = ? ? ??? + ? ? ? ?(?) ? ? ?(?|?)? ? ??? (?, ?) ? ? ??? (?) = ? ? ??? + ? ? ? ? ??? (?) ? ? ?(?|?)? ? ??? (?, ?) Approximation? ?? error? ??? ??? ? Lower bound Policy? ??? ??? steady state distribution? ??? ??? ? ?? ???
  • 20. Local Approximation with parameterized policy ? ? ??? (?) = ? ? ??? + ? ? ? ? ??? (?) ? ? ?(?|?)? ? ??? (?, ?) ? ? ? ?0 ? ?0 = ? ? ?0 ? ? ? ? ?0 ? ? ? ?=?0 = ? ? ? ? ? ? ?=?0 Kakade & Langford (2002)? ?? ?, ? ?0 ?? policy? ??? ????? ? local approximation? improve ? objective function? improve
  • 21. Conservative Policy Iteration ? ???? ??? objective function? improve? ????? ? ? ?? ? Conservative Policy Iteration : explicit? ? ? lower bound? ?? 1. Policy improvement(mixture policy ??) ? ?? policy : ? ???, ??? policy : ? ??? 1 ? = ?????? ? ? ? ??? ? 2 ? ??? a s = (1 ? ?)? ??? a s + ??(?|?)
  • 22. Conservative Policy Iteration 2. Lower bound of ? ? ? ??? ? ? ??? ? ??? ? 2?? 1 ? ? 2 ?2 ????? ? = max ? ? ?~? ? ? ? ?(?, ?) ? ??? lower bound? ????? ?? ? Parameterized policy? ??? mixture policy? ?? ? ?? Lower bound
  • 23. Conservative Policy Iteration? ?? 1. Mixture policy? ?? policy ??? ??? ???? ?? KL-div? ?? ?2 ? ?? ??? ? ???, ? ??? 2. ??? lower bound ? ? ? ??? ? ? ??? ? ??? ? ?? ?? ??? ? ???, ? ??? ????? ? = 4?? 1 ? ? 2 , ? = max ? ? ?~? ? ? ? ?(?, ?)
  • 24. Lower bound of objective function ? ? ??? ? ? ??? ? ??? ? ?? ?? ??? ? ???, ? ??? ?? ???? ?? ?? ??
  • 25. 3. Trust Region Policy Optimization TRPO
  • 26. Conservative Policy Iteration? ?? Policy improvement Policy evaluation
  • 27. KL-constraint optimization 1. Parameterized policy? lower bound ?? ?? ? ? ? ? ??? ? ? ?? ?? ??? ? ???, ? 2. Lower bound? optimization : C? ?? ?? step? ?? ???????? ? ? ? ??? ? ? ?? ?? ??? ? ???, ? 3. Large and robust step : KL-penalty ? KL-constraint ???????? ? ? ? ??? ? ?. ?. ? ?? ??? ? ???, ? ?
  • 28. KL-constraint optimization ???????? ? ? ? ??? ? ?. ?. ? ?? ??? ? ???, ? ? Trust region small step in policy space
  • 29. Approximation of KL-divergence ???????? ? ? ? ??? ? ?. ?. ? ?? ??? ? ???, ? ? ? ? ?, ? ?? ??? ? ???, ? ? ?? ??? ??? max?? ??? practical X ? Approximation! ? ?? ??? ? ???, ? ~? ?? ? ? ???, ? ? ??~? ??? ? ?? ? ? ??? ? ? || ? ? ? ? ? ??? sub-problem ? ???????? ? ? ? ??? ? ?. ?. ? ?? ? ??? ? ???, ? ?
  • 30. Surrogate advantage function ? ? ??? ? = ? ? ??? + ? ? ? ? ? ??? (?) ? ? ? ?(?|?)? ? ? ??? (?, ?) ???????? ? ? ? ??? ? ???????? ? ? ? ? ? ? ??? (?) ? ? ? ?(?|?)? ? ? ??? (?, ?) ? ? ? ? ??? (?) ? ? ?(?|?)? ? ? ??? (?, ?)? ??? ??? ? approximation
  • 31. Surrogate advantage function ? ? ? ? ? ??? (?) ? ? ? ?(?|?)? ? ? ??? (?, ?) = ??~? ? ??? ,?~? ? ? ? ? ??? (?, ?) = ??~? ? ??? ,?~? ? ??? ? ?(?|?) ? ? ??? (?|?) ? ? ? ??? (?, ?) ???????? ? ??~? ? ??? ,?~? ? ??? ? ?(?|?) ? ? ??? (?|?) ? ? ? ??? (?, ?) ?. ?. ? ?? ? ??? ? ???, ? ? Importance sampling Surrogate advantage function
  • 32. Natural Policy Gradient ? How to solve this problem? ? ? ??? ? = ??~? ? ??? ,?~? ? ??? ? ?(?|?) ? ? ??? (?|?) ? ? ? ??? (?, ?) ???????? ? ? ? ??? ? ?. ?. ? ?? ? ??? ? ???, ? ? 1. 1st order approximation to surrogate advantage function ? ? ??? ? ? ? ??? ? ??? + ? ? ? ? ? ??? , ? = ? ? ? ? ??? ? ? ? ??? 2. 2nd order approximation to KL-divergence ? ?? ? ??? ? ???, ? 1 2 ? ? ? ??? ? ? ? ? ? ??? , ? = ? ? 2 ? ?? ? ??? ? ???, ? ? ? ??? Fisher Information Matrix
  • 33. Natural Policy Gradient ? Change constraint problem ???????? ? ? ? ? ? ? ??? ?. ?. 1 2 ? ? ? ??? ? ? ? ? ? ??? ? ? Langrange Multiplier method ? ? ?? ?? ? ? ? ? ? ??? ? ? 1 2 ? ? ? ??? ? ? ? ? ? ??? ? ??? 0?? ?? ? ? = 1 ? ?, s = ? ? ? ???direction ? ? ?? ?? ? ? ?? = 0 ? = ??1 ?
  • 34. Natural Policy Gradient ? KL-divergence constraint? ?? ???? ? ???? ?? ?? ? Lagrange multiplier? ?? ?? direction? boundary??? direction ? ? ? ??? = ?? 1 2 ? ? ? ??? ? ? ? ? ? ??? = ? 1 2 ?? ? ? ?? = ? ? = 2? ? ? ?? = 2? ? ? ??1 ? (? = ??1 ?)
  • 35. Natural Policy Gradient ? problem ???????? ? ? ? ? ? ? ??? ?. ?. 1 2 ? ? ? ??? ? ? ? ? ? ??? ? ? Solution ? ??? = ? ??? + 2? ? ? ??1 ? ??1 ?
  • 36. Truncated Natural Policy Gradient ? NPG?? Neural Network? ?? parameter? ?? ?? ??1 ? ??? ??? ? Parameter ??? ???? ?? ??? ?2 , ??1 ??? ?(?3 ) ? Conjugate gradient method? ??? ??1? ???? ?? ??1 ? ??? ? Truncated Natural Policy Gradient ? CG(conjugate gradient method)? ?? = ?? ????? ??? ?? ?? ? Analytic?? ??? ?? iterative?? ?? ???? ?? ? ??? ?? ?? ?? hessian-vector product ??? ???? ?? = ?? ??? ?? ??
  • 37. Truncated Natural Policy Gradient ? Truncated Natural Policy Gradient? ?? 1. Might not be robust to trust region size ; at some iterations may be too large and performance can degrade 2. Because of quadratic approximation, KL-divergence constraint may be violated
  • 41. Trust Region Policy Optimization ? ?? ??? sub-problem?? ??? ? sub-problem? ? step?? ? 1. Finding search direction 2. Do line search on that direction inside trust region ? Trust Region Policy Optimization 1. Search direction ?= 2? ? ? ??1 ? ??1 ? 2. Backtracking line search ? ??? = ? ??? + ? ?? (? ? ??? ? > 0 ??? ? ?? ? ??? ? ???, ? ?? ?, stop)
  • 42. 4. Code review of TRPO TRPO
  • 44. TRPO ?? ?? 1. ?? policy? sample ??? 2. ?? sample? GAE ???? 3. Surrogate advantage function ???? 4. Surrogate? gradient? KL-divergence? hessian? ??? 5. g? H? ?? (CG) search direction ?? 6. Search direction? ?? backtracking line search
  • 45. TRPO baseline ?? ?? 1. run_atari.py : atari ???? ???? main loop 2. nosharing_cnn_policy.py : actor-critic network (actor? critic? ???? ?? X) 3. trpo_mpi.py : cnn_policy? ?? ??? ??? ?? ?? run_atari.py noshaing_cnn_policy.py trpo_mpi.py
  • 46. ?? policy? sample ??? def traj_segment_generator(pi, env, horizon, stochastic)
  • 48. Surrogate advantage function ???? ? ? ??? ? = ??~? ? ??? ,?~? ? ??? ? ?(?|?) ? ? ??? (?|?) ? ? ? ??? (?, ?)
  • 49. Surrogate? gradient? KL-divergence? hessian? ??? 1. Surrogate? gradient ? State, action, GAE? policy gradient ?? 2. KL-divergence? hessian matrix(2? ??) : FIM ??? ? KL-div : ? KL-div 1? ?? :
  • 50. Surrogate? gradient? KL-divergence? hessian? ??? 2. KL-divergence? hessian matrix(2? ??) : FIM ??? ? FIM ?? :
  • 51. g? H? ?? (CG) search direction ?? 1. ??1 ? ?? 2. Search direction ?= 2? ? ? ??1 ? ??1 ? ??
  • 52. Search direction? ?? backtracking line search ? ? ??? = ? ??? + ? ?? (? ? ??? ? > 0 ??? ? ?? ? ??? ? ???, ? ?? ?, stop)