狠狠撸

2018/06/30 ??? RLI STUDY
?? ?? ? ? ????~
)
( 0 (

*???? ??? ?? ?? ??? ?? ?? ???? ??? ??? ?? ??? ??? ?? ????. ?? ?? ??? ???? ???? ?????.

Natural Policy Gradient = Policy Gradient + Natural Gradient
???? ? ???

? ?????
??(2018? 6? ??) PPO(Proximal Policy Optimization)? ??
?????. PPO? ??? ? ????? ?? ???? NPG?
??? ??.
NPG(2001) à TRPO(2015) à PPO(2017)

Natural Policy Gradient ??? ??

Policy Gradient ?? à Natural Gradient ?? à ?? ?? ????
??? ??

??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy ∝ exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion

Review: Policy Gradient 1. ???? ????

Review: Policy Gradient
???? ????
!(#): ??(????)
?! # à &?? à !(#) ???
#' + 1 = #' + + ,?!(#')
Initial
parameter
Gradient ,?!(#')
&
Local maximum-(&.)
Gradient Ascent
-(&)
????? ??: ??? ??? ?? ?

Review: Policy Gradient
???? ????
?? ??? == x?: # $?: %(#)
?? ??? == x?: ?? $?: ??

Overview: Natural Gradient
1. ????? ??
2. ????? ????
3. ??? ??? ??

????? ??
?? policy? parameter (!) ? ???? ???? ?? ???. policy? ? ? ????? ??? ???
????, ??? ?? ?? ?? ?? ??? ???. ?? ??, policy? ??? ?????? ?? ?
?? ?? ???? (256 , 3)? ??? ???? ??? ???? ??? ????. ? ?, ??? ?? ?
??? ??? policy? 256*3=768 ??? ???. ?? ??? ????? ??? ????? ??? ?
?? ??? ???. ?? ???? Policy? ??? ????? ??? ??? 3???? ?????.
Policy parameter theta ? ??? ??? ?? (3,1) ???? ???
3.5
8.0
3.9
??
??? ???? ?? ???? ???? ??? ???. ????
??? ??? ??? ??? ???? ??? ??? ???? ??
? ?? ???.
3.8
8.0
3.0
() ()*
????? ???? ?? ??? ?? ?? ???…

????? ??
????? ? ??? ???? subspace? !
Manifold: ?? ?? ?. ?, ?? ?
??? ??? ???? ?. ??
? ?? ?? 2-d? ??? ???
dimensionally reduction? ?? ?
??. (?? ??? 2-d? ?? ?
? ? ??? ?????? 2-d?
??)
???? ?(???? policy)??
???? ?? ?? ???? ?
? ?? ???
à ??? ????? ?????
??? ? ??.
à ?? ??? ????? ??
? ? ? ??!
à ? ?? ???? ? ??
???.

????? ??
????? ?? ??? ?? ???? ???? ??? ???? ????.

????? ??
??, ??? NPG ???? ?? ????? ? ?????
Natural Gradient Method? ?? ???? ????? steepest descent direction? ???? ??!
?? ???? ??? ?? ??????. ?? ????? ????? ??? ?? ?????? ????
??? ?? ???? ???? ??.

????? ??
?? ???? ???? ?? policy ?? ??? ??? ?????.

????? ??
????? ???? ?? B?? A1? ??? A2? ??? ??? ?? ?, ???? ????? ??? A1?
? ???? ?? ???. ??? ????? ???? B?? A1?? A2? ? ???.
?? ??? ???????
(??? ??? ?? ??.)

????? ??
ü policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A1)??? policy? ?? ???? ?
?? ?? ? ??. ???? delta B? 0.0001?? ?? ?? ?????? ???? ??? ??? ??.
ü policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A2)?? policy? ?? ? ?????
?? ? ??.
?? ?? BàA1 ?? BàA2? ? covariant??? ? ? ??.

????? ??
ü ??? policy? ???? ???? ???? ??? (???? ??) manifold ???? ???? ? ? ?
???? ???? ???? gradient? ? ?? ????
????..
ü Natural Gradient??? policy? ?? manifold? ???? ??? ??.
ü Riemannian manifold? Manifold ???? ???? ??, ?? ??? Manifold?? ???? ????.

????? ??
?? ???? (?? ??)

????? ??
ü ??? Natural Policy Gradient? ?? ??(?? ??)??? ?? ??(?? ??)? ??? gradient?
policy? ???? ?? ???.
????? ???? ??, ??? ?? ???? ??
?? ???.
?, ??? ?? ??? ????? ????? ?? ?
?? ?? ?? ????.
?? ???
????? ????? ?? ?? = ?? ????? ?? ??
??? ???? ??.

?? Natural Gradient ? ??? ???,
?? ????? ?????
?? ??(??)? ?? ??(??)? ??? ??? ????

????? ????
???

?????? ?? ???
??: ???????, ??? ??? ??? ??
????? ????

?????? ?? ???
--- (1)
??: ???????, ??? ??? ??? ??
????? ????

?????? ?? ???
--- (2)
??: ???????, ??? ??? ??? ??
????? ????

30
?????? ?? ???
--- (2)
??: ???????, ??? ??? ??? ??
????? ????

??? ?? ???? ??.
??? ??? ??
??? ??

??? ?? ???? ??.
??? sin(x)?? ??.
x = 0 ??? 1? ???? y=x
x = 0 ?? ??? ?? ?? ??? ?? ??? ??? ???? gradient? ? ??(?????? ?? ???? ??)
x = 0 ???? gradient??? ??.
??? ??? ??

??? ?? ???? ??.
??? ??? ??
??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F

??? ?? ???? ??.
??? ??? ??
??? e^x ? ?,
??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F

Abstract
Key Information
1. We provide a natural gradient method that represents the steepest descent direction based on
the underlying structure of the parameter space.
2. Although gradient methods cannot make large changes in the values of the parameters, we show
that the natural gradient is moving toward choosing a greedy optimal action rather than just a
better action.

Abstract
3. These greedy optimal actions are those that would be chosen under one improvement step of
policy iteration with approximate, compatible value functions, as defined by Sutton et al.
4. We then show drastic performance improvements in simple MDPs and in the more challenging
MDP of Tetris.
?? ??????? ?? ????, policy iteration?? one improvement step?? ??? ?? ??? ???? ???.
?? ?? ?????? ?? ??? ????? ??…
Simple MDP, Tetris MDP?? ??? ?? ?????.

Introduction
1. Policy gradient: ?? ??? gradient? ?? ??? policy? ?? ??
2. ???? ??? ??? ????? ??? ????? SGD? non-covariant
3. ?? ??? ??? ????? Natural gradient? covariant
4. Policy iteration? ???? ?? (??? ??)
5. ?? ??? ??? ??. (??? ? ? ??)

Natural Gradient
1. Assume
2. ??? ??
3. Natural Gradient ??

Natural Gradient
Natural Gradient? ?? ???? ?? ? ??? ???? ???.
Assume à Gradient ??? ??(??) ?? à Natural Gradient ??

ü Finite MDP: (", $%, &, ', ()
ü Decision making method: policy(a;s)
ü Every policy is ergodic (well-defined stationary distribution *+
)
ü Average reward:
ü State-action value:
ü Value function:
ü Policy , ??? ??? ?? ??? ??
ergodic: ??????*stationary distribution = stationary distribution … ????? ????????? ??? ??? ?????? ??? ?? ??? ??. ?? ?? ????
????? ??? ?? ? ??? ???????? ?? ????? ?? ??? ?? ??(stationary distribution)? ??? ???.
Natural Gradient
Assume

ü The exact gradient of the average reward:
Natural Gradient
Assume
??? ??~
?? ??! ?? ??? ??? ? ‘policy ???? ? ? ????? ? ? ??!’ ??.
Policy ???? ???? ??
Policy? ?? ???? ??? ‘?’ ???? ???? ???.
? ?? ???? ??? ?? ???? ??? gradient??? ?? ????.

ü ???? ??(steepest descent direction):
!(# + %#)? ??????? ?? (%#)
ü ???? ??(??):
|%#|(
Natural Gradient
??? ??
??? .. ?? ??? ?? ?? ?? pg? ??? ??..
?? ???? ??(??)? ?? ???? ??? ??!!! ? ??? ??!

|"#|$
= "#"#
???? ????? ???? # ?? ??? ??? ???? ??.
??? ??? ?? ???? #? ??? ??? ?????? ???.
?? ??? ? &' ? ?????? ??? ? ????? ???? ???
???. ? &'? ????? ????(?? ????)? ??? ? ??.
??? ???? #? ?? ???? ?? ?? ???? ??? ?? ?? ?
? ??.
è ???? ??? ?? ??? ?? ? ? ??? ??? ?? ?? ? ??.
?? ??, #1, #2 ?? ??? ???? ?? ???. ???? ??? ??
?? ??? ? ?? ? ? general? ?? ?? ?? ????!
Natural Gradient
??? ??

?? ???? ????? ??? ??? ??.
Natural Gradient
??? ??
?? ?? ??
????? 2d? ?? ?? ? ??? ???? ????? ??? ?? ??? ??? ? !1, !2? ??? ???.

Natural Gradient
??? ??
???? ??? ???? ??? ? ?? ? ? general?? ?? ?? ?? ????!
!(#) ?? ??? ?????…
?? ??? ?? ??? ??.

Natural Gradient
??? ??
!(#)
!(#)
%(&) ? positive-definite matrix(?? ??? ??).
?? ???? ???? ??? ???? ??? ?? ????.
????? ??? ?? ?? ??? ??? ??..
??? ??? ??? ???? ??.
http://bskyvision.com/205

Natural Gradient
Natural Gradient ??
!(#)
!(#) ? identity matrix? ? ?? ???? ???? ??? ?? ?? ??.
?? ??(?? ????)? ???? ?? ??? ?? ??? Natural Gradient ?? ??.
?? ????
Manifold ??? ?? ????? ??? ? % ?? ??? ???(variant) ??? policy ????
???? ?? ? ??? ?? invariant?? ???? ??? ? ??? Fisher information matrix?
!(#)? ?? ???.
Fisher information matrix
FIM: ?? ??? ?? ??? positive definite matrix ? ?? ??.

Natural Gradient
Natural Gradient ??

Natural Gradient
Natural Gradient ??
NPG??? steepest descent direction? ??? ??.
??? ?? ?? ???.
???? ??? ?? ????(????? NPG??)
(https://dnddnjs.github.io/paper/2018/05/22/natural-policy-
gradient/)

Natural Gradient
Natural Gradient ??
NPG??? steepest descent direction? ??? ??.
??? ??? ?? ???.
‘FIM??? positive-definite matrix? ?? ?? ??? ??? ?? ? ???
????? ???? ??.’

??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy ∝ exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
NPG? ??? ??? 3 ?? ???
?? ??..

Theorem 1: Compatible
Function Approximation
1. Assume
2. Theorem
3. Proof
4. meaning

Theorem 1: Compatible Function Approximation
??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??.
(?? ?? ??? ??.)
*??? Function Approximation? ??? ??? ??

Assume
??? ? policy? ???
f ? y=ax ? linear function ? Q value? ??? ??
**??? ????? ??? ??? ?? ???? ??.
Q? ??? f? ?? Q? ??? ? MSE

Theorem
?? ??? ??? ??.

Theorem
???? ??? ??

meaning
??? ??? ??

Theorem 2: Greedy Policy Improvement
( policy ∝ exp )

Theorem 2: Greedy Policy Improvement ( policy ∝ exp )
??2: policy? exponential ??? ????, ?????? ???? ??? ?? policy? ??? ??? ???.
??? ??? ??? ??
??? ??? ?? ?? ??? ????

??? ???? ???? ?????? ?? ?????.

1. 2? ??? 1? ???? policy ??? ? ???? ??.
2. 2? ??: ?? theta ?? ???? ?????? ???
? ??? best action? ???? ??.
3. 1? ??: theta? ?? ???? ?? gradient? ?? 0?
????? ?????? ??? ????? best action
? ??? ??? ? ??.(better action??? ?? ? ?
?? ???)

Theorem 3: Greedy Policy Improvement
( policy == general parameterized policy )
1. Theorem
2. Result
3. Solution – Line Search

??3: policy? general function?? Natural policy gradient? policy? ???? ?? ??
??? ??? 1? ??? ??.
* ?? ????? 1? ??
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Theorem

Policy ? exp? ?? general function? ??
Greedy action ? ???? ? ???.
? ??? ???
?? policy? log(1+x)?? ?????.
Result

??? 2? ???? ????? ?? ??? ?? policy ??? ?? ???? ??? ? ? ??. ?, ??
??? a? ???? ?? ??? ???? policy ????? ? ???? ?? ?? greedy action? ???
???.
Result

?? ?? ??? ???? (a)? ?? ????? ?? ???? ?? ??? ???, (b)? ?? ??? ??
?? ??? ?? ??. ?? ????? ??? ? ??, ??? ?? ?? ???? ???(c).
??? policy? f(x) = x3 - 2x + 1 ?? ??? ?? ???.
Result
(d)(c)(a) (b)

Solution – Line Search

Solution – Line Search
??: ???????
http://darkpgmr.tistory.com/149

Metrics and Curvatures 1. FIM? Hessian ???

79
? Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use
than F.
FIM ?? G? ?? ? ?? ???? ????
? In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is
asymptotically efficient
FIM? ?? Hessian?? ???. ??? ????? ?????.
ü FIM? Hessian? ?? ?????
- FIM? stochastic?? ????? ?? ?, Hessian? ????? ? ???? ?? ? ??.
- ?, ????? ?????? ????? ?? ??? ?? ??? ? ?? FIMàHessian ????
- ?? ????? ????? ????????? ?? ????? ??? ??? ? ?????.(?
??..)
Metrics and Curvatures
FIM? Hessian ???

80
FIM? Hessian ???

81
FIM? Hessian ???
? Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying
parameter space (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second
order convergence)
ü Metric? ???? ??? ?? ????? Blind Source Separation? ??? ????.
? As argued by Mackay , one strategy is to pull a metric out of the data-independent terms of the Hessian (if
possible), and in fact, Mackay arrives at the same result as Amari for the blind source separation case.
ü ????? data-independent? ?? metric?? ??? ?? ?? ??? ??? ? ? ??!(?)
ü ?? ? ?? ??? data-independent?? ??.
ü ??? ?? ?????? ? ??? ?? FIM?? ? ???? ???.
ü ?? ??? ?? Q? policy? ???? ??? ??? ??? ??.. ??? ?? ?????? Q??
?? ????! (G * d_theta * d_theta ??)
ü ? FIM? Hessian?? ??? ??? ??. ???????? ???? positive definite?? ?? ??
??.
ü ??? ?? ??? ????? ? ???? ?? ???? ?? ?? ?? ???.
ü ?? ?? ??? ????? ??? Conjugate Method(inversed hessian)? ? ??????? ??.

82
FIM? Hessian ???

Experiments
1. simple I-dimensional linear quadratic
regulator with dynamics
2. a simple 2-state MDP
3. The game of Tetris

85
Experiments
simple I-dimensional linear quadratic regulator with dynamics
The goal is to apply a control signal u to keep
the system at x = 0, (incurring a cost of X(t)^2
at each step).
??:
policy:
??:

86
Experiments
simple I-dimensional linear quadratic regulator with dynamics
?? ? ?? npg, ??? ??? ?? gradient
1. NPG? ? ?? ????
2. NPG? parameter theta? ???? ???? invariant? ??(covariance)? ?? ??? ????? ????.
3. ? ??? ?? stationary distribution? ????? ??? ?? ??? ???.

87
Experiments
a simple 2-state MDP
????: i? stationary distribution=0.8
j? stationary distribution=0.2
??: j? ???? ?? ?
Policy:
?? gradient ??(solid)
NPG(dashed)
?? gradient ??
C (top): stationary distribution? ??? ???. ?? 1?? ? ?
????. / C(bottom): NPG? ?? 2? ? ?? ????.
(?? ??)

Experiments
a simple 2-state MDP
?? gradient ??(solid)
NPG(dashed)
Parameter? ?????? ??
?? gradient? ??? theta_i? ?? ?????? ?? NPG? i? j? ??? ? ???? ??.

Experiments
The game of Tetris
??: ????
policy:

1. gradient method? Greedy policy iteration? ?? policy? ?? ??? ???.
2. ?? ????? ??. ? natural gradient? ?? = policy improvement ?? ?? ???
3. Line search ?? natural gradient? policy improvement ??? ?????.
4. Policy improvement? ?? NPG? ?? ??? ???? ??.
Discussion

5. Fisher Information matrix? asymtotically Hessian?? ???? ??. asymtotically
conjugate gradient method(Hessian? inverse? approx.? ??? ??)? ? ?? ?? ? ?
?.
6. ??? Hessian? ?? informative?? ??(hessian? ?? ??? ??? positive
definite? ?? ??? ??? ?? ??? convex? ?? ? ? ????? ??? ???
??? hessian? ?? positive definite? ?? ? ??? ???) tetris?? ??? natural
gradient method? ? ???? ? ??(pushing the policy toward choosing greedy optimal
actions)
7. conjugate gradient method? ? ? maximum? ??? ?????, performance?
maximum?? ?? ????? ??? ??? ???(?). ? ??? ??? ???? ??
??.
Discussion
??: https://dnddnjs.github.io/paper/2018/05/22/natural-policy-gradient/

94
?? ???? ?? ???
è Natural gradient descent? ?? ??? ???? ??? gradient descent?? ??? ?? ??? ???.
?? ? ? ??? ????
è ???? ?? ????.
????

95
?? gradient descent
NPG NPG
????

1. ???, ’A Natural Policy Gradient’, (https://dnddnjs.github.io/paper/2018/05/22/natural-
policy-gradient/)
2. ???????, ’??? ??? ??? ??’,(http://darkpgmr.tistory.com/149)
3. ???, ‘?????? ?? ? 1/3’,
(https://www.youtube.com/watch?v=o_peo6U7IRM&feature=youtu.be)
4. ???, ???
Thanks to…

狠狠撸

Natural Policy Gradient ??? ??

Recommended

More Related Content

What's hot (20)

Similar to Natural Policy Gradient ??? ?? (6)

Natural Policy Gradient ??? ??