際際滷

際際滷Share a Scribd company logo
2018/06/30 ??? RLI STUDY
?? ?? ? ? ????~
)
( 0 (
	
*???? ??? ?? ?? ??? ?? ?? ???? ??? ??? ?? ??? ??? ?? ????. ?? ?? ??? ???? ???? ?????.
Natural Policy Gradient = Policy Gradient + Natural Gradient
???? ? ???
? ?????
??(2018? 6? ??) PPO(Proximal Policy Optimization)? ??
?????. PPO? ??? ? ????? ?? ???? NPG?
??? ??.
NPG(2001) ┐ TRPO(2015) ┐ PPO(2017)
Natural Policy Gradient ??? ??
Policy Gradient ?? ┐ Natural Gradient ?? ┐ ?? ?? ????
??? ??
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Review: Policy Gradient 1. ???? ????
Review: Policy Gradient
???? ????
!(#): ??(????)
?! # ┐ &?? ┐ !(#) ???
#' + 1 = #' + + ,?!(#')
Initial
parameter
Gradient ,?!(#')
&
Local maximum-(&.)
Gradient Ascent
-(&)
????? ??: ??? ??? ?? ?
Review: Policy Gradient
???? ????
?? ??? == x?: # $?: %(#)
?? ??? == x?: ?? $?: ??
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Overview: Natural Gradient
1. ????? ??
2. ????? ????
3. ??? ??? ??
Overview: Natural Gradient
????? ??
?? policy? parameter (!) ? ???? ???? ?? ???. policy? ? ? ????? ??? ???
????, ??? ?? ?? ?? ?? ??? ???. ?? ??, policy? ??? ?????? ?? ?
?? ?? ???? (256 , 3)? ??? ???? ??? ???? ??? ????. ? ?, ??? ?? ?
??? ??? policy? 256*3=768 ??? ???. ?? ??? ????? ??? ????? ??? ?
?? ??? ???. ?? ???? Policy? ??? ????? ??? ??? 3???? ?????.
Policy parameter theta ? ??? ??? ?? (3,1) ???? ???
3.5
8.0
3.9
??
??? ???? ?? ???? ???? ??? ???. ????
??? ??? ??? ??? ???? ??? ??? ???? ??
? ?? ???.
3.8
8.0
3.0
() ()*
????? ???? ?? ??? ?? ?? ???´
Overview: Natural Gradient
????? ??
????? ? ??? ???? subspace? !
Manifold: ?? ?? ?. ?, ?? ?
??? ??? ???? ?. ??
? ?? ?? 2-d? ??? ???
dimensionally reduction? ?? ?
??. (?? ??? 2-d? ?? ?
? ? ??? ?????? 2-d?
??)
???? ?(???? policy)??
???? ?? ?? ???? ?
? ?? ???
┐ ??? ????? ?????
??? ? ??.
┐ ?? ??? ????? ??
? ? ? ??!
┐ ? ?? ???? ? ??
???.
Overview: Natural Gradient
????? ??
????? ?? ??? ?? ???? ???? ??? ???? ????.
Overview: Natural Gradient
????? ??
??, ??? NPG ???? ?? ????? ? ?????
Natural Gradient Method? ?? ???? ????? steepest descent direction? ???? ??!
?? ???? ??? ?? ??????. ?? ????? ????? ??? ?? ?????? ????
??? ?? ???? ???? ??.
Overview: Natural Gradient
????? ??
?? ???? ???? ?? policy ?? ??? ??? ?????.
Overview: Natural Gradient
????? ??
????? ???? ?? B?? A1? ??? A2? ??? ??? ?? ?, ???? ????? ??? A1?
? ???? ?? ???. ??? ????? ???? B?? A1?? A2? ? ???.
?? ??? ???????
(??? ??? ?? ??.)
Overview: Natural Gradient
????? ??
┨ policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A1)??? policy? ?? ???? ?
?? ?? ? ??. ???? delta B? 0.0001?? ?? ?? ?????? ???? ??? ??? ??.
┨ policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A2)?? policy? ?? ? ?????
?? ? ??.
?? ?? B┐A1 ?? B┐A2? ? covariant??? ? ? ??.
Overview: Natural Gradient
????? ??
┨ ??? policy? ???? ???? ???? ??? (???? ??) manifold ???? ???? ? ? ?
???? ???? ???? gradient? ? ?? ????
????..
┨ Natural Gradient??? policy? ?? manifold? ???? ??? ??.
┨ Riemannian manifold? Manifold ???? ???? ??, ?? ??? Manifold?? ???? ????.
Overview: Natural Gradient
????? ??
?? ???? (?? ??)
Overview: Natural Gradient
????? ??
┨ ??? Natural Policy Gradient? ?? ??(?? ??)??? ?? ??(?? ??)? ??? gradient?
policy? ???? ?? ???.
????? ???? ??, ??? ?? ???? ??
?? ???.
?, ??? ?? ??? ????? ????? ?? ?
?? ?? ?? ????.
?? ???
????? ????? ?? ?? = ?? ????? ?? ??
??? ???? ??.
Overview: Natural Gradient
?? Natural Gradient ? ??? ???,
?? ????? ?????
?? ??(??)? ?? ??(??)? ??? ??? ????
Overview: Natural Gradient
????? ????
???
Overview: Natural Gradient
?????? ?? ???
??: ???????, ??? ??? ??? ??
????? ????
Overview: Natural Gradient
?????? ?? ???
--- (1)
??: ???????, ??? ??? ??? ??
????? ????
Overview: Natural Gradient
?????? ?? ???
--- (1)
??: ???????, ??? ??? ??? ??
????? ????
Overview: Natural Gradient
?????? ?? ???
--- (2)
??: ???????, ??? ??? ??? ??
????? ????
30
Overview: Natural Gradient
?????? ?? ???
--- (2)
??: ???????, ??? ??? ??? ??
????? ????
Overview: Natural Gradient
??? ?? ???? ??.
??? ??? ??
??? ??
Overview: Natural Gradient
??? ?? ???? ??.
??? sin(x)?? ??.
x = 0 ??? 1? ???? y=x
x = 0 ?? ??? ?? ?? ??? ?? ??? ??? ???? gradient? ? ??(?????? ?? ???? ??)
x = 0 ???? gradient??? ??.
??? ??? ??
Overview: Natural Gradient
??? ?? ???? ??.
??? ??? ??
??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F
Overview: Natural Gradient
??? ?? ???? ??.
??? ??? ??
??? e^x ? ?,
??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Abstract
Abstract
Key Information
1. We provide a natural gradient method that represents the steepest descent direction based on
the underlying structure of the parameter space.
2. Although gradient methods cannot make large changes in the values of the parameters, we show
that the natural gradient is moving toward choosing a greedy optimal action rather than just a
better action.
Abstract
3. These greedy optimal actions are those that would be chosen under one improvement step of
policy iteration with approximate, compatible value functions, as defined by Sutton et al.
4. We then show drastic performance improvements in simple MDPs and in the more challenging
MDP of Tetris.
?? ??????? ?? ????, policy iteration?? one improvement step?? ??? ?? ??? ???? ???.
?? ?? ?????? ?? ??? ????? ??´
Simple MDP, Tetris MDP?? ??? ?? ?????.
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Introduction
Introduction
1. Policy gradient: ?? ??? gradient? ?? ??? policy? ?? ??
2. ???? ??? ??? ????? ??? ????? SGD? non-covariant
3. ?? ??? ??? ????? Natural gradient? covariant
4. Policy iteration? ???? ?? (??? ??)
5. ?? ??? ??? ??. (??? ? ? ??)
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Natural Gradient
1. Assume
2. ??? ??
3. Natural Gradient ??
Natural Gradient
Natural Gradient? ?? ???? ?? ? ??? ???? ???.
Assume ┐ Gradient ??? ??(??) ?? ┐ Natural Gradient ??
┨ Finite MDP: (", $%, &, ', ()
┨ Decision making method: policy(a;s)
┨ Every policy is ergodic (well-defined stationary distribution *+
)
┨ Average reward:
┨ State-action value:
┨ Value function:
┨ Policy , ??? ??? ?? ??? ??
ergodic: ??????*stationary distribution = stationary distribution ´ ????? ????????? ??? ??? ?????? ??? ?? ??? ??. ?? ?? ????
????? ??? ?? ? ??? ???????? ?? ????? ?? ??? ?? ??(stationary distribution)? ??? ???.
Natural Gradient
Assume
┨ The exact gradient of the average reward:
Natural Gradient
Assume
??? ??~
?? ??! ?? ??? ??? ? `policy ???? ? ? ????? ? ? ??!¨ ??.
Policy ???? ???? ??
Policy? ?? ???? ??? `?¨ ???? ???? ???.
? ?? ???? ??? ?? ???? ??? gradient??? ?? ????.
┨ ???? ??(steepest descent direction):
!(# + %#)? ??????? ?? (%#)
┨ ???? ??(??):
|%#|(
Natural Gradient
??? ??
??? .. ?? ??? ?? ?? ?? pg? ??? ??..
?? ???? ??(??)? ?? ???? ??? ??!!! ? ??? ??!
|"#|$
= "#"#
???? ????? ???? # ?? ??? ??? ???? ??.
??? ??? ?? ???? #? ??? ??? ?????? ???.
?? ??? ? &' ? ?????? ??? ? ????? ???? ???
???. ? &'? ????? ????(?? ????)? ??? ? ??.
??? ???? #? ?? ???? ?? ?? ???? ??? ?? ?? ?
? ??.
┬ ???? ??? ?? ??? ?? ? ? ??? ??? ?? ?? ? ??.
?? ??, #1, #2 ?? ??? ???? ?? ???. ???? ??? ??
?? ??? ? ?? ? ? general? ?? ?? ?? ????!
Natural Gradient
??? ??
?? ???? ????? ??? ??? ??.
Natural Gradient
??? ??
?? ?? ??
????? 2d? ?? ?? ? ??? ???? ????? ??? ?? ??? ??? ? !1, !2? ??? ???.
Natural Gradient
??? ??
???? ??? ???? ??? ? ?? ? ? general?? ?? ?? ?? ????!
!(#) ?? ??? ?????´
?? ??? ?? ??? ??.
Natural Gradient
??? ??
!(#)
!(#)
%(&) ? positive-definite matrix(?? ??? ??).
?? ???? ???? ??? ???? ??? ?? ????.
????? ??? ?? ?? ??? ??? ??..
??? ??? ??? ???? ??.
http://bskyvision.com/205
Natural Gradient
Natural Gradient ??
!(#)
!(#) ? identity matrix? ? ?? ???? ???? ??? ?? ?? ??.
?? ??(?? ????)? ???? ?? ??? ?? ??? Natural Gradient ?? ??.
?? ????
Manifold ??? ?? ????? ??? ? % ?? ??? ???(variant) ??? policy ????
???? ?? ? ??? ?? invariant?? ???? ??? ? ??? Fisher information matrix?
!(#)? ?? ???.
Fisher information matrix
FIM: ?? ??? ?? ??? positive definite matrix ? ?? ??.
Natural Gradient
Natural Gradient ??
Natural Gradient
Natural Gradient ??
NPG??? steepest descent direction? ??? ??.
??? ?? ?? ???.
???? ??? ?? ????(????? NPG??)
(https://dnddnjs.github.io/paper/2018/05/22/natural-policy-
gradient/)
Natural Gradient
Natural Gradient ??
NPG??? steepest descent direction? ??? ??.
??? ??? ?? ???.
`FIM??? positive-definite matrix? ?? ?? ??? ??? ?? ? ???
????? ???? ??.¨
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
NPG? ??? ??? 3 ?? ???
?? ??..
Theorem 1: Compatible
Function Approximation
1. Assume
2. Theorem
3. Proof
4. meaning
Theorem 1: Compatible Function Approximation
??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??.
(?? ?? ??? ??.)
*??? Function Approximation? ??? ??? ??
??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??.
Theorem 1: Compatible Function Approximation
Assume
??? ? policy? ???
f ? y=ax ? linear function ? Q value? ??? ??
**??? ????? ??? ??? ?? ???? ??.
Q? ??? f? ?? Q? ??? ? MSE
??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??.
Theorem 1: Compatible Function Approximation
Theorem
?? ??? ??? ??.
Theorem 1: Compatible Function Approximation
Theorem
???? ??? ??
Theorem 1: Compatible Function Approximation
meaning
??? ??? ??
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Theorem 2: Greedy Policy Improvement
( policy 『 exp )
Theorem 2: Greedy Policy Improvement ( policy 『 exp )
??2: policy? exponential ??? ????, ?????? ???? ??? ?? policy? ??? ??? ???.
??? ??? ??? ??
??? ??? ?? ?? ??? ????
Overview: Natural Gradient
????? ??
┨ ??? Natural Policy Gradient? ?? ??(?? ??)??? ?? ??(?? ??)? ??? gradient?
policy? ???? ?? ???.
????? ???? ??, ??? ?? ???? ??
?? ???.
?, ??? ?? ??? ????? ????? ?? ?
?? ?? ?? ????.
?? ???
????? ????? ?? ?? = ?? ????? ?? ??
??? ???? ??.
??? ???? ???? ?????? ?? ?????.
Theorem 2: Greedy Policy Improvement ( policy 『 exp )
1. 2? ??? 1? ???? policy ??? ? ???? ??.
2. 2? ??: ?? theta ?? ???? ?????? ???
? ??? best action? ???? ??.
3. 1? ??: theta? ?? ???? ?? gradient? ?? 0?
????? ?????? ??? ????? best action
? ??? ??? ? ??.(better action??? ?? ? ?
?? ???)
Theorem 2: Greedy Policy Improvement ( policy 『 exp )
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Theorem 3: Greedy Policy Improvement
( policy == general parameterized policy )
1. Theorem
2. Result
3. Solution C Line Search
??3: policy? general function?? Natural policy gradient? policy? ???? ?? ??
??? ??? 1? ??? ??.
* ?? ????? 1? ??
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Theorem
Policy ? exp? ?? general function? ??
Greedy action ? ???? ? ???.
? ??? ???
?? policy? log(1+x)?? ?????.
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Result
??? 2? ???? ????? ?? ??? ?? policy ??? ?? ???? ??? ? ? ??. ?, ??
??? a? ???? ?? ??? ???? policy ????? ? ???? ?? ?? greedy action? ???
???.
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Result
?? ?? ??? ???? (a)? ?? ????? ?? ???? ?? ??? ???, (b)? ?? ??? ??
?? ??? ?? ??. ?? ????? ??? ? ??, ??? ?? ?? ???? ???(c).
??? policy? f(x) = x3 - 2x + 1 ?? ??? ?? ???.
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Result
(d)(c)(a) (b)
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Solution C Line Search
Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
Solution C Line Search
??: ???????
http://darkpgmr.tistory.com/149
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Metrics and Curvatures 1. FIM? Hessian ???
79
? Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use
than F.
FIM ?? G? ?? ? ?? ???? ????
? In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is
asymptotically efficient
FIM? ?? Hessian?? ???. ??? ????? ?????.
┨ FIM? Hessian? ?? ?????
- FIM? stochastic?? ????? ?? ?, Hessian? ????? ? ???? ?? ? ??.
- ?, ????? ?????? ????? ?? ??? ?? ??? ? ?? FIM┐Hessian ????
- ?? ????? ????? ????????? ?? ????? ??? ??? ? ?????.(?
??..)
Metrics and Curvatures
FIM? Hessian ???
80
Metrics and Curvatures
FIM? Hessian ???
81
Metrics and Curvatures
FIM? Hessian ???
? Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying
parameter space (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second
order convergence)
┨ Metric? ???? ??? ?? ????? Blind Source Separation? ??? ????.
? As argued by Mackay , one strategy is to pull a metric out of the data-independent terms of the Hessian (if
possible), and in fact, Mackay arrives at the same result as Amari for the blind source separation case.
┨ ????? data-independent? ?? metric?? ??? ?? ?? ??? ??? ? ? ??!(?)
┨ ?? ? ?? ??? data-independent?? ??.
┨ ??? ?? ?????? ? ??? ?? FIM?? ? ???? ???.
┨ ?? ??? ?? Q? policy? ???? ??? ??? ??? ??.. ??? ?? ?????? Q??
?? ????! (G * d_theta * d_theta ??)
┨ ? FIM? Hessian?? ??? ??? ??. ???????? ???? positive definite?? ?? ??
??.
┨ ??? ?? ??? ????? ? ???? ?? ???? ?? ?? ?? ???.
┨ ?? ?? ??? ????? ??? Conjugate Method(inversed hessian)? ? ??????? ??.
82
Metrics and Curvatures
FIM? Hessian ???
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Experiments
1. simple I-dimensional linear quadratic
regulator with dynamics
2. a simple 2-state MDP
3. The game of Tetris
85
Experiments
simple I-dimensional linear quadratic regulator with dynamics
The goal is to apply a control signal u to keep
the system at x = 0, (incurring a cost of X(t)^2
at each step).
??:
policy:
??:
86
Experiments
simple I-dimensional linear quadratic regulator with dynamics
?? ? ?? npg, ??? ??? ?? gradient
1. NPG? ? ?? ????
2. NPG? parameter theta? ???? ???? invariant? ??(covariance)? ?? ??? ????? ????.
3. ? ??? ?? stationary distribution? ????? ??? ?? ??? ???.
87
Experiments
a simple 2-state MDP
????: i? stationary distribution=0.8
j? stationary distribution=0.2
??: j? ???? ?? ?
Policy:
?? gradient ??(solid)
NPG(dashed)
?? gradient ??
C (top): stationary distribution? ??? ???. ?? 1?? ? ?
????. / C(bottom): NPG? ?? 2? ? ?? ????.
(?? ??)
Experiments
a simple 2-state MDP
?? gradient ??(solid)
NPG(dashed)
Parameter? ?????? ??
?? gradient? ??? theta_i? ?? ?????? ?? NPG? i? j? ??? ? ???? ??.
Experiments
The game of Tetris
??: ????
policy:
??? ??
1. Review: Policy Gradient
2. Overview: Natural Gradient
3. Abstract
4. Introduction
5. Natural Gradient
6. Theorem 1: Compatible Function Approximation
7. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy )
9. Metrics and Curvatures
10. Experiments
11. Discussion
Discussion
1. gradient method? Greedy policy iteration? ?? policy? ?? ??? ???.
2. ?? ????? ??. ? natural gradient? ?? = policy improvement ?? ?? ???
3. Line search ?? natural gradient? policy improvement ??? ?????.
4. Policy improvement? ?? NPG? ?? ??? ???? ??.
Discussion
5. Fisher Information matrix? asymtotically Hessian?? ???? ??. asymtotically
conjugate gradient method(Hessian? inverse? approx.? ??? ??)? ? ?? ?? ? ?
?.
6. ??? Hessian? ?? informative?? ??(hessian? ?? ??? ??? positive
definite? ?? ??? ??? ?? ??? convex? ?? ? ? ????? ??? ???
??? hessian? ?? positive definite? ?? ? ??? ???) tetris?? ??? natural
gradient method? ? ???? ? ??(pushing the policy toward choosing greedy optimal
actions)
7. conjugate gradient method? ? ? maximum? ??? ?????, performance?
maximum?? ?? ????? ??? ??? ???(?). ? ??? ??? ???? ??
??.
Discussion
??: https://dnddnjs.github.io/paper/2018/05/22/natural-policy-gradient/
94
?? ???? ?? ???
┬ Natural gradient descent? ?? ??? ???? ??? gradient descent?? ??? ?? ??? ???.
?? ? ? ??? ????
┬ ???? ?? ????.
????
95
?? gradient descent
NPG NPG
????
96
?? ?? ?? ???? ?? ??
????
1. ???, ¨A Natural Policy Gradient¨, (https://dnddnjs.github.io/paper/2018/05/22/natural-
policy-gradient/)
2. ???????, ¨??? ??? ??? ??¨,(http://darkpgmr.tistory.com/149)
3. ???, `?????? ?? ? 1/3¨,
(https://www.youtube.com/watch?v=o_peo6U7IRM&feature=youtu.be)
4. ???, ???
Thanks to´
Q&A
?? ?????? ????´
?"

More Related Content

What's hot (20)

Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionDiversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Yechan(Paul) Kim
?
???? ???? ????????, ???? ??? ??????.
???? ???? ????????, ???? ??? ??????.???? ???? ????????, ???? ??? ??????.
???? ???? ????????, ???? ??? ??????.
Yongho Ha
?
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
moterech
?
?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction
Taehoon Kim
?
???? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ???????? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ????
Byoung-Hee Kim
?
Positive-Unlabeled Learning with Non-Negative Risk Estimator
Positive-Unlabeled Learning with Non-Negative Risk EstimatorPositive-Unlabeled Learning with Non-Negative Risk Estimator
Positive-Unlabeled Learning with Non-Negative Risk Estimator
Kiryo Ryuichi
?
Keras-community-week_Flax_and_Keras.pptx
Keras-community-week_Flax_and_Keras.pptxKeras-community-week_Flax_and_Keras.pptx
Keras-community-week_Flax_and_Keras.pptx
?? ?
?
Texture-Aware Superpixel Segmentation
Texture-Aware Superpixel SegmentationTexture-Aware Superpixel Segmentation
Texture-Aware Superpixel Segmentation
yukihiro domae
?
???????? - ????GIS ????? - QGIS ??????
???????? - ????GIS ????? - QGIS ?????????????? - ????GIS ????? - QGIS ??????
???????? - ????GIS ????? - QGIS ??????
MinPa Lee
?
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
Dongmin Lee
?
1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????
NAVER Engineering
?
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
?? ?
?
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
Yusuke Kaneko
?
???? ???? ??? ??? ???
???? ???? ??? ??? ??????? ???? ??? ??? ???
???? ???? ??? ??? ???
Kwangsik Lee
?
??? ?? ??? ??
??? ?? ??? ????? ?? ??? ??
??? ?? ??? ??
Hee Won Park
?
RLCode? A3C ?? ?? ????
RLCode? A3C ?? ?? ????RLCode? A3C ?? ?? ????
RLCode? A3C ?? ?? ????
Woong won Lee
?
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
WON JOON YOO
?
????? ??
????? ??????? ??
????? ??
Dongmin Lee
?
20191019 sinkhorn
20191019 sinkhorn20191019 sinkhorn
20191019 sinkhorn
Taku Yoshioka
?
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionDiversity is all you need(DIAYN) : Learning Skills without a Reward Function
Diversity is all you need(DIAYN) : Learning Skills without a Reward Function
Yechan(Paul) Kim
?
???? ???? ????????, ???? ??? ??????.
???? ???? ????????, ???? ??? ??????.???? ???? ????????, ???? ??? ??????.
???? ???? ????????, ???? ??? ??????.
Yongho Ha
?
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
MLaPP 9嫗 仝匯違晒侘モデルと峺方侏蛍下怛々
moterech
?
?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction
Taehoon Kim
?
???? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ???????? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ????
Byoung-Hee Kim
?
Positive-Unlabeled Learning with Non-Negative Risk Estimator
Positive-Unlabeled Learning with Non-Negative Risk EstimatorPositive-Unlabeled Learning with Non-Negative Risk Estimator
Positive-Unlabeled Learning with Non-Negative Risk Estimator
Kiryo Ryuichi
?
Keras-community-week_Flax_and_Keras.pptx
Keras-community-week_Flax_and_Keras.pptxKeras-community-week_Flax_and_Keras.pptx
Keras-community-week_Flax_and_Keras.pptx
?? ?
?
Texture-Aware Superpixel Segmentation
Texture-Aware Superpixel SegmentationTexture-Aware Superpixel Segmentation
Texture-Aware Superpixel Segmentation
yukihiro domae
?
???????? - ????GIS ????? - QGIS ??????
???????? - ????GIS ????? - QGIS ?????????????? - ????GIS ????? - QGIS ??????
???????? - ????GIS ????? - QGIS ??????
MinPa Lee
?
Safe Reinforcement Learning
Safe Reinforcement LearningSafe Reinforcement Learning
Safe Reinforcement Learning
Dongmin Lee
?
1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????
NAVER Engineering
?
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
?? ?
?
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
森のあるクリエイティブレ御のつけ圭(Contextual Bandit + TS or UCB)
Yusuke Kaneko
?
???? ???? ??? ??? ???
???? ???? ??? ??? ??????? ???? ??? ??? ???
???? ???? ??? ??? ???
Kwangsik Lee
?
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
? ?? ??? ??? ??? ?? ?????. (Deep Learning for Natural Language Processing)
WON JOON YOO
?
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?

Similar to Natural Policy Gradient ??? ?? (6)

Rl from scratch part2
Rl from scratch part2Rl from scratch part2
Rl from scratch part2
Shinwoo Park
?
Machine learning bysogood
Machine learning bysogoodMachine learning bysogood
Machine learning bysogood
S.Good Kim
?
Guided policy search
Guided policy searchGuided policy search
Guided policy search
Jaehyeon Park
?
Rl from scratch part4
Rl from scratch part4Rl from scratch part4
Rl from scratch part4
Shinwoo Park
?
???/ ????? ??? ? ??? ???? ??
???/ ????? ??? ? ??? ???? ?????/ ????? ??? ? ??? ???? ??
???/ ????? ??? ? ??? ???? ??
Kwang Woo NAM
?
[?????? ????] 5. ???
[?????? ????] 5. ???[?????? ????] 5. ???
[?????? ????] 5. ???
jdo
?
Rl from scratch part2
Rl from scratch part2Rl from scratch part2
Rl from scratch part2
Shinwoo Park
?
Machine learning bysogood
Machine learning bysogoodMachine learning bysogood
Machine learning bysogood
S.Good Kim
?
Rl from scratch part4
Rl from scratch part4Rl from scratch part4
Rl from scratch part4
Shinwoo Park
?
???/ ????? ??? ? ??? ???? ??
???/ ????? ??? ? ??? ???? ?????/ ????? ??? ? ??? ???? ??
???/ ????? ??? ? ??? ???? ??
Kwang Woo NAM
?
[?????? ????] 5. ???
[?????? ????] 5. ???[?????? ????] 5. ???
[?????? ????] 5. ???
jdo
?

Natural Policy Gradient ??? ??

  • 1. 2018/06/30 ??? RLI STUDY ?? ?? ? ? ????~ ) ( 0 ( *???? ??? ?? ?? ??? ?? ?? ???? ??? ??? ?? ??? ??? ?? ????. ?? ?? ??? ???? ???? ?????.
  • 2. Natural Policy Gradient = Policy Gradient + Natural Gradient ???? ? ???
  • 3. ? ????? ??(2018? 6? ??) PPO(Proximal Policy Optimization)? ?? ?????. PPO? ??? ? ????? ?? ???? NPG? ??? ??. NPG(2001) ┐ TRPO(2015) ┐ PPO(2017)
  • 5. Policy Gradient ?? ┐ Natural Gradient ?? ┐ ?? ?? ???? ??? ??
  • 6. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 7. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 8. Review: Policy Gradient 1. ???? ????
  • 9. Review: Policy Gradient ???? ???? !(#): ??(????) ?! # ┐ &?? ┐ !(#) ??? #' + 1 = #' + + ,?!(#') Initial parameter Gradient ,?!(#') & Local maximum-(&.) Gradient Ascent -(&) ????? ??: ??? ??? ?? ?
  • 10. Review: Policy Gradient ???? ???? ?? ??? == x?: # $?: %(#) ?? ??? == x?: ?? $?: ??
  • 11. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 12. Overview: Natural Gradient 1. ????? ?? 2. ????? ???? 3. ??? ??? ??
  • 13. Overview: Natural Gradient ????? ?? ?? policy? parameter (!) ? ???? ???? ?? ???. policy? ? ? ????? ??? ??? ????, ??? ?? ?? ?? ?? ??? ???. ?? ??, policy? ??? ?????? ?? ? ?? ?? ???? (256 , 3)? ??? ???? ??? ???? ??? ????. ? ?, ??? ?? ? ??? ??? policy? 256*3=768 ??? ???. ?? ??? ????? ??? ????? ??? ? ?? ??? ???. ?? ???? Policy? ??? ????? ??? ??? 3???? ?????. Policy parameter theta ? ??? ??? ?? (3,1) ???? ??? 3.5 8.0 3.9 ?? ??? ???? ?? ???? ???? ??? ???. ???? ??? ??? ??? ??? ???? ??? ??? ???? ?? ? ?? ???. 3.8 8.0 3.0 () ()* ????? ???? ?? ??? ?? ?? ???´
  • 14. Overview: Natural Gradient ????? ?? ????? ? ??? ???? subspace? ! Manifold: ?? ?? ?. ?, ?? ? ??? ??? ???? ?. ?? ? ?? ?? 2-d? ??? ??? dimensionally reduction? ?? ? ??. (?? ??? 2-d? ?? ? ? ? ??? ?????? 2-d? ??) ???? ?(???? policy)?? ???? ?? ?? ???? ? ? ?? ??? ┐ ??? ????? ????? ??? ? ??. ┐ ?? ??? ????? ?? ? ? ? ??! ┐ ? ?? ???? ? ?? ???.
  • 15. Overview: Natural Gradient ????? ?? ????? ?? ??? ?? ???? ???? ??? ???? ????.
  • 16. Overview: Natural Gradient ????? ?? ??, ??? NPG ???? ?? ????? ? ????? Natural Gradient Method? ?? ???? ????? steepest descent direction? ???? ??! ?? ???? ??? ?? ??????. ?? ????? ????? ??? ?? ?????? ???? ??? ?? ???? ???? ??.
  • 17. Overview: Natural Gradient ????? ?? ?? ???? ???? ?? policy ?? ??? ??? ?????.
  • 18. Overview: Natural Gradient ????? ?? ????? ???? ?? B?? A1? ??? A2? ??? ??? ?? ?, ???? ????? ??? A1? ? ???? ?? ???. ??? ????? ???? B?? A1?? A2? ? ???. ?? ??? ??????? (??? ??? ?? ??.)
  • 19. Overview: Natural Gradient ????? ?? ┨ policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A1)??? policy? ?? ???? ? ?? ?? ? ??. ???? delta B? 0.0001?? ?? ?? ?????? ???? ??? ??? ??. ┨ policy(B)?? policy gradient? ?? policy(B + delta B)? ? ?? policy(A2)?? policy? ?? ? ????? ?? ? ??. ?? ?? B┐A1 ?? B┐A2? ? covariant??? ? ? ??.
  • 20. Overview: Natural Gradient ????? ?? ┨ ??? policy? ???? ???? ???? ??? (???? ??) manifold ???? ???? ? ? ? ???? ???? ???? gradient? ? ?? ???? ????.. ┨ Natural Gradient??? policy? ?? manifold? ???? ??? ??. ┨ Riemannian manifold? Manifold ???? ???? ??, ?? ??? Manifold?? ???? ????.
  • 21. Overview: Natural Gradient ????? ?? ?? ???? (?? ??)
  • 22. Overview: Natural Gradient ????? ?? ┨ ??? Natural Policy Gradient? ?? ??(?? ??)??? ?? ??(?? ??)? ??? gradient? policy? ???? ?? ???. ????? ???? ??, ??? ?? ???? ?? ?? ???. ?, ??? ?? ??? ????? ????? ?? ? ?? ?? ?? ????. ?? ??? ????? ????? ?? ?? = ?? ????? ?? ?? ??? ???? ??.
  • 23. Overview: Natural Gradient ?? Natural Gradient ? ??? ???, ?? ????? ????? ?? ??(??)? ?? ??(??)? ??? ??? ????
  • 25. Overview: Natural Gradient ?????? ?? ??? ??: ???????, ??? ??? ??? ?? ????? ????
  • 26. Overview: Natural Gradient ?????? ?? ??? --- (1) ??: ???????, ??? ??? ??? ?? ????? ????
  • 27. Overview: Natural Gradient ?????? ?? ??? --- (1) ??: ???????, ??? ??? ??? ?? ????? ????
  • 28. Overview: Natural Gradient ?????? ?? ??? --- (2) ??: ???????, ??? ??? ??? ?? ????? ????
  • 29. 30 Overview: Natural Gradient ?????? ?? ??? --- (2) ??: ???????, ??? ??? ??? ?? ????? ????
  • 30. Overview: Natural Gradient ??? ?? ???? ??. ??? ??? ?? ??? ??
  • 31. Overview: Natural Gradient ??? ?? ???? ??. ??? sin(x)?? ??. x = 0 ??? 1? ???? y=x x = 0 ?? ??? ?? ?? ??? ?? ??? ??? ???? gradient? ? ??(?????? ?? ???? ??) x = 0 ???? gradient??? ??. ??? ??? ??
  • 32. Overview: Natural Gradient ??? ?? ???? ??. ??? ??? ?? ??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F
  • 33. Overview: Natural Gradient ??? ?? ???? ??. ??? ??? ?? ??? e^x ? ?, ??: https://m.blog.naver.com/PostView.nhn?blogId=papers&logNo=220751624818&proxyReferer=https%3A%2F%2Fwww.google.com%2F
  • 34. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 36. Abstract Key Information 1. We provide a natural gradient method that represents the steepest descent direction based on the underlying structure of the parameter space. 2. Although gradient methods cannot make large changes in the values of the parameters, we show that the natural gradient is moving toward choosing a greedy optimal action rather than just a better action.
  • 37. Abstract 3. These greedy optimal actions are those that would be chosen under one improvement step of policy iteration with approximate, compatible value functions, as defined by Sutton et al. 4. We then show drastic performance improvements in simple MDPs and in the more challenging MDP of Tetris. ?? ??????? ?? ????, policy iteration?? one improvement step?? ??? ?? ??? ???? ???. ?? ?? ?????? ?? ??? ????? ??´ Simple MDP, Tetris MDP?? ??? ?? ?????.
  • 38. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 40. Introduction 1. Policy gradient: ?? ??? gradient? ?? ??? policy? ?? ?? 2. ???? ??? ??? ????? ??? ????? SGD? non-covariant 3. ?? ??? ??? ????? Natural gradient? covariant 4. Policy iteration? ???? ?? (??? ??) 5. ?? ??? ??? ??. (??? ? ? ??)
  • 41. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 42. Natural Gradient 1. Assume 2. ??? ?? 3. Natural Gradient ??
  • 43. Natural Gradient Natural Gradient? ?? ???? ?? ? ??? ???? ???. Assume ┐ Gradient ??? ??(??) ?? ┐ Natural Gradient ??
  • 44. ┨ Finite MDP: (", $%, &, ', () ┨ Decision making method: policy(a;s) ┨ Every policy is ergodic (well-defined stationary distribution *+ ) ┨ Average reward: ┨ State-action value: ┨ Value function: ┨ Policy , ??? ??? ?? ??? ?? ergodic: ??????*stationary distribution = stationary distribution ´ ????? ????????? ??? ??? ?????? ??? ?? ??? ??. ?? ?? ???? ????? ??? ?? ? ??? ???????? ?? ????? ?? ??? ?? ??(stationary distribution)? ??? ???. Natural Gradient Assume
  • 45. ┨ The exact gradient of the average reward: Natural Gradient Assume ??? ??~ ?? ??! ?? ??? ??? ? `policy ???? ? ? ????? ? ? ??!¨ ??. Policy ???? ???? ?? Policy? ?? ???? ??? `?¨ ???? ???? ???. ? ?? ???? ??? ?? ???? ??? gradient??? ?? ????.
  • 46. ┨ ???? ??(steepest descent direction): !(# + %#)? ??????? ?? (%#) ┨ ???? ??(??): |%#|( Natural Gradient ??? ?? ??? .. ?? ??? ?? ?? ?? pg? ??? ??.. ?? ???? ??(??)? ?? ???? ??? ??!!! ? ??? ??!
  • 47. |"#|$ = "#"# ???? ????? ???? # ?? ??? ??? ???? ??. ??? ??? ?? ???? #? ??? ??? ?????? ???. ?? ??? ? &' ? ?????? ??? ? ????? ???? ??? ???. ? &'? ????? ????(?? ????)? ??? ? ??. ??? ???? #? ?? ???? ?? ?? ???? ??? ?? ?? ? ? ??. ┬ ???? ??? ?? ??? ?? ? ? ??? ??? ?? ?? ? ??. ?? ??, #1, #2 ?? ??? ???? ?? ???. ???? ??? ?? ?? ??? ? ?? ? ? general? ?? ?? ?? ????! Natural Gradient ??? ??
  • 48. ?? ???? ????? ??? ??? ??. Natural Gradient ??? ?? ?? ?? ?? ????? 2d? ?? ?? ? ??? ???? ????? ??? ?? ??? ??? ? !1, !2? ??? ???.
  • 49. Natural Gradient ??? ?? ???? ??? ???? ??? ? ?? ? ? general?? ?? ?? ?? ????! !(#) ?? ??? ?????´ ?? ??? ?? ??? ??.
  • 50. Natural Gradient ??? ?? !(#) !(#) %(&) ? positive-definite matrix(?? ??? ??). ?? ???? ???? ??? ???? ??? ?? ????. ????? ??? ?? ?? ??? ??? ??.. ??? ??? ??? ???? ??. http://bskyvision.com/205
  • 51. Natural Gradient Natural Gradient ?? !(#) !(#) ? identity matrix? ? ?? ???? ???? ??? ?? ?? ??. ?? ??(?? ????)? ???? ?? ??? ?? ??? Natural Gradient ?? ??. ?? ???? Manifold ??? ?? ????? ??? ? % ?? ??? ???(variant) ??? policy ???? ???? ?? ? ??? ?? invariant?? ???? ??? ? ??? Fisher information matrix? !(#)? ?? ???. Fisher information matrix FIM: ?? ??? ?? ??? positive definite matrix ? ?? ??.
  • 53. Natural Gradient Natural Gradient ?? NPG??? steepest descent direction? ??? ??. ??? ?? ?? ???. ???? ??? ?? ????(????? NPG??) (https://dnddnjs.github.io/paper/2018/05/22/natural-policy- gradient/)
  • 54. Natural Gradient Natural Gradient ?? NPG??? steepest descent direction? ??? ??. ??? ??? ?? ???. `FIM??? positive-definite matrix? ?? ?? ??? ??? ?? ? ??? ????? ???? ??.¨
  • 55. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion NPG? ??? ??? 3 ?? ??? ?? ??..
  • 56. Theorem 1: Compatible Function Approximation 1. Assume 2. Theorem 3. Proof 4. meaning
  • 57. Theorem 1: Compatible Function Approximation ??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??. (?? ?? ??? ??.) *??? Function Approximation? ??? ??? ??
  • 58. ??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??. Theorem 1: Compatible Function Approximation Assume ??? ? policy? ??? f ? y=ax ? linear function ? Q value? ??? ?? **??? ????? ??? ??? ?? ???? ??. Q? ??? f? ?? Q? ??? ? MSE
  • 59. ??1: Natural Policy Gradient? Function Approximation? ?? ? ? ??. Theorem 1: Compatible Function Approximation Theorem ?? ??? ??? ??.
  • 60. Theorem 1: Compatible Function Approximation Theorem ???? ??? ??
  • 61. Theorem 1: Compatible Function Approximation meaning ??? ??? ??
  • 62. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 63. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
  • 64. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) ??2: policy? exponential ??? ????, ?????? ???? ??? ?? policy? ??? ??? ???. ??? ??? ??? ?? ??? ??? ?? ?? ??? ????
  • 65. Overview: Natural Gradient ????? ?? ┨ ??? Natural Policy Gradient? ?? ??(?? ??)??? ?? ??(?? ??)? ??? gradient? policy? ???? ?? ???. ????? ???? ??, ??? ?? ???? ?? ?? ???. ?, ??? ?? ??? ????? ????? ?? ? ?? ?? ?? ????. ?? ??? ????? ????? ?? ?? = ?? ????? ?? ?? ??? ???? ??.
  • 66. ??? ???? ???? ?????? ?? ?????. Theorem 2: Greedy Policy Improvement ( policy 『 exp )
  • 67. 1. 2? ??? 1? ???? policy ??? ? ???? ??. 2. 2? ??: ?? theta ?? ???? ?????? ??? ? ??? best action? ???? ??. 3. 1? ??: theta? ?? ???? ?? gradient? ?? 0? ????? ?????? ??? ????? best action ? ??? ??? ? ??.(better action??? ?? ? ? ?? ???) Theorem 2: Greedy Policy Improvement ( policy 『 exp )
  • 68. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 69. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 1. Theorem 2. Result 3. Solution C Line Search
  • 70. ??3: policy? general function?? Natural policy gradient? policy? ???? ?? ?? ??? ??? 1? ??? ??. * ?? ????? 1? ?? Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Theorem
  • 71. Policy ? exp? ?? general function? ?? Greedy action ? ???? ? ???. ? ??? ??? ?? policy? log(1+x)?? ?????. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Result
  • 72. ??? 2? ???? ????? ?? ??? ?? policy ??? ?? ???? ??? ? ? ??. ?, ?? ??? a? ???? ?? ??? ???? policy ????? ? ???? ?? ?? greedy action? ??? ???. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Result
  • 73. ?? ?? ??? ???? (a)? ?? ????? ?? ???? ?? ??? ???, (b)? ?? ??? ?? ?? ??? ?? ??. ?? ????? ??? ? ??, ??? ?? ?? ???? ???(c). ??? policy? f(x) = x3 - 2x + 1 ?? ??? ?? ???. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Result (d)(c)(a) (b)
  • 74. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Solution C Line Search
  • 75. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) Solution C Line Search ??: ??????? http://darkpgmr.tistory.com/149
  • 76. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 77. Metrics and Curvatures 1. FIM? Hessian ???
  • 78. 79 ? Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use than F. FIM ?? G? ?? ? ?? ???? ???? ? In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is asymptotically efficient FIM? ?? Hessian?? ???. ??? ????? ?????. ┨ FIM? Hessian? ?? ????? - FIM? stochastic?? ????? ?? ?, Hessian? ????? ? ???? ?? ? ??. - ?, ????? ?????? ????? ?? ??? ?? ??? ? ?? FIM┐Hessian ???? - ?? ????? ????? ????????? ?? ????? ??? ??? ? ?????.(? ??..) Metrics and Curvatures FIM? Hessian ???
  • 80. 81 Metrics and Curvatures FIM? Hessian ??? ? Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying parameter space (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second order convergence) ┨ Metric? ???? ??? ?? ????? Blind Source Separation? ??? ????. ? As argued by Mackay , one strategy is to pull a metric out of the data-independent terms of the Hessian (if possible), and in fact, Mackay arrives at the same result as Amari for the blind source separation case. ┨ ????? data-independent? ?? metric?? ??? ?? ?? ??? ??? ? ? ??!(?) ┨ ?? ? ?? ??? data-independent?? ??. ┨ ??? ?? ?????? ? ??? ?? FIM?? ? ???? ???. ┨ ?? ??? ?? Q? policy? ???? ??? ??? ??? ??.. ??? ?? ?????? Q?? ?? ????! (G * d_theta * d_theta ??) ┨ ? FIM? Hessian?? ??? ??? ??. ???????? ???? positive definite?? ?? ?? ??. ┨ ??? ?? ??? ????? ? ???? ?? ???? ?? ?? ?? ???. ┨ ?? ?? ??? ????? ??? Conjugate Method(inversed hessian)? ? ??????? ??.
  • 82. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 83. Experiments 1. simple I-dimensional linear quadratic regulator with dynamics 2. a simple 2-state MDP 3. The game of Tetris
  • 84. 85 Experiments simple I-dimensional linear quadratic regulator with dynamics The goal is to apply a control signal u to keep the system at x = 0, (incurring a cost of X(t)^2 at each step). ??: policy: ??:
  • 85. 86 Experiments simple I-dimensional linear quadratic regulator with dynamics ?? ? ?? npg, ??? ??? ?? gradient 1. NPG? ? ?? ???? 2. NPG? parameter theta? ???? ???? invariant? ??(covariance)? ?? ??? ????? ????. 3. ? ??? ?? stationary distribution? ????? ??? ?? ??? ???.
  • 86. 87 Experiments a simple 2-state MDP ????: i? stationary distribution=0.8 j? stationary distribution=0.2 ??: j? ???? ?? ? Policy: ?? gradient ??(solid) NPG(dashed) ?? gradient ?? C (top): stationary distribution? ??? ???. ?? 1?? ? ? ????. / C(bottom): NPG? ?? 2? ? ?? ????. (?? ??)
  • 87. Experiments a simple 2-state MDP ?? gradient ??(solid) NPG(dashed) Parameter? ?????? ?? ?? gradient? ??? theta_i? ?? ?????? ?? NPG? i? j? ??? ? ???? ??.
  • 88. Experiments The game of Tetris ??: ???? policy:
  • 89. ??? ?? 1. Review: Policy Gradient 2. Overview: Natural Gradient 3. Abstract 4. Introduction 5. Natural Gradient 6. Theorem 1: Compatible Function Approximation 7. Theorem 2: Greedy Policy Improvement ( policy 『 exp ) 8. Theorem 3: Greedy Policy Improvement ( policy == general parameterized policy ) 9. Metrics and Curvatures 10. Experiments 11. Discussion
  • 91. 1. gradient method? Greedy policy iteration? ?? policy? ?? ??? ???. 2. ?? ????? ??. ? natural gradient? ?? = policy improvement ?? ?? ??? 3. Line search ?? natural gradient? policy improvement ??? ?????. 4. Policy improvement? ?? NPG? ?? ??? ???? ??. Discussion
  • 92. 5. Fisher Information matrix? asymtotically Hessian?? ???? ??. asymtotically conjugate gradient method(Hessian? inverse? approx.? ??? ??)? ? ?? ?? ? ? ?. 6. ??? Hessian? ?? informative?? ??(hessian? ?? ??? ??? positive definite? ?? ??? ??? ?? ??? convex? ?? ? ? ????? ??? ??? ??? hessian? ?? positive definite? ?? ? ??? ???) tetris?? ??? natural gradient method? ? ???? ? ??(pushing the policy toward choosing greedy optimal actions) 7. conjugate gradient method? ? ? maximum? ??? ?????, performance? maximum?? ?? ????? ??? ??? ???(?). ? ??? ??? ???? ?? ??. Discussion ??: https://dnddnjs.github.io/paper/2018/05/22/natural-policy-gradient/
  • 93. 94 ?? ???? ?? ??? ┬ Natural gradient descent? ?? ??? ???? ??? gradient descent?? ??? ?? ??? ???. ?? ? ? ??? ???? ┬ ???? ?? ????. ????
  • 95. 96 ?? ?? ?? ???? ?? ?? ????
  • 96. 1. ???, ¨A Natural Policy Gradient¨, (https://dnddnjs.github.io/paper/2018/05/22/natural- policy-gradient/) 2. ???????, ¨??? ??? ??? ??¨,(http://darkpgmr.tistory.com/149) 3. ???, `?????? ?? ? 1/3¨, (https://www.youtube.com/watch?v=o_peo6U7IRM&feature=youtu.be) 4. ???, ??? Thanks to´