端端舝

∟
∟ 2013 3
∟ 2015 3
∟ 2018 3
∟
∟ 2018 4 ~
∟
∟
∟
∟ Deep Learning
∟ Goodfellow Deep Learning

∟ World Models
∟ David Ha, J邦rgen Schmidhuber
∟ arXiv: 1803.10122 Web https://worldmodels.github.io/
∟ Ha hardmaru
∟
∟
∟
∟
3

∟
∟
∟ ※World Model§
∟

☆岍賜乒犯伙★午憝蟀旃噶卞勾中化

∟
∟
∟
∟ internal model
∟
∟ world model
∟ dynamics model

∟ ※The image of the world around us, which we carry in our head, is just a model.
Nobody in his head imagines all the world, government or country. He has only
selected concepts, and relationships between them, and uses those to represent the
real system. (Forrester, 1971)§
∟
∟ [Chang+ 17, Cell]
->

∟
∟ Jeff Hawkins On Intelligence
∟
∟
∟
->
8

∟
∟
∟
∟
∟
∟
∟ PredNet [Watanabe+ 18]
http://www.psy.ritsumei.ac.jp/~akitaoka/rotsnakes.html

∟
∟ ! " # $ !＊
∟ &("|!)
∟ MDP
Recap: the reinforcement learning objective
The Anatomy of a Reinforcement Learning Problem
端端舝 from Sergey Levine
Recap: the reinforcement learning objective

∟
∟
∟
->
∟
∟
∟
∟
1.
2.
3.
4. 2
Model-based RL Review
improve the
policy
Correcting for model errors:
refitting model with new data, replanning with MPC, using local models
Model-based RL from raw observations:
learn latent space, typically with unsupervised learning, or
model &plan directly in observational space
e.g., backprop through model
supervised learning
Even simpler＃
generic trajectory
optimization, solve
however you want
? How can we impose constraints on trajectory optimization?

∟
∟
∟
∟
∟
∟ RBF DNN
∟
∟
∟
∟ PILCO
∟ Guided policy search (trajectory optimization)
∟ CMA-ES
Policy Search Classification
Yet, it＊s a grey zone＃
Important Extensions:
? Contextual Policy Search [Kupscik, Deisenroth, Peters & Neumann, AAAI 2013], [Silva, Konidaris & Barto, ICML 2012], [Kober & Peters, IJCAI 2011], [Paresi &
Peters et al., IROS 2015]
? Hierarchical Policy Search [Daniel, Neumann & Peters., AISTATS 2012], [Wingate et al., IJCAI 2011], [Ghavamzadeh & Mahedevan, ICML 2003]
9
Direct Policy
Search
Value-Based
RL
Evolutionary
Strategies,
CMA-ES
Episodic
REPS
Policy
Gradients,
eNAC
Actor Critic,
Natural Actor Critic
Model-based REPS
PS by Trajectory
Optimization
Q-Learning,
Fitted Q
LSPIPILCO
Advantage
Weighted
Regression
Conservative
Policy Iteration
Model-Based Policy Search Methods
85
Learn dynamics model from data-set
+ More data efficient than model-free methods
+ More complex policies can be optimized
? RBF networks [Deisenroth & Rasmussen, 2011]
? Time-dependent feedback controllers [Levine & Koltun, 2014]
? Gaussian Processes [Von Hoof, Peters & Nemann, 2015]
? Deep neural nets [Levine & Koltun, 2014][Levine & Abbeel, 2014]
Limitations:
- Learning good models is often very hard
- Small model errors can have drastic damage
on the resulting policy (due to optimization)
- Some models are hard to scale
- Computational Complexity

PILCO
∟ PILCO (probabilistic inference for learning control) [Deisenroth+ 11]
∟
∟
∟ RBF
∟
1.
2.
∟
∟
3.
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
Model Learning:
Policy Update:
88
Model Learning:
Policy Update:
88
Model Learning:
Policy Update:
88
Model Learning:
Policy Update:
88
What＊s the problem?
backprop backprop
backprop
? Similar parameter sensitivity problems as shooting methods
? But no longer have convenient second order LQR-like method, because policy
parameters couple all the time steps, so no dynamic programming
? Similar problems to training long RNNs with BPTT
? Vanishing and exploding gradients
? Unlike LSTM, we can＊t just ※choose§ a simple dynamics, dynamics are chosen by
nature

Guided Policy Search via trajectory optimization
∟
∟
∟ trajectory optimization
∟ DNN trajectory optimization+
guided policy search
[Levine+ 14]

CMA-ES
∟ Model-based 1
∟ Evolution Strategy ES
∟
∟
∟
∟ CMA-ES (
∟
∟
∟
1.
2.
3. 2
http://yuki-koyama.hatenablog.com/entry/2017/01/20/172109

∟
∟ [Gu+ 16]
∟ etc.
∟
∟
∟

∟
∟ 1980 Feed-forward neural networks FNN
∟ 1990 RNN
->
∟ RNN
∟ ※Making the World Differentiable§ [Schmidhuber, 1990]
∟ RNN
RNN

∟
∟ Learning deep dynamical models from image pixels [Wahlstr?m+ 14] From Pixels to
Torques: Policy Learning with Deep Dynamical Models [Wahlstrom+ 15]
∟ deep dynamical model DDM
∟

VAE
∟ ! "; $
∟
∟ "
∟
∟ Variational autoencoder VAE [Kingma+ 13] [Rezende+ 14]
∟
"
%
&'(%|")
" ~ !,("|%)
% ~ !(%)
&' % " = .(%|/ " , 12
(")) !, " % = ?("|/ " )

VAE
∟ ! "~$ " !
∟
∟ !
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p?(x|z) with the learned parameters ?.
[Kingma+ 13]

VAE
∟ VAE
∟
∟ GAN
∟ disentangle
∟
∟
∟ 汕-VAE[Higgins+ 17]
∟
∟ [Burgess+ 18]

∟ Schmidhuber
∟
∟
∟ +
25

∟
∟ 3
∟ Vision Model V
∟ Memory RNN M
∟ Controller C V M
26

Vision Model V
∟ 2D Variational Autoencoder VAE
∟
27

MDN-RNN M
∟ M !" !"#$
∟ %(!"#$|(", !", ?")
∟ ( ? RNN
∟ !"#$
∟ M MDN-RNN[Graves + 13, Ha+ 17]
∟ RNN
∟
∟ Ha
28

∟ [Bishop+ 94]
∟
∟
∟ ! "
∟
29

MDN-RNN
∟ SketchRNN[Ha+ 17]
∟ MDN-RNN
30

Controller (C) Model
∟
∟ C
∟ ! RNN ?
∟
∟ CMA-ES
∟ 1 867
31

V M
∟ VAE V
∟ V ! M
∟ "
∟
34

∟
∟ OpenAI Gym leaderboard
∟ RGB
∟
37

2 VizDoom
∟ VizDoom Doom
∟
∟ 750
39

∟ M
∟ !"#$% (!%) 2
∟ ( )%*+, !%*+ -%, )%, ?%)
∟ C
40

∟
∟ 13 BB
∟
∟
∟
∟
43

∟ MDN-RNN
∟ C M
∟ !
∟
44

∟
∟
∟
∟ Learning To Think[Schmidhuber+ 15]
1. M C
2.
3. M M C
4. 2
∟ 1
∟ 2
∟ curiosity
∟
45

∟
∟
∟
∟ Replay Comes of Age
∟
46

∟
∟ MDN-RNN VAE
∟ VAE
∟
∟
∟ Friston
∟ Wahlstr?m M V
∟ VRNN[Chung+ 15]
47

Friston
∟
∟ !"($)
∟
∟
∟
https://en.wikipedia.org/wiki/Free_energy_principle
164 菴 9 梒蕉舷
月午ㄛ囀窒乒犯伙反汜傖乒犯伙卞方勻化灍政今木月ㄝ
囀窒乒犯伙毛辻迮悝�卞云仃月汜傖乒犯伙午袙尹ㄛ俴�午磐太勾仃凶�瞎匹衄靡卅及互
Friston 卞方月赻蚕巨生伙幼奈埻燴ㄗfree-energy principleㄘ [Friston 10] 匹丐月ㄝ赻蚕巨生伙
幼奈埻燴匹反ㄛ汜昜悝腔卅扑旦氾丞互囀窒袨颷及赻蚕巨生伙幼奈毛郔苤趙允月仇午卞方勻化窏
唗毛鋤厥仄化中月午蕉尹化中月ㄝ
袨颷 x*8
午サ婓劐杅 z 毛厥勾汜傖乒犯伙 p牟(x, z) 毛蕉尹化ㄛ輪侔煦票毛 q耳(z) 午允月ㄝ
引凶ㄛ�及笚煘蚧僅及奻賜匹丐月劐煦赻蚕巨生伙幼奈ㄗ�及劐煦狟賜ㄘ毛 F(x; 耳, 牟) =
?Eq耳(z)[log p(x, z)] + H[q耳(z)] 午允月ㄝ赻蚕巨生伙幼奈埻燴匹反ㄛ囀窒由仿丟奈正耳午俴�
a 反ㄛㄗ劐煦ㄘ赻蚕巨生伙幼奈毛郔苤趙允月方丹卞載陔允月午蕉尹月ㄝ
?耳 = arg min
耳
F(x; 耳, 牟),
?a = arg min
a
F(x; 耳, 牟).
卅云ㄛ仇仇匹及 arg mina 反ㄛ赻蚕巨生伙幼奈互郔苤卞卅月方丹卅 x 毛腢少俴� a 毛龰月午中
丹仇午匹丐月ㄝ引凶ㄛ汜傖乒犯伙及由仿丟奈正牟卞勾中化反ㄛ奻�及載陔毛珨隅杅靜曰殿仄凶
摽卞載陔允月ㄝ
赻蚕巨生伙幼奈埻燴匹反ㄛ⻌薯反�g�卞袨颷 x 午仄化蕉尹日木化中月ㄝ丐月袨颷 x 毛忳仃
龰勻凶午五卞囀窒袨颷互載陔今木ㄛ公及摽汜傖乒犯伙毛啋卞ㄛ赻蚕巨生伙幼奈互郔苤卞卅月
方丹卅袨颷 x 毛腢少俴� a 互龰日木月ㄝ仄井仄灍蕣卞反ㄛ俋賜井日及棧慾反拻覜毛籵元化穴
伙民乒奈母伙ロ�午仄化腕日木月凶戶ㄛ赻蚕巨生伙幼奈反恚杅及乒母伉氾奴 x 支 w 毛漪氏分
164
囀窒乒犯伙毛辻迮悝�卞云仃月汜傖乒犯伙午袙尹ㄛ俴�午磐太勾仃
Friston 卞方月赻蚕巨生伙幼奈埻燴ㄗfree-energy principleㄘ [Friston 10
幼奈埻燴匹反ㄛ汜昜悝腔卅扑旦氾丞互囀窒袨颷及赻蚕巨生伙幼奈毛郔苤趙
袨颷 x*8
午サ婓劐杅 z 毛厥勾汜傖乒犯伙 p牟(x, z) 毛蕉尹化ㄛ輪侔煦
引凶ㄛ�及笚煘蚧僅及奻賜匹丐月劐煦赻蚕巨生伙幼奈ㄗ�及劐煦狟
?Eq耳(z)[log p(x, z)] + H[q耳(z)] 午允月ㄝ赻蚕巨生伙幼奈埻燴匹反ㄛ囀窒
?耳 = arg min
耳
F(x; 耳, 牟),
?a = arg min
a
F(x; 耳, 牟).
卅云ㄛ仇仇匹及 arg mina 反ㄛ赻蚕巨生伙幼奈互郔苤卞卅月方丹卅 x 毛腢
丹仇午匹丐月ㄝ引凶ㄛ汜傖乒犯伙及由仿丟奈正牟卞勾中化反ㄛ奻�及載陔
赻蚕巨生伙幼奈埻燴匹反ㄛ⻌薯反�g�卞袨颷 x 午仄化蕉尹日木化中月
龰勻凶午五卞囀窒袨颷互載陔今木ㄛ公及摽汜傖乒犯伙毛啋卞ㄛ赻蚕巨生
方丹卅袨颷 x 毛腢少俴� a 互龰日木月ㄝ仄井仄灍蕣卞反ㄛ俋賜井日及棧
菴 9 梒蕉舷
犯伙反汜傖乒犯伙卞方勻化灍政今木月ㄝ
毛辻迮悝�卞云仃月汜傖乒犯伙午袙尹ㄛ俴�午磐太勾仃凶�瞎匹衄靡卅及互
赻蚕巨生伙幼奈埻燴ㄗfree-energy principleㄘ [Friston 10] 匹丐月ㄝ赻蚕巨生伙
ㄛ汜昜悝腔卅扑旦氾丞互囀窒袨颷及赻蚕巨生伙幼奈毛郔苤趙允月仇午卞方勻化窏
中月午蕉尹化中月ㄝ
サ婓劐杅 z 毛厥勾汜傖乒犯伙 p牟(x, z) 毛蕉尹化ㄛ輪侔煦票毛 q耳(z) 午允月ㄝ
煘蚧僅及奻賜匹丐月劐煦赻蚕巨生伙幼奈ㄗ�及劐煦狟賜ㄘ毛 F(x; 耳, 牟) =
x, z)] + H[q耳(z)] 午允月ㄝ赻蚕巨生伙幼奈埻燴匹反ㄛ囀窒由仿丟奈正耳午俴�
赻蚕巨生伙幼奈毛郔苤趙允月方丹卞載陔允月午蕉尹月ㄝ
?耳 = arg min
耳
F(x; 耳, 牟),
?a = arg min
a
F(x; 耳, 牟).
及 arg mina 反ㄛ赻蚕巨生伙幼奈互郔苤卞卅月方丹卅 x 毛腢少俴� a 毛龰月午中
ㄝ引凶ㄛ汜傖乒犯伙及由仿丟奈正牟卞勾中化反ㄛ奻�及載陔毛珨隅杅靜曰殿仄凶
ㄝ
幼奈埻燴匹反ㄛ⻌薯反�g�卞袨颷 x 午仄化蕉尹日木化中月ㄝ丐月袨颷 x 毛忳仃
囀窒袨颷互載陔今木ㄛ公及摽汜傖乒犯伙毛啋卞ㄛ赻蚕巨生伙幼奈互郔苤卞卅月
164 菴 9 梒蕉舷
囀窒乒犯伙毛辻迮悝�卞云仃月汜傖乒犯伙午袙尹ㄛ俴�午磐太勾仃凶�瞎匹衄靡卅及互
Friston 卞方月赻蚕巨生伙幼奈埻燴ㄗfree-energy principleㄘ [Friston 10] 匹丐月ㄝ赻蚕巨生伙
幼奈埻燴匹反ㄛ汜昜悝腔卅扑旦氾丞互囀窒袨颷及赻蚕巨生伙幼奈毛郔苤趙允月仇午卞方勻化窏
袨颷 x*8
午サ婓劐杅 z 毛厥勾汜傖乒犯伙 p牟(x, z) 毛蕉尹化ㄛ輪侔煦票毛 q耳(z) 午允月ㄝ
引凶ㄛ�及笚煘蚧僅及奻賜匹丐月劐煦赻蚕巨生伙幼奈ㄗ�及劐煦狟賜ㄘ毛 F(x; 耳, 牟) =
?Eq耳(z)[log p(x, z)] + H[q耳(z)] 午允月ㄝ赻蚕巨生伙幼奈埻燴匹反ㄛ囀窒由仿丟奈正耳午俴�
?耳 = arg min
耳
F(x; 耳, 牟),
?a = arg min
a
F(x; 耳, 牟).
卅云ㄛ仇仇匹及 arg mina 反ㄛ赻蚕巨生伙幼奈互郔苤卞卅月方丹卅 x 毛腢少俴� a 毛龰月午中
丹仇午匹丐月ㄝ引凶ㄛ汜傖乒犯伙及由仿丟奈正牟卞勾中化反ㄛ奻�及載陔毛珨隅杅靜曰殿仄凶
赻蚕巨生伙幼奈埻燴匹反ㄛ⻌薯反�g�卞袨颷 x 午仄化蕉尹日木化中月ㄝ丐月袨颷 x 毛忳仃
龰勻凶午五卞囀窒袨颷互載陔今木ㄛ公及摽汜傖乒犯伙毛啋卞ㄛ赻蚕巨生伙幼奈互郔苤卞卅月
方丹卅袨颷 x 毛腢少俴� a 互龰日木月ㄝ仄井仄灍蕣卞反ㄛ俋賜井日及棧慾反拻覜毛籵元化穴
伙民乒奈母伙ロ�午仄化腕日木月凶戶ㄛ赻蚕巨生伙幼奈反恚杅及乒母伉氾奴 x 支 w 毛漪氏分

LeCun
Y LeCun
How Much Information Does the Machine Need to Predict?
※Pure§ Reinforcement Learning (cherry)
The machine predicts a scalar
reward given once in a while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category
or a few numbers for each input
Predicting human-supplied data
10 10,000 bits per sample↙
Unsupervised/Predictive Learning (cake)
The machine predicts any part of
its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
(Yes, I know, this picture is slightly offensive to RL folks. But I＊ll make it up)

∟
∟ C
∟
∟ PredNet [Lotter+ 16]
∟

∟
∟ ＃
∟
51
囀窒乒犯伙
俋賜
俴�
棧慾
ㄗ穴伙民乒奈母伙ロ�ㄘ

∟
∟
∟ Schmidhuber
∟
∟
∟
∟ POMDP
52

端端舝

☆岍賜乒犯伙★午憝蟀旃噶卞勾中化

Recommended

More Related Content

What's hot (20)

Similar to ☆岍賜乒犯伙★午憝蟀旃噶卞勾中化 (16)

More from Masahiro Suzuki (18)

☆岍賜乒犯伙★午憝蟀旃噶卞勾中化