[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
?
Deep reinforcement learning algorithms often fail to learn complex tasks. Recent works have identified three issues that form a "deadly triad" contributing to this problem: non-stationary targets, high variance, and positive correlation. New algorithms aim to address these issues by improving exploration, stabilizing learning, and decorrelating updates. Overall, deep reinforcement learning remains a challenging area with opportunities to develop more data-efficient and generally applicable algorithms.
[DL輪読会]深層強化学習はなぜ難しいのか?Why Deep RL fails? A brief survey of recent works.Deep Learning JP
?
Deep reinforcement learning algorithms often fail to learn complex tasks. Recent works have identified three issues that form a "deadly triad" contributing to this problem: non-stationary targets, high variance, and positive correlation. New algorithms aim to address these issues by improving exploration, stabilizing learning, and decorrelating updates. Overall, deep reinforcement learning remains a challenging area with opportunities to develop more data-efficient and generally applicable algorithms.
21. 実験1
? 部分観測なMiniPacman [Racanie?re et al., 2017]
– エージェントは幽霊を避けながら迷路内のすべての食物を食べようとする.
– 観測できるのは5×5のウィンドウ(右)
-> 高いスコアを達成するためには(過去の経験や環境の不確実性を考慮しつつ)信念状態を形成する必要がある.
? この実験では,non-jumpyなTD-VAEが適切に学習できるか確認する.
– 標準的なELBOの下での2つの状態空間モデルと比較
-> TD-VAEのELBOの有効性を評価
21
Under review as aconference paper at ICLR 2019
ELBO ? logp(x) (est.)
Filtering model 0.1169± 0.0003 0.0962± 0.0007
Mean-?eld model 0.1987± 0.0004 0.1678± 0.0010
TD-VAE 0.0773 ± 0.0002 0.0553 ± 0.0006
Figure 2: MiniPacman. Left: A full frame from the game (size 15 ? 19). Pacman (green) is
navigating the mazetrying to eat all the food (blue) whilebeing chased by aghost (red). Top right:
A sequence of observations, consisting of consecutive5?5 windowsaround Pacman. Bottom right:
ELBO and estimated negativelog probability on atest set of MiniPacman sequences. Lower isbetter.
Log probability isestimated using importance sampling with theencoder as proposal.
Under review asaconference paper at ICLR 2019
ELBO ? logp(x) (est.)
Filtering model 0.1169± 0.0003 0.0962± 0.0007
Mean-?eld model 0.1987± 0.0004 0.1678± 0.0010
TD-VAE 0.0773 ± 0.0002 0.0553 ± 0.0006
Figure 2: MiniPacman. Left: A full frame from the game (size 15 ? 19). Pacman (green) is
navigating themazetrying to eat all thefood (blue) whilebeing chased by aghost (red). Top right:
A sequenceof observations, consisting of consecutive5?5 windowsaround Pacman. Bottom right:
ELBO and estimated negativelog probability on atest set of MiniPacman sequences. Lower isbetter.
Log probability isestimated using importance sampling with theencoder asproposal.
22. 実験1
? 実験結果
– テスト集合に対する(恐らく負の)変分下界と負の対数尤度での評価
– 小さい方が良いモデル.
– TD-VAEが最も良い結果
– 平均場モデルが低い結果になっている
? 平均場モデルでは??が信念状態のコードになっているが,フィルタリングモデルではそうなっていないことに注意(フィルタリングモデルでは,
エンコーダで前のステップの?に依存しているので)
信念状態を得るために単純にエンコーダを制限するだけでは精度が下がる
22
Under review asaconference paper at ICLR 2019
ELBO ? logp(x) (est.)
Filtering model 0.1169± 0.0003 0.0962± 0.0007
Mean-?eld model 0.1987± 0.0004 0.1678± 0.0010
TD-VAE 0.0773 ± 0.0002 0.0553 ± 0.0006
Figure 2: MiniPacman. Left: A full frame from the game (size 15 ? 19). Pacman (green) is
navigating themazetrying to eat all thefood (blue) whilebeing chased by aghost (red). Top right:
A sequenceof observations, consisting of consecutive5?5 windowsaround Pacman. Bottom right:
ELBO and estimated negativelog probability on atest set of MiniPacman sequences. Lower isbetter.
Log probability isestimated using importance sampling with theencoder asproposal.
23. 実験2
? Moving MNIST
– 各ステップで移動するMNIST
– [1,4]の範囲でステップを飛び越えて学習し,生成できるかを実験
? 実験結果:
– ステップ数を飛ばしても生成できた.
– (明示的に書いてないが恐らく)一番左が元画像で各列が飛ばしたステップ数[1,4]に対応している
23
Figure 2: MiniPacman. Left: A full frame from the game (size 15 ? 19). Pacman (green) is
navigating themazetrying to eat all the food (blue) whilebeing chased by aghost (red). Top right:
A sequence of observations, consisting of consecutive5?5 windowsaround Pacman. Bottom right:
ELBO and estimated negativelog probability on atest set of MiniPacman sequences. Lower isbetter.
Log probability isestimated using importance sampling with theencoder asproposal.
Figure 3: Moving MNIST. Left: Rowsare example input sequences. Right: Jumpy rollouts from
themodel. Weseethat themodel isable to roll forward by skipping frames, keeping thecorrect digit
and thedirection of motion.
5.2 MOVING MNIST
In thisexperiment, weshow that themodel isable to learn thestateand roll forward in jumps. We
consider sequencesof length 20 of images of MNIST digits. For each sequence, arandom digit from
thedataset ischosen, aswell asthedirection of movement (left or right). At each timestep, thedigit
movesby one pixel in the chosen direction, asshown in Figure 3. Wetrain the model with t1 and
t2 separated by arandom amount t2 ? t1 from theinterval [1, 4]. Wewould liketo seewhether the
model at agiven timecan roll out asimulated experience in timesteps t1 = t + δ1, t2 = t1 + δ2, . . .
with δ1, δ2, . . . > 1, without considering theinputsin between thesetimepoints. Notethat it isnot
suf?cient to predict thefuture inputs xt 1 , . . . asthey do not contain information about whether the
digit movesleft or right. Weneed to sample astate that contains this information.
Weroll out asequence from themodel asfollows: (a) bt iscomputed by the aggregation recurrent
network from observations up to time t; (b) a state zt is sampled from pB (zt | bt ); (c) a sequence
0 0
ELBO ? logp(x) (est.)
Filtering model 0.1169± 0.0003 0.0962± 0.0007
Mean-?eld model 0.1987± 0.0004 0.1678± 0.0010
TD-VAE 0.0773 ± 0.0002 0.0553 ± 0.0006
re 2: MiniPacman. Left: A full frame from the game (size 15 ? 19). Pacman (green) is
gating themazetrying to eat all the food (blue) whilebeing chased by aghost (red). Top right:
quence of observations, consisting of consecutive5?5 windowsaround Pacman. Bottom right:
O and estimated negativelog probability on atest set of MiniPacman sequences. Lower isbetter.
probability isestimated using importance sampling with theencoder asproposal.
re 3: Moving MNIST. Left: Rowsare example input sequences. Right: Jumpy rollouts from
model. Wesee that themodel isable to roll forward by skipping frames, keeping thecorrect digit
the direction of motion.
24. 実験3
? ノイズの多い高調波発振器から得られた1次元シーケンス
– 各観測で情報がほとんどなくても(ノイズが入っていても)モデルが状態を構築できることを示す.
– RNNにはLSTMを用いて,階層TD-VAEを使って学習.
? bが階層化している(説明は省略)
– ステップ幅は確率0.8で[1,10]の間,確率0.2で[1,120]の間として学習
? 実験結果:
– 20ステップ及び100ステップ飛ばした結果
– ノイズが多い観測データでも生成できている.
24
Under review asaconference paper at ICLR 2019
Figure4: Skip-state prediction for 1D signal. Theinput isgenerated by anoisy harmonic oscillator.
Rollouts consist of (a) ajumpy state transition with either dt = 20 or dt = 100, followed by 20 state
transitions with dt = 1. The model is able to create a state and predict it into the future, correctly
predicting frequency and magnitude of thesignal.
predict asmuch aspossible about thestate, which consists of frequency, magnitude and position, and
it isonly theposition that cannot beaccurately predicted.