The document discusses control as inference in Markov decision processes (MDPs) and partially observable MDPs (POMDPs). It introduces optimality variables that represent whether a state-action pair is optimal or not. It formulates the optimal action-value function Q* and optimal value function V* in terms of these optimality variables and the reward and transition distributions. Q* is defined as the log probability of a state-action pair being optimal, and V* is defined as the log probability of a state being optimal. Bellman equations are derived relating Q* and V* to the reward and next state value.
The document summarizes recent research related to "theory of mind" in multi-agent reinforcement learning. It discusses three papers that propose methods for agents to infer the intentions of other agents by applying concepts from theory of mind:
1. The papers propose that in multi-agent reinforcement learning, being able to understand the intentions of other agents could help with cooperation and increase success rates.
2. The methods aim to estimate the intentions of other agents by modeling their beliefs and private information, using ideas from theory of mind in cognitive science. This involves inferring information about other agents that is not directly observable.
3. Bayesian inference is often used to reason about the beliefs, goals and private information of other agents based
This document provides an overview of POMDP (Partially Observable Markov Decision Process) and its applications. It first defines the key concepts of POMDP such as states, actions, observations, and belief states. It then uses the classic Tiger problem as an example to illustrate these concepts. The document discusses different approaches to solve POMDP problems, including model-based methods that learn the environment model from data and model-free reinforcement learning methods. Finally, it provides examples of applying POMDP to games like ViZDoom and robot navigation problems.
The document presents an overview of the research group 'Generations' focused on image generation and generative models, detailing their contributions to fields like unpaired image-to-image translation and domain adaptation. It highlights various studies and techniques, including CycleGAN and neural radiance fields, aimed at enhancing image translation while preserving contextual integrity. The group is actively seeking new members for collaboration on these innovative themes.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
The document summarizes recent research related to "theory of mind" in multi-agent reinforcement learning. It discusses three papers that propose methods for agents to infer the intentions of other agents by applying concepts from theory of mind:
1. The papers propose that in multi-agent reinforcement learning, being able to understand the intentions of other agents could help with cooperation and increase success rates.
2. The methods aim to estimate the intentions of other agents by modeling their beliefs and private information, using ideas from theory of mind in cognitive science. This involves inferring information about other agents that is not directly observable.
3. Bayesian inference is often used to reason about the beliefs, goals and private information of other agents based
This document provides an overview of POMDP (Partially Observable Markov Decision Process) and its applications. It first defines the key concepts of POMDP such as states, actions, observations, and belief states. It then uses the classic Tiger problem as an example to illustrate these concepts. The document discusses different approaches to solve POMDP problems, including model-based methods that learn the environment model from data and model-free reinforcement learning methods. Finally, it provides examples of applying POMDP to games like ViZDoom and robot navigation problems.
The document presents an overview of the research group 'Generations' focused on image generation and generative models, detailing their contributions to fields like unpaired image-to-image translation and domain adaptation. It highlights various studies and techniques, including CycleGAN and neural radiance fields, aimed at enhancing image translation while preserving contextual integrity. The group is actively seeking new members for collaboration on these innovative themes.
This document summarizes a presentation on offline reinforcement learning. It discusses how offline RL can learn from fixed datasets without further interaction with the environment, which allows for fully off-policy learning. However, offline RL faces challenges from distribution shift between the behavior policy that generated the data and the learned target policy. The document reviews several offline policy evaluation, policy gradient, and deep deterministic policy gradient methods, and also discusses using uncertainty and constraints to address distribution shift in offline deep reinforcement learning.
This document summarizes a research paper on scaling laws for neural language models. Some key findings of the paper include:
- Language model performance depends strongly on model scale and weakly on model shape. With enough compute and data, performance scales as a power law of parameters, compute, and data.
- Overfitting is universal, with penalties depending on the ratio of parameters to data.
- Large models have higher sample efficiency and can reach the same performance levels with less optimization steps and data points.
- The paper motivated subsequent work by OpenAI on applying scaling laws to other domains like computer vision and developing increasingly large language models like GPT-3.
This document discusses the relationship between control as inference, reinforcement learning, and active inference. It provides an overview of key concepts such as Markov decision processes (MDPs), partially observable MDPs (POMDPs), optimality variables, the evidence lower bound (ELBO), variational inference, and the free energy principle as applied to active inference. Control as inference frames reinforcement learning as probabilistic inference by defining a generative process and performing variational inference to find an optimal policy. Active inference uses the free energy principle and minimizes expected free energy to select actions that resolve uncertainty.
This document discusses generative adversarial networks (GANs) and their relationship to reinforcement learning. It begins with an introduction to GANs, explaining how they can generate images without explicitly defining a probability distribution by using an adversarial training process. The second half discusses how GANs are related to actor-critic models and inverse reinforcement learning in reinforcement learning. It explains how GANs can be viewed as training a generator to fool a discriminator, similar to how policies are trained in reinforcement learning.
This document discusses several semi-supervised deep generative models for multimodal data, including the Semi-Supervised Multimodal Variational AutoEncoder (SS-MVAE), Semi-Supervised Hierarchical Multimodal Variational AutoEncoder (SS-HMVAE), and their training procedures. The SS-MVAE extends the Joint Multimodal Variational Autoencoder (JMVAE) to semi-supervised learning. The SS-HMVAE introduces auxiliary variables to model dependencies between modalities more flexibly. Both models maximize a variational lower bound with supervised and unsupervised objectives. The document provides technical details of the generative processes, variational approximations, and optimization of these semi-supervised deep generative models.
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
?
This document discusses a method for variational dropout that improves deep neural networks by allowing a maximum dropout rate while addressing generalization issues in CNNs. It outlines the principles of Bayesian inference and variational inference techniques used to estimate posterior distributions through methods such as stochastic optimization and the reparameterization trick. The work highlights the effectiveness of incorporating Gaussian noise into dropout mechanisms as a way to enhance model performance.
(DL輪読)Matching Networks for One Shot LearningMasahiro Suzuki
?
1. Matching Networks is a neural network architecture proposed by DeepMind for one-shot learning.
2. The network learns to classify novel examples by comparing them to a small support set of examples, using an attention mechanism to focus on the most relevant support examples.
3. The network is trained using a meta-learning approach, where it learns to learn from small support sets to classify novel examples from classes not seen during training.
The document discusses Bayesian neural networks and related topics. It covers Bayesian neural networks, stochastic neural networks, variational autoencoders, and modeling prediction uncertainty in neural networks. Key points include using Bayesian techniques like MCMC and variational inference to place distributions over the weights of neural networks, modeling both model parameters and predictions as distributions, and how this allows capturing uncertainty in the network's predictions.
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...Masahiro Suzuki
?
This document discusses techniques for training deep variational autoencoders and probabilistic ladder networks. It proposes three advances: 1) Using an inference model similar to ladder networks with multiple stochastic layers, 2) Adding a warm-up period to keep units active early in training, and 3) Using batch normalization. These advances allow training models with up to five stochastic layers and achieve state-of-the-art log-likelihood results on benchmark datasets. The document explains variational autoencoders, probabilistic ladder networks, and how the proposed techniques parameterize the generative and inference models.
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
?
This document discusses variational inference with Rényi divergence. It summarizes variational autoencoders (VAEs), which are deep generative models that parametrize a variational approximation with a recognition network. VAEs define a generative model as a hierarchical latent variable model and approximate the intractable true posterior using variational inference. The document explores using Rényi divergence as an alternative to the evidence lower bound objective of VAEs, as it may provide tighter variational bounds.
This document discusses deep Kalman filters, which combine deep learning and Kalman filtering. It proposes replacing the linear transformations in classical Kalman filters with nonlinear transformations parameterized by neural networks. This allows the model to learn patterns in noisy sequential data and model the effects of external actions. The model is evaluated on synthetic and real patient data, showing it can successfully perform counterfactual inference about the effects of anti-diabetic drugs on diabetic patients.
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
?
Bayes by Backprop is a method for introducing weight uncertainty into neural networks using variational Bayesian learning. It represents each weight as a probability distribution rather than a fixed value. This allows the model to better assess uncertainty. The paper proposes Bayes by Backprop, which uses a simple approximate learning algorithm similar to backpropagation to learn the distributions over weights. Experiments show it achieves good results on classification, regression, and contextual bandit problems, outperforming standard regularization methods by capturing weight uncertainty.
The document discusses deep kernel learning, which combines deep learning and Gaussian processes (GPs). It briefly reviews the predictive equations and marginal likelihood for GPs, noting their computational requirements. GPs assume datasets with input vectors and target values, modeling the values as joint Gaussian distributions based on a mean function and covariance kernel. Predictive distributions for test points are also Gaussian. The goal of deep kernel learning is to leverage recent work on efficiently representing kernel functions to produce scalable deep kernels, allowing outperformance of standalone deep learning and GPs on various datasets.
(研究会輪読) Facial Landmark Detection by Deep Multi-task LearningMasahiro Suzuki
?
The document summarizes a research paper on facial landmark detection using deep multi-task learning. It proposes a Tasks-Constrained Deep Convolutional Network (TCDCN) that uses facial landmark detection as the main task and related auxiliary tasks like pose estimation and attribute inference to improve performance. The TCDCN learns shared representations across tasks using a deep convolutional network. It introduces task-wise early stopping to halt learning on auxiliary tasks that reach optimal performance early to avoid overfitting and improve convergence on the main task of landmark detection. Experimental results showed the proposed approach outperformed existing methods.
(DL Hacks輪読) How transferable are features in deep neural networks?Masahiro Suzuki
?
This document summarizes an experiment on measuring how transferable features are in deep neural networks. The experiment trained neural networks on halves of the ImageNet dataset and tested how well the networks could generalize to the other half. It found that earlier layer features transferred better than later layer features, and that fine-tuning improved performance. Transferring between more dissimilar datasets led to poorer performance. Randomly initialized weights performed worse than trained weights.
7. ¤ “The image of the world around us, which we carry in our head, is just a model.
Nobody in his head imagines all the world, government or country. He has only
selected concepts, and relationships between them, and uses those to represent the
real system. (Forrester, 1971)”
¤
¤ [Chang+ 17, Cell]
->
12. ¤
¤ ! " # $ !’
¤ &("|!)
¤ MDP
Recap: the reinforcement learning objective
The Anatomy of a Reinforcement Learning Problem
狠狠撸 from Sergey Levine
Recap: the reinforcement learning objective
13. ¤
¤
¤
->
¤
¤
¤
¤
1.
2.
3.
4. 2
Model-based RL Review
improve the
policy
Correcting for model errors:
refitting model with new data, replanning with MPC, using local models
Model-based RL from raw observations:
learn latent space, typically with unsupervised learning, or
model &plan directly in observational space
e.g., backprop through model
supervised learning
Even simpler…
generic trajectory
optimization, solve
however you want
? How can we impose constraints on trajectory optimization?
14. ¤
¤
¤
¤
¤
¤ RBF DNN
¤
¤
¤
¤ PILCO
¤ Guided policy search (trajectory optimization)
¤ CMA-ES
Policy Search Classification
Yet, it’s a grey zone…
Important Extensions:
? Contextual Policy Search [Kupscik, Deisenroth, Peters & Neumann, AAAI 2013], [Silva, Konidaris & Barto, ICML 2012], [Kober & Peters, IJCAI 2011], [Paresi &
Peters et al., IROS 2015]
? Hierarchical Policy Search [Daniel, Neumann & Peters., AISTATS 2012], [Wingate et al., IJCAI 2011], [Ghavamzadeh & Mahedevan, ICML 2003]
9
Direct Policy
Search
Value-Based
RL
Evolutionary
Strategies,
CMA-ES
Episodic
REPS
Policy
Gradients,
eNAC
Actor Critic,
Natural Actor Critic
Model-based REPS
PS by Trajectory
Optimization
Q-Learning,
Fitted Q
LSPIPILCO
Advantage
Weighted
Regression
Conservative
Policy Iteration
Model-Based Policy Search Methods
85
Learn dynamics model from data-set
+ More data efficient than model-free methods
+ More complex policies can be optimized
? RBF networks [Deisenroth & Rasmussen, 2011]
? Time-dependent feedback controllers [Levine & Koltun, 2014]
? Gaussian Processes [Von Hoof, Peters & Nemann, 2015]
? Deep neural nets [Levine & Koltun, 2014][Levine & Abbeel, 2014]
Limitations:
- Learning good models is often very hard
- Small model errors can have drastic damage
on the resulting policy (due to optimization)
- Some models are hard to scale
- Computational Complexity
15. PILCO
¤ PILCO (probabilistic inference for learning control) [Deisenroth+ 11]
¤
¤
¤ RBF
¤
1.
2.
¤
¤
3.
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
Greedy Policy Updates: PILCO [Deisenroth & Rasmussen 2011]
Model Learning:
? Use Bayesian models which integrate out model
uncertainty Gaussian Processes
? Reward predictions are not specialized to a single model
Internal Stimulation:
? Iteratively compute
? Moment matching: deterministic approximate inference
Policy Update:
? Analytically compute expected return and its gradient
? Greedily Optimize with BFGS
88
What’s the problem?
backprop backprop
backprop
? Similar parameter sensitivity problems as shooting methods
? But no longer have convenient second order LQR-like method, because policy
parameters couple all the time steps, so no dynamic programming
? Similar problems to training long RNNs with BPTT
? Vanishing and exploding gradients
? Unlike LSTM, we can’t just “choose” a simple dynamics, dynamics are chosen by
nature
20. ¤
¤ Learning deep dynamical models from image pixels [Wahlstr?m+ 14] From Pixels to
Torques: Policy Learning with Deep Dynamical Models [Wahlstrom+ 15]
¤ deep dynamical model DDM
¤
22. VAE
¤ ! "~$ " !
¤
¤ !
(a) Learned Frey Face manifold (b) Learned MNIST manifold
Figure 4: Visualisations of learned data manifold for generative models with two-dimensional latent
space, learned with AEVB. Since the prior of the latent space is Gaussian, linearly spaced coor-
dinates on the unit square were transformed through the inverse CDF of the Gaussian to produce
values of the latent variables z. For each of these values z, we plotted the corresponding generative
p?(x|z) with the learned parameters ?.
[Kingma+ 13]
47. ¤
¤ MDN-RNN VAE
¤ VAE
¤
¤
¤ Friston
¤ Wahlstr?m M V
¤ VRNN[Chung+ 15]
47
48. Friston
¤
¤ !"($)
¤
¤
¤
https://en.wikipedia.org/wiki/Free_energy_principle
164 第 9 章 考察
ると,内部モデルは生成モデルによって実現される.
内部モデルを機械学習における生成モデルと捉え,行動と結びつけた枠組で有名なのが
Friston による自由エネルギー原理(free-energy principle) [Friston 10] である.自由エネル
ギー原理では,生物学的なシステムが内部状態の自由エネルギーを最小化することによって秩
序を維持していると考えている.
状態 x*8
と潜在変数 z を持つ生成モデル pθ(x, z) を考えて,近似分布を qφ(z) とする.
また,負の周辺尤度の上界である変分自由エネルギー(負の変分下界)を F(x; φ, θ) =
?Eqφ(z)[log p(x, z)] + H[qφ(z)] とする.自由エネルギー原理では,内部パラメータ φ と行動
a は,(変分)自由エネルギーを最小化するように更新すると考える.
?φ = arg min
φ
F(x; φ, θ),
?a = arg min
a
F(x; φ, θ).
なお,ここでの arg mina は,自由エネルギーが最小になるような x を選ぶ行動 a を取るとい
うことである.また,生成モデルのパラメータ θ については,上記の更新を一定数繰り返した
後に更新する.
自由エネルギー原理では,入力は単純に状態 x として考えられている.ある状態 x を受け
取ったときに内部状態が更新され,その後生成モデルを元に,自由エネルギーが最小になる
ような状態 x を選ぶ行動 a が取られる.しかし実際には,外界からの刺激は五感を通じてマ
ルチモーダル情報として得られるため,自由エネルギーは複数のモダリティ x や w を含んだ
164
ると,内部モデルは生成モデルによって実現される.
内部モデルを機械学習における生成モデルと捉え,行動と結びつけ
Friston による自由エネルギー原理(free-energy principle) [Friston 10
ギー原理では,生物学的なシステムが内部状態の自由エネルギーを最小化
序を維持していると考えている.
状態 x*8
と潜在変数 z を持つ生成モデル pθ(x, z) を考えて,近似分
また,負の周辺尤度の上界である変分自由エネルギー(負の変分下
?Eqφ(z)[log p(x, z)] + H[qφ(z)] とする.自由エネルギー原理では,内部
a は,(変分)自由エネルギーを最小化するように更新すると考える.
?φ = arg min
φ
F(x; φ, θ),
?a = arg min
a
F(x; φ, θ).
なお,ここでの arg mina は,自由エネルギーが最小になるような x を選
うことである.また,生成モデルのパラメータ θ については,上記の更新
後に更新する.
自由エネルギー原理では,入力は単純に状態 x として考えられている
取ったときに内部状態が更新され,その後生成モデルを元に,自由エネ
ような状態 x を選ぶ行動 a が取られる.しかし実際には,外界からの刺
第 9 章 考察
デルは生成モデルによって実現される.
を機械学習における生成モデルと捉え,行動と結びつけた枠組で有名なのが
自由エネルギー原理(free-energy principle) [Friston 10] である.自由エネル
,生物学的なシステムが内部状態の自由エネルギーを最小化することによって秩
いると考えている.
潜在変数 z を持つ生成モデル pθ(x, z) を考えて,近似分布を qφ(z) とする.
辺尤度の上界である変分自由エネルギー(負の変分下界)を F(x; φ, θ) =
x, z)] + H[qφ(z)] とする.自由エネルギー原理では,内部パラメータ φ と行動
自由エネルギーを最小化するように更新すると考える.
?φ = arg min
φ
F(x; φ, θ),
?a = arg min
a
F(x; φ, θ).
の arg mina は,自由エネルギーが最小になるような x を選ぶ行動 a を取るとい
.また,生成モデルのパラメータ θ については,上記の更新を一定数繰り返した
.
ギー原理では,入力は単純に状態 x として考えられている.ある状態 x を受け
内部状態が更新され,その後生成モデルを元に,自由エネルギーが最小になる
164 第 9 章 考察
ると,内部モデルは生成モデルによって実現される.
内部モデルを機械学習における生成モデルと捉え,行動と結びつけた枠組で有名なのが
Friston による自由エネルギー原理(free-energy principle) [Friston 10] である.自由エネル
ギー原理では,生物学的なシステムが内部状態の自由エネルギーを最小化することによって秩
序を維持していると考えている.
状態 x*8
と潜在変数 z を持つ生成モデル pθ(x, z) を考えて,近似分布を qφ(z) とする.
また,負の周辺尤度の上界である変分自由エネルギー(負の変分下界)を F(x; φ, θ) =
?Eqφ(z)[log p(x, z)] + H[qφ(z)] とする.自由エネルギー原理では,内部パラメータ φ と行動
a は,(変分)自由エネルギーを最小化するように更新すると考える.
?φ = arg min
φ
F(x; φ, θ),
?a = arg min
a
F(x; φ, θ).
なお,ここでの arg mina は,自由エネルギーが最小になるような x を選ぶ行動 a を取るとい
うことである.また,生成モデルのパラメータ θ については,上記の更新を一定数繰り返した
後に更新する.
自由エネルギー原理では,入力は単純に状態 x として考えられている.ある状態 x を受け
取ったときに内部状態が更新され,その後生成モデルを元に,自由エネルギーが最小になる
ような状態 x を選ぶ行動 a が取られる.しかし実際には,外界からの刺激は五感を通じてマ
ルチモーダル情報として得られるため,自由エネルギーは複数のモダリティ x や w を含んだ
49. LeCun
Y LeCun
How Much Information Does the Machine Need to Predict?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar
reward given once in a while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category
or a few numbers for each input
Predicting human-supplied data
10 10,000 bits per sample→
Unsupervised/Predictive Learning (cake)
The machine predicts any part of
its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
(Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)