Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionYechan(Paul) Kim
?
DIAYN is an unsupervised reinforcement learning method that learns diverse skills without a reward function. It works by maximizing the mutual information between skills and states visited to ensure skills dictate different states, while minimizing the mutual information between skills and actions given a state to distinguish skills based on states. It also maximizes a mixture of policies to encourage diverse skills. Experiments show DIAYN discovers locomotion skills in complex environments and sometimes learns skills that solve benchmark tasks. The learned skills can then be adapted to maximize rewards, used for hierarchical RL, and to imitate experts.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
The document discusses the Sinkhorn algorithm for optimal transport. It describes how the Sinkhorn algorithm can be used to find the optimal transport plan between distributions by iteratively applying linear operations. It also introduces the GeomLoss Python library for using Sinkhorn divergences and mentions applications of Sinkhorn for latent permutations and solving jigsaw puzzles.
Diversity is all you need(DIAYN) : Learning Skills without a Reward FunctionYechan(Paul) Kim
?
DIAYN is an unsupervised reinforcement learning method that learns diverse skills without a reward function. It works by maximizing the mutual information between skills and states visited to ensure skills dictate different states, while minimizing the mutual information between skills and actions given a state to distinguish skills based on states. It also maximizes a mixture of policies to encourage diverse skills. Experiments show DIAYN discovers locomotion skills in complex environments and sometimes learns skills that solve benchmark tasks. The learned skills can then be adapted to maximize rewards, used for hierarchical RL, and to imitate experts.
This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
The document discusses the Sinkhorn algorithm for optimal transport. It describes how the Sinkhorn algorithm can be used to find the optimal transport plan between distributions by iteratively applying linear operations. It also introduces the GeomLoss Python library for using Sinkhorn divergences and mentions applications of Sinkhorn for latent permutations and solving jigsaw puzzles.
The guided policy search(GPS) is the branch of reinforcement learning developed for real-world robotics, and its utility is substantiated along many research. This slide show contains the comprehensive concept of GPS, and the detail way to implement, so it would be helpful for anyone who want to study this field.
36. Abstract
Key Information
1. We provide a natural gradient method that represents the steepest descent direction based on
the underlying structure of the parameter space.
2. Although gradient methods cannot make large changes in the values of the parameters, we show
that the natural gradient is moving toward choosing a greedy optimal action rather than just a
better action.
37. Abstract
3. These greedy optimal actions are those that would be chosen under one improvement step of
policy iteration with approximate, compatible value functions, as defined by Sutton et al.
4. We then show drastic performance improvements in simple MDPs and in the more challenging
MDP of Tetris.
?? ??????? ?? ????, policy iteration?? one improvement step?? ??? ?? ??? ???? ???.
?? ?? ?????? ?? ??? ????? ??´
Simple MDP, Tetris MDP?? ??? ?? ?????.
78. 79
? Obviously, our choice of F is not unique and the question arises as to whether or not there is a better metric to use
than F.
FIM ?? G? ?? ? ?? ???? ????
? In the different setting of parameter estimation, the Fisher information converges to the Hessian, so it is
asymptotically efficient
FIM? ?? Hessian?? ???. ??? ????? ?????.
┨ FIM? Hessian? ?? ?????
- FIM? stochastic?? ????? ?? ?, Hessian? ????? ? ???? ?? ? ??.
- ?, ????? ?????? ????? ?? ??? ?? ??? ? ?? FIM┐Hessian ????
- ?? ????? ????? ????????? ?? ????? ??? ??? ? ?????.(?
??..)
Metrics and Curvatures
FIM? Hessian ???
80. 81
Metrics and Curvatures
FIM? Hessian ???
? Our situation is more similar to the blind source separation case where a metric is chosen based on the underlying
parameter space (of non-singular matrices) and is not necessarily asymptotically efficient (ie does not attain second
order convergence)
┨ Metric? ???? ??? ?? ????? Blind Source Separation? ??? ????.
? As argued by Mackay , one strategy is to pull a metric out of the data-independent terms of the Hessian (if
possible), and in fact, Mackay arrives at the same result as Amari for the blind source separation case.
┨ ????? data-independent? ?? metric?? ??? ?? ?? ??? ??? ? ? ??!(?)
┨ ?? ? ?? ??? data-independent?? ??.
┨ ??? ?? ?????? ? ??? ?? FIM?? ? ???? ???.
┨ ?? ??? ?? Q? policy? ???? ??? ??? ??? ??.. ??? ?? ?????? Q??
?? ????! (G * d_theta * d_theta ??)
┨ ? FIM? Hessian?? ??? ??? ??. ???????? ???? positive definite?? ?? ??
??.
┨ ??? ?? ??? ????? ? ???? ?? ???? ?? ?? ?? ???.
┨ ?? ?? ??? ????? ??? Conjugate Method(inversed hessian)? ? ??????? ??.
84. 85
Experiments
simple I-dimensional linear quadratic regulator with dynamics
The goal is to apply a control signal u to keep
the system at x = 0, (incurring a cost of X(t)^2
at each step).
??:
policy:
??: