際際滷

際際滷Share a Scribd company logo
Reward-Conditioned Policies
Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019
Changhoon, Kevin Jeong
Seoul National University
chjeong@bi.snu.ac.kr
June 7, 2020
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
Contents
I. Motivation
II. Preliminaries
III. Reward-Conditioned Policies
IV. Experimental Evaluation
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
I. Motivation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
Motivation
Supervised Learning
C Works on existing or given sample data or examples
C Predict feedback is given
C Commonly used and well-understood
Reinforcement Learning
C Works on interacting with the environment
C Is about sequential decision making(e.g. Game, Robot, etc.)
C RL algorithms can be brittle, di?cult to use and tune
Can we learn e?ective policies via supervised learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
Motivation
One of possible method: Imitation learning
C Behavioural cloning, Direct policy learning, Inverse RL, etc.
C Imitation learning utilizes standard and well-understood supervised
learning methods
C But they require near-optimal expert data in advance
So, Can we learn e?ective policies via supervised learning without
demonstrations?
C non-expert trajectories collected from sub-optimal policies can be
viewed as optimal supervision
C not for maximizing the reward, but for matching the reward of the
given trajectory
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
II. Preliminaries
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
Preliminaries
Reinforcement Learning
Objective
J(θ) = Es0゛p(s0),a0:±゛π,st+1゛p(,|at ,st ) [ ±
t=1 γtr (st, at)]
C Policy-based: compute the derivative of J(π) w.r.t the policy
parameter θ
C Value-based: estimate value(or Q) function by means of temporal
di?erence learning
C How to avoid high-variance policy gradient estimators, as well as the
complexity of temporal di?erence learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
Preliminaries
Monte-Carlo update
V (St) ○ V (St) + α (Gt ? V (St))
where Gt =
±
t=1 γt
r (st, at)
C Pros: unbiased, good convergence properties
C Cons: high variance
Temporal-Di?erence update
V (St) ○ V (St) + α (Rt+1 + γV (St+1) ? V (St))
C Pros: learn online every step, low variance
C Cons: bootstrapping - update involves an estimate; biased
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
Preliminaries
Function Approximation: Policy Gradient
Policy Gradient Theorem
For any di?erentiable policy πθ(s, a), for any of the policy objective
functions, the policy gradient is
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qπθ (s, a)]
Monte-Carlo Policy Gradient(REINFORCE)
C using return Gt as an unbiased sample of Qπθ
(st, at)
?θt = α θ log πθ (st, at) Gt
Reducing variance using a baseline
C A good baseline is the state value function V πθ
(s)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
Preliminaries
Actor-critic algorithm
C Critic: updates Q-function parameters w
error = Eπθ
(Qπθ
(s, a) ? Qw (s, a))
2
C Actor: updates policy parameters θ, in direction suggested by critic
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qw (s, a)]
Reducing variance using a baseline: Advantage function
C One of good baseline is the state value function V πθ
(s)
C Advantage function;
Aπθ
(s, a) = Qπθ
(s, a) ? V πθ
(s)
C Rewriting the policy gradient using advantage function
θJ(θ) = Eπθ
[ θ log πθ(s, a)Aπθ
(s, a)]
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
III. Reward-Conditioned Policies
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
Reward-Conditioned Policies
RCPs Algorithm(left) and Architecture(right)
C Z can be return(RCP-R) or advantage(RCP-A)
C Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z))
C ?pk (Z) is represented as Gaussian distribution, and ?Z and σZ are
updated based on the soft ? maximum, i.e. log exp, of target value
Z observed so far in the dataset D
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
Theoretical Motivation for RCPs
Derivation of two variants of RCPs;
C RCP-R: use Z as an return
C RCP-A: use Z as an advantage
RCP-R
Constrained Optimization
arg max
π
Eτ,Z゛pπ(τ,Z)[Z]
s.t. DKL (pπ(τ, Z) p?(τ, Z)) + ε
By forming the Lagrangian of constrained optimization with Lagrange
multiplier β,
L(π, β) = Eτ,Z゛pπ(τ,Z)[Z] + β ε ? Eτ,Z゛゛pπ(τ,Z) log
pπ(τ, Z)
p?(τ, Z)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
Theoretical Motivation for RCPs
Constrained Optimization
Di?erentiating L(π, β) with respect to π and β and applying optimality
conditions, we obtain a non-parametric form for the joint trajectory-return
distribution of the optimal policy, pπ? (τ, Z); (See AWR Appendix A.)
pπ? (τ, Z) 『 p?(τ, Z) exp Z
β
By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and
pπ(τ|Z)
pπ? (τ|Z)pπ? (Z) 『 [p?(τ|Z)p?(Z)] exp Z
β
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
Theoretical Motivation for RCPs
Constrained Optimization
pπ? (τ|Z) 『 p?(τ|Z) ★ corresponds to Line 9
pπ? (Z) 『 p?(Z) exp Z
β ★ corresponds to Line 10
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
Theoretical Motivation for RCPs
Maximum likelihood estimation
By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and,
to train a parametric policy πθ(a|s, ?Z), projecting the optimal
non-parametric policy p?
π computed above onto the manifold of parametric
policies, according to
πθ(a|s, Z) = arg min
θ
EZ゛D [DKL (pπ? (τ|Z) pπθ
(τ|Z))]
= arg maxθ EZ゛D Ea゛?(a|s, ?Z) [log πθ(a|s, Z)]
Theoretical motivation of RCP-A(See the Section 4.3.2)
For RCP-A, a new sample for Z is drawn at each time step, while for
RCP-R, a sample for the return Z is drawn once for the whole
trajectory(Line 5)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
IV. Experimental Evaluation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
Experimental Evaluation
C Results are averaged across 5 random seeds
C Comparison to RL benchmark: on-policy(TRPO, PPO)
o?-policy(SAC, DDPG)
C AWR: o?-policy RL method that also utilizes supervised learning as a
subroutine, but does not condition on rewards and requires an
exponential weighting scheme during training
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
Experimental Evaluation
C Heatmap: relationship between target value ?Z and observed target
values of Z after 2,000 training iterations for both RCP variants
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
Discussion and Future work
Propose a general class of algorithms that enable learning of control
policies with standard supervised learning approaches
Sub-optimal trajectories can be regarded as optimal supervision for a
policy that does not aim to attain the largest possible reward, but
rather to match the reward of that trajectory
By then conditioning the policy on the reward, we can train a single
model to simultaneously represent policies for all possible reward
values, and generalize to larger reward values
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
Discussion and Future work
Limitations
C Its sample e?ciency and ?nal performance still lags behind the best
and most e?cient approximate dynamic programming methods(SAC,
DDPG, etc.)
C Sometimes the reward-conditioned policies might generalize
successfully, and sometimes they might not
C Main challenge of these variants: exploration?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
References
C Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and
Scalable O?-Policy Reinforcement Learning, 2019
C Jan Peters, et al., Reinforcement learning by reward-weighted
regression for operational space control, ICML 2007
C RL course by David Silver, DeepMind
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
Thank you for your attention!
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24

More Related Content

Tensorflow KR PR12(Season 3) : 251th Paper Review

  • 1. Reward-Conditioned Policies Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019 Changhoon, Kevin Jeong Seoul National University chjeong@bi.snu.ac.kr June 7, 2020 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
  • 2. Contents I. Motivation II. Preliminaries III. Reward-Conditioned Policies IV. Experimental Evaluation V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
  • 3. I. Motivation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
  • 4. Motivation Supervised Learning C Works on existing or given sample data or examples C Predict feedback is given C Commonly used and well-understood Reinforcement Learning C Works on interacting with the environment C Is about sequential decision making(e.g. Game, Robot, etc.) C RL algorithms can be brittle, di?cult to use and tune Can we learn e?ective policies via supervised learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
  • 5. Motivation One of possible method: Imitation learning C Behavioural cloning, Direct policy learning, Inverse RL, etc. C Imitation learning utilizes standard and well-understood supervised learning methods C But they require near-optimal expert data in advance So, Can we learn e?ective policies via supervised learning without demonstrations? C non-expert trajectories collected from sub-optimal policies can be viewed as optimal supervision C not for maximizing the reward, but for matching the reward of the given trajectory Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
  • 6. II. Preliminaries Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
  • 7. Preliminaries Reinforcement Learning Objective J(θ) = Es0゛p(s0),a0:±゛π,st+1゛p(,|at ,st ) [ ± t=1 γtr (st, at)] C Policy-based: compute the derivative of J(π) w.r.t the policy parameter θ C Value-based: estimate value(or Q) function by means of temporal di?erence learning C How to avoid high-variance policy gradient estimators, as well as the complexity of temporal di?erence learning? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
  • 8. Preliminaries Monte-Carlo update V (St) ○ V (St) + α (Gt ? V (St)) where Gt = ± t=1 γt r (st, at) C Pros: unbiased, good convergence properties C Cons: high variance Temporal-Di?erence update V (St) ○ V (St) + α (Rt+1 + γV (St+1) ? V (St)) C Pros: learn online every step, low variance C Cons: bootstrapping - update involves an estimate; biased Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
  • 9. Preliminaries Function Approximation: Policy Gradient Policy Gradient Theorem For any di?erentiable policy πθ(s, a), for any of the policy objective functions, the policy gradient is θJ(θ) = Eπθ [ θ log πθ(s, a)Qπθ (s, a)] Monte-Carlo Policy Gradient(REINFORCE) C using return Gt as an unbiased sample of Qπθ (st, at) ?θt = α θ log πθ (st, at) Gt Reducing variance using a baseline C A good baseline is the state value function V πθ (s) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
  • 10. Preliminaries Actor-critic algorithm C Critic: updates Q-function parameters w error = Eπθ (Qπθ (s, a) ? Qw (s, a)) 2 C Actor: updates policy parameters θ, in direction suggested by critic θJ(θ) = Eπθ [ θ log πθ(s, a)Qw (s, a)] Reducing variance using a baseline: Advantage function C One of good baseline is the state value function V πθ (s) C Advantage function; Aπθ (s, a) = Qπθ (s, a) ? V πθ (s) C Rewriting the policy gradient using advantage function θJ(θ) = Eπθ [ θ log πθ(s, a)Aπθ (s, a)] Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
  • 11. III. Reward-Conditioned Policies Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 11 / 24 June 7, 2020 11 / 24
  • 12. Reward-Conditioned Policies RCPs Algorithm(left) and Architecture(right) C Z can be return(RCP-R) or advantage(RCP-A) C Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z)) C ?pk (Z) is represented as Gaussian distribution, and ?Z and σZ are updated based on the soft ? maximum, i.e. log exp, of target value Z observed so far in the dataset D Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
  • 13. Theoretical Motivation for RCPs Derivation of two variants of RCPs; C RCP-R: use Z as an return C RCP-A: use Z as an advantage RCP-R Constrained Optimization arg max π Eτ,Z゛pπ(τ,Z)[Z] s.t. DKL (pπ(τ, Z) p?(τ, Z)) + ε By forming the Lagrangian of constrained optimization with Lagrange multiplier β, L(π, β) = Eτ,Z゛pπ(τ,Z)[Z] + β ε ? Eτ,Z゛゛pπ(τ,Z) log pπ(τ, Z) p?(τ, Z) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
  • 14. Theoretical Motivation for RCPs Constrained Optimization Di?erentiating L(π, β) with respect to π and β and applying optimality conditions, we obtain a non-parametric form for the joint trajectory-return distribution of the optimal policy, pπ? (τ, Z); (See AWR Appendix A.) pπ? (τ, Z) 『 p?(τ, Z) exp Z β By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and pπ(τ|Z) pπ? (τ|Z)pπ? (Z) 『 [p?(τ|Z)p?(Z)] exp Z β Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
  • 15. Theoretical Motivation for RCPs Constrained Optimization pπ? (τ|Z) 『 p?(τ|Z) ★ corresponds to Line 9 pπ? (Z) 『 p?(Z) exp Z β ★ corresponds to Line 10 Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
  • 16. Theoretical Motivation for RCPs Maximum likelihood estimation By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and, to train a parametric policy πθ(a|s, ?Z), projecting the optimal non-parametric policy p? π computed above onto the manifold of parametric policies, according to πθ(a|s, Z) = arg min θ EZ゛D [DKL (pπ? (τ|Z) pπθ (τ|Z))] = arg maxθ EZ゛D Ea゛?(a|s, ?Z) [log πθ(a|s, Z)] Theoretical motivation of RCP-A(See the Section 4.3.2) For RCP-A, a new sample for Z is drawn at each time step, while for RCP-R, a sample for the return Z is drawn once for the whole trajectory(Line 5) Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
  • 17. IV. Experimental Evaluation Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
  • 18. Experimental Evaluation C Results are averaged across 5 random seeds C Comparison to RL benchmark: on-policy(TRPO, PPO) o?-policy(SAC, DDPG) C AWR: o?-policy RL method that also utilizes supervised learning as a subroutine, but does not condition on rewards and requires an exponential weighting scheme during training Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
  • 19. Experimental Evaluation C Heatmap: relationship between target value ?Z and observed target values of Z after 2,000 training iterations for both RCP variants Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
  • 20. V. Discussion and Future Work Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
  • 21. Discussion and Future work Propose a general class of algorithms that enable learning of control policies with standard supervised learning approaches Sub-optimal trajectories can be regarded as optimal supervision for a policy that does not aim to attain the largest possible reward, but rather to match the reward of that trajectory By then conditioning the policy on the reward, we can train a single model to simultaneously represent policies for all possible reward values, and generalize to larger reward values Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
  • 22. Discussion and Future work Limitations C Its sample e?ciency and ?nal performance still lags behind the best and most e?cient approximate dynamic programming methods(SAC, DDPG, etc.) C Sometimes the reward-conditioned policies might generalize successfully, and sometimes they might not C Main challenge of these variants: exploration? Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
  • 23. References C Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and Scalable O?-Policy Reinforcement Learning, 2019 C Jan Peters, et al., Reinforcement learning by reward-weighted regression for operational space control, ICML 2007 C RL course by David Silver, DeepMind Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
  • 24. Thank you for your attention! Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24