*Paper Review : Aviral Kumar, et al., "Reward-Conditioned Policies", 2019
*Paper Link : https://arxiv.org/abs/1912.13465
*Youtube Link(In korean) : TBU
1 of 24
Download to read offline
More Related Content
Tensorflow KR PR12(Season 3) : 251th Paper Review
1. Reward-Conditioned Policies
Aviral Kumar, Xue Bin Peng, Sergey Levine, 2019
Changhoon, Kevin Jeong
Seoul National University
chjeong@bi.snu.ac.kr
June 7, 2020
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 1 / 24 June 7, 2020 1 / 24
2. Contents
I. Motivation
II. Preliminaries
III. Reward-Conditioned Policies
IV. Experimental Evaluation
V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 2 / 24 June 7, 2020 2 / 24
3. I. Motivation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 3 / 24 June 7, 2020 3 / 24
4. Motivation
Supervised Learning
C Works on existing or given sample data or examples
C Predict feedback is given
C Commonly used and well-understood
Reinforcement Learning
C Works on interacting with the environment
C Is about sequential decision making(e.g. Game, Robot, etc.)
C RL algorithms can be brittle, di?cult to use and tune
Can we learn e?ective policies via supervised learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 4 / 24 June 7, 2020 4 / 24
5. Motivation
One of possible method: Imitation learning
C Behavioural cloning, Direct policy learning, Inverse RL, etc.
C Imitation learning utilizes standard and well-understood supervised
learning methods
C But they require near-optimal expert data in advance
So, Can we learn e?ective policies via supervised learning without
demonstrations?
C non-expert trajectories collected from sub-optimal policies can be
viewed as optimal supervision
C not for maximizing the reward, but for matching the reward of the
given trajectory
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 5 / 24 June 7, 2020 5 / 24
6. II. Preliminaries
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 6 / 24 June 7, 2020 6 / 24
7. Preliminaries
Reinforcement Learning
Objective
J(θ) = Es0゛p(s0),a0:±゛π,st+1゛p(,|at ,st ) [ ±
t=1 γtr (st, at)]
C Policy-based: compute the derivative of J(π) w.r.t the policy
parameter θ
C Value-based: estimate value(or Q) function by means of temporal
di?erence learning
C How to avoid high-variance policy gradient estimators, as well as the
complexity of temporal di?erence learning?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 7 / 24 June 7, 2020 7 / 24
8. Preliminaries
Monte-Carlo update
V (St) ○ V (St) + α (Gt ? V (St))
where Gt =
±
t=1 γt
r (st, at)
C Pros: unbiased, good convergence properties
C Cons: high variance
Temporal-Di?erence update
V (St) ○ V (St) + α (Rt+1 + γV (St+1) ? V (St))
C Pros: learn online every step, low variance
C Cons: bootstrapping - update involves an estimate; biased
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 8 / 24 June 7, 2020 8 / 24
9. Preliminaries
Function Approximation: Policy Gradient
Policy Gradient Theorem
For any di?erentiable policy πθ(s, a), for any of the policy objective
functions, the policy gradient is
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qπθ (s, a)]
Monte-Carlo Policy Gradient(REINFORCE)
C using return Gt as an unbiased sample of Qπθ
(st, at)
?θt = α θ log πθ (st, at) Gt
Reducing variance using a baseline
C A good baseline is the state value function V πθ
(s)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 9 / 24 June 7, 2020 9 / 24
10. Preliminaries
Actor-critic algorithm
C Critic: updates Q-function parameters w
error = Eπθ
(Qπθ
(s, a) ? Qw (s, a))
2
C Actor: updates policy parameters θ, in direction suggested by critic
θJ(θ) = Eπθ
[ θ log πθ(s, a)Qw (s, a)]
Reducing variance using a baseline: Advantage function
C One of good baseline is the state value function V πθ
(s)
C Advantage function;
Aπθ
(s, a) = Qπθ
(s, a) ? V πθ
(s)
C Rewriting the policy gradient using advantage function
θJ(θ) = Eπθ
[ θ log πθ(s, a)Aπθ
(s, a)]
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 10 / 24 June 7, 2020 10 / 24
12. Reward-Conditioned Policies
RCPs Algorithm(left) and Architecture(right)
C Z can be return(RCP-R) or advantage(RCP-A)
C Z can be incorporated in form of multiplicative interactions(πθ(a|s, Z))
C ?pk (Z) is represented as Gaussian distribution, and ?Z and σZ are
updated based on the soft ? maximum, i.e. log exp, of target value
Z observed so far in the dataset D
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 12 / 24 June 7, 2020 12 / 24
13. Theoretical Motivation for RCPs
Derivation of two variants of RCPs;
C RCP-R: use Z as an return
C RCP-A: use Z as an advantage
RCP-R
Constrained Optimization
arg max
π
Eτ,Z゛pπ(τ,Z)[Z]
s.t. DKL (pπ(τ, Z) p?(τ, Z)) + ε
By forming the Lagrangian of constrained optimization with Lagrange
multiplier β,
L(π, β) = Eτ,Z゛pπ(τ,Z)[Z] + β ε ? Eτ,Z゛゛pπ(τ,Z) log
pπ(τ, Z)
p?(τ, Z)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 13 / 24 June 7, 2020 13 / 24
14. Theoretical Motivation for RCPs
Constrained Optimization
Di?erentiating L(π, β) with respect to π and β and applying optimality
conditions, we obtain a non-parametric form for the joint trajectory-return
distribution of the optimal policy, pπ? (τ, Z); (See AWR Appendix A.)
pπ? (τ, Z) 『 p?(τ, Z) exp Z
β
By decompose the joint distribution pπ(τ, Z) into conditionals pπ(Z) and
pπ(τ|Z)
pπ? (τ|Z)pπ? (Z) 『 [p?(τ|Z)p?(Z)] exp Z
β
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 14 / 24 June 7, 2020 14 / 24
15. Theoretical Motivation for RCPs
Constrained Optimization
pπ? (τ|Z) 『 p?(τ|Z) ★ corresponds to Line 9
pπ? (Z) 『 p?(Z) exp Z
β ★ corresponds to Line 10
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 15 / 24 June 7, 2020 15 / 24
16. Theoretical Motivation for RCPs
Maximum likelihood estimation
By factorizing pπ(τ|Z) as pπ(τ|Z) = Πtπ (at|st, Z) p (st+1|st, at) and,
to train a parametric policy πθ(a|s, ?Z), projecting the optimal
non-parametric policy p?
π computed above onto the manifold of parametric
policies, according to
πθ(a|s, Z) = arg min
θ
EZ゛D [DKL (pπ? (τ|Z) pπθ
(τ|Z))]
= arg maxθ EZ゛D Ea゛?(a|s, ?Z) [log πθ(a|s, Z)]
Theoretical motivation of RCP-A(See the Section 4.3.2)
For RCP-A, a new sample for Z is drawn at each time step, while for
RCP-R, a sample for the return Z is drawn once for the whole
trajectory(Line 5)
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 16 / 24 June 7, 2020 16 / 24
17. IV. Experimental Evaluation
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 17 / 24 June 7, 2020 17 / 24
18. Experimental Evaluation
C Results are averaged across 5 random seeds
C Comparison to RL benchmark: on-policy(TRPO, PPO)
o?-policy(SAC, DDPG)
C AWR: o?-policy RL method that also utilizes supervised learning as a
subroutine, but does not condition on rewards and requires an
exponential weighting scheme during training
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 18 / 24 June 7, 2020 18 / 24
19. Experimental Evaluation
C Heatmap: relationship between target value ?Z and observed target
values of Z after 2,000 training iterations for both RCP variants
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 19 / 24 June 7, 2020 19 / 24
20. V. Discussion and Future Work
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 20 / 24 June 7, 2020 20 / 24
21. Discussion and Future work
Propose a general class of algorithms that enable learning of control
policies with standard supervised learning approaches
Sub-optimal trajectories can be regarded as optimal supervision for a
policy that does not aim to attain the largest possible reward, but
rather to match the reward of that trajectory
By then conditioning the policy on the reward, we can train a single
model to simultaneously represent policies for all possible reward
values, and generalize to larger reward values
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 21 / 24 June 7, 2020 21 / 24
22. Discussion and Future work
Limitations
C Its sample e?ciency and ?nal performance still lags behind the best
and most e?cient approximate dynamic programming methods(SAC,
DDPG, etc.)
C Sometimes the reward-conditioned policies might generalize
successfully, and sometimes they might not
C Main challenge of these variants: exploration?
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 22 / 24 June 7, 2020 22 / 24
23. References
C Xue Bin Peng, et al., Advantage-Weighted Regression: Simple and
Scalable O?-Policy Reinforcement Learning, 2019
C Jan Peters, et al., Reinforcement learning by reward-weighted
regression for operational space control, ICML 2007
C RL course by David Silver, DeepMind
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 23 / 24 June 7, 2020 23 / 24
24. Thank you for your attention!
Changhoon, Kevin Jeong (Seoul National University*chjeong@bi.snu.ac.kr)Reward-Conditioned Policies 24 / 24 June 7, 2020 24 / 24