際際滷

際際滷Share a Scribd company logo
Deep Reinforcement Learning
CS294-112, 2017 Fall
Lec18 Meta Learning & Parallelism in RL
蠏觜

螻る蟲 一蟆曙螻牛螻
/ 33
覈谿
1. Meta learning
A. What is meta learning
B. Meta learning with Supervised learning
C. Meta learning with Memory
D. Contextual policies
E. Model agnostic meta learning
2. Parallelism in RL
1. High level scheme & decisions
2. Stochastic Gradient descent
3. Policy gradient
4. Actor critic
5. Online learning
!2
/ 33
Meta Learning
!3
/ 33
What is meta-learning?
!4
- Meta learning = learning to learn
伎  ,  覦炎 task襯 牛朱 企ゼ 覦朱
れ task襯  朱 牛  蟾
- Multi task learning螻 郁 
- 襦 るジ  貉るる一  
- Deep learning: learning to learn
- Meta optimization : hyperparams  Auto-ML 蟯
- れ  覦覯れ 譟伎.
- Learning an optimizer
- Learning RNN that ingests experiences : 伎 蟆渚れ 牛  覈
- Learning a representation
/ 33
- Deep RL, 麹 model free 螻襴讀 襷   
RL 螻襴讀 覦一磯 蟆 牛  る るジ 襦 task  觜襴  螳
- 磯Μ 瑚  企襷 覘螳   
- 蠍一ヾ 蟆渚 襦  螻殊 旧 螳ロ 蟆願
 襾殊 蟆渚 伎狩覩襦 襷  一危郁 る 蟆 螳
- Meta-learner螳 覦一磯 蟆
-  讌レ朱 explore  蟆
- 螳豺 り 螻  actionれ 譟一姶 讌 (谿螳 覦蟆)
- 覦襯 feature襯  觜襯願 貂豺: るジ task input 轟 覿覿  譯殊蟾蟆 覺
る  覿覿 企 task  誤 覲企 蟆.    螳レ 譟伎
Why is meta-learning is a good idea?
!5
/ 33
Meta-learning with supervised learning
!6
- 朱 supervised learning
- supervised meta learning(train set 1-5螳)
-  覓苦 螳  task.  task襯 
蟆壱  meta learning  parameter 
 . 襦 企り襦 觜襯願 
- 讀 豌 train set螻 蟷 螻牛旧朱 蟷 
 f 螳 譟伎.  f 螳 meta-learner
- Classification 伎狩螻, 豌 一危一
 伎伎  -> ex) RNN
f(x)  y
f(Dtrain, x)  y
<Image classification, Few shot learning>
/ 33
The meta-learning problem in RL
!7
- Supervised meta-learning:
- reinforcement meta-learning:
f(Dtrain, x)  y
f(Dexperiences, state)  action
Dexperiences = {s1, a1, r1, , sn, an, rn}
/ 33
Meta-learning in RL with memory
!8
- Agent: 2d agent, 覓危蟆  覃伎  
- Reward : 觜螳 蠏碁朱碁 企覃 reward, るジ 讌 reward 0
- agent 觜螳 蠏碁朱瑚 企讌 覈襯企伎 
episode   螻, 蠍  覈讌 伎 
- memory螳  蟆 れ     焔レ 覲伎.  蟆  Chaos.
with memory without memory
1st attempt
2nd attempt
3rd attempt
< water maze >
/ 33
RL2
!9
- Trial : Memory螳 螻旧 Meta trial 螻糾(ex. 觜螳 蠏碁朱 豺螳 螳 譟伎)
- Episode : water maze 觜螳 蠏碁朱瑚  覯 伎螻 蠏碁ゼ 谿場螳  螻殊
Trial  伎螻, episode 螳 memory 螻旧
- Policy gradient sampleれ 襦 襴曙願 覲 螻旧 
蠍一 Trial Policy gradient 螳 sample螻 襷れ広(殊 覓語 穴襴  )
hidden state of RNN
/ 33
RL2
!10
- 螳 覩碁 覈螻 豢覦, 覈讌 覈螳 random
- RNN policy螳 覩碁襯  覦覯 牛伎狩
- 螳 trial 伎 memory螳 螻旧覩襦, trial 伎 殊螳 願 襦
  焔レ 覲伎 
- Trial1  給 覲  螻,
Trial2 豌 覯讌 殊 譬  れ 伎讌 覈詩
Trial 1 Trial 2
/ 33
Connection to contextual policies
!11
contextual policy : 慮(a|s, )
: stack location  : walking direction
- multi task learning contextual policy
襾殊 覓伎 伎狩讌 伎狩
- 螳 襦覺企朱 觜襯 伎狩讌, り碓讌
襯 伎狩讌 伎 
- : context information伎, experience.
- context 伎 覈 input tupleれ 蟯
螻, 襷れ 覲旧″螻  experience 覲伎
- context襯 螳讌螻  agent螳 覓伎 螻
讌, 覘 伎狩讌 insight襯 詞
/ 33
Back to representations
!12
- ImageNet 一危磯ゼ  pre-training transfer learning 螳  
- Meta learning 殊企 伎手鍵  
- ImageNet pre-training 牛 襷れ伎 feature map fine-tuning 蟇磯 覲伎 X
pre-training
feature maps other sub-tasks
/ 33
Preparing a model for faster learning
!13
- Model-agnostic meta-learning: 覈語 蟲覦讌  ML
- policy gradient 螻襴讀朱 螳覩瑚 蟇結 蟆 
- single task learning 螳 覦レ朱 蟇結 蟆襷 牛 蟆
螻, multi task襦 ロ覃  覦レ朱 蟇結 蟆 覩
- 轟 覦(task)  牛  企 覦レ  襷 reward
- task i  reward襯 譴  企 task gradient step 
覯 覦 襷 螳. 覈 覦レ   牛  螻,
gradient螳 覩誤 蟆 朱   蟆企襦.
- 襷 ろ襷 L1, L2, L3
襦 るジ task 
gradient 蟲螻,
蠏 れ ろ
企 覦レ朱 螳讌 蟆一.
慮
慮
慮  慮 + 留慮R(慮)
Single task
慮  慮 + 留

i
慮Ri[慮 + 留慮Ri(慮)]
Multi task
覦 覩
/ 33
What did we just do?
!14
supervised meta-learning
reinforcement meta-learning
model-agnostic meta-learning
f(Dtrain, x)  y
f(Dtrain, x)  y
fMAML(Dtrain, x)  y
fMAML(Dtrain, x) = f慮(x)
- MAML  Neural network襦 蟲
- Update rule 蠍一ヾ 覦螻 蟆 るゴ讌 .
Loss襯 蟲  function MAML 企麹
Neural net  蟆 訖 覈 伎 觜訣.
- Inductive bias螳  伎  .
MAML 覦   蟆企手 螳蠍 覓
慮 = 慮  留

(x,y)Dtrain
慮L(f慮(x), y)
/ 33
Meta-learning summary & open problems
!15
- Meta-learning : learning to learn
- Supervised meta-learning : 豌 一危磯ゼ 螻朱 . experience.
- RL meta-learning with RNN policies
- 蠍一ヾ 讌 蟆渚 RNN朱 伎
- 牛 蟆 覦一郁鍵  test time  RNN forward 
- Context 覲企ゼ 襷  訖 れ 旧企手 覲  
- Model-agnostic meta-learning
- 旧 Gradient descent 覦覯 
- 蠍一ヾ 覈, 給逢覯り骸 觜訣 覦
- 蠍一ヾ RL 螻襴讀覲企  觜襯願  螳
/ 33
Meta-learning summary & open problems
!16
- 襷: 伎 蟆渚 襦   覈  螳
- :  覓語れ 伎   焔レ 覲伎願 螻, るジ taskれ 豌企 一
- 螻
- RNN 牛蠍郁 企糾, 蠏覈襯 ロ蠍郁 企れ
- Model agnostic 覦 optimization   蟆曙 譟伎
- Designing the right task distribution is hard
Supervised task training, test set distribution 螳り 螳.
讌  螳襦 牛伎 覦 螳襯 ろ誤覃 焔レ 譬蟾?
Meta learning task襯 覦蠑碁 蟆 伎 螳 覓語襯 手鍵 螳レ 
- Task distribution 襷れ 覩手
meta-overfitting = overfitting to training task
/ 33
Parallelism in RL
!17
/ 33
Overview
!18
- Policy襯 谿城 れ  覦覯 覦一
- 覈 螻襴讀れ sequential .
- RL 螻襴讀 覲朱 豌襴  蟾?
- Multiple learning threads
- Multiple experience collection threads
/ 33
High-level RL schematic
!19
estimate : fitting the model, dynamicsp(s|s, a)
compute : sum of rewardsQ =
T

t=t
粒t霞t
rt
fit (actor-critic, Q-learning)Q(s, a)
optimize 慮(a|s)
慮  慮 + 留慮J(慮)
(s) = argmaxQ(s, a)
/ 33
Which parts are slow?
!20
Generate samples
- 覓朱Μ朱 譟伎
simulator襯 る
襷れ 襴.
- MuJoCo simulator 螳
 語企ゼ 
覃 れ 螳覲企 1襷覦
 觜襯願 螳
Fitting a model
- reward襯 螻壱 蟆
襷れ 瑚 觜襴
- Q function fitting 蟆
 螻磯 讌襷 願屋
螳ロ 碁Ν 譟伎
Improve the policy
- Q function 螳 豕
policy襯 谿城 蟆 觜襴.
- policy襯 豕 蟆 螻
磯
/ 33
Which parts can we parallelize?
!21
覈 覿覿
parallelize 螳
/ 33
High-level decisions
!22
- Online learning : 一危磯ゼ 覈朱 讀 覦襦覦襦 螻襴讀 汲
Batch learning : Policy gradient螳  覦. Big bucket of experience.
- Synchronous or Asynchronous
Parallelize Policy gradient
-  sample 覲襦  
- Batch + Synchronous
Parallelize Q-learning
- Sample , SGD fitting 螳螳 覲
- Synchronous point.
/ 33
Relationship to parallelized SGD
!23
1. 覲 豌襴襯 る 襷 SGD襯 覲蟆る 覩
2. Simple parallel SGD
1. 螳螳 worker 襦 るジ 覿覿 一危磯ゼ 螳讌螻 
2. worker 覲襦 gradientれ 螻壱螻 server襦 覲企
3. Server れ伎 workerれ gradientれ 螻, 襦 parameter襯
workerれ蟆 企慨
3. 朱 SGD 狩讌襷, asynchronous (  delay)
1. 螳 worker螳 server襦 ″ 螳, server螳 壱螻 れ 企慨企 螳
2. 企 一 覦  螳 worker螳 synchronous 伎狩
4. 讀 SGD 覯渚蟆 覲襯   讌襷, 襷 old state襯 螳讌螻 update 
襦 蟆 る    觜殊  
5. 企 task, 企 problem企 磯殊 殊
/ 33
Simple example: sample parallelism with PG (1)
!24
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
i = {si
1, ai
1, , si
T, ai
T} 慮(at |st)
ri = r(i)
i = (

t
慮log慮(ai
t |si
t))(ri  b)
慮  慮 + 留

i
i
Sample 襦  襴曙企襦
Sample 燕 1覯 螻殊 覲 蟆
螳  覦覯
/ 33
Simple example: sample parallelism with PG (2)
!25
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
i = {si
1, ai
1, , si
T, ai
T} 慮(at |st)
ri = r(i)
i = (

t
慮log慮(ai
t |si
t))(ri  b)
慮  慮 + 留

i
i
Reward襯 螻壱 螻殊
襴曙企襦 覲 螳
/ 33
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
Simple example: sample parallelism with PG (3)
!26
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
i = (

t
慮log慮(ai
t |si
t))(ri  b)
慮  慮 + 留

i
i
 襴 -> 覲
慮(at |st)
SGD
狩 覦
/ 33
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
What if we add a critic?
!27
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮  慮 + 留

i
i
慮(at |st)
i = (

t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Synchronization point
4覯 PG 螻殊 る Critic 蠏 
3覯 襭伎 螳ロ
/ 33
What if we add a critic?
!28
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮  慮 + 留

i
i
慮(at |st)
i = (

t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Practical way: Drop the sync point
れ襦 asynchronous SGD trick 
Critic  一危碁讌 朱 蠏碁
/ 33
What if we run online?
!29
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮  慮 + 留

i
i
慮(at |st)
i = (

t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Synchronization point
parameter襯 一危誤 5覯 螻殊襷 sync
伎 drop-sync critic螻
/ 33
Asynchronous Actor-Critic Agents (A3C)
!30
1. Collect samples by running for 1 step
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
(si, ai, si)
ri = r(i)
A(si
t, ai
t)
慮  慮 + 留

i
i
慮(at |st)
i = 慮log慮(ai
|si
) A(si
, ai
)
Asynchronous SGD in the actor-critic
-  worker threads螳 experience 讌
- Update policy, critic
- Server襦 update parameter 
- Server螳 parameter 壱伎 れ back
螻襴讀 轟
- workers 襷 exploration  れ 譟伎(るジ parameter) -> bootstrap螻 觜訣 螻
- レ : 豌  螳 蟯 Fully synchronous 螻襴讀覲企  觜襴
-  : sample 螳 蟯 ろ 襴
/ 33
Actor-critic algorithm: A3C
!31
碁 蠏碁螳 A3C
worker 襷 譟郁 るジ parameter -> exploration benefit
- A3C 20M step
+   螳 觜襯伎襷
+ step, sample  蟯 襴
- DDPG 1M step
/ 33
Multiple experience collection threads
!32
/ 33
Questions
螳.
!33

More Related Content

CS294-112 18

  • 1. Deep Reinforcement Learning CS294-112, 2017 Fall Lec18 Meta Learning & Parallelism in RL 蠏觜 螻る蟲 一蟆曙螻牛螻
  • 2. / 33 覈谿 1. Meta learning A. What is meta learning B. Meta learning with Supervised learning C. Meta learning with Memory D. Contextual policies E. Model agnostic meta learning 2. Parallelism in RL 1. High level scheme & decisions 2. Stochastic Gradient descent 3. Policy gradient 4. Actor critic 5. Online learning !2
  • 4. / 33 What is meta-learning? !4 - Meta learning = learning to learn 伎 , 覦炎 task襯 牛朱 企ゼ 覦朱 れ task襯 朱 牛 蟾 - Multi task learning螻 郁 - 襦 るジ 貉るる一 - Deep learning: learning to learn - Meta optimization : hyperparams Auto-ML 蟯 - れ 覦覯れ 譟伎. - Learning an optimizer - Learning RNN that ingests experiences : 伎 蟆渚れ 牛 覈 - Learning a representation
  • 5. / 33 - Deep RL, 麹 model free 螻襴讀 襷 RL 螻襴讀 覦一磯 蟆 牛 る るジ 襦 task 觜襴 螳 - 磯Μ 瑚 企襷 覘螳 - 蠍一ヾ 蟆渚 襦 螻殊 旧 螳ロ 蟆願 襾殊 蟆渚 伎狩覩襦 襷 一危郁 る 蟆 螳 - Meta-learner螳 覦一磯 蟆 - 讌レ朱 explore 蟆 - 螳豺 り 螻 actionれ 譟一姶 讌 (谿螳 覦蟆) - 覦襯 feature襯 觜襯願 貂豺: るジ task input 轟 覿覿 譯殊蟾蟆 覺 る 覿覿 企 task 誤 覲企 蟆. 螳レ 譟伎 Why is meta-learning is a good idea? !5
  • 6. / 33 Meta-learning with supervised learning !6 - 朱 supervised learning - supervised meta learning(train set 1-5螳) - 覓苦 螳 task. task襯 蟆壱 meta learning parameter . 襦 企り襦 觜襯願 - 讀 豌 train set螻 蟷 螻牛旧朱 蟷 f 螳 譟伎. f 螳 meta-learner - Classification 伎狩螻, 豌 一危一 伎伎 -> ex) RNN f(x) y f(Dtrain, x) y <Image classification, Few shot learning>
  • 7. / 33 The meta-learning problem in RL !7 - Supervised meta-learning: - reinforcement meta-learning: f(Dtrain, x) y f(Dexperiences, state) action Dexperiences = {s1, a1, r1, , sn, an, rn}
  • 8. / 33 Meta-learning in RL with memory !8 - Agent: 2d agent, 覓危蟆 覃伎 - Reward : 觜螳 蠏碁朱碁 企覃 reward, るジ 讌 reward 0 - agent 觜螳 蠏碁朱瑚 企讌 覈襯企伎 episode 螻, 蠍 覈讌 伎 - memory螳 蟆 れ 焔レ 覲伎. 蟆 Chaos. with memory without memory 1st attempt 2nd attempt 3rd attempt < water maze >
  • 9. / 33 RL2 !9 - Trial : Memory螳 螻旧 Meta trial 螻糾(ex. 觜螳 蠏碁朱 豺螳 螳 譟伎) - Episode : water maze 觜螳 蠏碁朱瑚 覯 伎螻 蠏碁ゼ 谿場螳 螻殊 Trial 伎螻, episode 螳 memory 螻旧 - Policy gradient sampleれ 襦 襴曙願 覲 螻旧 蠍一 Trial Policy gradient 螳 sample螻 襷れ広(殊 覓語 穴襴 ) hidden state of RNN
  • 10. / 33 RL2 !10 - 螳 覩碁 覈螻 豢覦, 覈讌 覈螳 random - RNN policy螳 覩碁襯 覦覯 牛伎狩 - 螳 trial 伎 memory螳 螻旧覩襦, trial 伎 殊螳 願 襦 焔レ 覲伎 - Trial1 給 覲 螻, Trial2 豌 覯讌 殊 譬 れ 伎讌 覈詩 Trial 1 Trial 2
  • 11. / 33 Connection to contextual policies !11 contextual policy : 慮(a|s, ) : stack location : walking direction - multi task learning contextual policy 襾殊 覓伎 伎狩讌 伎狩 - 螳 襦覺企朱 觜襯 伎狩讌, り碓讌 襯 伎狩讌 伎 - : context information伎, experience. - context 伎 覈 input tupleれ 蟯 螻, 襷れ 覲旧″螻 experience 覲伎 - context襯 螳讌螻 agent螳 覓伎 螻 讌, 覘 伎狩讌 insight襯 詞
  • 12. / 33 Back to representations !12 - ImageNet 一危磯ゼ pre-training transfer learning 螳 - Meta learning 殊企 伎手鍵 - ImageNet pre-training 牛 襷れ伎 feature map fine-tuning 蟇磯 覲伎 X pre-training feature maps other sub-tasks
  • 13. / 33 Preparing a model for faster learning !13 - Model-agnostic meta-learning: 覈語 蟲覦讌 ML - policy gradient 螻襴讀朱 螳覩瑚 蟇結 蟆 - single task learning 螳 覦レ朱 蟇結 蟆襷 牛 蟆 螻, multi task襦 ロ覃 覦レ朱 蟇結 蟆 覩 - 轟 覦(task) 牛 企 覦レ 襷 reward - task i reward襯 譴 企 task gradient step 覯 覦 襷 螳. 覈 覦レ 牛 螻, gradient螳 覩誤 蟆 朱 蟆企襦. - 襷 ろ襷 L1, L2, L3 襦 るジ task gradient 蟲螻, 蠏 れ ろ 企 覦レ朱 螳讌 蟆一. 慮 慮 慮 慮 + 留慮R(慮) Single task 慮 慮 + 留 i 慮Ri[慮 + 留慮Ri(慮)] Multi task 覦 覩
  • 14. / 33 What did we just do? !14 supervised meta-learning reinforcement meta-learning model-agnostic meta-learning f(Dtrain, x) y f(Dtrain, x) y fMAML(Dtrain, x) y fMAML(Dtrain, x) = f慮(x) - MAML Neural network襦 蟲 - Update rule 蠍一ヾ 覦螻 蟆 るゴ讌 . Loss襯 蟲 function MAML 企麹 Neural net 蟆 訖 覈 伎 觜訣. - Inductive bias螳 伎 . MAML 覦 蟆企手 螳蠍 覓 慮 = 慮 留 (x,y)Dtrain 慮L(f慮(x), y)
  • 15. / 33 Meta-learning summary & open problems !15 - Meta-learning : learning to learn - Supervised meta-learning : 豌 一危磯ゼ 螻朱 . experience. - RL meta-learning with RNN policies - 蠍一ヾ 讌 蟆渚 RNN朱 伎 - 牛 蟆 覦一郁鍵 test time RNN forward - Context 覲企ゼ 襷 訖 れ 旧企手 覲 - Model-agnostic meta-learning - 旧 Gradient descent 覦覯 - 蠍一ヾ 覈, 給逢覯り骸 觜訣 覦 - 蠍一ヾ RL 螻襴讀覲企 觜襯願 螳
  • 16. / 33 Meta-learning summary & open problems !16 - 襷: 伎 蟆渚 襦 覈 螳 - : 覓語れ 伎 焔レ 覲伎願 螻, るジ taskれ 豌企 一 - 螻 - RNN 牛蠍郁 企糾, 蠏覈襯 ロ蠍郁 企れ - Model agnostic 覦 optimization 蟆曙 譟伎 - Designing the right task distribution is hard Supervised task training, test set distribution 螳り 螳. 讌 螳襦 牛伎 覦 螳襯 ろ誤覃 焔レ 譬蟾? Meta learning task襯 覦蠑碁 蟆 伎 螳 覓語襯 手鍵 螳レ - Task distribution 襷れ 覩手 meta-overfitting = overfitting to training task
  • 18. / 33 Overview !18 - Policy襯 谿城 れ 覦覯 覦一 - 覈 螻襴讀れ sequential . - RL 螻襴讀 覲朱 豌襴 蟾? - Multiple learning threads - Multiple experience collection threads
  • 19. / 33 High-level RL schematic !19 estimate : fitting the model, dynamicsp(s|s, a) compute : sum of rewardsQ = T t=t 粒t霞t rt fit (actor-critic, Q-learning)Q(s, a) optimize 慮(a|s) 慮 慮 + 留慮J(慮) (s) = argmaxQ(s, a)
  • 20. / 33 Which parts are slow? !20 Generate samples - 覓朱Μ朱 譟伎 simulator襯 る 襷れ 襴. - MuJoCo simulator 螳 語企ゼ 覃 れ 螳覲企 1襷覦 觜襯願 螳 Fitting a model - reward襯 螻壱 蟆 襷れ 瑚 觜襴 - Q function fitting 蟆 螻磯 讌襷 願屋 螳ロ 碁Ν 譟伎 Improve the policy - Q function 螳 豕 policy襯 谿城 蟆 觜襴. - policy襯 豕 蟆 螻 磯
  • 21. / 33 Which parts can we parallelize? !21 覈 覿覿 parallelize 螳
  • 22. / 33 High-level decisions !22 - Online learning : 一危磯ゼ 覈朱 讀 覦襦覦襦 螻襴讀 汲 Batch learning : Policy gradient螳 覦. Big bucket of experience. - Synchronous or Asynchronous Parallelize Policy gradient - sample 覲襦 - Batch + Synchronous Parallelize Q-learning - Sample , SGD fitting 螳螳 覲 - Synchronous point.
  • 23. / 33 Relationship to parallelized SGD !23 1. 覲 豌襴襯 る 襷 SGD襯 覲蟆る 覩 2. Simple parallel SGD 1. 螳螳 worker 襦 るジ 覿覿 一危磯ゼ 螳讌螻 2. worker 覲襦 gradientれ 螻壱螻 server襦 覲企 3. Server れ伎 workerれ gradientれ 螻, 襦 parameter襯 workerれ蟆 企慨 3. 朱 SGD 狩讌襷, asynchronous ( delay) 1. 螳 worker螳 server襦 ″ 螳, server螳 壱螻 れ 企慨企 螳 2. 企 一 覦 螳 worker螳 synchronous 伎狩 4. 讀 SGD 覯渚蟆 覲襯 讌襷, 襷 old state襯 螳讌螻 update 襦 蟆 る 觜殊 5. 企 task, 企 problem企 磯殊 殊
  • 24. / 33 Simple example: sample parallelism with PG (1) !24 1. Collect samples by running N times 2. Compute 3. Compute 4. Update : i = {si 1, ai 1, , si T, ai T} 慮(at |st) ri = r(i) i = ( t 慮log慮(ai t |si t))(ri b) 慮 慮 + 留 i i Sample 襦 襴曙企襦 Sample 燕 1覯 螻殊 覲 蟆 螳 覦覯
  • 25. / 33 Simple example: sample parallelism with PG (2) !25 1. Collect samples by running N times 2. Compute 3. Compute 4. Update : i = {si 1, ai 1, , si T, ai T} 慮(at |st) ri = r(i) i = ( t 慮log慮(ai t |si t))(ri b) 慮 慮 + 留 i i Reward襯 螻壱 螻殊 襴曙企襦 覲 螳
  • 26. / 33 1. Collect samples by running N times 2. Compute 3. Compute 4. Update : Simple example: sample parallelism with PG (3) !26 i = {si 1, ai 1, , si T, ai T} ri = r(i) i = ( t 慮log慮(ai t |si t))(ri b) 慮 慮 + 留 i i 襴 -> 覲 慮(at |st) SGD 狩 覦
  • 27. / 33 1. Collect samples by running N times 2. Compute 3. Update with regression to target values 4. Compute 5. Update : What if we add a critic? !27 i = {si 1, ai 1, , si T, ai T} ri = r(i) A(si t, ai t) 慮 慮 + 留 i i 慮(at |st) i = ( t 慮log慮(ai t |si t)) A(si t, ai t) Synchronization point 4覯 PG 螻殊 る Critic 蠏 3覯 襭伎 螳ロ
  • 28. / 33 What if we add a critic? !28 1. Collect samples by running N times 2. Compute 3. Update with regression to target values 4. Compute 5. Update : i = {si 1, ai 1, , si T, ai T} ri = r(i) A(si t, ai t) 慮 慮 + 留 i i 慮(at |st) i = ( t 慮log慮(ai t |si t)) A(si t, ai t) Practical way: Drop the sync point れ襦 asynchronous SGD trick Critic 一危碁讌 朱 蠏碁
  • 29. / 33 What if we run online? !29 1. Collect samples by running N times 2. Compute 3. Update with regression to target values 4. Compute 5. Update : i = {si 1, ai 1, , si T, ai T} ri = r(i) A(si t, ai t) 慮 慮 + 留 i i 慮(at |st) i = ( t 慮log慮(ai t |si t)) A(si t, ai t) Synchronization point parameter襯 一危誤 5覯 螻殊襷 sync 伎 drop-sync critic螻
  • 30. / 33 Asynchronous Actor-Critic Agents (A3C) !30 1. Collect samples by running for 1 step 2. Compute 3. Update with regression to target values 4. Compute 5. Update : (si, ai, si) ri = r(i) A(si t, ai t) 慮 慮 + 留 i i 慮(at |st) i = 慮log慮(ai |si ) A(si , ai ) Asynchronous SGD in the actor-critic - worker threads螳 experience 讌 - Update policy, critic - Server襦 update parameter - Server螳 parameter 壱伎 れ back 螻襴讀 轟 - workers 襷 exploration れ 譟伎(るジ parameter) -> bootstrap螻 觜訣 螻 - レ : 豌 螳 蟯 Fully synchronous 螻襴讀覲企 觜襴 - : sample 螳 蟯 ろ 襴
  • 31. / 33 Actor-critic algorithm: A3C !31 碁 蠏碁螳 A3C worker 襷 譟郁 るジ parameter -> exploration benefit - A3C 20M step + 螳 觜襯伎襷 + step, sample 蟯 襴 - DDPG 1M step
  • 32. / 33 Multiple experience collection threads !32