2. / 33
覈谿
1. Meta learning
A. What is meta learning
B. Meta learning with Supervised learning
C. Meta learning with Memory
D. Contextual policies
E. Model agnostic meta learning
2. Parallelism in RL
1. High level scheme & decisions
2. Stochastic Gradient descent
3. Policy gradient
4. Actor critic
5. Online learning
!2
19. / 33
High-level RL schematic
!19
estimate : fitting the model, dynamicsp(s|s, a)
compute : sum of rewardsQ =
T
t=t
粒t霞t
rt
fit (actor-critic, Q-learning)Q(s, a)
optimize 慮(a|s)
慮 慮 + 留慮J(慮)
(s) = argmaxQ(s, a)
20. / 33
Which parts are slow?
!20
Generate samples
- 覓朱Μ朱 譟伎
simulator襯 る
襷れ 襴.
- MuJoCo simulator 螳
語企ゼ
覃 れ 螳覲企 1襷覦
觜襯願 螳
Fitting a model
- reward襯 螻壱 蟆
襷れ 瑚 觜襴
- Q function fitting 蟆
螻磯 讌襷 願屋
螳ロ 碁Ν 譟伎
Improve the policy
- Q function 螳 豕
policy襯 谿城 蟆 觜襴.
- policy襯 豕 蟆 螻
磯
24. / 33
Simple example: sample parallelism with PG (1)
!24
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
i = {si
1, ai
1, , si
T, ai
T} 慮(at |st)
ri = r(i)
i = (
t
慮log慮(ai
t |si
t))(ri b)
慮 慮 + 留
i
i
Sample 襦 襴曙企襦
Sample 燕 1覯 螻殊 覲 蟆
螳 覦覯
25. / 33
Simple example: sample parallelism with PG (2)
!25
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
i = {si
1, ai
1, , si
T, ai
T} 慮(at |st)
ri = r(i)
i = (
t
慮log慮(ai
t |si
t))(ri b)
慮 慮 + 留
i
i
Reward襯 螻壱 螻殊
襴曙企襦 覲 螳
26. / 33
1. Collect samples by running N times
2. Compute
3. Compute
4. Update :
Simple example: sample parallelism with PG (3)
!26
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
i = (
t
慮log慮(ai
t |si
t))(ri b)
慮 慮 + 留
i
i
襴 -> 覲
慮(at |st)
SGD
狩 覦
27. / 33
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
What if we add a critic?
!27
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮 慮 + 留
i
i
慮(at |st)
i = (
t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Synchronization point
4覯 PG 螻殊 る Critic 蠏
3覯 襭伎 螳ロ
28. / 33
What if we add a critic?
!28
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮 慮 + 留
i
i
慮(at |st)
i = (
t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Practical way: Drop the sync point
れ襦 asynchronous SGD trick
Critic 一危碁讌 朱 蠏碁
29. / 33
What if we run online?
!29
1. Collect samples by running N times
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
i = {si
1, ai
1, , si
T, ai
T}
ri = r(i)
A(si
t, ai
t)
慮 慮 + 留
i
i
慮(at |st)
i = (
t
慮log慮(ai
t |si
t)) A(si
t, ai
t)
Synchronization point
parameter襯 一危誤 5覯 螻殊襷 sync
伎 drop-sync critic螻
30. / 33
Asynchronous Actor-Critic Agents (A3C)
!30
1. Collect samples by running for 1 step
2. Compute
3. Update with regression to target values
4. Compute
5. Update :
(si, ai, si)
ri = r(i)
A(si
t, ai
t)
慮 慮 + 留
i
i
慮(at |st)
i = 慮log慮(ai
|si
) A(si
, ai
)
Asynchronous SGD in the actor-critic
- worker threads螳 experience 讌
- Update policy, critic
- Server襦 update parameter
- Server螳 parameter 壱伎 れ back
螻襴讀 轟
- workers 襷 exploration れ 譟伎(るジ parameter) -> bootstrap螻 觜訣 螻
- レ : 豌 螳 蟯 Fully synchronous 螻襴讀覲企 觜襴
- : sample 螳 蟯 ろ 襴