狠狠撸

???? ??? ?? :
Rainbow
???? ????

??? Curt Park
curt.park@medipixel.io
AI Research / Developer in Medipixel
?? ??
??? Kyunghwan Kim
kh.kim@medipixel.io
AI Research / Developer in Medipixel

Guidewire control by RL
https://youtu.be/uAZtUNwA4i0

??
1. ???? ?
2. ???? ??? Deep ????
- Q-Learning
- Function approximation
- DQN
- Rainbow DQN

https://www.youtube.com/watch?v=TmPfTpjtdgg
?????
DeepMind DQN with Atari game

https://youtu.be/UZHTNBMAfAA
?????
OpenAI Dota2

https://www.youtube.com/watch?v=cUTMhmVh1qs
?????
Deepmind AlphaStar

?????
https://youtu.be/Dr0RvX1F-YQ
“Sim-to-Real Reinforcement Learning for Deformable Object Manipulation”
J. Matas, S. James and A. J Davison
CoRL, 2018

?????
https://www.youtube.com/watch?v=FmMPHL3TcrE
“Learning to Walk via Deep Reinforcement Learning”
T. Haarnoja et al.
arXiv:1812.11103v1

???? ?
● ????? ?? ??
+ 10 - 100

???? ?
ref : Reinforcement Learning: An Introduction, 2nd ed (Sutton and Barto)
Reinforcement learning is what to do - how to map situations to actions -
so as to maximize a numerical reward signal

???? ?
● ???? (Reinforcement Learning)
- ??(Reward)? ?? ??.
- ??? ?? ??(State)?? ???(Agent)? ??? ??
(Action)? ???? ? ??(Environment)? ??
- ??? ????? ??? ??? ??.

Markov Decision Process
Agent Environment
State, Reward
Action
< MDP framework >

????? ??
? Trial-and-error search
○ ??? ??(Agent)? ??? ?? ??? ??? ???? ??? ??
??? ??.
? Delayed reward
○ ??? ????? ?? ??? ??? ??? ??? ??? ???
???? ??? ?.
● ????? ??

????? ??
● Trial-and-error search
? Exploitation (??)
? Exploration (??)
- ??? policy? ??? action? ??
- ??? ?? ???? action ??

??? ??
< ?? > < ??? ??? >
Exploitation Exploration
< Exploitation & Exploration >

????? ??
● Delayed reward
- ??? ??? ??? ??? ??? ?? ??? ???? ??(reward)
??? ?? ??? ??? ?? ???.
- ??? ?? ???? ??? ????? !
Return !

Return
S0 S1 S2 S3 S4 ST
R1
< ???? ?? >

Return
S0 S1 S2 S3 S4 ST
R1 R2 R3 R4 R5
< ???? ?? >

Return
● Return
? ?? t? state ???? ??? time step T ?? ??? ??? ???
??

Return
● Discounted Return
? ???? ??? ??? ??

Policy
● Policy (??)
? ?? State?? ?? Action? ?? ??.
○ π(a∣s) ? ??.
○ ????? ??? ??? ?? ?.
○ MDP ??? ??? ! → ??? ??? ??? !

?? ?? ??
- ?? π ? ??? ? ?? ?? :
- ??? ??

?? ?? ??
?? ??? goodness ??? ? ? ????

?? ?? ??
?? ??? goodness ??? ? ? ????
→ Value function !

Value function
● Value function
? Policy π ? ?? ?? ??? ??? ?? ?? ??
- ?? ??(Policy)?? ?? ??(??)? ?? ??? ??.
- ?? ??? ?? : ?? s??? Return? ???.
- ????? ??? ??? ?? ? ? ?? !

Value function
● State value function
? ?? π ? ?? ? ?? state? ?? ??? ??
○ ?? : ??? ? ?? ?? ???

Value function
● State-action value function (Q ??)
? ?? π ? ?? ? ?? state? action? ?? ??? ??

????? Value Function
??(Reward)? ??? ?? ??? ???
??? ??? ??? ??(Return)? ??? ?? ??? ???
????? ??? ?? ??? ???
Expected Return
=Value Function

Value-based RL
?? State?? ? ? ?? Action? ??? ???
?? ??? ?? Action? ??? !

Value-based RL
?? State?? ? ? ?? Action? ??? ???
?? ??? ?? Action? ??? !
Value function
Greedy policy

Value-based RL
- ?? : Q - value
action 1
Q : 10
action 2
Q : -5

Q Learning
- Q learning
● Value function ????

FrozenLake-v0
[State]
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal
[Action]
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
S F F F
F H F H
F F F H
H F F G

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
Reward : 1
( ?? 1?? ?? )

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 1
0
0
0 0
0
Reward : 1
( ?? 1?? ?? )

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
1
0
0 1
0
0
0 0
0
Reward : 1
( ?? 1?? ?? )

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 1
0
0
0 1
0
0
0 0
1
0
0 0
1
0
0 0
1
0
0 1
0
0
0 0
0
Reward : 1
( ?? 1?? ?? )

Greedy policy? ??
- ??? ?? ?? near optimal policy? ???? ?? ??? ????
??.
S F F F
F F F H
F F F H
H F F G

? - greedy Policy
- ? ?? ?? exploration ? exploitation ? ??? ??
< ?? > < ??? ??? >
60 % 40 %

? - greedy Policy
S F F F
F F F H
F F F H
H F F G

What we have learned?
ref : ???? ? DeepRL(???)
https://tykimos.github.io/warehouse/2018-2-7-ISS_Near_and_Far_DeepRL_4.pdf
- ? state, action? ???? Q ?? ?? ?? ? ????
● Tabular method

Tabular method? ??
?? : ? ??? : 84 x 84 x 3 ?? : Continuous
State space? ??? ?

Tabular method? ??
- ?? ??? ?? → Large state space
- ??? state? ??? output? ?? ?? ???? (Generalization !)
- parameter? ??? ??? ?????. → Function Approximation
- How? Neural Net + Deep Learning !!

Function Approximation
Q Learning

Tabular vs Function Approximation
- ?? state, action? ?? Q ?? table? ???? ????.
- state, action? ??? ???? space complexity ??? ??? ???
- parameter? ??? Q ?? ??.
- ??? ?? ?????? ??.
● Tabular method
● Function Approximation method

- parameter : w
ref : David Silver
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf

- Function Approximation example
S F F F
F H F H
F F F H
H F F G
Function
Approximator
(w)
Q(s, Left)
Q(s, Right)
Q(s, Up)
Q(s, Down)
< Current State s >

Tabular vs Function Approximation
- Tabular : state, action? ?? Q ?? ??????.
● ' ????. ' ? ??
- F.A. : parameter? gradient descent? ??????.

DQN
https://www.youtube.com/watch?v=TmPfTpjtdgg
● DQN ?? ?? ??

DQN
https://www.nature.com/articles/nature14236
● ???? DQN Agent? Human-level? ???? ??? ???

DQN
- Convolution Neural Network
- Experience Replay
- Fixed Target Network
● Deep Q Network (2015)

DQN - CNN
- ???(pixel) ???? ?? ???? ??.
- ??? ?? state? ???? ?? ??? ?? ?? ???? ??.
- ex. ??? ?? : ?? ??, ???, … → ??? ?? ???
???? ?? : ?? ??, ?? ??, … → ???? ?? ???
● Convolution Neural Network

DQN - CNN
https://www.nature.com/articles/nature14236
Output
? action?
?? Q value
Input
?? ??
pixel ?

DQN - Experience Replay
ref: ????????DQN?? (???)
/CurtPark1/dqn-reinforcement-learning-from-basics-to-dqn
● Challenge 1 - Correlation between samples
- ??????? sample? ??? ?? ????? ???? ??? correlation?
??
- Sample?? correlation? ??? ??? ??????.

- transition(S, A, R, S’)? memory(buffer)? ???? batch ??? ????.
- data(transition)?? correlation? ??.
- batch ??? ?? ??.
● Experience replay

DQN Agent Environment
State, Reward
Action
Replay Buffer
[S, A, R, S’]
[S, A, R, S’]
batch?? sampling
N?? Transition ??

DQN - Target Network
● Challenge 2 - Non-stationary targets
- Loss function?? target ? current value ? ?? ???? w? ??
???.
- w? ???? ?? target? ??? ??.
target
=

DQN - Target Network
- ?? step?? ???? ?? network? ???? update?? target?? ??
k step ??
copy
Main Network Target Network
● Target network

DQN - ?? ??
● Gradient Clipping
- Loss function? ???? ???? 1 ??? ?? 1? ??? Clipping
ref: wiki
https://en.wikipedia.org/wiki/Huber_loss

Rainbow DQN?
1. Deep Q Network
2. Double Q-learning
3. Prioritized Replay
4. Dueling Networks
5. Multi-step Learning
6. Distributional RL
7. Noisy Network

Rainbow DQN
● Rainbow? ?? ?????? ??? ???? ??? ???

Q-learning? ???
- Q-learning? maximization ???? Q? ????.
- maximization ??? overestimation ??? ??. (????)
- ?, Q-value? ???? ??? ???.

Double Q-learning
-0.1

Double Q-learning
Left
?? reward : -0.1
Right
reward : 0
<

Double Q-learning
A
Right 0
Left 0
B action1 0
< Q-table >

Double Q-learning
A
Right 0
Left 0
B action1 +0.2
< Q-table >
-0.1 0.2

Double Q-learning
A
Right 0
Left +0.2
B action1 +0.2
< Q-table >

Double Q-learning
Q-learning
Q-learning (??)
Double Q-learning
or
Q → Q1, Q2

Double Q-learning
10000? ??? ? ??
???

Double DQN
- ? ?? Q estimator Q1, Q2 → DQN? main Q, target Q
- main Q : Q ?? max? ?? action? ???.
- target Q : ??? ???? ???? ??.

Prioritized Replay
" Replay Buffer? ??? ??? ! "

Prioritized Replay
- ?? ??? ??? ????
- ??? ?? ? ?? ??? ??? ??????

Prioritized Replay
?? ??? ????? ?? ???

Prioritized Replay
?? ??? ????? ?? ???
→ TD - Error !

Prioritized Replay
- TD error :
-
" TD error? ??? ????? ??? ! "
alpha = [0, 1]
alpha 0?? Uniform sample

Prioritized Replay
- ????? ??? ???? sampling ?? ???
- ?? ????? transition? ??? sampling ? transition? ?? ???
??? ??. → ??? bias? ??
- ??? update ? Importance-sampling weight? ?? !
???? ???? PER ??? ???

Prioritized Replay
beta = [0, 1]
beta 0?? weight ?? x

Dueling Networks
?? State?? ?? Action? ?? ?? ??

Dueling Networks
??? ??? ??? ??? ?? ? ?? !
10? 20?
-10? -20?
?? : 0
+5? 3?
-2? -3?

Dueling Networks
??? (state value)
Value Advantage
Q?? ???? ??

Dueling Networks
Q(s, a1)
Q(s, a2)
Q(s, a3)
S

Dueling Networks
A(s, a1)
A(s, a2)
A(s, a3)
S
V(s)
Q

Dueling Networks
Sum :
- But, ?? sum ???? Q? ?? V? A ?? unique ?? ??
- ex. Q = 4?? V + A? (1, 3), (2, 2), (3, 1) ? ?? ??? ?? ??
● Dueling Network??? Q ?? ??

Dueling Networks
Max :
Average :

Dueling Networks
Max :
Average :
max? ?? ??? V? A? ?????? ???
max? ??? ??? ??? ???? ???? ???
??? V? A? ???

Multi-Step Learning
S0 S1 S2 S3 S4 ST
R1
1 step :
S0 S1 S2 S3 S4 ST
R1 R2 R3
3 step :

Distributional RL
ref : RLKorea ???, A Distributional Perspective on Reinforcement Learning
https://reinforcement-learning-kr.github.io/2018/10/02/C51/

network
Distributional RL
S
Q
S
Q
network
" Return? ??? ????. "

Distributional RL
Distributional :
General :

Distributional RL
N
: Value distribution
- x? : atom (or support)
- y? : ? atom? ?? ?? ??
● Value distribution

Distributional RL
- Value distribution? ???? ??
● Q-value

Distributional RL
- Target value distribution? ?? value distribution?? ??? ???
???? ??
- KL-Divergence
● Loss function
: Target value distribution

Distributional RL
1. ?? state? ?? value ??? ?? :
2. Target atom? ??
● Target value distribution

Distributional RL
● Target value distribution
- x? : atom (or support)
- y? : ? atom? ?? ?? ??
??? ...

Distributional RL
● Target value distribution - Projection
- Target value distribution? value distribution? atom? ???
- Target atom? reward? ??? ?? ??? ???
- Projection? ?? ?? ??.
1 2 3 4 5 6 7 1 2.3 3.2 4.1 5 5.9 6.8
R = 0.5, ? = 0.9

Distributional RL
1 2.3 3.2 4.1 5 5.9 6.8
3 3.2 4
0.50.5 * (4 - 3.2)
= 0.4
0.5 * (3.2 - 3)
= 0.1
ref : RLKorea ???, A Distributional Perspective on Reinforcement Learning
https://reinforcement-learning-kr.github.io/2018/10/02/C51/

Distributional RL
1 2.3 3.2 4.1 5 5.9 6.8
1 2 3 4 5 6 7

network
Distributional RL
S
Q(s, a1)
Q(s, a2)
Q(s, a3)
Q(s, a4)
action size
● General

network
Distributional RL
S
action size x atom size
Expectation
Q(s, a1)
Q(s, a2)
Q(s, a3)
Q(s, a4)
action size
● Distributional

Noisy Network
< ?? > < ??? ??? >
< Exploitation & Exploration >
● Exploitation? Exploration

Noisy Network
? - greedy ?? ???? exploration
??? ??? ?

Noisy Network
? - greedy Policy
Random perturbations of the policy
??, ??
Large-scale behavioral pattern???
????

Noisy Network
" Network? noise? ???? exploration? ??. "
perturbations

Noisy Network
" Network? noise? ???? exploration? ??. "
perturbations
State-dependent ? exploration? ? ? ??.

Noisy Network
Q(s, a1)
Q(s, a2)
Q(s, a3)
S
element-wise
multiplication
:

Noisy Network
1. Independent Gaussian noise
- noise? weight, bias size ?? ?? ? weight? ??
- noise? ??? ?? ??: (p x q) + q
● Gaussian noise? ??? ??
p x q q

Noisy Network
2. Factorised Gaussian noise
- input size(p)? noise? output size(q)? noise? ??
- ? noise? ??? (p x q) size? noise? ??
- noise? ??? ?? ??: p + q
p
q

Reference
● Sutton, R. and Barto, A., Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
● V. Mnih et al., "Human-level control through deep reinforcement learning." Nature, 518 (7540):529–533, 2015.
● van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning." arXiv preprint arXiv:1509.06461, 2015.
● T. Schaul et al., "Prioritized Experience Replay." arXiv preprint arXiv:1511.05952, 2015.
● Z. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning." arXiv preprint arXiv:1511.06581, 2015.
● M. Fortunato et al., "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295, 2017.
● M. G. Bellemare et al., "A Distributional Perspective on Reinforcement Learning." arXiv preprint arXiv:1707.06887, 2017.
● M. Hessel et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning." arXiv preprint arXiv:1710.02298, 2017.
● ???, “???? ? DeepRL”, https://tykimos.github.io/warehouse/2018-2-7-ISS_Near_and_Far_DeepRL_4.pdf
● David Silver, “Lecture 6 in UCL Course on RL” , http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf
● RLKorea ???, “A Distributional Perspective on Reinforcement Learning”, https://reinforcement-learning-kr.github.io/2018/10/02/C51/

狠狠撸

???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)

Recommended

More Related Content

What's hot (20)

Similar to ???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon) (20)

???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)