狠狠撸

????? ??? ??? ??
???? ?????? ??? ??? ??
?? ?? ?? ??????.

??
??? ???
City University of New York -Baruch College
Data Science ??
ConnexionAI ???
Freelancer Data Scientist
?????? ???? ???
Github:
https://github.com/wonseokjung
Facebook:
https://www.facebook.com/ws.jung.798
Blog:
https://wonseokjung.github.io/

1. Dynamic Programming
a. Policy iteration
b. Value iteration
2. Monte Carlo method
3. Temporal-Difference Learning
a. Sarsa
b. Q-learning
4. ??? ????? ??? ?? ? ????? ?? ??
5. DQN? ??? ???? ????? ???
??

a. Policy iteration
b. Value iteration
a. Sarsa
b. Q-learning
4. ????? ?? ?? ? ??? ????? ??? ??
5. DQN? ??? ???? ????? ???
Model-free
Model-based
Deeplearning?
+?
RL
??

a. Policy iteration
b. Value iteration
a. Sarsa
b. Q-learning
4. ????? ?? ?? ? ??? ????? ??? ??
5. DQN? ??? ???? ????? ???
Grid world
??

Before Deeplearning After Deeplearning
Tabular Image,text,voice…
?? ??? ??

Classic RL + DeepLearning = !
?? ????? ??? ??

? Level? State? ??? ??? General agent??
???? ???.
??? ?? ???

????? ??
University of California, Berkeley
ICML 2017
Curiosity-driven Exploration by
Self-supervised Prediction

https://github.com/wonseokjung/KIPS_Reinforcement
?????
Code + Jupyter Notebook ?? + ??? ?? ??
????? ???? ????? ?????!

Return of Episode
Episode ?? Return? Reward ? ?
Total Reward

Discounted Return
Discounted factor? ??? Reward? ?
Total Reward with Discounted

MDP ??? 5 x 5 Grid world
Grid World Environment

MDP??? 5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Transition Probability : 1
Discount factor : 0.9
Reward
+ 1
Reward
-1
State
Action

State-value function
(Policy? ?? state-value function)
State value

Action Value function
(Policy? ?? action-value function)
State-action value

Bellman equation !
A Fundamental property of value function

Optimal Policy? ?? - state
Value? ??? !
Optimal state value function

Optimal Policy ? ?? - state action
Value? ??? !
Optimal state-action value function

Bellman equation + Optimality
Bellman optimality equation v*

Bellman equation + Optimality
Bellman optimality equation q*

??
MDP
Return Episode
Return Epsisode(discount)
State-value function
Action-value function
Optimal PolicyBellmanEquationBellman optimal equation
Bellman Equation + Optimal Policy

??
Transition Probability
?? ????.

Dynamic Programming?
Value function? ???? ?? ?? Policy
? ???? ??? ??? ???? ??.
Dynamic programming? Key idea!

5 x 5 Grid world?? Dynamic Programming

5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Reward
+ 1
Reward
-1
?? state
Action
?? state
?? state

Update Rule?
Bellman equation? ???? ??????.
State!

? ??? Optimal Value functions
State Value
Bellman optimality equations ??

? ??? Optimal Value functions
Action Value
Bellman optimality equations ??

Dynamic Programming?
? ??? ?? Value function
State-action Value function

Policy Iteration
Value Iteration
Dynamic Programming

Policy iteration
1.Policy? ?? state-value? ????!
Policy Evaluation
2. ? ?? Policy? ??!
Policy Improvement
??? Policy?
???? ???
??

Policy iteration- Policy Evaluation
Update Rule? ???? Evaluation? ??.
Value update
Policy Transition
Probability
Reward Next State?
estimated value

1. ?? state? V(s) = 0 ?? ??? ???.
2. ? state? Update Rule? ???? V(s)? ???? ??.
Policy iteration- Policy Evaluation
3. ?????? V(s)? ???? ?? ??? ????? ???.
Policy? ?? state-value? ????!

Policy iteration- Improvement
Policy? ?? Value function? ??? ??? ? ??
Policy? ???????.
Greedy Policy

Policy iteration- Improvement
Greedy Policy ??

Policy iteration
Policy iteration? Optimal policy? ?????
Policy Evaluation? Policy Improvement? ????.

5 x 5 Grid world

5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? ?????? -1
Grid World Environment - Policy iteration
Reward
-1
Goal
Action
Goal
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1

k=0 ?? (???)
Vk Greed Policy
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0

Vk Greed Policy
-1.00.0
0.0
k=1
-1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0

Vk Greed Policy
-1.70.0
0.0
k=2
-2.0 -2.0 -2.0
-2.0-1.7
-2.0
-1.7
-1.7
-2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0

Vk Greed Policy
-900.0
0.0
k=inf
-98 -99 -100
-97-90
-91
-90
-90
-98
-99
-99
-98 -99
-98
-98
-99 -99 -98 -97
-100 -99 -98

Vk Greed Policy
-900.0
0.0
k=????
-98 -99 -100
-97-90
-91
-90
-90
Grid World Environment - Value iteration
-98
-99
-99
-98 -99
-98
-98
-99 -99 -98 -97
-100 -99 -98

Value Iteration
???? ???!
State, Action!

Model? ?????
??? ??? ??? ??? ????? ????.

Monte Carlo method? Dynamic programing??
?? ??? ?? ?? ?? ?? ??
??? ??? ?? ??? ????? ??.
Monte Carlo

??? ??? ?? ??? ??? ?? ?? environment?
??? ??? ??? ??? ?? optimal behavior? ???
???
Monte Carlo

Monte Carlo
Monte Carlo? episode-by-episode? ???? ??
????? ??? ???? terminal state ??
?? ???? ??.
Monte Carlo? ??? ?? return ? sample? ????
state-action value? ???? ??????.

Goal
Monte Carlo-GridWorld
??? ??? Update
Start

?? ????? ??? ? ?? ????? ??? ?
?? TD(temporal-difference) learning ? ???.
-Sutton
Temporal-Difference Learning

Monte Carlo + Dynamic programming
Monte Carlo ?? ?? ?? ??? ??? value? ????
DP?? ??? ?? ??? ???? value? estimate???? ??

?? state?? action? ???? ?? Reward ? ?? State? discount factor? ???
state value? estimate?? update??.
Monte Carolo??? Gt? ??? ???? ?? :

TD? ??
1. Monte Carlo? ??? ??? ??? ?? ??? ????
2. Dynamic programming?? ?? On-line ????.
3. ??? ???? ???, ???? update? ????? episode?
?? ??? ?? continue? model?? ???? ??.

TD? ????
Temporal-Diffrenece Learning
Sarsa Q-learning
Temporal-Diffrenece Learning? Sarsa? Q-learning? ?? ????? ???.
On-policy Off-policy

Sarsa
Q-learning
Temporal-Diffrenece Learning

Sarsa
on-policy ??? ???? Sarsa
state-value function ?? action-value function? ??

Sarsa
?? time step ?? state? action? ?? ???? action value? estimate??.

Sarsa-gridworld
Goal
StartAt+1 St+1

Sarsa-gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
???? ?? ????? ??? ??.

Sarsa-gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
??? ??? ??? action-value? ??????.
Policy? On-policy
0.8
0.7
0.6
2.0 1.1 1.0 0.9

Q-learning
Q-learning??? ??? off-policy TD control?? ????? ???? ??? ???.
-(Watkins, 1989)
exploration? exploitation? ?? ??.

Q-learning-pseudo code
Off-policy

qlearning-gridworld
Goal
Start
Argmax
St+1

Q-learning- gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
???? ?? ????? ??? ??.

Q-learning- gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
??? ??? ??? action-value? ??????.
Policy? Off-policy
0.8
0.7
0.6
2.0 1.1 1.0 0.9

Deeplearning
https://goo.gl/images/VA89CC

Deeplearning?? ??
https://chaosmail.github.io/deeplearning/2016/10/22/intro-to-deep-learning-for-computer-vision/
??? input?? ???? ????

Deep Reinforcement Learning
Deeplearning+Reinforcement Learning
https://goo.gl/images/oNu5Gr

Deepmind, DQN
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Deeplearning? ????? ????, ???? ???? ??? ????? ??

DQN, Keras(breakout)
????
1.Main
2.library
3.Function

DQN, Keras(breakout)
????
1.Main
2.library
3.DQN

1. ??? ????.
2. agent? ????.
3. score, episode, global_step ? ????.
4. ??? ?????? ??? ??
- ?? ???? ????.
- ??? ???? ????.
- ?? ?????? ?? action? ??? ????
?? ??.
- ?? ?? ???? ??? ??? ???.
- ????? ???? ??? ????? ???.
Main-1

* ??? ????? ?? ???? while?? ??
?.
-render? ??? ????. (render? ??? ?
? ?? ??? ????.
- ??? ??? ??? ????.
- ??? ??? ????.
- ??? state( history )? ???? action?
????.
Main-2

Main-3
- ??? action?? ??? ?????? ???? ?? ???
?, reward, done, info? ?? ???.
- ?? ?? ????? ?? ??? ???.
- history ?? ?? ??? ?? ??? state? ???
next_history? ??
- q_max? ??? ???? ??? ??? model ? ?? ?? Q
?? max? agent.avg_q_max? ???.
- ?? dead? ?? dead? True? ???, start_life? ??
????.

Main-4
- ?????? target model? update??.
- ??? ??? dead = false? ??? ???
next history ?? history? ???.
- ??? done?? ????? ????? ??
?? ????.
- ?? ?????? ??? ????.

Main-4
- ?? ??? ????? ??? ? ??? ???? ??? -1~1
? ??.
- s,a,r,s'? ???? ???? ????.
- ???? ???? ????? ???? ??? ????.

Main-4
- ?????? target model? update??.
- ??? ??? dead = false? ??? ???
next history ?? history? ???.
- ?? ?????? ??? ????.

Import-1
1.??? ?????? ????.
a.Keras
* CNN layer
* Dense layer
* optimizer
* ?????? ??? ??

Import-2
b. ??? ???
* input?? ???? ??? ?? ??
* RGB? Gray? ??? ?????
* replay memory???
c.Tensorflow
* tensorflow backend
* tensorflow
d. ??
* numpy
* random
* gym
* os

DQN-1
? - render ? ??
? - model? load ??
? - state ???
? - action ???
? - epsilon?
? - epsilon? ???? ?? ( decay? ?
? )
? - epsilon? decay step ??
?
???

DQN-2
? - ???? ????? ?? ????? ??
? - ??? ??? ?? ??
? - ?? ??? ???? ?? ??
? - discount factor
? - ??????? ???? ??
? - ????? action? ??? ???? ??
? - Deeplearning model
? - Target model
? - update target model
?
???

DQN-3
? - optimizer
? - Tensorboard
?
???-2
Save? ??? ???? ???? ??

DQN-4
Keras? ??? ?? ???
CNN Layers
Dense Layer

DQN-5
action? ???? ??(policy) : ?
??? Epsilon greedy
?? model? weight? ???? target
model? ???? ???? ?? ??

DQN-6
state, action, reward, next state? ???? ???? ????? ??
Replay Memory

DQN-7
???? ????? ??? ??? ??? ???? ??
Replay memory-2

DQN-8
? optimizer
? Tensorboard
? ???? ? ?? ??? ???
? ??
???

DQN-10
Optimizer ??
???? Huber Loss??
https://goo.gl/images/XGsfYx

Human A.I.
Deeplearning+Reinforcement Learning
?? ?????? ??? ????!

?? ??? ??
??? ??? ?? ??? ???? ????? ???? ??.

??? ????
State ??, action? ???. ?
?
?? ????? ??????, Deeplearning model, hyper parameter?
???? ???? ??.

Emulator
Environment
Algorithm
Programming Language
????? ??? ??? ???

1. https://www.python.org/downloads/ - 3.5 version
2. https://www.anaconda.com/download/ -Anaconda
3. https://www.tensorflow.org/install/ - TensorFlow
4. https://keras.io/#installation -Keras
Programming Language - Python

Emulator http://www.fceux.com/web/home.html
Ubuntu
sudo apt-get update
sudo apt-get install fceux
MAC
https://brew.sh/ -homebrew website
Terminal open -> brew install fceux
sudo apt-get install fceux
Emulator -FCUX

Environment OpenAI_Gym
https://github.com/openai/gym
pip3 install gym
git clone https://github.com/openai/gym.git?
cd gym?
pip install -e
OpenAI_Gym
OpenAI? Gym? ???? ?? ?? ???? ??? ????.

Environment
Baselines
https://github.com/openai/baselines
pip3 install baselines
git clone https://github.com/openai/baselines.git?
cd baselines?
pip install -e .
OpenAI_Baselines

Environment
Philip Paquette
https://github.com/ppaquette/gym-super-mario
pip3 install gym-pull
import gym
import gym_pull
gym_pull.pull('github.com/ppaquette/gym-super-mario')
env = gym.make('ppaquette/SuperMarioBros-1-1-v0')
SuperMario

Algorithm
DEEP Q-NETWORK
Algorithm-DQN

??? ??? ?????
https://github.com/wonseokjung/KIPS_Reinforcement/tree/
master/DQN?
?
?? ???? ???? ?? github? ??? ??? ???????.

?????? ??? ????
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Reward
+ 1
Reward
-1
State
Action
?????? ???????

Goal? ??
Goal
Start
?????? ??? ??? ???????? ??? goal state? ???

???????? ??
State : ??
Action : ?, ? , ?, ?,??,???, action? ??
Reward : ??? ???? Reward +1, ???? -1?
State
Action
???? ??? ??? ??? ?? reward? ???.

??
1. ???? ??? ???? ???? ?? ??
2. State? breakout?? ? ???? action? ??.

Reward ??
?????? ??? -
??? ????? -
???? ???? -
??? ????? +
??? ???? +
Penalty, Bonus reward??

Deeplearning model
VGG model and regular ??
https://goo.gl/images/eoXooChttps://goo.gl/images/s8XrCK
? ?? ????

????(reinforcement learning) ?? ?? ? Unity ml-agent? ???? ?? ??? ???? ???? ??
---
Github:
Facebook:
Blog:
https://wonseokjung.github.io/??!!

?? 1 ? ??? ?? ???? ?? ?..
?? ???? ??? ?? ????. Overfitting? ????
https://goo.gl/images/6uDmqH

????? ??? ??? ??? !
Reward Exploration Algorithm

?????.
Github:
Facebook:
Blog:
https://wonseokjung.github.io/

References:
* Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge,
MA, 2017
* https://github.com/rlcode/reinforcement-learning-kr

狠狠撸

Rl

Recommended

More Related Content

What's hot (20)

Similar to Rl (20)

Rl