ݺߣ

ݺߣShare a Scribd company logo
????? ??? ??
? ?
03.30
????? ??? ??? ??
???? ?????? ??? ??? ??
?? ?? ?? ??????.
??
??? ???
City University of New York -Baruch College
Data Science ??
ConnexionAI ???
Freelancer Data Scientist
?????? ???? ???
Github:
https://github.com/wonseokjung
Facebook:
https://www.facebook.com/ws.jung.798
Blog:
https://wonseokjung.github.io/
1. Dynamic Programming
a. Policy iteration
b. Value iteration
2. Monte Carlo method
3. Temporal-Difference Learning
a. Sarsa
b. Q-learning
4. ??? ????? ??? ?? ? ????? ?? ??
5. DQN? ??? ???? ????? ???
??
1. Dynamic Programming
a. Policy iteration
b. Value iteration
2. Monte Carlo method
3. Temporal-Difference Learning
a. Sarsa
b. Q-learning
4. ????? ?? ?? ? ??? ????? ??? ??
5. DQN? ??? ???? ????? ???
Model-free
Model-based
Deeplearning?
+?
RL
??
1. Dynamic Programming
a. Policy iteration
b. Value iteration
2. Monte Carlo method
3. Temporal-Difference Learning
a. Sarsa
b. Q-learning
4. ????? ?? ?? ? ??? ????? ??? ??
5. DQN? ??? ???? ????? ???
Grid world
??
Before Deeplearning After Deeplearning
Tabular Image,text,voice
?? ??? ??
Classic RL + DeepLearning = !
?? ????? ??? ??
Q-learning + CNN -> DQN
DQN
? Level? State? ??? ??? General agent??
???? ???.
??? ?? ???
????? ??
University of California, Berkeley
ICML 2017
Curiosity-driven Exploration by
Self-supervised Prediction
https://github.com/wonseokjung/KIPS_Reinforcement
?????
Code + Jupyter Notebook ?? + ??? ?? ??
????? ???? ????? ?????!
Markov Decision Process
Return of Episode
Episode ?? Return? Reward ? ?
Total Reward
Discounted Return
Discounted factor? ??? Reward? ?
Total Reward with Discounted
MDP ??? 5 x 5 Grid world
Grid World Environment
MDP??? 5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Transition Probability : 1
Discount factor : 0.9
Reward
+ 1
Reward
-1
State
Action
Grid World Environment
State-value function
(Policy? ?? state-value function)
State value
State-value function
(Policy? ?? state-value function)
State value
Action Value function
(Policy? ?? action-value function)
State-action value
Bellman equation !
A Fundamental property of value function
Optimal Policy? ?? - state
Value? ??? !
Optimal state value function
Optimal Policy ? ?? - state action
Value? ??? !
Optimal state-action value function
Bellman equation + Optimality
Bellman optimality equation v*
Bellman equation + Optimality
Bellman optimality equation q*
??
MDP
Return Episode
Return Epsisode(discount)
State-value function
Action-value function
Optimal PolicyBellmanEquationBellman optimal equation
Bellman Equation + Optimal Policy
Dynamic Programming
??
State, Reward, Action
??
Transition Probability
?? ????.
Dynamic Programming?
Value function? ???? ?? ?? Policy
? ???? ??? ??? ???? ??.
Dynamic programming? Key idea!
Dynamic programming
5 x 5 Grid world?? Dynamic Programming
Grid World Environment
5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Transition Probability : 1
Discount factor : 0.9
Reward
+ 1
Reward
-1
?? state
Action
Grid World Environment
?? state
?? state
Update Rule?
Bellman equation? ???? ??????.
State!
? ??? Optimal Value functions
State Value
Bellman optimality equations ??
? ??? Optimal Value functions
Action Value
Bellman optimality equations ??
Dynamic Programming?
? ??? ?? Value function
State-action Value function
Policy Iteration
Value Iteration
Dynamic Programming
Policy iteration
1.Policy? ?? state-value? ????!
Policy Evaluation
2. ? ?? Policy? ??!
Policy Improvement
??? Policy?
???? ???
??
Policy iteration- Policy Evaluation
Update Rule? ???? Evaluation? ??.
Value update
Policy Transition
Probability
Reward Next State?
estimated value
1. ?? state? V(s) = 0 ?? ??? ???.
2. ? state? Update Rule? ???? V(s)? ???? ??.
Policy iteration- Policy Evaluation
3. ?????? V(s)? ???? ?? ??? ????? ???.
Policy? ?? state-value? ????!
Policy iteration- Improvement
Policy? ?? Value function? ??? ??? ? ??
Policy? ???????.
Greedy Policy
Policy iteration- Improvement
Greedy Policy ??
Policy iteration
Policy iteration? Optimal policy? ?????
Policy Evaluation? Policy Improvement? ????.
Grid World Environment
5 x 5 Grid world
5 x 5 Grid world
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? ?????? -1
Transition Probability : 1
Discount factor : 0.9
Grid World Environment - Policy iteration
Reward
-1
Goal
Action
Goal
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Reward
-1
Grid World Environment - Policy iteration
k=0 ?? (???)
Vk Greed Policy
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
0.00.0 0.0 0.0 0.0
Vk Greed Policy
-1.00.0
0.0
k=1
-1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0 -1.0
-1.0-1.0 -1.0 -1.0
Grid World Environment - Policy iteration
Vk Greed Policy
-1.70.0
0.0
k=2
-2.0 -2.0 -2.0
-2.0-1.7
-2.0
-1.7
-1.7
Grid World Environment - Policy iteration
-2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0
Vk Greed Policy
-900.0
0.0
k=inf
-98 -99 -100
-97-90
-91
-90
-90
Grid World Environment - Policy iteration
-98
-99
-99
-98 -99
-98
-98
-99 -99 -98 -97
-100 -99 -98
Policy iteration
??
Policy Iteration
Value Iteration
Dynamic Programming
Vk Greed Policy
-900.0
0.0
k=????
-98 -99 -100
-97-90
-91
-90
-90
Grid World Environment - Value iteration
-98
-99
-99
-98 -99
-98
-98
-99 -99 -98 -97
-100 -99 -98
Value Iteration
???? ???!
State, Action!
Value iteration
??
Model? ?????
??? ??? ??? ??? ????? ????.
Monte Carlo method
Monte Carlo method? Dynamic programing??
?? ??? ?? ?? ?? ?? ??
??? ??? ?? ??? ????? ??.
Monte Carlo
??? ??? ?? ??? ??? ?? ?? environment?
??? ??? ??? ??? ?? optimal behavior? ???
???
Monte Carlo
Monte Carlo
Monte Carlo? episode-by-episode? ???? ??
????? ??? ???? terminal state ??
?? ???? ??.
Monte Carlo? ??? ?? return ? sample? ????
state-action value? ???? ??????.
Goal
Monte Carlo-GridWorld
??? ??? Update
Start
Monte Carlo
??
Temporal-Difference Learning
?? ????? ??? ? ?? ????? ??? ?
?? TD(temporal-difference) learning ? ???.
-Sutton
Temporal-Difference Learning
Temporal-Difference Learning
Monte Carlo + Dynamic programming
Monte Carlo ?? ?? ?? ??? ??? value? ????
DP?? ??? ?? ??? ???? value? estimate???? ??
Temporal-Difference Learning
?? state?? action? ???? ?? Reward ? ?? State? discount factor? ???
state value? estimate?? update??.
Monte Carolo??? Gt? ??? ???? ?? :
TD? ??
1. Monte Carlo? ??? ??? ??? ?? ??? ????
2. Dynamic programming?? ?? On-line ????.
3. ??? ???? ???, ???? update? ????? episode?
?? ??? ?? continue? model?? ???? ??.
TD? ????
Temporal-Diffrenece Learning
Sarsa Q-learning
Temporal-Diffrenece Learning? Sarsa? Q-learning? ?? ????? ???.
On-policy Off-policy
Sarsa
Q-learning
Temporal-Diffrenece Learning
Sarsa
on-policy ??? ???? Sarsa
state-value function ?? action-value function? ??
Sarsa
?? time step ?? state? action? ?? ???? action value? estimate??.
Sarsa-pseudo code
Sarsa-pseudo code
On-policy
Sarsa-gridworld
Goal
StartAt+1 St+1
Sarsa-gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
???? ?? ????? ??? ??.
Sarsa-gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
??? ??? ??? action-value? ??????.
Policy? On-policy
0.8
0.7
0.6
2.0 1.1 1.0 0.9
Sarsa
??
Sarsa
Q-learning
Temporal-Diffrenece Learning
Q-learning
Q-learning??? ??? off-policy TD control?? ????? ???? ??? ???.
-(Watkins, 1989)
exploration? exploitation? ?? ??.
Q-learning-pseudo code
Off-policy
qlearning-gridworld
Goal
Start
Argmax
St+1
Q-learning- gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
???? ?? ????? ??? ??.
Q-learning- gridworld
1.1
2.0
0.0
1.0
0.80.9 0.7 0.6 0.5
??? ??? ??? action-value? ??????.
Policy? Off-policy
0.8
0.7
0.6
2.0 1.1 1.0 0.9
Q-learning
??
??? ????? ??? ?? ????? ?? ??
Deeplearning
https://goo.gl/images/VA89CC
Deeplearning?? ??
https://chaosmail.github.io/deeplearning/2016/10/22/intro-to-deep-learning-for-computer-vision/
??? input?? ???? ????
Deep Reinforcement Learning
Deeplearning+Reinforcement Learning
https://goo.gl/images/oNu5Gr
Deepmind, DQN
https://www.youtube.com/watch?v=V1eYniJ0Rnk
Deeplearning? ????? ????, ???? ???? ??? ????? ??
DQN, Keras(breakout)
????
1.Main
2.library
3.Function
DQN, Keras(breakout)
????
1.Main
2.library
3.DQN
1. ??? ????.
2. agent? ????.
3. score, episode, global_step ? ????.
4. ??? ?????? ??? ??
- ?? ???? ????.
- ??? ???? ????.
- ?? ?????? ?? action? ??? ????
?? ??.
- ?? ?? ???? ??? ??? ???.
- ????? ???? ??? ????? ???.
Main-1
* ??? ????? ?? ???? while?? ??
?.
-render? ??? ????. (render? ??? ?
? ?? ??? ????.
- ??? ??? ??? ????.
- ??? ??? ????.
- ??? state( history )? ???? action?
????.
Main-2
Main-3
- ??? action?? ??? ?????? ???? ?? ???
?, reward, done, info? ?? ???.
- ?? ?? ????? ?? ??? ???.
- history ?? ?? ??? ?? ??? state? ???
next_history? ??
- q_max? ??? ???? ??? ??? model ? ?? ?? Q
?? max? agent.avg_q_max? ???.
- ?? dead? ?? dead? True? ???, start_life? ??
????.
Main-4
- ?????? target model? update??.
- ??? ??? dead = false? ??? ???
next history ?? history? ???.
- ??? done?? ????? ????? ??
?? ????.
- ?? ?????? ??? ????.
DQN, Keras(breakout)
????
1.Main
2.library
3.DQN
Main-4
- ?? ??? ????? ??? ? ??? ???? ??? -1~1
? ??.
- s,a,r,s'? ???? ???? ????.
- ???? ???? ????? ???? ??? ????.
Main-4
- ?????? target model? update??.
- ??? ??? dead = false? ??? ???
next history ?? history? ???.
- ?? ?????? ??? ????.
Import-1
1.??? ?????? ????.
a.Keras
* CNN layer
* Dense layer
* optimizer
* ?????? ??? ??
Import-2
b. ??? ???
* input?? ???? ??? ?? ??
* RGB? Gray? ??? ?????
* replay memory???
c.Tensorflow
* tensorflow backend
* tensorflow
d. ??
* numpy
* random
* gym
* os
DQN, Keras(breakout)
????
1.Main
2.library
3.DQN
DQN-1
? - render ? ??
? - model? load ??
? - state ???
? - action ???
? - epsilon?
? - epsilon? ???? ?? ( decay? ?
? )
? - epsilon? decay step ??
?
???
DQN-2
? - ???? ????? ?? ????? ??
? - ??? ??? ?? ??
? - ?? ??? ???? ?? ??
? - discount factor
? - ??????? ???? ??
? - ????? action? ??? ???? ??
? - Deeplearning model
? - Target model
? - update target model
?
???
DQN-3
? - optimizer
? - Tensorboard
?
???-2
Save? ??? ???? ???? ??
DQN-4
Keras? ??? ?? ???
CNN Layers
Dense Layer
DQN-5
action? ???? ??(policy) : ?
??? Epsilon greedy
?? model? weight? ???? target
model? ???? ???? ?? ??
DQN-6
state, action, reward, next state? ???? ???? ????? ??
Replay Memory
DQN-7
???? ????? ??? ??? ??? ???? ??
Replay memory-2
DQN-8
? optimizer
? Tensorboard
? ???? ? ?? ??? ???
? ??
???
DQN-9
???? ???? ?? ??
?
???
DQN-10
Optimizer ??
???? Huber Loss??
https://goo.gl/images/XGsfYx
DQN, Keras(breakout)
??
????? ?? ??
Human A.I.
Deeplearning+Reinforcement Learning
?? ?????? ??? ????!
?? ??? ??
??? ??? ?? ??? ???? ????? ???? ??.
??? ????
State ??, action? ???. ?
?
?? ????? ??????, Deeplearning model, hyper parameter?
???? ???? ??.
Emulator
Environment
Algorithm
Programming Language
????? ??? ??? ???
1. https://www.python.org/downloads/ - 3.5 version
2. https://www.anaconda.com/download/ -Anaconda
3. https://www.tensorflow.org/install/ - TensorFlow
4. https://keras.io/#installation -Keras
Programming Language - Python
Emulator http://www.fceux.com/web/home.html
Ubuntu
sudo apt-get update
sudo apt-get install fceux
MAC
https://brew.sh/ -homebrew website
Terminal open -> brew install fceux
sudo apt-get install fceux
Emulator -FCUX
Environment OpenAI_Gym
https://github.com/openai/gym
pip3 install gym
git clone https://github.com/openai/gym.git?
cd gym?
pip install -e
OpenAI_Gym
OpenAI? Gym? ???? ?? ?? ???? ??? ????.
Environment
Baselines
https://github.com/openai/baselines
pip3 install baselines
git clone https://github.com/openai/baselines.git?
cd baselines?
pip install -e .
OpenAI_Baselines
Environment
Philip Paquette
https://github.com/ppaquette/gym-super-mario
pip3 install gym-pull
import gym
import gym_pull
gym_pull.pull('github.com/ppaquette/gym-super-mario')
env = gym.make('ppaquette/SuperMarioBros-1-1-v0')
SuperMario
Algorithm
DEEP Q-NETWORK
Algorithm-DQN
??? ??? ?????
https://github.com/wonseokjung/KIPS_Reinforcement/tree/
master/DQN?
?
?? ???? ???? ?? github? ??? ??? ???????.
DQN? ??? ???? ????? ???
?????? ??? ????
State : ???? ??
Action : ?, ? , ?, ?
Reward : ?? = -1, ?? = 1?
Transition Probability : 1
Discount factor : 0.9
Reward
+ 1
Reward
-1
State
Action
?????? ???????
Goal? ??
Goal
Start
?????? ??? ??? ???????? ??? goal state? ???
???????? ??
State : ??
Action : ?, ? , ?, ?,??,???, action? ??
Reward : ??? ???? Reward +1, ???? -1?
Transition Probability : 1
Discount factor : 0.9
State
Action
???? ??? ??? ??? ?? reward? ???.
???? ??
??
1. ???? ??? ???? ???? ?? ??
2. State? breakout?? ? ???? action? ??.
Reward ??
?????? ??? -
??? ????? -
???? ???? -
??? ????? +
??? ???? +
Penalty, Bonus reward??
Deeplearning model
VGG model and regular ??
https://goo.gl/images/eoXooChttps://goo.gl/images/s8XrCK
? ?? ????
????(reinforcement learning) ?? ?? ? Unity ml-agent? ???? ?? ??? ???? ???? ??
---
Github:
https://github.com/wonseokjung
Facebook:
https://www.facebook.com/ws.jung.798
Blog:
https://wonseokjung.github.io/??!!
DQN? ??? ???? ????? ???
??
?? 1 ? ??? ?? ???? ?? ?..
?? ???? ??? ?? ????. Overfitting? ????
https://goo.gl/images/6uDmqH
????? ??? ??? ??? !
Reward Exploration Algorithm
?????.
Github:
https://github.com/wonseokjung
Facebook:
https://www.facebook.com/ws.jung.798
Blog:
https://wonseokjung.github.io/
References:
* Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge,
MA, 2017
* https://github.com/rlcode/reinforcement-learning-kr

More Related Content

Rl