際際滷

際際滷Share a Scribd company logo
???
2
Overview
Of several responses made to the same situation, those which are accompanied
or closely followed by satisfaction to the animal will, other things being equal,
be more firmly connected with the situation, so that, when it recurs, they will
be more likely to recur; those which are accompanied or closely followed by
discomfort to the animal will, other things being equal, have their connections
with that situation weakened, so that, when it recurs, they will be less likely to
occur. The greater the satisfaction or discomfort, the greater the strengthening
or weakening of the bond. (E. L. Thorndike, Animal Intelligence, page 244.)
? ????? ???? ????(trial-and-error) ??? ???? ??
? ?????(Thorndike) ??? ??(law of effect)?? ?? ? ???? ? ????? ?? ?? ??? ??
3
Overview (cont¨d)
? ??????? ??(state)? ??(reward)? ??? ?? ??? ???
? ???? ???
? Agent: ??? ?? ??
? Environment: Agent? ???? ?
? State(? ?): agent? ?? or ??? ??
? Action(? ?): agent? ??
? Reward(??): ??? ?? ?? ??
4
Overview (cont¨d)
From Yann Lecun, (NIPS 2016)
? ?????? ????? ??? ?? ??? ? ?? ?? ???? ??? ??
? ??????? ??? ??? ??(??)??? ??? ????? ??? ???? ??
? ??? Supervised Learning? ??? ?? ?? ??
? ??? ??? ??? ?? ??(??)? ?? ?? (? ?)? ?? ? ??(policy)
? ??? ??(exploration)? ??(exploitation)?? ????
5
MDP, Markov Decision Process
? ?????? state? Markov ??? ???? ???? ??
? Markov State: ??? state ?? ??? ???? ?? state? reward? ???? ? ??? ??? ??
? ?? ??, ?? ???? ?? ??? ??? ?? ?? ?? ??? ????? ??? ???, ?? ??? ???
???? ???? ?? ?? ??? ?? ??
6
Q-Learning
 state
 action
 Reward
Q(state, action) ? ?(?, ?)
? Q-function (state-action value function)
7
Q-Learning (cont¨d)
? Frozen Lake? ?? Q-Learning ??
Left Right
Up
Down
8
Q-Learning C state, action reward
? state, action, reward
?0, ?0, ?1, ?0, ?0, ?1, ? , ? ??1, ? ??1, ??, ? ?
state
action
reward
Terminal state
9
Q-Learning C state, action reward
? state, action, reward
?? = ?? + ??+1
??
?
= ?? + max ??+1
?(?, ?) = ? + max ?(?>
, ?>
)
10
Q-Learning : Policy
? ? state ?? ??? ??? ?? ?? ?? actio? ?? ???? ?? policy, π ?? ?
? Greedy: ?? ?? ?, max?(?, ?) ??
? ?-greedy: greedy policy? ????? ?? ?? ?? ??? action? ??
? ??? ? ?? ??? ??? ? ??? ???? ??? ??
? soft-max: ?-greedy? ???? ??? ???? ? action? ??? ??? ?? ??? ?? (p.277 ??)
11
? Learning Q(s, a) Table
?(?, ?) ○ ? + max ?(?>, ?>)
Initial Q values are 0
Q-Learning : Greedy Policy
12
? Learning Q(s, a) Table
?(?, ?) ○ ? + max ?(?>, ?>)
Initial Q values are 0
1
Q-Learning : Greedy Policy
13
? Learning Q(s, a) Table
?(?, ?) ○ ? + max ?(?>, ?>)
Initial Q values are 0
11
Q-Learning : Greedy Policy
14
Q-Learning : Greedy Policy
? Learning Q(s, a) Table
?(?, ?) ○ ? + max ?(?>, ?>)
Initial Q values are 0
11
1
1
1
1
11
15
Q-Learning Algorithm(greedy)
? Q-function (state-action value function)
16
Exploitation vs Exploration
11
1
1
1
1
11
Exploitation
Exploration
17
QCLearning : ?-greedy
18
QCLearning : Discounted future reward
11
1
1
1
1
11
1
? ??? ??? reward? ?? ???? ??
? ???? ?? 0 + ? + 1? ???? ??? ?? ??? ??? ??
?(?, ?) ○ ? + ?max ?(?>, ?>)
19
QCLearning : Discounted future reward
20
QCLearning : Discounted future reward
1
?(?, ?) ○ ? + ?max ?(?>, ?>)
? = 0.9
21
QCLearning : Discounted future reward
1
?(?, ?) ○ ? + ?max ?(?>, ?>)
? = 0.9
0.9
= 0 + 0.9 〜 1
22
QCLearning : Discounted future reward
1
?(?, ?) ○ ? + ?max ?(?>, ?>)
? = 0.9
0.9
0.81 0.9
0.729
23
QCLearning : Discounted future reward
24
QCLearning : Deterministic vs Stochastic
? Deterministic: Agent ? action? ???(determined)?? state? ??? ?
? Stochastic(Non-deterministic): Agent? ???? action? ?? state ? ??? ?
? ?, agent? action? ??? ?? state? deterministic?? ???? ?? ??? ????? ???
25
QCLearning : Deterministic vs Stochastic
? ?, ? ○ 1 ? ? ? ?, ? + ?[? + ?max ? ?>, ?> ]
- Learning rate, ? = 0.1
?(?, ?) ○ ? + ?max ?(?>, ?>)
? ?, ? ○ ? ?, ? + ?[? + ?max ? ?>, ?> ? ? ?, ? ]
26
QCLearning : Deterministic vs Stochastic
THANK YOU

More Related Content

What's hot (20)

???? ???? ??? ???? NAVER 2017
???? ???? ??? ???? NAVER 2017???? ???? ??? ???? NAVER 2017
???? ???? ??? ???? NAVER 2017
Taehoon Kim
?
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
Kyunghwan Kim
?
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
?
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
Thom Lane
?
???? ? Trpo
???? ? Trpo???? ? Trpo
???? ? Trpo
Woong won Lee
?
Or seminar2011final
Or seminar2011finalOr seminar2011final
Or seminar2011final
Mikio Kubo
?
?.?.?.?. ????!
?.?.?.?. ????!?.?.?.?. ????!
?.?.?.?. ????!
Dongmin Lee
?
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
?? ?
?
An Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence TestingAn Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence Testing
Yuchi Matsuoka
?
???? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ???????? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ????
Byoung-Hee Kim
?
Soft Actor Critic 盾h
Soft Actor Critic 盾hSoft Actor Critic 盾h
Soft Actor Critic 盾h
KCS Keio Computer Society
?
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Jisang Yoon
?
Ecoute activeEcoute active
Ecoute active
Herv└ Boullanger
?
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
?
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
?
769sj formation techniques-de_vente769sj formation techniques-de_vente
769sj formation techniques-de_vente
Youness Alami
?
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ????? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
NAVER Engineering
?
Reinforcement Learning(圭貨個鋲協尖)
Reinforcement Learning(圭貨個鋲協尖)Reinforcement Learning(圭貨個鋲協尖)
Reinforcement Learning(圭貨個鋲協尖)
Masanori Yamada
?
???? ??_2(Deep sarsa, Deep Q-learning, DQN)
???? ??_2(Deep sarsa, Deep Q-learning, DQN)???? ??_2(Deep sarsa, Deep Q-learning, DQN)
???? ??_2(Deep sarsa, Deep Q-learning, DQN)
Euijin Jeong
?
???? ???? ??? ???? NAVER 2017
???? ???? ??? ???? NAVER 2017???? ???? ??? ???? NAVER 2017
???? ???? ??? ???? NAVER 2017
Taehoon Kim
?
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
Kyunghwan Kim
?
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Dongmin Lee
?
Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)Proximal Policy Optimization (Reinforcement Learning)
Proximal Policy Optimization (Reinforcement Learning)
Thom Lane
?
Or seminar2011final
Or seminar2011finalOr seminar2011final
Or seminar2011final
Mikio Kubo
?
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
?? ?
?
An Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence TestingAn Introduction to HSIC for Independence Testing
An Introduction to HSIC for Independence Testing
Yuchi Matsuoka
?
???? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ???????? ??? - Deep Learning ?? ????
???? ??? - Deep Learning ?? ????
Byoung-Hee Kim
?
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Jisang Yoon
?
Ecoute activeEcoute active
Ecoute active
Herv└ Boullanger
?
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
?
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
Sangwoo Mo
?
769sj formation techniques-de_vente769sj formation techniques-de_vente
769sj formation techniques-de_vente
Youness Alami
?
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ????? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??
NAVER Engineering
?
Reinforcement Learning(圭貨個鋲協尖)
Reinforcement Learning(圭貨個鋲協尖)Reinforcement Learning(圭貨個鋲協尖)
Reinforcement Learning(圭貨個鋲協尖)
Masanori Yamada
?
???? ??_2(Deep sarsa, Deep Q-learning, DQN)
???? ??_2(Deep sarsa, Deep Q-learning, DQN)???? ??_2(Deep sarsa, Deep Q-learning, DQN)
???? ??_2(Deep sarsa, Deep Q-learning, DQN)
Euijin Jeong
?

More from ?? ? (7)

[???] Chap06 ???????
[???] Chap06 ???????[???] Chap06 ???????
[???] Chap06 ???????
?? ?
?
[???]Chap02 ?? ???-?????
[???]Chap02 ?? ???-?????[???]Chap02 ?? ???-?????
[???]Chap02 ?? ???-?????
?? ?
?
[??]Chap115 ??????????
[??]Chap115 ??????????[??]Chap115 ??????????
[??]Chap115 ??????????
?? ?
?
Chap06 dimensionality reduction
Chap06 dimensionality reductionChap06 dimensionality reduction
Chap06 dimensionality reduction
?? ?
?
Gan ????
Gan ????Gan ????
Gan ????
?? ?
?
Rnn????
Rnn????Rnn????
Rnn????
?? ?
?
Cnn ????
Cnn ????Cnn ????
Cnn ????
?? ?
?
[???] Chap06 ???????
[???] Chap06 ???????[???] Chap06 ???????
[???] Chap06 ???????
?? ?
?
[???]Chap02 ?? ???-?????
[???]Chap02 ?? ???-?????[???]Chap02 ?? ???-?????
[???]Chap02 ?? ???-?????
?? ?
?
[??]Chap115 ??????????
[??]Chap115 ??????????[??]Chap115 ??????????
[??]Chap115 ??????????
?? ?
?
Chap06 dimensionality reduction
Chap06 dimensionality reductionChap06 dimensionality reduction
Chap06 dimensionality reduction
?? ?
?
Gan ????
Gan ????Gan ????
Gan ????
?? ?
?
Rnn????
Rnn????Rnn????
Rnn????
?? ?
?
Cnn ????
Cnn ????Cnn ????
Cnn ????
?? ?
?

[??]Chap11 ????

  • 1. ???
  • 2. 2 Overview Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (E. L. Thorndike, Animal Intelligence, page 244.) ? ????? ???? ????(trial-and-error) ??? ???? ?? ? ?????(Thorndike) ??? ??(law of effect)?? ?? ? ???? ? ????? ?? ?? ??? ??
  • 3. 3 Overview (cont¨d) ? ??????? ??(state)? ??(reward)? ??? ?? ??? ??? ? ???? ??? ? Agent: ??? ?? ?? ? Environment: Agent? ???? ? ? State(? ?): agent? ?? or ??? ?? ? Action(? ?): agent? ?? ? Reward(??): ??? ?? ?? ??
  • 4. 4 Overview (cont¨d) From Yann Lecun, (NIPS 2016) ? ?????? ????? ??? ?? ??? ? ?? ?? ???? ??? ?? ? ??????? ??? ??? ??(??)??? ??? ????? ??? ???? ?? ? ??? Supervised Learning? ??? ?? ?? ?? ? ??? ??? ??? ?? ??(??)? ?? ?? (? ?)? ?? ? ??(policy) ? ??? ??(exploration)? ??(exploitation)?? ????
  • 5. 5 MDP, Markov Decision Process ? ?????? state? Markov ??? ???? ???? ?? ? Markov State: ??? state ?? ??? ???? ?? state? reward? ???? ? ??? ??? ?? ? ?? ??, ?? ???? ?? ??? ??? ?? ?? ?? ??? ????? ??? ???, ?? ??? ??? ???? ???? ?? ?? ??? ?? ??
  • 6. 6 Q-Learning state action Reward Q(state, action) ? ?(?, ?) ? Q-function (state-action value function)
  • 7. 7 Q-Learning (cont¨d) ? Frozen Lake? ?? Q-Learning ?? Left Right Up Down
  • 8. 8 Q-Learning C state, action reward ? state, action, reward ?0, ?0, ?1, ?0, ?0, ?1, ? , ? ??1, ? ??1, ??, ? ? state action reward Terminal state
  • 9. 9 Q-Learning C state, action reward ? state, action, reward ?? = ?? + ??+1 ?? ? = ?? + max ??+1 ?(?, ?) = ? + max ?(?> , ?> )
  • 10. 10 Q-Learning : Policy ? ? state ?? ??? ??? ?? ?? ?? actio? ?? ???? ?? policy, π ?? ? ? Greedy: ?? ?? ?, max?(?, ?) ?? ? ?-greedy: greedy policy? ????? ?? ?? ?? ??? action? ?? ? ??? ? ?? ??? ??? ? ??? ???? ??? ?? ? soft-max: ?-greedy? ???? ??? ???? ? action? ??? ??? ?? ??? ?? (p.277 ??)
  • 11. 11 ? Learning Q(s, a) Table ?(?, ?) ○ ? + max ?(?>, ?>) Initial Q values are 0 Q-Learning : Greedy Policy
  • 12. 12 ? Learning Q(s, a) Table ?(?, ?) ○ ? + max ?(?>, ?>) Initial Q values are 0 1 Q-Learning : Greedy Policy
  • 13. 13 ? Learning Q(s, a) Table ?(?, ?) ○ ? + max ?(?>, ?>) Initial Q values are 0 11 Q-Learning : Greedy Policy
  • 14. 14 Q-Learning : Greedy Policy ? Learning Q(s, a) Table ?(?, ?) ○ ? + max ?(?>, ?>) Initial Q values are 0 11 1 1 1 1 11
  • 15. 15 Q-Learning Algorithm(greedy) ? Q-function (state-action value function)
  • 18. 18 QCLearning : Discounted future reward 11 1 1 1 1 11 1 ? ??? ??? reward? ?? ???? ?? ? ???? ?? 0 + ? + 1? ???? ??? ?? ??? ??? ?? ?(?, ?) ○ ? + ?max ?(?>, ?>)
  • 19. 19 QCLearning : Discounted future reward
  • 20. 20 QCLearning : Discounted future reward 1 ?(?, ?) ○ ? + ?max ?(?>, ?>) ? = 0.9
  • 21. 21 QCLearning : Discounted future reward 1 ?(?, ?) ○ ? + ?max ?(?>, ?>) ? = 0.9 0.9 = 0 + 0.9 〜 1
  • 22. 22 QCLearning : Discounted future reward 1 ?(?, ?) ○ ? + ?max ?(?>, ?>) ? = 0.9 0.9 0.81 0.9 0.729
  • 23. 23 QCLearning : Discounted future reward
  • 24. 24 QCLearning : Deterministic vs Stochastic ? Deterministic: Agent ? action? ???(determined)?? state? ??? ? ? Stochastic(Non-deterministic): Agent? ???? action? ?? state ? ??? ? ? ?, agent? action? ??? ?? state? deterministic?? ???? ?? ??? ????? ???
  • 25. 25 QCLearning : Deterministic vs Stochastic ? ?, ? ○ 1 ? ? ? ?, ? + ?[? + ?max ? ?>, ?> ] - Learning rate, ? = 0.1 ?(?, ?) ○ ? + ?max ?(?>, ?>) ? ?, ? ○ ? ?, ? + ?[? + ?max ? ?>, ?> ? ? ?, ? ]