[??]Chap11 ????

Sep 23, 20170 likes516 views

?? ?

???? ???? Reinforcement Learning

2
Overview
Of several responses made to the same situation, those which are accompanied
or closely followed by satisfaction to the animal will, other things being equal,
be more firmly connected with the situation, so that, when it recurs, they will
be more likely to recur; those which are accompanied or closely followed by
discomfort to the animal will, other things being equal, have their connections
with that situation weakened, so that, when it recurs, they will be less likely to
occur. The greater the satisfaction or discomfort, the greater the strengthening
or weakening of the bond. (E. L. Thorndike, Animal Intelligence, page 244.)
? ????? ???? ????(trial-and-error) ??? ???? ??
? ?????(Thorndike) ??? ??(law of effect)?? ?? ? ???? ? ????? ?? ?? ??? ??

3
Overview (cont’d)
? ??????? ??(state)? ??(reward)? ??? ?? ??? ???
? ???? ???
? Agent: ??? ?? ??
? Environment: Agent? ???? ?
? State(? ?): agent? ?? or ??? ??
? Action(? ?): agent? ??
? Reward(??): ??? ?? ?? ??

4
Overview (cont’d)
From Yann Lecun, (NIPS 2016)
? ?????? ????? ??? ?? ??? ? ?? ?? ???? ??? ??
? ??????? ??? ??? ??(??)??? ??? ????? ??? ???? ??
? ??? Supervised Learning? ??? ?? ?? ??
? ??? ??? ??? ?? ??(??)? ?? ?? (? ?)? ?? ? ??(policy)
? ??? ??(exploration)? ??(exploitation)?? ????

5
MDP, Markov Decision Process
? ?????? state? Markov ??? ???? ???? ??
? Markov State: ??? state ?? ??? ???? ?? state? reward? ???? ? ??? ??? ??
? ?? ??, ?? ???? ?? ??? ??? ?? ?? ?? ??? ????? ??? ???, ?? ??? ???
???? ???? ?? ?? ??? ?? ??

6
Q-Learning
① state
② action
③ Reward
Q(state, action) ? ?(?, ?)
? Q-function (state-action value function)

7
Q-Learning (cont’d)
? Frozen Lake? ?? Q-Learning ??
Left Right
Up
Down

8
Q-Learning – state, action reward
? state, action, reward
?0, ?0, ?1, ?0, ?0, ?1, ? , ? ??1, ? ??1, ??, ? ?
state
action
reward
Terminal state

9
Q-Learning – state, action reward
? state, action, reward
?? = ?? + ??+1
??
?
= ?? + max ??+1
?(?, ?) = ? + max ?(?′
, ?′
)

10
Q-Learning : Policy
? ? state ?? ??? ??? ?? ?? ?? actio? ?? ???? ?? policy, π ?? ?
? Greedy: ?? ?? ?, max?(?, ?) ??
? ?-greedy: greedy policy? ????? ?? ?? ?? ??? action? ??
? ??? ? ?? ??? ??? ? ??? ???? ??? ??
? soft-max: ?-greedy? ???? ??? ???? ? action? ??? ??? ?? ??? ?? (p.277 ??)

11
? Learning Q(s, a) Table
?(?, ?) ← ? + max ?(?′, ?′)
Initial Q values are 0
Q-Learning : Greedy Policy

12
? Learning Q(s, a) Table
?(?, ?) ← ? + max ?(?′, ?′)
Initial Q values are 0
1
Q-Learning : Greedy Policy

13
? Learning Q(s, a) Table
?(?, ?) ← ? + max ?(?′, ?′)
Initial Q values are 0
11
Q-Learning : Greedy Policy

14
Q-Learning : Greedy Policy
? Learning Q(s, a) Table
?(?, ?) ← ? + max ?(?′, ?′)
Initial Q values are 0
11
1
1
1
1
11

15
Q-Learning Algorithm(greedy)
? Q-function (state-action value function)

16
Exploitation vs Exploration
11
1
1
1
1
11
Exploitation
Exploration

18
Q–Learning : Discounted future reward
11
1
1
1
1
11
1
? ??? ??? reward? ?? ???? ??
? ???? ?? 0 ≤ ? ≤ 1? ???? ??? ?? ??? ??? ??
?(?, ?) ← ? + ?max ?(?′, ?′)

19
Q–Learning : Discounted future reward

20
Q–Learning : Discounted future reward
1
?(?, ?) ← ? + ?max ?(?′, ?′)
? = 0.9

21
Q–Learning : Discounted future reward
1
?(?, ?) ← ? + ?max ?(?′, ?′)
? = 0.9
0.9
= 0 + 0.9 × 1

22
Q–Learning : Discounted future reward
1
?(?, ?) ← ? + ?max ?(?′, ?′)
? = 0.9
0.9
0.81 0.9
0.729

23
Q–Learning : Discounted future reward

24
Q–Learning : Deterministic vs Stochastic
? Deterministic: Agent ? action? ???(determined)?? state? ??? ?
? Stochastic(Non-deterministic): Agent? ???? action? ?? state ? ??? ?
? ?, agent? action? ??? ?? state? deterministic?? ???? ?? ??? ????? ???

25
Q–Learning : Deterministic vs Stochastic
? ?, ? ← 1 ? ? ? ?, ? + ?[? + ?max ? ?′, ?′ ]
- Learning rate, ? = 0.1
?(?, ?) ← ? + ?max ?(?′, ?′)
? ?, ? ← ? ?, ? + ?[? + ?max ? ?′, ?′ ? ? ?, ? ]

26
Q–Learning : Deterministic vs Stochastic

The document discusses Proximal Policy Optimization (PPO), a policy gradient method for reinforcement learning. Some key points: - PPO directly learns the policy rather than values like Q-functions. It can handle both discrete and continuous action spaces. - Policy gradient methods estimate the gradient of expected return with respect to the policy parameters. Basic updates involve taking a step in the direction of this gradient. - PPO improves stability over basic policy gradients by clipping the objective to constrain the policy update. It also uses multiple losses including for the value function and entropy. - Actor-critic methods like PPO learn the policy (actor) and estimated state value (critic) simultaneously. The critic acts as

???? ? TrpoWoong won Lee

Or seminar2011finalMikio Kubo

?.?.?.?. ????!Dongmin Lee

?????. ??? '1st ???? ??? ????'?? "?.?.?.?. ????"?? ??? ??? ?????? ???. ???? ?? ??? ??? ????. https://tykimos.github.io/2018/06/28/ISS_1st_Deep_Learning_Conference_All_Together/ ??? ???? ??? ??? ????. 1. What is Artificial Intelligence? 2. What is Reinforcement Learning? 3. What is Artificial General Intelligence? 4. Planning and Learning 5. Safe Reinforcement Learning ?? ? ???? "Imagination-Augmented Agents for Deep Reinforcement Learning"??? ??? ??? ???????. ?? ??? ??? ??? ???? ?????~!

Introduction to SAC(Soft Actor-Critic)Suhyun Cho

ddpg seminar?? ?

This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.

An Introduction to HSIC for Independence TestingYuchi Matsuoka

This document introduces Hilbert-Schmidt Independence Criterion (HSIC) for testing independence between random variables. HSIC embeds probability distributions into reproducing kernel Hilbert spaces and computes the distance between joint and product distributions using the Maximum Mean Discrepancy. It presents HSIC as a completely nonparametric measure of dependence that is applicable to high dimensional data. The document outlines how to compute HSIC from samples and discusses its relationship to U-statistics, providing an independence test using HSIC with permutations.

???? ??? - Deep Learning ?? ????Byoung-Hee Kim

Soft Actor Critic 解説KCS Keio Computer Society

Soft Actor-Critic is an off-policy maximum entropy deep reinforcement learning algorithm that uses a stochastic actor. It was presented in a 2017 NIPS paper by researchers from OpenAI, UC Berkeley, and DeepMind. Soft Actor-Critic extends the actor-critic framework by incorporating an entropy term into the reward function to encourage exploration. This allows the agent to learn stochastic policies that can operate effectively in environments with complex, sparse rewards. The algorithm was shown to learn robust policies on continuous control tasks using deep neural networks to approximate the policy and action-value functions.

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon

MADDPG is a multi-agent actor-critic reinforcement learning algorithm that can operate in mixed cooperative-competitive environments. It uses a decentralized actor and centralized critic architecture. The centralized critic takes the observations and actions of all agents as input to guide learning, even though each agent only controls its own actor. To deal with non-stationary environments, it approximates other agents' policies when they are unknown. It also trains with policy ensembles to prevent overfitting to competitors' strategies. Experiments show MADDPG outperforms decentralized methods on cooperative tasks and its performance benefits from approximating other agents and using policy ensembles in competitive settings.

Ecoute activeHervé Boullanger

Introduction of Deep Reinforcement LearningNAVER Engineering

Multi-Armed Bandit and ApplicationsSangwoo Mo

The document discusses multi-armed bandits and their applications. It provides an overview of multi-armed bandits, describing the exploration-exploitation dilemma. It then discusses the optimal UCB algorithm and how it balances exploration and exploitation. Finally, it summarizes two applications of multi-armed bandits: using them for learning to rank in recommendation systems and addressing the cold-start problem in recommender systems.

769sj formation techniques-de_venteYouness Alami

??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??NAVER Engineering

Reinforcement Learning(方策改善定理)Masanori Yamada

???? ??_2(Deep sarsa, Deep Q-learning, DQN)Euijin Jeong

[???] Chap06 ????????? ?

[???]Chap02 ?? ???-??????? ?

More Related Content

What's hot (20)

???? ???? ??? ???? NAVER 2017Taehoon Kim

???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)Kyunghwan Kim

Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee

Proximal Policy Optimization (Reinforcement Learning)Thom Lane

???? ? TrpoWoong won Lee

Or seminar2011finalMikio Kubo

?.?.?.?. ????!Dongmin Lee

Introduction to SAC(Soft Actor-Critic)Suhyun Cho

ddpg seminar?? ?

An Introduction to HSIC for Independence TestingYuchi Matsuoka

???? ??? - Deep Learning ?? ????Byoung-Hee Kim

Soft Actor Critic 解説KCS Keio Computer Society

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon

Ecoute activeHervé Boullanger

Introduction of Deep Reinforcement LearningNAVER Engineering

Multi-Armed Bandit and ApplicationsSangwoo Mo

769sj formation techniques-de_venteYouness Alami

??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??NAVER Engineering

Reinforcement Learning(方策改善定理)Masanori Yamada

???? ??_2(Deep sarsa, Deep Q-learning, DQN)Euijin Jeong

???? ???? ??? ???? NAVER 2017Taehoon Kim

???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)Kyunghwan Kim

Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee

Proximal Policy Optimization (Reinforcement Learning)Thom Lane

???? ? TrpoWoong won Lee

Or seminar2011finalMikio Kubo

?.?.?.?. ????!Dongmin Lee

Introduction to SAC(Soft Actor-Critic)Suhyun Cho

ddpg seminar?? ?

An Introduction to HSIC for Independence TestingYuchi Matsuoka

???? ??? - Deep Learning ?? ????Byoung-Hee Kim

Soft Actor Critic 解説KCS Keio Computer Society

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon

Ecoute activeHervé Boullanger

Introduction of Deep Reinforcement LearningNAVER Engineering

Multi-Armed Bandit and ApplicationsSangwoo Mo

769sj formation techniques-de_venteYouness Alami

??? MRC ??? ?? ?? ????(KorQuAD) ?? ? B2B? ?? MRC ?? ??NAVER Engineering

Reinforcement Learning(方策改善定理)Masanori Yamada

???? ??_2(Deep sarsa, Deep Q-learning, DQN)Euijin Jeong

More from ?? ? (7)

[???] Chap06 ????????? ?

[???]Chap02 ?? ???-??????? ?

[??]Chap115 ???????????? ?

Chap06 dimensionality reduction?? ?

Gan ?????? ?

This document discusses Generative Adversarial Networks (GANs), an unsupervised machine learning algorithm proposed by Ian Goodfellow in 2014. GANs use two neural networks, a generator and a discriminator, that compete against each other in a mini-max game framework. The generator tries to generate fake samples from the data distribution to fool the discriminator, while the discriminator tries to distinguish real samples from fake ones. The goal is for the generator to eventually produce samples indistinguishable from real data. GANs have been shown to generate highly realistic images and can learn complex high-dimensional distributions.

Rnn?????? ?

Cnn ?????? ?

[???] Chap06 ????????? ?

[???]Chap02 ?? ???-??????? ?

[??]Chap115 ???????????? ?

Chap06 dimensionality reduction?? ?

Gan ?????? ?

Rnn?????? ?

Cnn ?????? ?

[??]Chap11 ????

1. ???

2. 2 Overview Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (E. L. Thorndike, Animal Intelligence, page 244.) ? ????? ???? ????(trial-and-error) ??? ???? ?? ? ?????(Thorndike) ??? ??(law of effect)?? ?? ? ???? ? ????? ?? ?? ??? ??

3. 3 Overview (cont’d) ? ??????? ??(state)? ??(reward)? ??? ?? ??? ??? ? ???? ??? ? Agent: ??? ?? ?? ? Environment: Agent? ???? ? ? State(? ?): agent? ?? or ??? ?? ? Action(? ?): agent? ?? ? Reward(??): ??? ?? ?? ??

4. 4 Overview (cont’d) From Yann Lecun, (NIPS 2016) ? ?????? ????? ??? ?? ??? ? ?? ?? ???? ??? ?? ? ??????? ??? ??? ??(??)??? ??? ????? ??? ???? ?? ? ??? Supervised Learning? ??? ?? ?? ?? ? ??? ??? ??? ?? ??(??)? ?? ?? (? ?)? ?? ? ??(policy) ? ??? ??(exploration)? ??(exploitation)?? ????

5. 5 MDP, Markov Decision Process ? ?????? state? Markov ??? ???? ???? ?? ? Markov State: ??? state ?? ??? ???? ?? state? reward? ???? ? ??? ??? ?? ? ?? ??, ?? ???? ?? ??? ??? ?? ?? ?? ??? ????? ??? ???, ?? ??? ??? ???? ???? ?? ?? ??? ?? ??

6. 6 Q-Learning ① state ② action ③ Reward Q(state, action) ? ?(?, ?) ? Q-function (state-action value function)

7. 7 Q-Learning (cont’d) ? Frozen Lake? ?? Q-Learning ?? Left Right Up Down

8. 8 Q-Learning – state, action reward ? state, action, reward ?0, ?0, ?1, ?0, ?0, ?1, ? , ? ??1, ? ??1, ??, ? ? state action reward Terminal state

9. 9 Q-Learning – state, action reward ? state, action, reward ?? = ?? + ??+1 ?? ? = ?? + max ??+1 ?(?, ?) = ? + max ?(?′ , ?′ )

10. 10 Q-Learning : Policy ? ? state ?? ??? ??? ?? ?? ?? actio? ?? ???? ?? policy, π ?? ? ? Greedy: ?? ?? ?, max?(?, ?) ?? ? ?-greedy: greedy policy? ????? ?? ?? ?? ??? action? ?? ? ??? ? ?? ??? ??? ? ??? ???? ??? ?? ? soft-max: ?-greedy? ???? ??? ???? ? action? ??? ??? ?? ??? ?? (p.277 ??)

11. 11 ? Learning Q(s, a) Table ?(?, ?) ← ? + max ?(?′, ?′) Initial Q values are 0 Q-Learning : Greedy Policy

12. 12 ? Learning Q(s, a) Table ?(?, ?) ← ? + max ?(?′, ?′) Initial Q values are 0 1 Q-Learning : Greedy Policy

13. 13 ? Learning Q(s, a) Table ?(?, ?) ← ? + max ?(?′, ?′) Initial Q values are 0 11 Q-Learning : Greedy Policy

14. 14 Q-Learning : Greedy Policy ? Learning Q(s, a) Table ?(?, ?) ← ? + max ?(?′, ?′) Initial Q values are 0 11 1 1 1 1 11

15. 15 Q-Learning Algorithm(greedy) ? Q-function (state-action value function)

16. 16 Exploitation vs Exploration 11 1 1 1 1 11 Exploitation Exploration

17. 17 Q–Learning : ?-greedy

18. 18 Q–Learning : Discounted future reward 11 1 1 1 1 11 1 ? ??? ??? reward? ?? ???? ?? ? ???? ?? 0 ≤ ? ≤ 1? ???? ??? ?? ??? ??? ?? ?(?, ?) ← ? + ?max ?(?′, ?′)

19. 19 Q–Learning : Discounted future reward

20. 20 Q–Learning : Discounted future reward 1 ?(?, ?) ← ? + ?max ?(?′, ?′) ? = 0.9

21. 21 Q–Learning : Discounted future reward 1 ?(?, ?) ← ? + ?max ?(?′, ?′) ? = 0.9 0.9 = 0 + 0.9 × 1

22. 22 Q–Learning : Discounted future reward 1 ?(?, ?) ← ? + ?max ?(?′, ?′) ? = 0.9 0.9 0.81 0.9 0.729

23. 23 Q–Learning : Discounted future reward

24. 24 Q–Learning : Deterministic vs Stochastic ? Deterministic: Agent ? action? ???(determined)?? state? ??? ? ? Stochastic(Non-deterministic): Agent? ???? action? ?? state ? ??? ? ? ?, agent? action? ??? ?? state? deterministic?? ???? ?? ??? ????? ???

25. 25 Q–Learning : Deterministic vs Stochastic ? ?, ? ← 1 ? ? ? ?, ? + ?[? + ?max ? ?′, ?′ ] - Learning rate, ? = 0.1 ?(?, ?) ← ? + ?max ?(?′, ?′) ? ?, ? ← ? ?, ? + ?[? + ?max ? ?′, ?′ ? ? ?, ? ]

26. 26 Q–Learning : Deterministic vs Stochastic

27. THANK YOU

狠狠撸

[??]Chap11 ????

Recommended

More Related Content

What's hot (20)

More from ?? ? (7)

[??]Chap11 ????