際際滷

際際滷Share a Scribd company logo
CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
? DQN is capable of human level performance on many Atari games
? Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
? High dimensional observation: deep neural network can extract feature from high dimensional input
? Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
?????? ? ?(??, ? ?; ?)
??
Update
Copy
?(??, ? ?; ?)
store
(??, ? ?, ??, ??+1)
??
??+1(??, ? ?)
??? ? ?(??
>
, ? ?; ?>)
? ??, ? ? ○ ? ??, ? ? + ?[??+1 + ???? ? ? ??+1, ? ?+1 ? ? ??, ? ? ]
Q learning
? ? ?
??, ? ? ○ ? ? ?
??, ? ? + ?[??+1 + ???? ? ? ? ?>
??+1, ? ?+1 ? ? ? ?
??, ? ? ]
DQN
Policy(?): ? ? = ?????? ? ? ? ?
(??, ? ?)??: state
? ?: action
??: reward
?(??, ? ?): reward to go
Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
? Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. ? discrete action spaces with ? dimension -> ? ?action spaces
? DQN cannot be straight forwardly applied to continuous domain
? Why? -> 1. Policy(?): ? ? = ?????? ? ? ? ?
(??, ? ?)
2. Update: ? ? ?
??, ? ? ○ ? ? ?
??, ? ? + ?[??+1 + ???? ? ? ? ?>
? ?+?, ? ?+? ? ? ? ?
??, ? ? ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
? ? ?1,?1,?,? ?,? ?
= ?(?1) ?
?=1
?
? ? ? ? ?? ?(??+1|??, ? ?)
Agent World
action
?(??+1|??, ? ?)
model
reward & next state
??
? ?
??+1
state&
??
policy
?(? ?|??)
??
= ?????? ? ??~? ?(?) ?
?
? ??, ? ?
Markov decision process
?1
?1
?2
?2
?3
?(?2|?1, ?1) ?(?3|?2, ?2)
?3
?(?4|?3, ?3)
?
objective: ?(?)
trajectory distribution
Goal of reinforcement learning
policy(? ?):
stochastic policy with weights ?
Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
? REINFORCE models the policy as a stochastic policy: ? ? ~ ? ?(? ?|??)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
?? ? ?(? ?|??)
probability
0.1
0.1
0.2
0.2
0.4
Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
? ? = ??~? ?(?) ?
?
? ??, ? ?
?? ? =??~? ?(?) ( σ ?=1
?
?? ??? ? ?(? ?|??) (σ ?=1
?
?(??, ? ?))
?? ? 「
1
?
?
?=1
?
?
?=1
?
?? ??? ? ?(? ?|??) ?
?=1
?
?(??, ? ?)
? ○ ? + ???(?)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
?1
?2
? ?
? REINFORCE models the policy as a stochastic decision: ? ? ~ ? ?(? ?|??)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
?: weights of actor network
?: learning rate
Autonomous Systems Laboratory
9/21
★ ? ?(??, ? ?)
Policy gradient: Actor critic (actor critic)
? Actor(? ?(? ?|??)): output action distribution by policy network and updates in the direction suggested by critic
? Critic(? ?(? ?, ? ?)): evaluate actor¨s action
initial state
sample data ? times
update critic & actor
sample data ? times
update critic & actor
1. Sample ??, ? ?, ??, ??+1 from ? ?(? ?|??) ? times
2. Update ? ?(??, ? ?) to sampled data
3. ?? ?(?) 「 σ? ????? ? ? ? ?? ? ?(??, ? ?)
4. ? ○ ? + ??? ?(?)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
?: weights of critic network
1. High gradient variance
2. Slow training policy
? ?(? ?|??)
? ?(??, ? ?)
Env.? ?
??
(??, ? ?, ??, ??+1)0~???(?)
actor
critic
update critic
Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
? Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: ?t = ? ?(?t)
? ?
? ?
Stochastic policy ? ?(? ?|??)
??
? Need 10 action spaces for 5 discretized 2 dimensional actions
? ?
? ?
Deterministic policy ? ?(? ?)
??
? Only 2 action spaces are needed
Autonomous Systems Laboratory
11/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
? Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: ?t = ? ?(?t)
1. Sample ??, ? ?, ??, ??+1 from ? ?(?) ? times
2. Update ? ?(??, ? ?) to samples
3. ?? ?(?) 「 σ? ?? ? ? ?? ? ? ? ?(??, ? ?)| ? ?=? ?(? ?)
4. ? ○ ? + ??? ?(?)
? ? ?1,?1,?,? ?,? ?
= ?(?1) ?
?=1
?
? ? ? ? ?? ?(??+1|??, ? ?)
trajectory distribution
? ? ?1,?2,?3?,? ?
= ?(?1) ?
?=1
?
?(??+1|??, ? ?(??))
? ? = ??,?~? ?(?) ? ?(??, ? ?)
objective
? ?
? ? = ??~? ?(?)[?(?, ? ? ? )]
loss: ? = ?? + ?? ? ??+1, ? ?(??+1) ? ? ?(??, ? ?)
Autonomous Systems Laboratory
12/21
DDPG: DQN + DPG
Q learning DQN
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
+ continuous action spaces
- no replay buffer: sample correlation
- no target network: unstable
- high variance + lower variance
+ off policy: replay buffer
+ stable update: target network
+ high dimensional observation spaces
- discrete action spaces
- low dimensional observation spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
Autonomous Systems Laboratory
13/21
DDPG: algorithm(1/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
? policy
? exploration
? Add noise for exploration: white Gaussian noise
? soft target update
? Target network is constrained to change slowly
? Stabilize training process
? = ?????? ? ? ? ?
(?, ?) ? = ? ?(?)
?> ? = ? ? ? + ?
?>
○ ?? + ? ? ? ?> where ? ? ?
Autonomous Systems Laboratory
14/21
soft update ?>
DDPG: algorithm(2/2)
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
policy ? ?
target policy ? ?>
critic ? ?
target critic ? ?>
? ? = ? ? ?? + ?
Env
actorcritic
??
Replay buffer
store ? data (??, ? ?, ??, ??+1)sample ? batch (??, ? ?, ??, ??+1)
? ?>(??+1)
update critic
loss: ?(?) soft update ?>
? ?(??)
??(?)
select action
??+1(??, ? ?, ??)
Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
? DDPG used as local planner for long range navigation
Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
Autonomous Systems Laboratory
18/21
Conclusion & Future work
? DQN have problem to adjust continuous action space directly
? DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
? MADDPG for multi agent RL
? Use DDPG for continuous action space decision making problem
? ex) navigation, obstacle avoidance
Autonomous Systems Laboratory
19/21
Appendix: Objective gradient derivation
objective gradient
Autonomous Systems Laboratory
20/21
Appendix: DPG objective
Autonomous Systems Laboratory
21/21
Appendix: DDPG algorithm

More Related Content

What's hot (20)

Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
?
???? ? Trpo
???? ? Trpo???? ? Trpo
???? ? Trpo
Woong won Lee
?
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Nguyen Quang
?
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
Kyunghwan Kim
?
1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????
NAVER Engineering
?
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
?
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
?
???? ???? ??? ???? ????
???? ???? ??? ???? ???????? ???? ??? ???? ????
???? ???? ??? ???? ????
Woong won Lee
?
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
?
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Jisang Yoon
?
?????? ?? ?
?????? ?? ??????? ?? ?
?????? ?? ?
NAVER Engineering
?
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
?
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
?
???? ????? ??? Part 2
???? ????? ??? Part 2???? ????? ??? Part 2
???? ????? ??? Part 2
Dongmin Lee
?
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
?
Chapter 10 sequence modeling recurrent and recursive nets
Chapter 10 sequence modeling recurrent and recursive netsChapter 10 sequence modeling recurrent and recursive nets
Chapter 10 sequence modeling recurrent and recursive nets
KyeongUkJang
?
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
?
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
?
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
Jaya Kawale
?
Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)Introduction to SAC(Soft Actor-Critic)
Introduction to SAC(Soft Actor-Critic)
Suhyun Cho
?
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
Dong Guo
?
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Nguyen Quang
?
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
???? ??? ??: Rainbow ???? ???? (2nd dlcat in Daejeon)
Kyunghwan Kim
?
1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????1???? GAN(Generative Adversarial Network) ?? ????
1???? GAN(Generative Adversarial Network) ?? ????
NAVER Engineering
?
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
?
???? ???? ??? ???? ????
???? ???? ??? ???? ???????? ???? ??? ???? ????
???? ???? ??? ???? ????
Woong won Lee
?
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
Subrat Panda, PhD
?
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsMulti PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Jisang Yoon
?
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
Usman Qayyum
?
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
Dongmin Lee
?
???? ????? ??? Part 2
???? ????? ??? Part 2???? ????? ??? Part 2
???? ????? ??? Part 2
Dongmin Lee
?
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
Khaled Saleh
?
Chapter 10 sequence modeling recurrent and recursive nets
Chapter 10 sequence modeling recurrent and recursive netsChapter 10 sequence modeling recurrent and recursive nets
Chapter 10 sequence modeling recurrent and recursive nets
KyeongUkJang
?
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
Omar Enayet
?
A Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at NetflixA Multi-Armed Bandit Framework For Recommendations at Netflix
A Multi-Armed Bandit Framework For Recommendations at Netflix
Jaya Kawale
?

Similar to ddpg seminar (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
?
Learning a nonlinear embedding by preserving class neibourhood structure ??
Learning a nonlinear embedding by preserving class neibourhood structure   ??Learning a nonlinear embedding by preserving class neibourhood structure   ??
Learning a nonlinear embedding by preserving class neibourhood structure ??
WooSung Choi
?
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
?
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf
?
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
?
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
Anirban Santara
?
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
dinesh malla
?
Continuous control
Continuous controlContinuous control
Continuous control
Reiji Hatsugai
?
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
?? ?
?
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
?
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
Ryo Iwaki
?
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
Arithmer Inc.
?
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
?
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
Kai Zhang
?
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...
Lionel Briand
?
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
Sang Jun Lee
?
Modifed my_poster
Modifed my_posterModifed my_poster
Modifed my_poster
Anshul Goyal, EIT
?
Neural network
Neural networkNeural network
Neural network
Babu Priyavrat
?
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
?
2021 06-02-tabnet
2021 06-02-tabnet2021 06-02-tabnet
2021 06-02-tabnet
JAEMINJEONG5
?
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Jack Clark
?
Learning a nonlinear embedding by preserving class neibourhood structure ??
Learning a nonlinear embedding by preserving class neibourhood structure   ??Learning a nonlinear embedding by preserving class neibourhood structure   ??
Learning a nonlinear embedding by preserving class neibourhood structure ??
WooSung Choi
?
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
MLconf
?
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf
?
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
Ryo Iwaki
?
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
Anirban Santara
?
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
dinesh malla
?
QMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper reviewQMIX: monotonic value function factorization paper review
QMIX: monotonic value function factorization paper review
?? ?
?
Learning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifoldLearning to discover monte carlo algorithm on spin ice manifold
Learning to discover monte carlo algorithm on spin ice manifold
Kai-Wen Zhao
?
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
Ryo Iwaki
?
Recommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learningRecommendation algorithm using reinforcement learning
Recommendation algorithm using reinforcement learning
Arithmer Inc.
?
Lecture 5 backpropagation
Lecture 5 backpropagationLecture 5 backpropagation
Lecture 5 backpropagation
ParveenMalik18
?
Intrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement LearningIntrinsically Motivated Reinforcement Learning
Intrinsically Motivated Reinforcement Learning
Kai Zhang
?
Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...Combining genetic algoriths and constraint programming to support stress test...
Combining genetic algoriths and constraint programming to support stress test...
Lionel Briand
?
Lecture 5: Neural Networks II
Lecture 5: Neural Networks IILecture 5: Neural Networks II
Lecture 5: Neural Networks II
Sang Jun Lee
?
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Hye-min Ahn
?

Recently uploaded (20)

CONTRACTOR ALL RISK INSURANCESAR (1).ppt
CONTRACTOR ALL RISK INSURANCESAR (1).pptCONTRACTOR ALL RISK INSURANCESAR (1).ppt
CONTRACTOR ALL RISK INSURANCESAR (1).ppt
suaktonny
?
CFOT Fiber Optics FOA CERTIFICATION.pptx
CFOT Fiber Optics FOA CERTIFICATION.pptxCFOT Fiber Optics FOA CERTIFICATION.pptx
CFOT Fiber Optics FOA CERTIFICATION.pptx
MohamedShabana37
?
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
?
eng funda notes.pdfddddddddddddddddddddddd
eng funda notes.pdfdddddddddddddddddddddddeng funda notes.pdfddddddddddddddddddddddd
eng funda notes.pdfddddddddddddddddddddddd
aayushkumarsinghec22
?
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
?
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
?
Indian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical EngineeringIndian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical Engineering
Rajani Vyawahare
?
Lecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptxLecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptx
rabiaatif2
?
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
?
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S
?
health safety and environment presentation
health safety and environment presentationhealth safety and environment presentation
health safety and environment presentation
ssuserc606c7
?
Engineering at Lovely Professional University (LPU).pdf
Engineering at Lovely Professional University (LPU).pdfEngineering at Lovely Professional University (LPU).pdf
Engineering at Lovely Professional University (LPU).pdf
Sona
?
Frankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkundeFrankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkunde
Lisa Emerson
?
only history of java.pptx real bihind the name java
only history of java.pptx real bihind the name javaonly history of java.pptx real bihind the name java
only history of java.pptx real bihind the name java
mushtaqsaliq9
?
Equipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding ProcessEquipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding Process
AhmadKamil87
?
Env and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdfEnv and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdf
MahmudHasan747870
?
Soil Properties and Methods of Determination
Soil Properties and  Methods of DeterminationSoil Properties and  Methods of Determination
Soil Properties and Methods of Determination
Rajani Vyawahare
?
G8 mini project for alcohol detection and engine lock system with GPS tracki...
G8 mini project for  alcohol detection and engine lock system with GPS tracki...G8 mini project for  alcohol detection and engine lock system with GPS tracki...
G8 mini project for alcohol detection and engine lock system with GPS tracki...
sahillanjewar294
?
GM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptxGM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptx
crdslalcomumbai
?
The Golden Gate Bridge a structural marvel inspired by mother nature.pptx
The Golden Gate Bridge a structural marvel inspired by mother nature.pptxThe Golden Gate Bridge a structural marvel inspired by mother nature.pptx
The Golden Gate Bridge a structural marvel inspired by mother nature.pptx
AkankshaRawat75
?
CONTRACTOR ALL RISK INSURANCESAR (1).ppt
CONTRACTOR ALL RISK INSURANCESAR (1).pptCONTRACTOR ALL RISK INSURANCESAR (1).ppt
CONTRACTOR ALL RISK INSURANCESAR (1).ppt
suaktonny
?
CFOT Fiber Optics FOA CERTIFICATION.pptx
CFOT Fiber Optics FOA CERTIFICATION.pptxCFOT Fiber Optics FOA CERTIFICATION.pptx
CFOT Fiber Optics FOA CERTIFICATION.pptx
MohamedShabana37
?
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
?
eng funda notes.pdfddddddddddddddddddddddd
eng funda notes.pdfdddddddddddddddddddddddeng funda notes.pdfddddddddddddddddddddddd
eng funda notes.pdfddddddddddddddddddddddd
aayushkumarsinghec22
?
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptxRAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
RAMSES- EDITORIAL SAMPLE FOR DSSPC C.pptx
JenTeruel1
?
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
?
Indian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical EngineeringIndian Soil Classification System in Geotechnical Engineering
Indian Soil Classification System in Geotechnical Engineering
Rajani Vyawahare
?
Lecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptxLecture -3 Cold water supply system.pptx
Lecture -3 Cold water supply system.pptx
rabiaatif2
?
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
?
Cyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptxCyber Security_ Protecting the Digital World.pptx
Cyber Security_ Protecting the Digital World.pptx
Harshith A S
?
health safety and environment presentation
health safety and environment presentationhealth safety and environment presentation
health safety and environment presentation
ssuserc606c7
?
Engineering at Lovely Professional University (LPU).pdf
Engineering at Lovely Professional University (LPU).pdfEngineering at Lovely Professional University (LPU).pdf
Engineering at Lovely Professional University (LPU).pdf
Sona
?
Frankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkundeFrankfurt University of Applied Science urkunde
Frankfurt University of Applied Science urkunde
Lisa Emerson
?
only history of java.pptx real bihind the name java
only history of java.pptx real bihind the name javaonly history of java.pptx real bihind the name java
only history of java.pptx real bihind the name java
mushtaqsaliq9
?
Equipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding ProcessEquipment for Gas Metal Arc Welding Process
Equipment for Gas Metal Arc Welding Process
AhmadKamil87
?
Env and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdfEnv and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdf
MahmudHasan747870
?
Soil Properties and Methods of Determination
Soil Properties and  Methods of DeterminationSoil Properties and  Methods of Determination
Soil Properties and Methods of Determination
Rajani Vyawahare
?
G8 mini project for alcohol detection and engine lock system with GPS tracki...
G8 mini project for  alcohol detection and engine lock system with GPS tracki...G8 mini project for  alcohol detection and engine lock system with GPS tracki...
G8 mini project for alcohol detection and engine lock system with GPS tracki...
sahillanjewar294
?
GM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptxGM Meeting 070225 TO 130225 for 2024.pptx
GM Meeting 070225 TO 130225 for 2024.pptx
crdslalcomumbai
?
The Golden Gate Bridge a structural marvel inspired by mother nature.pptx
The Golden Gate Bridge a structural marvel inspired by mother nature.pptxThe Golden Gate Bridge a structural marvel inspired by mother nature.pptx
The Golden Gate Bridge a structural marvel inspired by mother nature.pptx
AkankshaRawat75
?

ddpg seminar

  • 1. CONTACT Autonomous Systems Laboratory Mechanical Engineering 5th Engineering Building Room 810 Web. https://sites.google.com/site/aslunist/ Deep deterministic policy gradient Minjae Jung May. 19, 2020
  • 2. Autonomous Systems Laboratory 2/21 DQN to DDPG: DQN overview Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. Q learning DQN (2015) 1. replay buffer 2. neural network 3. target network
  • 3. Autonomous Systems Laboratory 3/21 DQN to DDPG: DQN algorithm Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. ? DQN is capable of human level performance on many Atari games ? Off policy training: replay buffer breaks the correlation of samples that are sampled from agent ? High dimensional observation: deep neural network can extract feature from high dimensional input ? Learning stability: target network make training process stable Environment Q Network Target Q Network DQN Loss Replay buffer ?????? ? ?(??, ? ?; ?) ?? Update Copy ?(??, ? ?; ?) store (??, ? ?, ??, ??+1) ?? ??+1(??, ? ?) ??? ? ?(?? > , ? ?; ?>) ? ??, ? ? ○ ? ??, ? ? + ?[??+1 + ???? ? ? ??+1, ? ?+1 ? ? ??, ? ? ] Q learning ? ? ? ??, ? ? ○ ? ? ? ??, ? ? + ?[??+1 + ???? ? ? ? ?> ??+1, ? ?+1 ? ? ? ? ??, ? ? ] DQN Policy(?): ? ? = ?????? ? ? ? ? (??, ? ?)??: state ? ?: action ??: reward ?(??, ? ?): reward to go
  • 4. Autonomous Systems Laboratory 4/21 DQN to DDPG: Limitation of DQN (discrete action spaces) ? Discrete action spaces - DQN can only handle discrete and low-dimensional action spaces - If the dimension increases, action spaces(the number of node) increase exponentially - i.e. ? discrete action spaces with ? dimension -> ? ?action spaces ? DQN cannot be straight forwardly applied to continuous domain ? Why? -> 1. Policy(?): ? ? = ?????? ? ? ? ? (??, ? ?) 2. Update: ? ? ? ??, ? ? ○ ? ? ? ??, ? ? + ?[??+1 + ???? ? ? ? ?> ? ?+?, ? ?+? ? ? ? ? ??, ? ? ] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 5. Autonomous Systems Laboratory 5/21 DDPG: DQN with Policy gradient methods Q learning DQN 1. replay buffer 2. deep neural network 3. target network Policy gradient (REINFORCE) Actor critic DPG DDPG Continuous action spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 6. Autonomous Systems Laboratory 6/21 Policy gradient: The goal of Reinforcement learning ? ? ?1,?1,?,? ?,? ? = ?(?1) ? ?=1 ? ? ? ? ? ?? ?(??+1|??, ? ?) Agent World action ?(??+1|??, ? ?) model reward & next state ?? ? ? ??+1 state& ?? policy ?(? ?|??) ?? = ?????? ? ??~? ?(?) ? ? ? ??, ? ? Markov decision process ?1 ?1 ?2 ?2 ?3 ?(?2|?1, ?1) ?(?3|?2, ?2) ?3 ?(?4|?3, ?3) ? objective: ?(?) trajectory distribution Goal of reinforcement learning policy(? ?): stochastic policy with weights ?
  • 7. Autonomous Systems Laboratory 7/21 Policy gradient: REINFORCE ? REINFORCE models the policy as a stochastic policy: ? ? ~ ? ?(? ?|??) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. ?? ? ?(? ?|??) probability 0.1 0.1 0.2 0.2 0.4
  • 8. Autonomous Systems Laboratory 8/21 Policy gradient: REINFORCE ? ? = ??~? ?(?) ? ? ? ??, ? ? ?? ? =??~? ?(?) ( σ ?=1 ? ?? ??? ? ?(? ?|??) (σ ?=1 ? ?(??, ? ?)) ?? ? 「 1 ? ? ?=1 ? ? ?=1 ? ?? ??? ? ?(? ?|??) ? ?=1 ? ?(??, ? ?) ? ○ ? + ???(?) The number of episodes problem Must experience some episodes to update 1. Slow training process 2. High gradient variance initial state ?1 ?2 ? ? ? REINFORCE models the policy as a stochastic decision: ? ? ~ ? ?(? ?|??) Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000. ?: weights of actor network ?: learning rate
  • 9. Autonomous Systems Laboratory 9/21 ★ ? ?(??, ? ?) Policy gradient: Actor critic (actor critic) ? Actor(? ?(? ?|??)): output action distribution by policy network and updates in the direction suggested by critic ? Critic(? ?(? ?, ? ?)): evaluate actor¨s action initial state sample data ? times update critic & actor sample data ? times update critic & actor 1. Sample ??, ? ?, ??, ??+1 from ? ?(? ?|??) ? times 2. Update ? ?(??, ? ?) to sampled data 3. ?? ?(?) 「 σ? ????? ? ? ? ?? ? ?(??, ? ?) 4. ? ○ ? + ??? ?(?) Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000. ?: weights of critic network 1. High gradient variance 2. Slow training policy ? ?(? ?|??) ? ?(??, ? ?) Env.? ? ?? (??, ? ?, ??, ??+1)0~???(?) actor critic update critic
  • 10. Autonomous Systems Laboratory 10/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. ? Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: ?t = ? ?(?t) ? ? ? ? Stochastic policy ? ?(? ?|??) ?? ? Need 10 action spaces for 5 discretized 2 dimensional actions ? ? ? ? Deterministic policy ? ?(? ?) ?? ? Only 2 action spaces are needed
  • 11. Autonomous Systems Laboratory 11/21 Policy gradient: DPG Silver, David, et al. "Deterministic policy gradient algorithms." 2014. ? Deterministic policy gradient (DPG) models the actor policy as a deterministic decision: ?t = ? ?(?t) 1. Sample ??, ? ?, ??, ??+1 from ? ?(?) ? times 2. Update ? ?(??, ? ?) to samples 3. ?? ?(?) 「 σ? ?? ? ? ?? ? ? ? ?(??, ? ?)| ? ?=? ?(? ?) 4. ? ○ ? + ??? ?(?) ? ? ?1,?1,?,? ?,? ? = ?(?1) ? ?=1 ? ? ? ? ? ?? ?(??+1|??, ? ?) trajectory distribution ? ? ?1,?2,?3?,? ? = ?(?1) ? ?=1 ? ?(??+1|??, ? ?(??)) ? ? = ??,?~? ?(?) ? ?(??, ? ?) objective ? ? ? ? = ??~? ?(?)[?(?, ? ? ? )] loss: ? = ?? + ?? ? ??+1, ? ?(??+1) ? ? ?(??, ? ?)
  • 12. Autonomous Systems Laboratory 12/21 DDPG: DQN + DPG Q learning DQN Policy gradient (REINFORCE) Actor critic DPG DDPG + continuous action spaces - no replay buffer: sample correlation - no target network: unstable - high variance + lower variance + off policy: replay buffer + stable update: target network + high dimensional observation spaces - discrete action spaces - low dimensional observation spaces Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
  • 13. Autonomous Systems Laboratory 13/21 DDPG: algorithm(1/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). ? policy ? exploration ? Add noise for exploration: white Gaussian noise ? soft target update ? Target network is constrained to change slowly ? Stabilize training process ? = ?????? ? ? ? ? (?, ?) ? = ? ?(?) ?> ? = ? ? ? + ? ?> ○ ?? + ? ? ? ?> where ? ? ?
  • 14. Autonomous Systems Laboratory 14/21 soft update ?> DDPG: algorithm(2/2) Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015). policy ? ? target policy ? ?> critic ? ? target critic ? ?> ? ? = ? ? ?? + ? Env actorcritic ?? Replay buffer store ? data (??, ? ?, ??, ??+1)sample ? batch (??, ? ?, ??, ??+1) ? ?>(??+1) update critic loss: ?(?) soft update ?> ? ?(??) ??(?) select action ??+1(??, ? ?, ??)
  • 15. Autonomous Systems Laboratory 15/21 DDPG example: landing on a moving platform Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic Systems 93.1-2 (2019): 351-366.
  • 16. Autonomous Systems Laboratory 16/21 DDPG example: long-range robotic navigation Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018. ? DDPG used as local planner for long range navigation
  • 17. Autonomous Systems Laboratory 17/21 DDPG example: multi agent DDPG (MADDPG) Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
  • 18. Autonomous Systems Laboratory 18/21 Conclusion & Future work ? DQN have problem to adjust continuous action space directly ? DDPG is able to consider continuous action spaces via policy gradient method and actor critic architecture ? MADDPG for multi agent RL ? Use DDPG for continuous action space decision making problem ? ex) navigation, obstacle avoidance
  • 19. Autonomous Systems Laboratory 19/21 Appendix: Objective gradient derivation objective gradient