This document provides an overview of deep deterministic policy gradient (DDPG), which combines aspects of DQN and policy gradient methods to enable deep reinforcement learning with continuous action spaces. It summarizes DQN and its limitations for continuous domains. It then explains policy gradient methods like REINFORCE, actor-critic, and deterministic policy gradient (DPG) that can handle continuous action spaces. DDPG adopts key elements of DQN like experience replay and target networks, and models the policy as a deterministic function like DPG, to apply deep reinforcement learning to complex continuous control tasks.
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
?
This document presents a model-free, off-policy actor-critic algorithm to learn policies in continuous action spaces using deep reinforcement learning. The algorithm is based on deterministic policy gradients and extends DQN to continuous action domains by using deep neural networks to approximate the actor and critic. Challenges addressed include ensuring samples are i.i.d. by using a replay buffer, stabilizing learning with a target network, normalizing observations with batch normalization, and exploring efficiently with an Ornstein-Uhlenbeck process. The algorithm is able to learn policies on high-dimensional continuous control tasks.
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
?
I reviewed the following papers.
- T. Haarnoja, et al., ^Reinforcement Learning with Deep Energy-Based Policies", ICML 2017
- T. Haarnoja, et al., ^Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor", ICML 2018
- T. Haarnoja, et al., ^Soft Actor-Critic Algorithms and Applications", arXiv preprint 2018
Thank you.
This document discusses Deep Deterministic Policy Gradient (DDPG), a reinforcement learning algorithm for problems with continuous state and action spaces. DDPG uses an actor-critic method with experience replay and soft target updates to learn a policy in an off-policy manner. It demonstrates how DDPG can be used to train an agent to drive a vehicle in a simulator by designing a reward function, but notes that designing effective rewards, avoiding local optima, instability, and data requirements are challenges for DDPG.
The document discusses Deep Q-Network (DQN), which combines Q-learning with deep neural networks to allow for function approximation and solving problems with large state/action spaces. DQN uses experience replay and a separate target network to stabilize training. It has led to many successful variants, including Double DQN which reduces overestimation, prioritized experience replay which replays important transitions more frequently, and dueling networks which separate value and advantage estimation.
Dr. Subrat Panda gave an introduction to reinforcement learning. He defined reinforcement learning as dealing with agents that must sense and act upon their environment to receive delayed scalar feedback in the form of rewards. He described key concepts like the Markov decision process framework, value functions, Q-functions, exploration vs exploitation, and extensions like deep reinforcement learning. He listed several real-world applications of reinforcement learning and resources for learning more.
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
?
MADDPG is a multi-agent actor-critic reinforcement learning algorithm that can operate in mixed cooperative-competitive environments. It uses a decentralized actor and centralized critic architecture. The centralized critic takes the observations and actions of all agents as input to guide learning, even though each agent only controls its own actor. To deal with non-stationary environments, it approximates other agents' policies when they are unknown. It also trains with policy ensembles to prevent overfitting to competitors' strategies. Experiments show MADDPG outperforms decentralized methods on cooperative tasks and its performance benefits from approximating other agents and using policy ensembles in competitive settings.
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
This document provides an introduction to deep reinforcement learning. It begins with an overview of reinforcement learning and its key characteristics such as using reward signals rather than supervision and sequential decision making. The document then covers the formulation of reinforcement learning problems using Markov decision processes and the typical components of an RL agent including policies, value functions, and models. It discusses popular RL algorithms like Q-learning, deep Q-networks, and policy gradient methods. The document concludes by outlining some potential applications of deep reinforcement learning and recommending further educational resources.
Reinforcement Learning : A Beginners TutorialOmar Enayet
?
This document provides an overview of reinforcement learning concepts including:
1) It defines the key components of a Markov Decision Process (MDP) including states, actions, transitions, rewards, and discount rate.
2) It describes value functions which estimate the expected return for following a particular policy from each state or state-action pair.
3) It discusses several elementary solution methods for reinforcement learning problems including dynamic programming, Monte Carlo methods, temporal-difference learning, and actor-critic methods.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
?
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
?
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
Learning a nonlinear embedding by preserving class neibourhood structure ??WooSung Choi
?
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
The document discusses Deep Q-Network (DQN), which combines Q-learning with deep neural networks to allow for function approximation and solving problems with large state/action spaces. DQN uses experience replay and a separate target network to stabilize training. It has led to many successful variants, including Double DQN which reduces overestimation, prioritized experience replay which replays important transitions more frequently, and dueling networks which separate value and advantage estimation.
Dr. Subrat Panda gave an introduction to reinforcement learning. He defined reinforcement learning as dealing with agents that must sense and act upon their environment to receive delayed scalar feedback in the form of rewards. He described key concepts like the Markov decision process framework, value functions, Q-functions, exploration vs exploitation, and extensions like deep reinforcement learning. He listed several real-world applications of reinforcement learning and resources for learning more.
Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon
?
MADDPG is a multi-agent actor-critic reinforcement learning algorithm that can operate in mixed cooperative-competitive environments. It uses a decentralized actor and centralized critic architecture. The centralized critic takes the observations and actions of all agents as input to guide learning, even though each agent only controls its own actor. To deal with non-stationary environments, it approximates other agents' policies when they are unknown. It also trains with policy ensembles to prevent overfitting to competitors' strategies. Experiments show MADDPG outperforms decentralized methods on cooperative tasks and its performance benefits from approximating other agents and using policy ensembles in competitive settings.
Reinforcement learning is a machine learning technique that involves trial-and-error learning. The agent learns to map situations to actions by trial interactions with an environment in order to maximize a reward signal. Deep Q-networks use reinforcement learning and deep learning to allow agents to learn complex behaviors directly from high-dimensional sensory inputs like pixels. DQN uses experience replay and target networks to stabilize learning from experiences. DQN has achieved human-level performance on many Atari 2600 games.
Hello~! :)
While studying the Sutton-Barto book, the traditional textbook for Reinforcement Learning, I created PPT about the Multi-armed Bandits, a Chapter 2.
If there are any mistakes, I would appreciate your feedback immediately.
Thank you.
This document provides an introduction to deep reinforcement learning. It begins with an overview of reinforcement learning and its key characteristics such as using reward signals rather than supervision and sequential decision making. The document then covers the formulation of reinforcement learning problems using Markov decision processes and the typical components of an RL agent including policies, value functions, and models. It discusses popular RL algorithms like Q-learning, deep Q-networks, and policy gradient methods. The document concludes by outlining some potential applications of deep reinforcement learning and recommending further educational resources.
Reinforcement Learning : A Beginners TutorialOmar Enayet
?
This document provides an overview of reinforcement learning concepts including:
1) It defines the key components of a Markov Decision Process (MDP) including states, actions, transitions, rewards, and discount rate.
2) It describes value functions which estimate the expected return for following a particular policy from each state or state-action pair.
3) It discusses several elementary solution methods for reinforcement learning problems including dynamic programming, Monte Carlo methods, temporal-difference learning, and actor-critic methods.
A Multi-Armed Bandit Framework For Recommendations at NetflixJaya Kawale
?
In this talk, we present a general multi-armed bandit framework for recommendations on the Netflix homepage. We present two example case studies using MABs at Netflix - a) Artwork Personalization to recommend personalized visuals for each of our members for the different titles and b) Billboard recommendation to recommend the right title to be watched on the Billboard.
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
?
This document discusses deep reinforcement learning through policy optimization. It begins with an introduction to reinforcement learning and how deep neural networks can be used to approximate policies, value functions, and models. It then discusses how deep reinforcement learning can be applied to problems in robotics, business operations, and other machine learning domains. The document reviews how reinforcement learning relates to other machine learning problems like supervised learning and contextual bandits. It provides an overview of policy gradient methods and the cross-entropy method for policy optimization before discussing Markov decision processes, parameterized policies, and specific policy gradient algorithms like the vanilla policy gradient algorithm and trust region policy optimization.
Learning a nonlinear embedding by preserving class neibourhood structure ??WooSung Choi
?
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
?
This document discusses deep reinforcement learning and concept network reinforcement learning. It begins with an introduction to reinforcement learning concepts like Markov decision processes and value-based methods. It then describes Concept-Network Reinforcement Learning which decomposes complex tasks into high-level concepts or actions. This allows composing existing solutions to sub-problems without retraining. The document provides examples of using concept networks for lunar lander and robot pick-and-place tasks. It concludes by discussing how concept networks can improve sample efficiency, especially for sparse reward problems.
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
?
Graph Representation Learning with Deep Embedding Approach:
Graphs are commonly used data structure for representing the real-world relationships, e.g., molecular structure, knowledge graphs, social and communication networks. The effective encoding of graphical information is essential to the success of such applications. In this talk I¨ll first describe a general deep learning framework, namely structure2vec, for end to end graph feature representation learning. Then I¨ll present the direct application of this model on graph problems on different scales, including community detection and molecule graph classification/regression. We then extend the embedding idea to temporal evolving user-product interaction graph for recommendation. Finally I¨ll present our latest work on leveraging the reinforcement learning technique for graph combinatorial optimization, including vertex cover problem for social influence maximization and traveling salesman problem for scheduling management.
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
?
Reinforcement Learning (RL) is a genre of Machine Learning in which an agent learns to choose optimal actions in different states in order to reach its specified goal, solely by interacting with the environment through trial and error. Unlike supervised learning, the agent does not get examples of "correct" actions in given states as ground truth. Instead, it has to use feedback from the environment (which can be sparse and delayed) to improve its policy over time. The formulation of the RL problem closely resembles the way in which human beings learn to act in different situations. Hence it is often considered the gateway to achieving the goal of Artificial General Intelligence.
The motivation of this talk is to introduce the audience to key theoretical concepts like formulation of the RL problem using Markov Decision Process (MDP) and solution of MDP using dynamic programming and policy gradient based algorithms. State-of-the-art deep reinforcement learning algorithms will also be covered. A case study of the application of reinforcement learning in robotics will also be presented.
1. The document proposes using Bayesian inverse reinforcement learning (IRL) with neural networks for anomaly prediction detection. It formulates the problem as a Markov decision process to learn the reward function from expert trajectories.
2. A Bayesian neural network is used to model the reward function, with weights assigned prior distributions. The model is trained by maximizing the log likelihood of the training data to find the posterior distribution over weights.
3. The approach is evaluated on temperature anomaly detection and maze navigation tasks. Bayesian IRL is able to distinguish normal trajectories from anomalous ones in test data for intentional anomaly detection.
This document discusses model-free continuous control in reinforcement learning. It introduces policy gradient and value gradient methods for learning continuous control policies with neural networks. Policy gradient methods directly optimize the policy parameters with the policy gradient. Value gradient methods optimize the policy based on the gradient of a learned state-action value function. Actor-critic methods combine policy gradients with value functions to reduce variance.
QMIX: monotonic value function factorization paper review?? ?
?
QMIX is a deep multi-agent reinforcement learning method that allows for centralized training with decentralized execution. It represents the joint action-value function as a factored and monotonic combination of individual agent value functions. This ensures greedy policies over the individual value functions correspond to greedy policies over the joint value function. Experiments in StarCraft II micromanagement tasks show QMIX outperforms independent learners and value decomposition networks by effectively learning cooperative behaviors while ensuring scalability.
Learning to discover monte carlo algorithm on spin ice manifoldKai-Wen Zhao
?
The global update Monte Carlo sampler can be discovered naturally by trained machine using policy gradient method on topologically constrained environment.
increasing the action gap - new operators for reinforcement learningRyo Iwaki
?
The document introduces new operators called consistent Bellman operators for reinforcement learning. These operators aim to increase the "action gap" or difference in value between the optimal action and suboptimal actions at each state. Increasing the action gap makes value function approximation and estimation errors less impactful on the induced greedy policy. The consistent Bellman operator incorporates a notion of local policy consistency to devalue suboptimal actions while preserving optimal values, providing a first-order solution to inconsistencies from function approximation. Experiments showed these operators achieve overwhelming performance on Atari 2600 games and other tasks.
Recommendation algorithm using reinforcement learningArithmer Inc.
?
際際滷 for study session given by Lu Juanjuan at Arithmer inc.
It is a summary of recent methods for recommendation system using reinforcement learning.
Arithmer幄塀氏芙は|奨寄僥寄僥垪方尖親僥冩梢親kの方僥の氏芙です。暴_はF旗方僥を鮄辰靴董?な蛍勸のソリュ`ションに、仟しい互業AIシステムを秘しています。AIをいかに貧返に聞って碧並を紳併するか、そして繁?の叨に羨つY惚を伏み竃すのか、それを深えるのが暴たちの碧並です。
Arithmer began at the University of Tokyo Graduate School of Mathematical Sciences. Today, our research of modern mathematics and AI systems has the capability of providing solutions when dealing with tough complex issues. At Arithmer we believe it is our job to realize the functions of AI through improving work efficiency and producing more useful results for society.
The document discusses backpropagation, an algorithm used to train neural networks. It begins with background on perceptron learning and the need for an algorithm that can train multilayer perceptrons to perform nonlinear classification. It then describes the development of backpropagation, from early work in the 1970s to its popularization in the 1980s. The document provides examples of using backpropagation to design networks for binary classification and multi-class problems. It also outlines the generalized mathematical expressions and steps involved in backpropagation, including calculating the error derivative with respect to weights and updating weights to minimize loss.
The document discusses intrinsically motivated reinforcement learning. It presents a graphical model showing how an agent learns a policy by optimizing both expected extrinsic reward (value) and intrinsic information gain together using Bellman-like equations. The agent interacts with an environment over discrete time steps, taking actions and receiving rewards while also estimating internal states and information gain. The value and information equations are combined using a temperature parameter to optimize both goals of minimizing decision complexity and maximizing environmental information gain. Examples on a grid world demonstrate how the agent's behavior changes with different temperature parameters when trading off value and information.
Combining genetic algoriths and constraint programming to support stress test...Lionel Briand
?
The document presents a search strategy called GA+CP that combines genetic algorithms and constraint programming to identify scenarios likely to violate task deadlines in real-time embedded systems. GA+CP casts the generation of stress test cases as an optimization problem to find arrival times that maximize the chance of deadline violations. GA is efficient and generates diverse test cases, while CP is effective at finding test cases more likely to violate deadlines. The approach uses GA to evolve potential solutions and CP to improve the most promising ones in each generation.
This document summarizes a study on using sequential Markov chain Monte Carlo (MCMC) methods for parameter estimation of linear time-invariant (LTI) systems subjected to non-stationary seismic excitations. The study involves applying particle filtering algorithms like sequential importance sampling (SIS), sequential importance resampling (SIR), and bootstrap filtering to identify natural frequencies, mode shapes, and other parameters of single-degree-of-freedom, multi-story, and actual reinforced concrete buildings using synthetic and field acceleration data. Results show that stratified and systematic resampling give the best parameter estimates and all three particle filtering variants perform well, with identified frequencies and mode shapes close to original values.
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
?
The document summarizes a research paper titled "Continuous Deep Q-Learning with Model-based Acceleration" presented at ICML 2016. It proposes a method that incorporates advantages of both model-free and model-based reinforcement learning. The method uses deep Q-learning with normalized advantage functions to learn a parameterized Q-function for continuous state-action spaces. It accelerates the learning process by using trajectory optimization from an imagined model to generate exploratory behaviors during data collection.
TabNet is a model that performs well on tabular data without preprocessing. It uses sequential attention to select important features at each step, improving interpretability and learning. TabNet outperforms other models on various datasets through its multi-step architecture with sparse feature selection at each step and unsupervised pre-training.
This presentation provides an in-depth analysis of structural quality control in the KRP 401600 section of the Copper Processing Plant-3 (MOF-3) in Uzbekistan. As a Structural QA/QC Inspector, I have identified critical welding defects, alignment issues, bolting problems, and joint fit-up concerns.
Key topics covered:
? Common Structural Defects C Welding porosity, misalignment, bolting errors, and more.
? Root Cause Analysis C Understanding why these defects occur.
? Corrective & Preventive Actions C Effective solutions to improve quality.
? Team Responsibilities C Roles of supervisors, welders, fitters, and QC inspectors.
? Inspection & Quality Control Enhancements C Advanced techniques for defect detection.
? Applicable Standards: GOST, KMK, SNK C Ensuring compliance with international quality benchmarks.
? This presentation is a must-watch for:
? QA/QC Inspectors, Structural Engineers, Welding Inspectors, and Project Managers in the construction & oil & gas industries.
? Professionals looking to improve quality control processes in large-scale industrial projects.
? Download & share your thoughts! Let's discuss best practices for enhancing structural integrity in industrial projects.
Categories:
Engineering
Construction
Quality Control
Welding Inspection
Project Management
Tags:
#QAQC #StructuralInspection #WeldingDefects #BoltingIssues #ConstructionQuality #Engineering #GOSTStandards #WeldingInspection #QualityControl #ProjectManagement #MOF3 #CopperProcessing #StructuralEngineering #NDT #OilAndGas
Indian Soil Classification System in Geotechnical EngineeringRajani Vyawahare
?
This PowerPoint presentation provides a comprehensive overview of the Indian Soil Classification System, widely used in geotechnical engineering for identifying and categorizing soils based on their properties. It covers essential aspects such as particle size distribution, sieve analysis, and Atterberg consistency limits, which play a crucial role in determining soil behavior for construction and foundation design. The presentation explains the classification of soil based on particle size, including gravel, sand, silt, and clay, and details the sieve analysis experiment used to determine grain size distribution. Additionally, it explores the Atterberg consistency limits, such as the liquid limit, plastic limit, and shrinkage limit, along with a plasticity chart to assess soil plasticity and its impact on engineering applications. Furthermore, it discusses the Indian Standard Soil Classification (IS 1498:1970) and its significance in construction, along with a comparison to the Unified Soil Classification System (USCS). With detailed explanations, graphs, charts, and practical applications, this presentation serves as a valuable resource for students, civil engineers, and researchers in the field of geotechnical engineering.
Lecture -3 Cold water supply system.pptxrabiaatif2
?
The presentation on Cold Water Supply explored the fundamental principles of water distribution in buildings. It covered sources of cold water, including municipal supply, wells, and rainwater harvesting. Key components such as storage tanks, pipes, valves, and pumps were discussed for efficient water delivery. Various distribution systems, including direct and indirect supply methods, were analyzed for residential and commercial applications. The presentation emphasized water quality, pressure regulation, and contamination prevention. Common issues like pipe corrosion, leaks, and pressure drops were addressed along with maintenance strategies. Diagrams and case studies illustrated system layouts and best practices for optimal performance.
Engineering at Lovely Professional University (LPU).pdfSona
?
LPU¨s engineering programs provide students with the skills and knowledge to excel in the rapidly evolving tech industry, ensuring a bright and successful future. With world-class infrastructure, top-tier placements, and global exposure, LPU stands as a premier destination for aspiring engineers.
This PPT covers the index and engineering properties of soil. It includes details on index properties, along with their methods of determination. Various important terms related to soil behavior are explained in detail. The presentation also outlines the experimental procedures for determining soil properties such as water content, specific gravity, plastic limit, and liquid limit, along with the necessary calculations and graph plotting. Additionally, it provides insights to understand the importance of these properties in geotechnical engineering applications.
The Golden Gate Bridge a structural marvel inspired by mother nature.pptxAkankshaRawat75
?
The Golden Gate Bridge is a 6 lane suspension bridge spans the Golden Gate Strait, connecting the city of San Francisco to Marin County, California.
It provides a vital transportation link between the Pacific Ocean and the San Francisco Bay.
The Golden Gate Bridge a structural marvel inspired by mother nature.pptxAkankshaRawat75
?
ddpg seminar
1. CONTACT
Autonomous Systems Laboratory
Mechanical Engineering
5th Engineering Building Room 810
Web. https://sites.google.com/site/aslunist/
Deep deterministic policy gradient
Minjae Jung
May. 19, 2020
2. Autonomous Systems Laboratory
2/21
DQN to DDPG: DQN overview
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
Q learning DQN
(2015)
1. replay buffer
2. neural network
3. target network
3. Autonomous Systems Laboratory
3/21
DQN to DDPG: DQN algorithm
Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.
? DQN is capable of human level performance on many Atari games
? Off policy training: replay buffer breaks the correlation of samples that are sampled from agent
? High dimensional observation: deep neural network can extract feature from high dimensional input
? Learning stability: target network make training process stable
Environment Q Network
Target
Q Network
DQN Loss
Replay buffer
?????? ? ?(??, ? ?; ?)
??
Update
Copy
?(??, ? ?; ?)
store
(??, ? ?, ??, ??+1)
??
??+1(??, ? ?)
??? ? ?(??
>
, ? ?; ?>)
? ??, ? ? ○ ? ??, ? ? + ?[??+1 + ???? ? ? ??+1, ? ?+1 ? ? ??, ? ? ]
Q learning
? ? ?
??, ? ? ○ ? ? ?
??, ? ? + ?[??+1 + ???? ? ? ? ?>
??+1, ? ?+1 ? ? ? ?
??, ? ? ]
DQN
Policy(?): ? ? = ?????? ? ? ? ?
(??, ? ?)??: state
? ?: action
??: reward
?(??, ? ?): reward to go
4. Autonomous Systems Laboratory
4/21
DQN to DDPG: Limitation of DQN (discrete action spaces)
? Discrete action spaces
- DQN can only handle discrete and low-dimensional action spaces
- If the dimension increases, action spaces(the number of node) increase exponentially
- i.e. ? discrete action spaces with ? dimension -> ? ?action spaces
? DQN cannot be straight forwardly applied to continuous domain
? Why? -> 1. Policy(?): ? ? = ?????? ? ? ? ?
(??, ? ?)
2. Update: ? ? ?
??, ? ? ○ ? ? ?
??, ? ? + ?[??+1 + ???? ? ? ? ?>
? ?+?, ? ?+? ? ? ? ?
??, ? ? ]
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
5. Autonomous Systems Laboratory
5/21
DDPG: DQN with Policy gradient methods
Q learning DQN
1. replay buffer
2. deep neural network
3. target network
Policy gradient
(REINFORCE)
Actor critic DPG
DDPG
Continuous
action spaces
Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
6. Autonomous Systems Laboratory
6/21
Policy gradient: The goal of Reinforcement learning
? ? ?1,?1,?,? ?,? ?
= ?(?1) ?
?=1
?
? ? ? ? ?? ?(??+1|??, ? ?)
Agent World
action
?(??+1|??, ? ?)
model
reward & next state
??
? ?
??+1
state&
??
policy
?(? ?|??)
??
= ?????? ? ??~? ?(?) ?
?
? ??, ? ?
Markov decision process
?1
?1
?2
?2
?3
?(?2|?1, ?1) ?(?3|?2, ?2)
?3
?(?4|?3, ?3)
?
objective: ?(?)
trajectory distribution
Goal of reinforcement learning
policy(? ?):
stochastic policy with weights ?
7. Autonomous Systems Laboratory
7/21
Policy gradient: REINFORCE
? REINFORCE models the policy as a stochastic policy: ? ? ~ ? ?(? ?|??)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
?? ? ?(? ?|??)
probability
0.1
0.1
0.2
0.2
0.4
8. Autonomous Systems Laboratory
8/21
Policy gradient: REINFORCE
? ? = ??~? ?(?) ?
?
? ??, ? ?
?? ? =??~? ?(?) ( σ ?=1
?
?? ??? ? ?(? ?|??) (σ ?=1
?
?(??, ? ?))
?? ? 「
1
?
?
?=1
?
?
?=1
?
?? ??? ? ?(? ?|??) ?
?=1
?
?(??, ? ?)
? ○ ? + ???(?)
The number of episodes
problem
Must experience some episodes to update
1. Slow training process
2. High gradient variance
initial state
?1
?2
? ?
? REINFORCE models the policy as a stochastic decision: ? ? ~ ? ?(? ?|??)
Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." Advances in neural information processing systems. 2000.
?: weights of actor network
?: learning rate
9. Autonomous Systems Laboratory
9/21
★ ? ?(??, ? ?)
Policy gradient: Actor critic (actor critic)
? Actor(? ?(? ?|??)): output action distribution by policy network and updates in the direction suggested by critic
? Critic(? ?(? ?, ? ?)): evaluate actor¨s action
initial state
sample data ? times
update critic & actor
sample data ? times
update critic & actor
1. Sample ??, ? ?, ??, ??+1 from ? ?(? ?|??) ? times
2. Update ? ?(??, ? ?) to sampled data
3. ?? ?(?) 「 σ? ????? ? ? ? ?? ? ?(??, ? ?)
4. ? ○ ? + ??? ?(?)
Konda, Vijay R., and John N. Tsitsiklis. "Actor-critic algorithms." Advances in neural information processing systems. 2000.
?: weights of critic network
1. High gradient variance
2. Slow training policy
? ?(? ?|??)
? ?(??, ? ?)
Env.? ?
??
(??, ? ?, ??, ??+1)0~???(?)
actor
critic
update critic
10. Autonomous Systems Laboratory
10/21
Policy gradient: DPG
Silver, David, et al. "Deterministic policy gradient algorithms." 2014.
? Deterministic policy gradient (DPG) models the actor policy as a deterministic policy: ?t = ? ?(?t)
? ?
? ?
Stochastic policy ? ?(? ?|??)
??
? Need 10 action spaces for 5 discretized 2 dimensional actions
? ?
? ?
Deterministic policy ? ?(? ?)
??
? Only 2 action spaces are needed
15. Autonomous Systems Laboratory
15/21
DDPG example: landing on a moving platform
Rodriguez-Ramos, Alejandro, et al. "A deep reinforcement learning strategy for UAV autonomous landing on a moving platform." Journal of Intelligent & Robotic
Systems 93.1-2 (2019): 351-366.
16. Autonomous Systems Laboratory
16/21
DDPG example: long-range robotic navigation
Faust, Aleksandra, et al. "PRM-RL: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning." 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE, 2018.
? DDPG used as local planner for long range navigation
17. Autonomous Systems Laboratory
17/21
DDPG example: multi agent DDPG (MADDPG)
Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive environments." Advances in neural information processing systems. 2017.
18. Autonomous Systems Laboratory
18/21
Conclusion & Future work
? DQN have problem to adjust continuous action space directly
? DDPG is able to consider continuous action spaces via policy gradient method and actor critic
architecture
? MADDPG for multi agent RL
? Use DDPG for continuous action space decision making problem
? ex) navigation, obstacle avoidance