狠狠撸

强化学习的王者之旅
余方國博士
04/11/2023

2
https://www.imdb.com/title/tt0108065/?ref_=ttmi_tt

3
MuZero
Alpha Zero
Gym
Gym

4
Atari Games
Pong Breakout Phoenix
https://www.gymlibrary.dev/
https://gymnasium.farama.org/

5
Reinforcement Learning Framework
ENVIRONMENT
AGENT
State Action Reward
(s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ …
Making Sequential Decisions to Maximize Long-Term Rewards

6
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()

7
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point

8
From One Game to All The Games in Atari
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark

9
A Journey to Artificial General Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020

10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
●
-1 : per step unless other rewards is triggered
https://www.gymlibrary.dev/environments/toy_text/taxi/

11
OpenAI Gym Taxi-v3 : Q Table
(500 x 6)
https://www.gocoder.one/blog/rl-tutorial-with-openai-gym

12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/

13
Limitation of Q Table
representation
scalability

14
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning

15
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning

16
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks

17
Deep Q Network (DQN) on Breakout
Artificial Intelligence and the Future - Demis Hassabis/DeepMind
https://youtu.be/zYII3AOSgo8?t=2236

18
Deep Q Network (DQN) Benchmark

19
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.

20
Distributed Reinforcement Learning
Agent57
Gorila
https://arxiv.org/abs/2003.13350
https://arxiv.org/abs/1507.04296

21
How Well Can Agent57 Do?

22
Deep Q Network and Brain Activity

23
Policy Gradient on Atari Pong
https://www.youtube.com/watch?v=tqrcjHuNdmQ

24
Reinforcement Learning at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/

25
Mastering Go at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/

26
A Journey to Artificial General Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U

27
AlphaGo, AlphaGo Zero, Alpha Zero, MuZero
AlphaGo Zero, Nature, 2017
AlphaZero, Science, 2018 MuZero, Nature, 2020
AlphaGo, Nature, 2016

28
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=LX8Knl0g0LE

29
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=WXuK6gekU1Y

30
AlphaGo Inputs and Policy/Value Networks
/ckmarkohchang/alphago-in-depth

31
AlphaGo Monte Carlo Tree Search
/ckmarkohchang/alphago-in-depth

32
AlphaZero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network

33
AlphaZero Network for Chess
Ref: Acquisition of Chess Knowledge in AlphaZero
AlphaGo
? Two networks: policy network and value network
? Conv/ReLu-based layer structure
AlphaZero
? One network with two heads: policy and value
? ResNet-based layer structure

34
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/

35
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

36
MuZero Performance Benchmark
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

37
MuZero for Self-Driving Car at Tesla
https://www.youtube.com/watch?v=j0z4FweCy4M

38
AlphaGo to AlphaStar by David Silver
Deep Reinforcement Learning from AlphaGo to AlphaStar - London Machine Learning Meetup

39
MuZero
Alpha Zero
Gym
Gym
深度強化學習
通用人工智慧

40
演講資料
fangkuoyu@gmail.com
博客帳號

狠狠撸

强化学习的王者之旅

More Related Content

强化学习的王者之旅

强化学习的王者之旅

强化学习的王者之旅