6. 6
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
7. 7
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point
https://www.gymlibrary.dev/
https://gymnasium.farama.org/
8. 8
From One Game to All The Games in Atari
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
9. 9
A Journey to Artificial General Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020
10. 10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
●
-1 : per step unless other rewards is triggered
https://www.gymlibrary.dev/environments/toy_text/taxi/
14. 14
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning
15. 15
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning
16. 16
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning
17. 17
Deep Q Network (DQN) on Breakout
Artificial Intelligence and the Future - Demis Hassabis/DeepMind
https://youtu.be/zYII3AOSgo8?t=2236
18. 18
Deep Q Network (DQN) Benchmark
Ref : Human-level control through deep reinforcement learning
19. 19
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
21. 21
How Well Can Agent57 Do?
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
22. 22
Reinforcement Learning at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
23. 23
Mastering Go at DeepMind
https://analyticsindiamag.com/all-hail-the-king-of-reinforcement-learning-deepmind/
24. 24
A Journey to Artificial General Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U
26. 26
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=lVMgxtm5L-U
https://www.youtube.com/watch?v=LX8Knl0g0LE
27. 27
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=lVMgxtm5L-U
https://www.youtube.com/watch?v=WXuK6gekU1Y
31. 31
AlphaZero Network for Chess
Ref: Acquisition of Chess Knowledge in AlphaZero
AlphaGo
? Two networks: policy network and value network
? Conv/ReLu-based layer structure
AlphaZero
? One network with two heads: policy and value
? ResNet-based layer structure
32. 32
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
33. 33
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model