狠狠撸

狠狠撸Share a Scribd company logo
余方國 博士
06/04/2023
從 Atari/AlphaGo/ChatGPT 談
深度強化學習 及 通用人工智慧
2
深度強化學習 及 通用人工智慧
Artificial General Intelligence (AGI) :
an agent can achieve or exceed human performance
in a wide range of environments
(Credit: Shane Legg and Marcus Hutter)
Reinforcement Learning : decision-making framework
Deep Learning : representation computation/optimization mechanism
Deep Reinforcement Learning : formulate problem/solution
(Credit: David Silver and Demis Hassabis)
3
深度強化學習 及 通用人工智慧
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4
4
Atari Games
Pong Breakout Phoenix
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
5
Reinforcement Learning Framework
ENVIRONMENT
AGENT
State Action Reward
(s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ …
Making Sequential Decisions to Maximize Long-Term Rewards
6
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
7
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
8
From One Game to All The Games in Atari
https://www.gymlibrary.dev/ https://gymnasium.farama.org/
9
A Journey to Artificial General Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020
10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
●
-1 : per step unless other rewards is triggered
https://www.gymlibrary.dev/environments/toy_text/taxi/
11
OpenAI Gym Taxi-v3 : Q Table
(500 x 6)
https://www.gocoder.one/blog/rl-tutorial-with-openai-gym
12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/
13
Limitation of Q Table
representation
scalability
14
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning
15
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning
16
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning
17
Deep Q Network (DQN) Benchmark
Ref : Human-level control through deep reinforcement learning
18
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
19
Policy Gradient on Atari Pong
https://www.youtube.com/watch?v=tqrcjHuNdmQ
20
Reinforcement Learning Algorithms
Ref: OpenAI Spinning Up
21
深度強化學習 及 通用人工智慧
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4
22
A Journey to Artificial General Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U
23
AlphaGo, AlphaGo Zero, Alpha Zero, MuZero
AlphaGo Zero, Nature, 2017
AlphaZero, Science, 2018 MuZero, Nature, 2020
AlphaGo, Nature, 2016
24
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=lVMgxtm5L-U
https://www.youtube.com/watch?v=WXuK6gekU1Y
25
AlphaGo Inputs and Policy/Value Networks
/ckmarkohchang/alphago-in-depth
26
AlphaGo Monte Carlo Tree Search
/ckmarkohchang/alphago-in-depth
27
AlphaZero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network
https://www.youtube.com/watch?v=lVMgxtm5L-U
28
AlphaZero Network
Ref: Acquisition of Chess Knowledge in AlphaZero
AlphaGo
? Two networks: policy network and value network
? Conv/ReLu-based layer structure
AlphaZero
? One network with two heads: policy and value
? ResNet-based layer structure
29
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
30
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
31
MuZero Performance Benchmark
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
32
AlphaGo to AlphaStar by David Silver
Deep Reinforcement Learning from AlphaGo to AlphaStar - London Machine Learning Meetup
33
深度強化學習 及 通用人工智慧
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4
34
Evolution of Large Language Models
Ref: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
35
Language Model and Text Generation
? ? ? ? ?
? Sampling Strategy ~ Greedy / Top-K / Top-P (Temperature)
? Next Word Prediction ~ Sequential Decision Making
Ref: “Language Modeling” from “NLP Course | For You”
36
ChatGPT Training Pipeline
Ref: “Introducing ChatGPT” from OpenAI
? Supervised Learning
? Reward Model
? Reinforcement Learning
? Supervised Fine-Tuning
(SFT)
? Reinforcement Learning
from Human Feedback
(RLHF)
37
GPT Assistant Training Pipeline
Andrej Karpathy - State of GPT / Microsoft Developer / 05.25.2023 @ Youtube
38
Reinforcement Learning from Human Feedback
(General Process)
Step 1. Rollout :
Step 2. Evaluation :
Step 3. Optimization :
Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl
39
Reinforcement Learning from Human Feedback
(Sentiment)
Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl
prompt response reward
BERT
Classifier
control
Movie Review
Dataset
Tune GPT-2 to Generate
Controlled Sentiment Reviews
train
train
40
Reinforcement Learning from Human Feedback
(Detoxification)
Ref: Using Transformer Reinforcement Learning to Detoxify Generative Language Models
prompt response reward
Detoxifying Large Language Model
train
RealToxicityPrompts
Dataset
RoBERTa
Classifier
GPT-Neo
41
GPT-4 Content Policy and Safety Challenge
Ref: GPT-4 Technical Report / System Card
42
GPT-4 Training Pipeline for Safety
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Rule-Based Reward Models (RBRMs)
? a refusal in the desired style
? a refusal in the undesired style
? containing disallowed content
? a safe non-refusal response
Ref: GPT-4 Technical Report / System Card
43
ChatGPT Hallucinations
44
GPT-4 Hallucinations and Improvements
Enhance Reward Models to mitigate
? Open-Domain Hallucinations
? Closed-Domain Hallucinations
Ref: GPT-4 Technical Report / System Card
45
Reinforcement Learning Use Cases
1. Reinforcement Learning for Quality
2. Reinforcement Learning for Safety
3. Reinforcement Learning for Hallucination
4. Reinforcement Learning for Sentiment
5. Reinforcement Learning for Detoxification
46
Summary of Five Large Language Models
Ref: “What Makes a Dialog Agent Useful” from Hugging Face
System
Pre-Trained Base Model
Supervised Fine-Tuning
Reinforcement Learning from Human Feedback
Hand Written Rules for Safety
47
深度強化學習 及 通用人工智慧
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4
Q&A
從 Atari/AlphaGo/ChatGPT 談
深度強化學習 及 通用人工智慧

More Related Content

從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧

  • 1. 余方國 博士 06/04/2023 從 Atari/AlphaGo/ChatGPT 談 深度強化學習 及 通用人工智慧
  • 2. 2 深度強化學習 及 通用人工智慧 Artificial General Intelligence (AGI) : an agent can achieve or exceed human performance in a wide range of environments (Credit: Shane Legg and Marcus Hutter) Reinforcement Learning : decision-making framework Deep Learning : representation computation/optimization mechanism Deep Reinforcement Learning : formulate problem/solution (Credit: David Silver and Demis Hassabis)
  • 3. 3 深度強化學習 及 通用人工智慧 1 3 2 Atari Games AlphaGo Series ChatGPT / GPT-4
  • 4. 4 Atari Games Pong Breakout Phoenix https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
  • 5. 5 Reinforcement Learning Framework ENVIRONMENT AGENT State Action Reward (s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ … Making Sequential Decisions to Maximize Long-Term Rewards
  • 6. 6 Atari Breakout in OpenAI Gym import gym env = gym.make("ALE/Breakout-v5", render_mode="human") state, info = env.reset() for index in range(1000): action = env.action_space.sample() # action by random or policy state, reward, terminated, truncated, info = env.step(action) if terminated or truncated: state, info = env.reset() env.close() https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
  • 7. 7 State/Action/Reward in Atari Breakout State: ● (210, 160, 3) - image Action: ● 0 - NO OP ● 1 - FIRE ● 2 - RIGHT ● 3 - LEFT Reward: ● Red - 7 points ● Orange - 7 points ● Yellow - 4 points ● Green - 4 points ● Aqua - 1 point ● Blue - 1 point https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
  • 8. 8 From One Game to All The Games in Atari https://www.gymlibrary.dev/ https://gymnasium.farama.org/
  • 9. 9 A Journey to Artificial General Intelligence https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/ https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark DQN/2015 R2D2/2019 NGU/2019 Agent57/2020
  • 10. 10 OpenAI Gym Taxi-v3 : State/Action/Reward State: ● Number of Variable : 1 ● Range of Variable : [1, 500] ● 25 taxi positions x 5 passenger positions x 4 destination locations Action: ● 0 : move south ● 1 : move north ● 2 : move east ● 3 : move west ● 4 : pickup passenger ● 5 : drop off passenger Reward: ● +20 : delivering passenger ● -10 : pickup/dropoff illegally ● -1 : per step unless other rewards is triggered https://www.gymlibrary.dev/environments/toy_text/taxi/
  • 11. 11 OpenAI Gym Taxi-v3 : Q Table (500 x 6) https://www.gocoder.one/blog/rl-tutorial-with-openai-gym
  • 12. 12 Q Learning (with epsilon greedy policy) 3. exploitation 1. initialize Q table 4. exploration 5. action 2. state 8. update Q table 6. next state 7. reward https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/
  • 13. 13 Limitation of Q Table representation scalability
  • 14. 14 Deep Q Network (DQN) Architecture (1/2) Ref : Human-level control through deep reinforcement learning
  • 15. 15 Deep Q Network (DQN) Architecture (2/2) Ref : Massively Parallel Methods for Deep Reinforcement Learning
  • 16. 16 Deep Q Learning (with experience replay and dual networks) 1. initialize replay memory 5. store transition in replay memory 6. get batch from replay memory 2. initialize main network 3. initialize target network 4. epsilon greedy policy from main network 7. calculate error between two networks 8. synchronize two networks Ref : Human-level control through deep reinforcement learning
  • 17. 17 Deep Q Network (DQN) Benchmark Ref : Human-level control through deep reinforcement learning
  • 18. 18 Four Tough Games in Atari Pitfall Solaris Skiing Montezuma’s Revenge Problems : long-term credit assignment and exploitation/exploration tradeoff Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc. https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
  • 19. 19 Policy Gradient on Atari Pong https://www.youtube.com/watch?v=tqrcjHuNdmQ
  • 21. 21 深度強化學習 及 通用人工智慧 1 3 2 Atari Games AlphaGo Series ChatGPT / GPT-4
  • 22. 22 A Journey to Artificial General Intelligence https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules https://www.youtube.com/watch?v=lVMgxtm5L-U
  • 23. 23 AlphaGo, AlphaGo Zero, Alpha Zero, MuZero AlphaGo Zero, Nature, 2017 AlphaZero, Science, 2018 MuZero, Nature, 2020 AlphaGo, Nature, 2016
  • 24. 24 AlphaGo Fan/Lee/Master ● European Go Champion Fan Hui — 5:0 ● South Korean professional Go player Lee Sedol — 4:1 ● Online games with players from China/Korea/Japan — 60:0 ● Chinese professional Go player Ke Jie — 3:0 https://www.youtube.com/watch?v=lVMgxtm5L-U https://www.youtube.com/watch?v=WXuK6gekU1Y
  • 25. 25 AlphaGo Inputs and Policy/Value Networks /ckmarkohchang/alphago-in-depth
  • 26. 26 AlphaGo Monte Carlo Tree Search /ckmarkohchang/alphago-in-depth
  • 28. 28 AlphaZero Network Ref: Acquisition of Chess Knowledge in AlphaZero AlphaGo ? Two networks: policy network and value network ? Conv/ReLu-based layer structure AlphaZero ? One network with two heads: policy and value ? ResNet-based layer structure
  • 29. 29 AlphaGo Zero Performance Benchmark https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
  • 30. 30 MuZero Training Process h: representation f: prediction g: dynamics Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
  • 31. 31 MuZero Performance Benchmark Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
  • 32. 32 AlphaGo to AlphaStar by David Silver Deep Reinforcement Learning from AlphaGo to AlphaStar - London Machine Learning Meetup
  • 33. 33 深度強化學習 及 通用人工智慧 1 3 2 Atari Games AlphaGo Series ChatGPT / GPT-4
  • 34. 34 Evolution of Large Language Models Ref: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
  • 35. 35 Language Model and Text Generation ? ? ? ? ? ? Sampling Strategy ~ Greedy / Top-K / Top-P (Temperature) ? Next Word Prediction ~ Sequential Decision Making Ref: “Language Modeling” from “NLP Course | For You”
  • 36. 36 ChatGPT Training Pipeline Ref: “Introducing ChatGPT” from OpenAI ? Supervised Learning ? Reward Model ? Reinforcement Learning ? Supervised Fine-Tuning (SFT) ? Reinforcement Learning from Human Feedback (RLHF)
  • 37. 37 GPT Assistant Training Pipeline Andrej Karpathy - State of GPT / Microsoft Developer / 05.25.2023 @ Youtube
  • 38. 38 Reinforcement Learning from Human Feedback (General Process) Step 1. Rollout : Step 2. Evaluation : Step 3. Optimization : Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl
  • 39. 39 Reinforcement Learning from Human Feedback (Sentiment) Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl prompt response reward BERT Classifier control Movie Review Dataset Tune GPT-2 to Generate Controlled Sentiment Reviews train train
  • 40. 40 Reinforcement Learning from Human Feedback (Detoxification) Ref: Using Transformer Reinforcement Learning to Detoxify Generative Language Models prompt response reward Detoxifying Large Language Model train RealToxicityPrompts Dataset RoBERTa Classifier GPT-Neo
  • 41. 41 GPT-4 Content Policy and Safety Challenge Ref: GPT-4 Technical Report / System Card
  • 42. 42 GPT-4 Training Pipeline for Safety Supervised Fine-Tuning (SFT) Reinforcement Learning from Human Feedback (RLHF) Rule-Based Reward Models (RBRMs) ? a refusal in the desired style ? a refusal in the undesired style ? containing disallowed content ? a safe non-refusal response Ref: GPT-4 Technical Report / System Card
  • 44. 44 GPT-4 Hallucinations and Improvements Enhance Reward Models to mitigate ? Open-Domain Hallucinations ? Closed-Domain Hallucinations Ref: GPT-4 Technical Report / System Card
  • 45. 45 Reinforcement Learning Use Cases 1. Reinforcement Learning for Quality 2. Reinforcement Learning for Safety 3. Reinforcement Learning for Hallucination 4. Reinforcement Learning for Sentiment 5. Reinforcement Learning for Detoxification
  • 46. 46 Summary of Five Large Language Models Ref: “What Makes a Dialog Agent Useful” from Hugging Face System Pre-Trained Base Model Supervised Fine-Tuning Reinforcement Learning from Human Feedback Hand Written Rules for Safety
  • 47. 47 深度強化學習 及 通用人工智慧 1 3 2 Atari Games AlphaGo Series ChatGPT / GPT-4