The document discusses deep reinforcement learning and artificial general intelligence through examples of Atari games, AlphaGo series, and ChatGPT/GPT-4. It covers how deep reinforcement learning was applied to achieve human-level performance in Atari games using DQN and policy gradient methods. It also summarizes the development of AlphaGo, AlphaGo Zero, AlphaZero and MuZero using deep reinforcement learning techniques like self-play and Monte Carlo tree search. Finally, it discusses how ChatGPT and GPT-4 were trained using supervised learning, reinforcement learning from human feedback, and rule-based reward models to improve safety.
Convert to study guideBETA
Transform any presentation into a summarized study guide, highlighting the most important points and key insights.
2. 2
深度強化學習 及 通用人工智慧
Artificial General Intelligence (AGI) :
an agent can achieve or exceed human performance
in a wide range of environments
(Credit: Shane Legg and Marcus Hutter)
Reinforcement Learning : decision-making framework
Deep Learning : representation computation/optimization mechanism
Deep Reinforcement Learning : formulate problem/solution
(Credit: David Silver and Demis Hassabis)
6. 6
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
7. 7
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/
8. 8
From One Game to All The Games in Atari
https://www.gymlibrary.dev/ https://gymnasium.farama.org/
9. 9
A Journey to Artificial General Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020
10. 10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
●
-1 : per step unless other rewards is triggered
https://www.gymlibrary.dev/environments/toy_text/taxi/
14. 14
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning
15. 15
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning
16. 16
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks
Ref : Human-level control through deep reinforcement learning
17. 17
Deep Q Network (DQN) Benchmark
Ref : Human-level control through deep reinforcement learning
18. 18
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
22. 22
A Journey to Artificial General Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U
24. 24
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=lVMgxtm5L-U
https://www.youtube.com/watch?v=WXuK6gekU1Y
28. 28
AlphaZero Network
Ref: Acquisition of Chess Knowledge in AlphaZero
AlphaGo
? Two networks: policy network and value network
? Conv/ReLu-based layer structure
AlphaZero
? One network with two heads: policy and value
? ResNet-based layer structure
29. 29
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/
30. 30
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model
34. 34
Evolution of Large Language Models
Ref: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
35. 35
Language Model and Text Generation
? ? ? ? ?
? Sampling Strategy ~ Greedy / Top-K / Top-P (Temperature)
? Next Word Prediction ~ Sequential Decision Making
Ref: “Language Modeling” from “NLP Course | For You”
36. 36
ChatGPT Training Pipeline
Ref: “Introducing ChatGPT” from OpenAI
? Supervised Learning
? Reward Model
? Reinforcement Learning
? Supervised Fine-Tuning
(SFT)
? Reinforcement Learning
from Human Feedback
(RLHF)
37. 37
GPT Assistant Training Pipeline
Andrej Karpathy - State of GPT / Microsoft Developer / 05.25.2023 @ Youtube
39. 39
Reinforcement Learning from Human Feedback
(Sentiment)
Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl
prompt response reward
BERT
Classifier
control
Movie Review
Dataset
Tune GPT-2 to Generate
Controlled Sentiment Reviews
train
train
40. 40
Reinforcement Learning from Human Feedback
(Detoxification)
Ref: Using Transformer Reinforcement Learning to Detoxify Generative Language Models
prompt response reward
Detoxifying Large Language Model
train
RealToxicityPrompts
Dataset
RoBERTa
Classifier
GPT-Neo
42. 42
GPT-4 Training Pipeline for Safety
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Rule-Based Reward Models (RBRMs)
? a refusal in the desired style
? a refusal in the undesired style
? containing disallowed content
? a safe non-refusal response
Ref: GPT-4 Technical Report / System Card
44. 44
GPT-4 Hallucinations and Improvements
Enhance Reward Models to mitigate
? Open-Domain Hallucinations
? Closed-Domain Hallucinations
Ref: GPT-4 Technical Report / System Card
45. 45
Reinforcement Learning Use Cases
1. Reinforcement Learning for Quality
2. Reinforcement Learning for Safety
3. Reinforcement Learning for Hallucination
4. Reinforcement Learning for Sentiment
5. Reinforcement Learning for Detoxification
46. 46
Summary of Five Large Language Models
Ref: “What Makes a Dialog Agent Useful” from Hugging Face
System
Pre-Trained Base Model
Supervised Fine-Tuning
Reinforcement Learning from Human Feedback
Hand Written Rules for Safety