狠狠撸

余方國博士
06/04/2023
從 Atari/AlphaGo/ChatGPT 談
深度強化學習及通用人工智慧

2
Artificial General Intelligence (AGI) :
an agent can achieve or exceed human performance
in a wide range of environments
(Credit: Shane Legg and Marcus Hutter)
Reinforcement Learning : decision-making framework
Deep Learning : representation computation/optimization mechanism
Deep Reinforcement Learning : formulate problem/solution
(Credit: David Silver and Demis Hassabis)

3
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4

4
Atari Games
Pong Breakout Phoenix
https://www.gymlibrary.dev/ & https://gymnasium.farama.org/

5
Reinforcement Learning Framework
ENVIRONMENT
AGENT
State Action Reward
(s1 → a1 → r1)→ (s2 → a2 → r2)→ (s3 → a3 → r3)→ …
Making Sequential Decisions to Maximize Long-Term Rewards

6
Atari Breakout in OpenAI Gym
import gym
env = gym.make("ALE/Breakout-v5", render_mode="human")
state, info = env.reset()
for index in range(1000):
action = env.action_space.sample() # action by random or policy
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
state, info = env.reset()
env.close()

7
State/Action/Reward in Atari Breakout
State:
●
(210, 160, 3) - image
Action:
●
0 - NO OP
●
1 - FIRE
●
2 - RIGHT
●
3 - LEFT
Reward:
●
Red - 7 points
●
Orange - 7 points
●
Yellow - 4 points
●
Green - 4 points
●
Aqua - 1 point
●
Blue - 1 point

8
From One Game to All The Games in Atari
https://www.gymlibrary.dev/ https://gymnasium.farama.org/

9
A Journey to Artificial General Intelligence
https://www.assemblyai.com/blog/reinforcement-learning-with-deep-q-learning-explained/
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark
DQN/2015
R2D2/2019
NGU/2019
Agent57/2020

10
OpenAI Gym Taxi-v3 : State/Action/Reward
State:
●
Number of Variable : 1
●
Range of Variable : [1, 500]
●
25 taxi positions x 5 passenger positions x 4 destination locations
Action:
●
0 : move south
●
1 : move north
●
2 : move east
●
3 : move west
●
4 : pickup passenger
●
5 : drop off passenger
Reward:
●
+20 : delivering passenger
●
-10 : pickup/dropoff illegally
●
-1 : per step unless other rewards is triggered
https://www.gymlibrary.dev/environments/toy_text/taxi/

11
OpenAI Gym Taxi-v3 : Q Table
(500 x 6)
https://www.gocoder.one/blog/rl-tutorial-with-openai-gym

12
Q Learning (with epsilon greedy policy)
3. exploitation
1. initialize Q table
4. exploration
5. action
2. state
8. update Q table
6. next state
7. reward
https://www.cs.toronto.edu/~rgrosse/courses/csc311_f21/

13
Limitation of Q Table
representation
scalability

14
Deep Q Network (DQN) Architecture (1/2)
Ref : Human-level control through deep reinforcement learning

15
Deep Q Network (DQN) Architecture (2/2)
Ref : Massively Parallel Methods for Deep Reinforcement Learning

16
Deep Q Learning (with experience replay and dual networks)
1. initialize replay memory
5. store transition in replay memory
6. get batch from replay memory
2. initialize main network
3. initialize target network
4. epsilon greedy policy from main network
7. calculate error between two networks
8. synchronize two networks

17
Deep Q Network (DQN) Benchmark

18
Four Tough Games in Atari
Pitfall Solaris Skiing Montezuma’s Revenge
Problems : long-term credit assignment and exploitation/exploration tradeoff
Solutions : intrinsic motivation, meta-controller, short-term/episodic memory, distributed agents, etc.
https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark

19
Policy Gradient on Atari Pong
https://www.youtube.com/watch?v=tqrcjHuNdmQ

20
Reinforcement Learning Algorithms
Ref: OpenAI Spinning Up

21
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4

22
A Journey to Artificial General Intelligence
https://www.deepmind.com/blog/muzero-mastering-go-chess-shogi-and-atari-without-rules
https://www.youtube.com/watch?v=lVMgxtm5L-U

23
AlphaGo, AlphaGo Zero, Alpha Zero, MuZero
AlphaGo Zero, Nature, 2017
AlphaZero, Science, 2018 MuZero, Nature, 2020
AlphaGo, Nature, 2016

24
AlphaGo Fan/Lee/Master
●
European Go Champion Fan Hui — 5:0
●
South Korean professional Go player Lee Sedol — 4:1
●
Online games with players from China/Korea/Japan — 60:0
●
Chinese professional Go player Ke Jie — 3:0
https://www.youtube.com/watch?v=WXuK6gekU1Y

25
AlphaGo Inputs and Policy/Value Networks
/ckmarkohchang/alphago-in-depth

26
AlphaGo Monte Carlo Tree Search
/ckmarkohchang/alphago-in-depth

27
AlphaZero Training Process
Self-Play
Train
Value
Network
Train
Policy
Network

28
AlphaZero Network
Ref: Acquisition of Chess Knowledge in AlphaZero
AlphaGo
? Two networks: policy network and value network
? Conv/ReLu-based layer structure
AlphaZero
? One network with two heads: policy and value
? ResNet-based layer structure

29
AlphaGo Zero Performance Benchmark
https://thirdeyedata.ai/how-to-build-your-own-alphazero-ai-using-python-and-keras/

30
MuZero Training Process
h: representation
f: prediction
g: dynamics
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

31
MuZero Performance Benchmark
Ref: Mastering Atari, Go, chess and shogi by planning with a learned model

32
AlphaGo to AlphaStar by David Silver
Deep Reinforcement Learning from AlphaGo to AlphaStar - London Machine Learning Meetup

33
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4

34
Evolution of Large Language Models
Ref: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

35
Language Model and Text Generation
? ? ? ? ?
? Sampling Strategy ~ Greedy / Top-K / Top-P (Temperature)
? Next Word Prediction ~ Sequential Decision Making
Ref: “Language Modeling” from “NLP Course | For You”

36
ChatGPT Training Pipeline
Ref: “Introducing ChatGPT” from OpenAI
? Supervised Learning
? Reward Model
? Reinforcement Learning
? Supervised Fine-Tuning
(SFT)
? Reinforcement Learning
from Human Feedback
(RLHF)

37
GPT Assistant Training Pipeline
Andrej Karpathy - State of GPT / Microsoft Developer / 05.25.2023 @ Youtube

38
Reinforcement Learning from Human Feedback
(General Process)
Step 1. Rollout :
Step 2. Evaluation :
Step 3. Optimization :
Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl

39
(Sentiment)
Ref: Transformer Reinforcement Learning @ https://github.com/lvwerra/trl
prompt response reward
BERT
Classifier
control
Movie Review
Dataset
Tune GPT-2 to Generate
Controlled Sentiment Reviews
train
train

40
(Detoxification)
Ref: Using Transformer Reinforcement Learning to Detoxify Generative Language Models
prompt response reward
Detoxifying Large Language Model
train
RealToxicityPrompts
Dataset
RoBERTa
Classifier
GPT-Neo

41
GPT-4 Content Policy and Safety Challenge
Ref: GPT-4 Technical Report / System Card

42
GPT-4 Training Pipeline for Safety
Supervised Fine-Tuning (SFT)
Reinforcement Learning from Human Feedback (RLHF)
Rule-Based Reward Models (RBRMs)
? a refusal in the desired style
? a refusal in the undesired style
? containing disallowed content
? a safe non-refusal response

44
GPT-4 Hallucinations and Improvements
Enhance Reward Models to mitigate
? Open-Domain Hallucinations
? Closed-Domain Hallucinations

45
Reinforcement Learning Use Cases
1. Reinforcement Learning for Quality
2. Reinforcement Learning for Safety
3. Reinforcement Learning for Hallucination
4. Reinforcement Learning for Sentiment
5. Reinforcement Learning for Detoxification

46
Summary of Five Large Language Models
Ref: “What Makes a Dialog Agent Useful” from Hugging Face
System
Pre-Trained Base Model
Supervised Fine-Tuning
Hand Written Rules for Safety

47
1
3
2
Atari Games
AlphaGo Series
ChatGPT / GPT-4

Q&A
從 Atari/AlphaGo/ChatGPT 談

狠狠撸

從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧

Convert to study guideBETA

More Related Content

從 Atari/AlphaGo/ChatGPT 談深度強化學習及通用人工智慧