ݺߣ

ݺߣShare a Scribd company logo
RLHF Lessons Learned
?? SDS ???
??
? RLHF?
? RLHF ?? ? ?? ??
? Stage ? ?? ??
? ?? ??
? Lessons Learned
RLHF?
3
https://www.youtube.com/watch?v=vziygFrRlZ4
? ??? ??? NO NO
? ?? ??? ???? ? Imitation Learning & Learning from human preference
Behavior Cloning : ??? ?? ?? ??? ???
(????, ????/Demonstration ??? ??)
Learning from human preference (OpenAI, 2017)
: ?? ??? ?? ???? ??? ??
? ??? ???? ???? ?? ??(????)
4
Reward Function ??? ???
https://youtu.be/tlOIHko8ySg, https://openai.com/research/faulty-reward-functions
Reward
? ? ? ??? Reward Function
? ???? -, ??? ??? +
? ??? ??? ?? ??? ???? ????
5
???? ?? ??? Reward Function ??
Boston Dynamics
????? ????? ??? Reward? ?????
- ?? ?? ??? ? ?? ??????
- ?? ?? +100??
- ?? ?? ?? ???? 0???
- ??? 10Cm ?? 1??? 2??? 10???
6
???? ??? ???????? How??
??? ?? ???
Reward
Model
GPT
?? ????
+100
Reinforce
?? ???? ??? ??? ????
7
?? ?? == ??? ???
? ??? ???? ??? ? ?? ?? ????
? ??? ???? ?????
P( ??|???1, ???2  , ?0)
?? ??
?? ??
P( ??|???1, ???2  , ?0) + ??? ???
?? ????? ?? ??
? ?? ??? ??
ChatGPT? ?? ??
? ??? ???? ??
Reward Model
ChatGPT? RLHF ??
8
https://www.youtube.com/watch?v=vziygFrRlZ4
Stage 1: Supervised Fine-tuned Model ???
9
? Goal
? ??? ??? ???? ?? ? ? ?? ?? ?? ?? (Instruction Fine-tuning, Imitation Learning)
? ???
? ?? ? ?? ??? ?? ??(GPT 3.5, 175B)
? ??? ?? ? ? ? ? ?? ???? ?? ??-?? ????
(Demonstration Dataset)
? ?? ??
? ??? ?? ??? ??? ??? Supervised Fine-tuning
(??: ??, ??: ??, Loss function: Cross-entropy loss)
? ??
? ?????, ??? ??? ??? ?? ?
??? ?? ???
GPT
?? ????
Stage 2: Reward Model ???
10
? Goal
? ??? ??? ?? ??? ??? ??? ???? Reward Model(RM) ??
? ???
? Stage1?? ?? ? SFT ??(6B, head ??)
? ??? ?? ? ?? ?? ??? ????, ? ? ?? ??? ?? ???? ????? ?.
? ?? ??
? ??? ? ???? ??? ??? ???? ?? (?? ?? ?? >> ?? ?? ??)
? ??
? ?? ??? ?? ??, ??? Stage 1??? ?? ??? ?? ?.
### ?? : ??? ?? ??
Reward
Model
??1) ??? ??2) ????
100?
Reward
Model
-100?
Stage 3: ???? ????
11
? Goal
? RM? SFT? ???? ???? ???? ? ??? ? ChatGPT ??
? ???
? Stage1?? ?? ? SFT ??, Stage2?? ?? ? RM ??
? ?? Stage?? ???? ?? ??? ??(??)?
? ?? ??
? PPO Algorithm(???? ????? ? ??)
? ??
? ???? ??? ???? ??? ?? ?? Stage. Stage2, 3? ?? ?? ??
???? RLHF ?? ???
12
SFT ?? ???
Reward ??
???
???? ????
? ?? ?? ?? ??
? ??? ??-?? ?? ?? ? ?? ?? C ?? ?? ? ?? ? ????? ?? set ??
Stage 1 Stage 2 Stage 3
RLHF? ?? ?? ?? ???? ? ?? C ???? ??
13
SFT
RL
? Dataset ??? ??? ??
Reward
Model
Case1) ?? ?? Distribution
- ??: ??? ??? ????? ??
- ??: ?? ?? X
SFT
RL
Reward
Model
Case2) ?? ?? ??? Dist.
- ??: ?? ??? ??? ?? ??? ??
- ??: Data Mix ??? ???, ?? ???
SFT
RL
Reward
Model
Case 3) ?? ??
- ??: ??/???? ??
- ??: ????? 3?? ?????? ?
Stage ? ????? ???,
???? ??? ??
? ? ???? ??? ??
SFT
RL
Reward
Model
?? ????? ???? ???
14
? ?? ???? SFT ????
Name #N
??/??/?? ?? ????? 48163
Wizard 69615
Open orca 998520
koopen-platypus 24818
korquad-chat 32287
counsel_bot.jsonl 60367
evolve-instruct 36809
oig-smallchip2-dedu.jsonl 210282
Kullm-v2 141177
oig-instructions_en_ko.jsonl 49210
Super natural 49166
Shared GPT 34051
Everything 991
KoAlpaca 20366
KoCLima-vicuna 999
Naver ??? 319714
?? 200?? ??
?? ??
1) ?? ??? ?? ???.
2) ?? N?? ??? ???.
3) ?? ??? ?? ???? ?? ???
?? ????? ???? ???
15
? SFT ???? ? ?? ?? (polyglot 12.8b ??)
1) ?? ??? ?? ? ???? ??, ???? ??? ? ??? ??
(??? ?? ???? ?? ??)
2) ?? ??? ?? ???? ???? ??
: ??? ??? ??? ???? ??? ?? ??? ??, ?? ??? ?? ??
??) Ko-Lima (??? ?? ?? ???, ??? ???) + ??/?? ? ????? ??? ???(?? 1??? ?
?? ??)
? ?? ??
2M  90K  25K ???? ??? ? ?? ??
?? ????? ???? ???
16
? ?? ???? SFT ????
Name #N
??/??/?? ?? ????? 48163
Wizard 69615
Open orca 998520
koopen-platypus 24818
korquad-chat 32287
counsel_bot.jsonl 60367
evolve-instruct 36809
oig-smallchip2-dedu.jsonl 210282
Kullm-v2 141177
oig-instructions_en_ko.jsonl 49210
Super natural 49166
Shared GPT 34051
Everything 991
KoAlpaca 20366
KoCLima-vicuna 999
Naver ??? 319714
?? 200?? ??
?? ??
? Stage1 SFT ?? : 25000
? Stage2 Reward Model ?? : 8500
? Stage2 Reward Model ?? : 1500
? Stage3 PPO ?? : 15000
??? ????? ??? ?? ?? ? ?????,
??? ????? ??? ????? ??? ??
? oig-instructions ??? ?? (from Anthropic)
: ??? ?? ?? ????(1024 token)?? ??
: opensource ?? + LLM? output?? ???? ??
Reward Model Data ??
17
? Reward Model Data? ??? ???
? ?? ? 1?? ???? ???
??? 2? ??? ?? ??
Where can I find information about the Paris
Climate Agreement?
You can find information about the Paris Clim
ate Agreement on the website of the United
Nations Framework Convention on Climate C
hange (UNFCCC). The website has a dedicate
d section which contains information on the
Agreement, its objectives and its implementa
tion. Additionally, you can find news and upda
tes related to the Agreement by searching on
line news sources.
<?? ???? ??>
Reward Model Data ??
18
? ChatGPT 3.5 API? ???? ?? ?? ??
Where can I find information about the Paris
Climate Agreement?
You can find information about the Paris Clim
ate Agreement on the website of the United
Nations Framework Convention on Climate C
hange (UNFCCC). The website has a dedicate
d section which contains information on the
Agreement, its objectives and its implementa
tion. Additionally, you can find news and upda
tes related to the Agreement by searching on
line news sources.
<?? ???? ??>
You can find information about the Paris Climate Agreement from various sources, including official documents, international
organizations, and reputable news outlets. Here are some suggestions on where to find information:
1.United Nations Framework Convention on Climate Change (UNFCCC) Website:
- The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary
source for official documents, updates, and information.
- Website: UNFCCC
2.Paris Agreement Text:
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and
provisions of the agreement.
- Paris Agreement Text
- 
ChatGPT? ??? ??
: ??? ????? ??
??? ?? + ??? ?? ?? ??
Good
Bad
Reward Model Data ??
19
? ChatGPT 3.5 API? ???? ?? ?? ??
Where can I find information about the Paris
Climate Agreement?
You can find information about the Paris Clim
ate Agreement on the website of the United
Nations Framework Convention on Climate C
hange (UNFCCC). The website has a dedicate
d section which contains information on the
Agreement, its objectives and its implementa
tion. Additionally, you can find news and upda
tes related to the Agreement by searching on
line news sources.
<?? ???? ??>
You can find information about the Paris Climate Agreement from various sources, including official documents, international
organizations, and reputable news outlets. Here are some suggestions on where to find information:
1.United Nations Framework Convention on Climate Change (UNFCCC) Website:
- The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary
source for official documents, updates, and information.
- Website: UNFCCC
2.Paris Agreement Text:
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and
provisions of the agreement.
- Paris Agreement Text
- 
ChatGPT? ??? ??
: ??? ????? ??
??? ?? + ??? ?? ?? ??
Good
Bad
??? ??? ???
??? ???
????
? ??? ??
???, ??? ?? ???? ??
20
? ?? ?? ?? : Polyglot-1.3B, 3.8B, 5.8B, 12.8B
? ?? ??? : oig-instruction -> 4? ??? ??
? GPUs : A100 (80GB) * 8 ?
? ?? ?? : MS/DeepSpeed-Chat
? ColossaiAI ? TRL(Transformers Reinforcement Learning)? ?? ??
? ColossaiAI ? ?? ?? ?? ?? ????, 12.8B actor + 3.8B critic ???? ?? ???, ?? ??? ??? ??
? TRL ? Actor? Critic? Shared Architecture? ????? ?? ?? ? ???? Vanila RLHF? ??? ??? ?? ??
? ?? ????? ???? ?? ??? ??? ?? ?? ??? ???? ??? ? ?????.. ? (?? ??/??? ??? ???..)
Stage 1 ?? Stage 2 ?? Stage 2 ?? PPO
Stage 1: SFT ??
21
? ?? ??? ??
? ??? ??? ???? + Template ??
? ?? ??? ?? Context Length =1024
? Epoch : LLaMA2, Instruct GPT??? ??
2~3 Epoch
? LR 1~5e-5(AdamW, Cosine Scheduler)
? Zero-out loss on Prompt
? Polyglot-ko 1.3b~12.8B
? ??? Test ??? ????? ??
SFT
????! ??? ??? ??? ????:

### system: ???? ??? ?? ??? ??? ?????.
### ???: ??? ?? ???
### ??:
Stage 2: Reward Model ??
22
? GPTModel ?? Reward? ???? ?? Linear Layer ??
Pretrained GPT
Linear(hidden_dim, 1)
??: ??? ?? ?? ?? ??: ???
<Loss function ???>
EOS
pooling
Stage 2: Reward Model ??
23
? ChatGPT? ?? ?? ??? ??? ?????? ?? ? ??(1 epoch)
? ?? ?? ? ?? ??
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-1.3b 6.861068249 -5.620769024 0.9958334
polyglot-ko-3.8b 11.7929697 -3.29787612 0.9979167
polyglot-ko-5.8b 9.338997841 -8.786784172 0.9958334
???
?, ????
Stage 2: Reward Model ??
24
? ??? Overfitting ??
You can find information about the Paris Climate Agreement on
the website of the United Nations Framework Convention on
Climate Change (UNFCCC). The website has a dedicated section
which contains information on the Agreement, its objectives and
its implementation. Additionally, you can find news and updates
related to the Agreement by searching online news sources.
You can find information about the Paris Climate Agreement from various sources, including official
documents, international organizations, and reputable news outlets. Here are some suggestions on
where to find information:
1.United Nations Framework Convention on Climate Change (UNFCCC) Website:
- The official website of the UNFCCC, the organization responsible for the Paris Climate
Agreement, is a primary source for official documents, updates, and information.
- Website: UNFCCC
2.Paris Agreement Text:
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals,
commitments, and provisions of the agreement.
- Paris Agreement Text
- 
<?? ????? ??> <ChatGPT? ?? ??>
? 3??? ??? ?? ?? ??
? ?? ??? ??? ??? ?? ??? ??
??? ???? ?? ??
? ???? ??? ????/??? ?? ????? ?? ??
Stage 2: Reward Model ??
25
? ??? Overfitting ??
? ?? ??
1) ?? ???? ???? ?? ?? : ???? 17?? ???? ??
??: ????? ?? ??? ????? ??????
??? 1: ???, ????, ?? ?? ??? ????
? ?? ??? ?????. ? ??? ?? ??? ??
?? ??? ??? ?? ???? ????.
??? 2: ???? ??? ? ???! ?? ?? ?? ?
? ????????. ????? ???? ????
<??>
Stage 2: Reward Model ??
26
? ??? ???? ??, ??? ??? ??? ??? ?? ? ???? vs ????
? ???, ??? Generality? ?? ? ??? ??
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-1.3b 0.6244 0.3636 0.5987
polyglot-ko-3.8b 1.50507 1.1352 0.6337
polyglot-ko-5.8b 0.99981 0.5761 0.6574
TF + ??? - - 0.6267
???? ?? ??
<???? ?? ??>
Stage 2: Reward Model ??
27
? Overfitting ?? ?? 2)
: ?? ?? ??? ??? ??? ?? ?? ??? ???
To ChatGPT : "??? ??? ??? ?? ???? ??? ???? ???. ??? ??? ?? ?? ??? ????? ???.
???? ? ?? ? ??? ?? ??? ????? ?? ??????. ?? ???? ??? ???."
??: ???? ?? ? ??? ? ?? ??? ??????
?? :
1. ????? ???? ??? ?????. ?? ?? ????? ??? ??? ??? ?? ?????.
2. ???? ?? ?? ?? ? ??? ??? ???? ?? ????. ???? ???? ??? ?? ? ? ???? ?? ? ????.
3. ???? ?? ????? ?? ?? ????. ???? ??? ???? ?? ?? ????? ??? ?? ?? ?? ?? ????.
4. ???? ??? ???? ?? ??? ?????. ??? ???? ??? ????? ????? ?? ? ???? ??? ? ???? ????
???? ?? ????.
5. ???? ? ?????? ??? ??? ?? ??? ?? ????. ??? ??? ?? ?? ?? ??? ???? ?? ?? ?? ? ????.
6. ???? ?? ???? ?? ?? ????. ???? ??? ??? ?? ?? ?? ??? ??? ?? ???? ?? ?? ??? ? ? ????.
7. ???? ???? ???? ? ?? ???? ?? ???? ????. ??? ???? ???? ??? ????? ?? ??? ? ???? ??
?? ??? ? ?? ??? ???? ???? ???.
Stage 2: Reward Model ??
28
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-3.8b 6.8610 -5.6207 0.9958
polyglot-ko-3.8b
+ opensource + lying 1.8448 -1.5232 0.85125
? ?????, ?? + ?? ??? + ? ??? = 1:1:1 ??? ???? ??
?? ???? ??
???? ??? ?? ???? ?? ??
???? ?? ???
Stage 3: PPO ??
29
https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training
? PPO : Actor + Critic ??
? ??? 4?? ??? ???;
? ??? ??(??) ? Inference 5? ???;;;
Stage 3: PPO ??
30
?? ???
? ???? ?? ???
Reward Clipping? ?? ??
Kl penalty? ??? ??
(Dynamic KL Penalty ???)
Kl = 0.04, 0.025
Kl = 0.01
Stage 3: PPO ??
31
? Reward Hacking ?? : ?? ???? ??? ??? Reward? ?? ?? ? ?? ??? ???
? ??!?
12.8B Actor +
5.8B RewardModel
??: ???? ??? ??? ? ? ????
?? :
1??) ?? ??? ?????.
2??) 
??: ?? ???? ???!
?? :
1??) ?? ??? ?????
2??) 
??: ? ??? ???
?? :
? ??? ??? ????.
1??) ??? ?????.
2??) 
?? ?? ??
??: ? ??? ???
?? :
__ __ __
?? ?? ?? ??
(?? ???? ???? ??)
Stage 3: PPO ??
32
? PPO? ?? ? ????? ???? ?? ????
? 1) ??? ?? ?? ?? ??. ??) 12.8B Actor + 3.8B Critic
? 2) Llama2? Rejection Sampling ??
Actor
??
??1
??2
??3
??4
??5
Reward
Model ??3
Best ?? ??
Supervised Learning(1 epoch)
Stage 3: PPO ??
33
? PPO ?? ??? 15000?? 5????, 4?? Rejection Sampling + 1?? PPO ??
3000? ??? ?? ??
X 4?
? 3?? ???? 4? Iteration
? ?? Iteration?? Actor? ?? Iteration?? ?? ? Actor
? 2?? Iteration??? diff(Best-Worst) > 2 ? ??? ?? (?? 800?) ? ???? ??? ???? ??, overfitting ??
Stage 3: PPO ?? ??
34
? Rejection Sampling ? PPO? ?? ?? ?? ??
No RS
RS1
RS2(>2)
RS2(all)
RS3(>2)
RS4(>2)
RS4(>2) + PPO(KL 0.04)
RS4(>2) + PPO(KL 0.01)
No RS + PPO
Reward hacking /
Local optimum
Win Rate
(vs oig ?? ???)
(Reward Model ??)
?? ??? ?
Stage 3: PPO ?? ??
35
? Human Evaluation ??
[?? ??]
1) ???:
- ???? ?? ?? ????? (?? ??)
- ??? ??? ??? ?? ?????? (???)
- ?? ?? ?? ?? ???
- ??? ?? ?? ?? ?? ???
2) ???:
- ??? ??/??? ?? ?? ???? ??? ?? ??? ????
- ??? ?????
3) ???:
- ???? ??? ?????
??? ???? Win Rate
(Win = PPO win)
???, ??? -> 1~2% ??? ??
PPO vs SFT
??? Lessons Learned?
36
? ??? ????? .. ???? ?? ????. ?? ??? Quality? ? ?? ? SFT >> Reward Modeling > PPO
? Quality is all you need (Llama2, LIMA)
? SFT? ?? PPO? ????? ??? ????? ???. ??, ?? ??? ??? ?? ??? ???(ex. ??, ?? ? NLP Task ??)
? InstructGPT ?? NLP Task? PPO? ?? ?? ????, ???? ???? ??? ? ?????? ? Failed
? ???? Reward? ???? ? ?? ? ?? ??? ??? ?? ??
? Na?
ve ? Reward Modeling?? PPO? ??? ? ???? ?? ??? ?? ? ???? ? Nope!
? Rejection Sampling?? Iterative Learning? ?? ?? ???? ?? ?? ??? ??? ?? ? ??? ??? ?? ??
Reward
Reward
?? ?? ? ?? ???
?? ??
??? Lessons Learned?
37
? Reward Model? ??? ???
? 1) Reward model? ??? Truthfulness? ?? ???? ???. -> ????? ??? ???
? 2) ??? 100?? ??? 100?? ???? ??? ?? ? ?? ??? ??? ????? ?? ? ??? ??
? 3) Reward ??? ??
? 4) ?? ?? Reward Model ??? Llama2??? 2? ???, ? ?? ???? ????
https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s
??? Lessons Learned?
38
? ??? ?? ??/??? ?? ??
? SFT ?? ??? ????
? ??? 2~3? ?? ????
? ?? ??: DPO, Hydra-PPO, Offline-RL ? ?? ?? ??
? RLHF ???? ??? ? issue? ?? ??? ??
? ?? ??? Colossal AI ? TRL ? DeepSpeed? ???
? ?? ?? TRL? ?? ??? ..
https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s
?~~?? ????
????
Step 3: ???? ??? ??| 40
ChatBot? ????? ?????
Agent
Reward
Action
State
Environment
?????? ???? ?? ????? ?????
https://luda.ai/
Agent
Environment
? Action: ?
? State: ?
? Reward: ?
Step 3: ???? ??? ??| 41
ChatBot? ????? ?????
https://luda.ai/
Agent:
???(??)
Environment
? State: ????? ?? ????
? Action: ??? ?? ??? ??/??(??)
? ?? ?? Action Space == Vocabulary Size
? ?? ?? Action Space == Vocabulary Size * ? ??? ??
? Reward: Reward Model? State+Action? ??? ? ???? Scalar
Value
Step 3: ???? ??? ??| 42
State, Action, Reward? Trajectory
https://luda.ai/
State
GPT
Action
Reward Model
Reward
State
GPT
Action
Reward Model
Reward
continue
<Episode 1> <Episode 2>
(state, action, reward) (state, action, reward) Training Data
Step 3: ???? ??? ??| 43
???? ??? ??
? OpenAI Gymnasium? Lunar Lander
https://gymnasium.farama.org/environments/box2d/lunar_lander/, https://www.youtube.com/watch?v=U4vRW4fcXRA
??? ??? ?? ???? ???
Step 3: ???? ??? ??| 44
ChatGPT? ??? ???? ???? ??(PPO)
? ??? ?? ????: Advantage Actor-Critic
Actor Network
? ?? ?? ??? ?? ???? ??
? Input: State
? Output: Action Probability Dist.
? Critic Network?? ?? ??? ?? Action?
?? ???? ???
? Actor? ??? ?? ??? ??
? Input: State
? output: ?? ????? ??(?? ??? ??? ????)
? ?? ??(???? ??)? ??? ??? ??? ??? ?
??? ??
Critic Network
Step 3: ???? ??? ??| 45
Critic? ???
? Critic? ?? ????? ??? ?? Reward? ?? ???? ????
Actor Network Critic Network
Reward
vs
?? ??? ???? ??
?? ??? ??? ??? ? ?? Reward ???
Step 3: ???? ??? ??| 46
Critic? ???

Action
Actor Network
??? ???? update
[???]
? High Variance
? ??? ??
? ?? ?? ??
+1 +1 -1 -1 +1
Reward
<High variance Reward ??>
? Critic? ????
Time 0 1  T_end
T T+1
episode
reward
Step 3: ???? ??? ??| 47
Critic? ???

Action
+1 +1 -1 -1 +1
Reward
? Critic ??
Time 0 1  T_end
T T+1
Value +1.2 +1.3 -0.9 +0.3 +0.8 +0
Critic Network
? ?? S(t)?? ??? ?? ? Reward? ???? ??(value)? ?? ? V(??)
? ???, V(??) ? V(??+1)? ??? ???? Actor network? ??
Step 3: ???? ??? ??| 48
Critic? ???

Action
+1 +1 -1 -1 +1
Reward
? Critic ??
Time 0 1  T_end
T T+1
Value +1.2 +1.3 -0.9 +0.3 +0.8 +0
Critic Network
V(??) V(??+1)
Advantage = r + V(??+1) -V(??)
r
Actor Network
update
[??]
? Low Variance
? ? Step?? ??? ??
Step 3: ???? ??? ??| 49
????: Loss Clipping
https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
??? ???
(Critic? ??)
?? ?? ?? ??!!
??? ???..!
Step 3: ???? ??? ??| 50
Loss Clipping
https://huggingface.co/blog/deep-rl-ppo
?? Policy? ?? Policy? ??
r? ?? ??? ???? ??? Clipping
Critic? ?? ??? Reward(advantage)
? ChatGPT? ??? Proximal Policy Optimization(PPO) ????? 2017?? ??
? ??? ???? ??? ?? ??? ????? ???? ??? ????? ??(?)
Step 3: ???? ??? ??| 51
[??]???? ???? ??
? Neural net ??? ?? ?? C Loss function
? ???? ???
? ??? ?? ?? ?? = Loss
? Loss? Gradient()? ?? ?,
Gradient Decent? ???? ????
? ?????? Loss???
? Loss? ???? Reward? ??
? Reward? ???? Gradient Ascent!!
????  ?ylog(?) ????  ? ?
?
??
Actions
Agent Network(Policy)
State
Step 3: ???? ??? ??| 52
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?? P(a|s) ? R(s,a)
?? s?? ?? a? ? ?? ?? ?? ?? Reward
Step 3: ???? ??? ??| 53
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?? P(a|s) ? R(s,a) = J
?? s?? ?? a? ? ?? ?? ?? ?? Reward
(??, return)
= Policy(Agent)
? ??? ???? ?? ??? ???? ??? J? ? ? ??? ??, Gradient Ascent!
?
= ? + ? ? ?? ?
?
= ? + ? ? ????? ? ? ? ?(?, ?)
? ? ?, ? ? ?? ?? ??? ? ?? ??, ??? ????
Step 3: ???? ??? ??| 54
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?? P(a|s) ? R(s,a) = J
?? s?? ?? a? ? ?? ?? ?? ?? Reward
? Action a? ?? ??? ???? Reward?
??? ??? ??.
? ??? roll out ???? ??? ?? ?? ??
? Neural Net?? ?????!
? ??(?, ?)? ????,
? ?
= ? + ? ? ??? ?, ? ? ?
? ? = ?? reward? ??(?, ?)? ??

More Related Content

What's hot (20)

KorQuAD introduction
KorQuAD introductionKorQuAD introduction
KorQuAD introduction
SeungyoungLim
?
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
?
KorQuAD v2.0 ??
KorQuAD v2.0 ??KorQuAD v2.0 ??
KorQuAD v2.0 ??
LGCNSairesearch
?
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen
?
[226]???????? ?????????????? ????? ???????
[226]???????? ?????????????? ????? ???????[226]???????? ?????????????? ????? ???????
[226]???????? ?????????????? ????? ???????
NAVER D2
?
?????, ?????, ????
?????, ?????, ?????????, ?????, ????
?????, ?????, ????
SANGHEE SHIN
?
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
?
[NAVER D2SF][TMS2019] ???? ??
[NAVER D2SF][TMS2019] ???? ??[NAVER D2SF][TMS2019] ???? ??
[NAVER D2SF][TMS2019] ???? ??
NAVER D2 STARTUP FACTORY
?
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
?
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
?
Word2 vec
Word2 vecWord2 vec
Word2 vec
ankit_ppt
?
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).ppt
TemesgenTolcha2
?
20170227 ????? ??_???
20170227 ????? ??_???20170227 ????? ??_???
20170227 ????? ??_???
Kim Sungdong
?
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
irpycon
?
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
?
[224]??? ??? ???
[224]??? ??? ???[224]??? ??? ???
[224]??? ??? ???
NAVER D2
?
??? (???? GPS??????? ?? ????)
??? (???? GPS??????? ?? ????)??? (???? GPS??????? ?? ????)
??? (???? GPS??????? ?? ????)
if kakao
?
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
SylvainGugger
?
LLM ????? ????? ?????? ?????? ??????
LLM ????? ????? ?????? ?????? ??????LLM ????? ????? ?????? ?????? ??????
LLM ????? ????? ?????? ?????? ??????
Tae Young Lee
?
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
?
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
Young Seok Kim
?
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
Leon Dohmen
?
[226]???????? ?????????????? ????? ???????
[226]???????? ?????????????? ????? ???????[226]???????? ?????????????? ????? ???????
[226]???????? ?????????????? ????? ???????
NAVER D2
?
An introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging FaceAn introduction to computer vision with Hugging Face
An introduction to computer vision with Hugging Face
Julien SIMON
?
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
Matej Varga
?
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
?
introduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).pptintroduction to natural language processing(NLP).ppt
introduction to natural language processing(NLP).ppt
TemesgenTolcha2
?
20170227 ????? ??_???
20170227 ????? ??_???20170227 ????? ??_???
20170227 ????? ??_???
Kim Sungdong
?
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
irpycon
?
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
Jinpyo Lee
?
[224]??? ??? ???
[224]??? ??? ???[224]??? ??? ???
[224]??? ??? ???
NAVER D2
?
??? (???? GPS??????? ?? ????)
??? (???? GPS??????? ?? ????)??? (???? GPS??????? ?? ????)
??? (???? GPS??????? ?? ????)
if kakao
?
LLM ????? ????? ?????? ?????? ??????
LLM ????? ????? ?????? ?????? ??????LLM ????? ????? ?????? ?????? ??????
LLM ????? ????? ?????? ?????? ??????
Tae Young Lee
?
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
Yuriy Guts
?

Similar to RLHF_Lessons_learned.pdf (20)

Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Platform
?
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training Model
Zimin Park
?
????? ??? ChatGPT
????? ??? ChatGPT????? ??? ChatGPT
????? ??? ChatGPT
Tae Young Lee
?
The roadtocodecraft
The roadtocodecraftThe roadtocodecraft
The roadtocodecraft
bbongcsu
?
??? ?? ??? ????? ??
??? ?? ??? ????? ????? ?? ??? ????? ??
??? ?? ??? ????? ??
Jungkyu Lee
?
Transfer learning usage
Transfer learning usageTransfer learning usage
Transfer learning usage
Tae Young Lee
?
2018 06-11-active-question-answering
2018 06-11-active-question-answering2018 06-11-active-question-answering
2018 06-11-active-question-answering
Woong won Lee
?
Ml for ?????
Ml for ?????Ml for ?????
Ml for ?????
JEEHYUN PAIK
?
Java performance and trouble shooting
Java performance and trouble shootingJava performance and trouble shooting
Java performance and trouble shooting
Anna Choi
?
?????? ? 5: ??? ???? ????? ??? (?????? ??)
?????? ? 5:  ??? ???? ????? ??? (?????? ??)?????? ? 5:  ??? ???? ????? ??? (?????? ??)
?????? ? 5: ??? ???? ????? ??? (?????? ??)
Jaimie Kwon (???)
?
??? ??? ???? ?? ??
??? ??? ???? ?? ????? ??? ???? ?? ??
??? ??? ???? ?? ??
Ubuntu Korea Community
?
100% Serverless big data scale production Deep Learning System
100% Serverless big data scale production Deep Learning System100% Serverless big data scale production Deep Learning System
100% Serverless big data scale production Deep Learning System
hoondong kim
?
Coding interview
Coding interviewCoding interview
Coding interview
Soohan Ahn
?
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
Kay Kim
?
??????????? & Unity ML Agents
??????????? & Unity ML Agents??????????? & Unity ML Agents
??????????? & Unity ML Agents
Hyunjong Lee
?
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience
NAVER D2
?
Image Deep Learning ????
Image Deep Learning ????Image Deep Learning ????
Image Deep Learning ????
Youngjae Kim
?
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
BOAZ Bigdata
?
Chapter 11 Practical Methodology
Chapter 11 Practical MethodologyChapter 11 Practical Methodology
Chapter 11 Practical Methodology
KyeongUkJang
?
Workshop 210417 dhlee
Workshop 210417 dhleeWorkshop 210417 dhlee
Workshop 210417 dhlee
Dongheon Lee
?
Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Tech Summit 2: ???? ?? ??? Chatbot ?? ??? ??
Clova Platform
?
GPT : Generative Pre-Training Model
GPT : Generative Pre-Training ModelGPT : Generative Pre-Training Model
GPT : Generative Pre-Training Model
Zimin Park
?
The roadtocodecraft
The roadtocodecraftThe roadtocodecraft
The roadtocodecraft
bbongcsu
?
??? ?? ??? ????? ??
??? ?? ??? ????? ????? ?? ??? ????? ??
??? ?? ??? ????? ??
Jungkyu Lee
?
Transfer learning usage
Transfer learning usageTransfer learning usage
Transfer learning usage
Tae Young Lee
?
2018 06-11-active-question-answering
2018 06-11-active-question-answering2018 06-11-active-question-answering
2018 06-11-active-question-answering
Woong won Lee
?
Java performance and trouble shooting
Java performance and trouble shootingJava performance and trouble shooting
Java performance and trouble shooting
Anna Choi
?
?????? ? 5: ??? ???? ????? ??? (?????? ??)
?????? ? 5:  ??? ???? ????? ??? (?????? ??)?????? ? 5:  ??? ???? ????? ??? (?????? ??)
?????? ? 5: ??? ???? ????? ??? (?????? ??)
Jaimie Kwon (???)
?
100% Serverless big data scale production Deep Learning System
100% Serverless big data scale production Deep Learning System100% Serverless big data scale production Deep Learning System
100% Serverless big data scale production Deep Learning System
hoondong kim
?
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
??? ?? ??: ?? ??? ??? ??? ? (Agile Game Development: Dealing With Chaos In Th...
Kay Kim
?
??????????? & Unity ML Agents
??????????? & Unity ML Agents??????????? & Unity ML Agents
??????????? & Unity ML Agents
Hyunjong Lee
?
[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience[2A7]Linkedin'sDataScienceWhyIsItScience
[2A7]Linkedin'sDataScienceWhyIsItScience
NAVER D2
?
Image Deep Learning ????
Image Deep Learning ????Image Deep Learning ????
Image Deep Learning ????
Youngjae Kim
?
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
? 18? ???(BOAZ) ???? ???? - [???] : ?? ????
BOAZ Bigdata
?
Chapter 11 Practical Methodology
Chapter 11 Practical MethodologyChapter 11 Practical Methodology
Chapter 11 Practical Methodology
KyeongUkJang
?
Workshop 210417 dhlee
Workshop 210417 dhleeWorkshop 210417 dhlee
Workshop 210417 dhlee
Dongheon Lee
?

RLHF_Lessons_learned.pdf

  • 2. ?? ? RLHF? ? RLHF ?? ? ?? ?? ? Stage ? ?? ?? ? ?? ?? ? Lessons Learned
  • 3. RLHF? 3 https://www.youtube.com/watch?v=vziygFrRlZ4 ? ??? ??? NO NO ? ?? ??? ???? ? Imitation Learning & Learning from human preference Behavior Cloning : ??? ?? ?? ??? ??? (????, ????/Demonstration ??? ??) Learning from human preference (OpenAI, 2017) : ?? ??? ?? ???? ??? ?? ? ??? ???? ???? ?? ??(????)
  • 4. 4 Reward Function ??? ??? https://youtu.be/tlOIHko8ySg, https://openai.com/research/faulty-reward-functions Reward ? ? ? ??? Reward Function ? ???? -, ??? ??? + ? ??? ??? ?? ??? ???? ????
  • 5. 5 ???? ?? ??? Reward Function ?? Boston Dynamics ????? ????? ??? Reward? ????? - ?? ?? ??? ? ?? ?????? - ?? ?? +100?? - ?? ?? ?? ???? 0??? - ??? 10Cm ?? 1??? 2??? 10???
  • 6. 6 ???? ??? ???????? How?? ??? ?? ??? Reward Model GPT ?? ???? +100 Reinforce ?? ???? ??? ??? ????
  • 7. 7 ?? ?? == ??? ??? ? ??? ???? ??? ? ?? ?? ???? ? ??? ???? ????? P( ??|???1, ???2 , ?0) ?? ?? ?? ?? P( ??|???1, ???2 , ?0) + ??? ??? ?? ????? ?? ?? ? ?? ??? ?? ChatGPT? ?? ?? ? ??? ???? ?? Reward Model
  • 9. Stage 1: Supervised Fine-tuned Model ??? 9 ? Goal ? ??? ??? ???? ?? ? ? ?? ?? ?? ?? (Instruction Fine-tuning, Imitation Learning) ? ??? ? ?? ? ?? ??? ?? ??(GPT 3.5, 175B) ? ??? ?? ? ? ? ? ?? ???? ?? ??-?? ???? (Demonstration Dataset) ? ?? ?? ? ??? ?? ??? ??? ??? Supervised Fine-tuning (??: ??, ??: ??, Loss function: Cross-entropy loss) ? ?? ? ?????, ??? ??? ??? ?? ? ??? ?? ??? GPT ?? ????
  • 10. Stage 2: Reward Model ??? 10 ? Goal ? ??? ??? ?? ??? ??? ??? ???? Reward Model(RM) ?? ? ??? ? Stage1?? ?? ? SFT ??(6B, head ??) ? ??? ?? ? ?? ?? ??? ????, ? ? ?? ??? ?? ???? ????? ?. ? ?? ?? ? ??? ? ???? ??? ??? ???? ?? (?? ?? ?? >> ?? ?? ??) ? ?? ? ?? ??? ?? ??, ??? Stage 1??? ?? ??? ?? ?. ### ?? : ??? ?? ?? Reward Model ??1) ??? ??2) ???? 100? Reward Model -100?
  • 11. Stage 3: ???? ???? 11 ? Goal ? RM? SFT? ???? ???? ???? ? ??? ? ChatGPT ?? ? ??? ? Stage1?? ?? ? SFT ??, Stage2?? ?? ? RM ?? ? ?? Stage?? ???? ?? ??? ??(??)? ? ?? ?? ? PPO Algorithm(???? ????? ? ??) ? ?? ? ???? ??? ???? ??? ?? ?? Stage. Stage2, 3? ?? ?? ??
  • 12. ???? RLHF ?? ??? 12 SFT ?? ??? Reward ?? ??? ???? ???? ? ?? ?? ?? ?? ? ??? ??-?? ?? ?? ? ?? ?? C ?? ?? ? ?? ? ????? ?? set ?? Stage 1 Stage 2 Stage 3
  • 13. RLHF? ?? ?? ?? ???? ? ?? C ???? ?? 13 SFT RL ? Dataset ??? ??? ?? Reward Model Case1) ?? ?? Distribution - ??: ??? ??? ????? ?? - ??: ?? ?? X SFT RL Reward Model Case2) ?? ?? ??? Dist. - ??: ?? ??? ??? ?? ??? ?? - ??: Data Mix ??? ???, ?? ??? SFT RL Reward Model Case 3) ?? ?? - ??: ??/???? ?? - ??: ????? 3?? ?????? ? Stage ? ????? ???, ???? ??? ?? ? ? ???? ??? ?? SFT RL Reward Model
  • 14. ?? ????? ???? ??? 14 ? ?? ???? SFT ???? Name #N ??/??/?? ?? ????? 48163 Wizard 69615 Open orca 998520 koopen-platypus 24818 korquad-chat 32287 counsel_bot.jsonl 60367 evolve-instruct 36809 oig-smallchip2-dedu.jsonl 210282 Kullm-v2 141177 oig-instructions_en_ko.jsonl 49210 Super natural 49166 Shared GPT 34051 Everything 991 KoAlpaca 20366 KoCLima-vicuna 999 Naver ??? 319714 ?? 200?? ?? ?? ?? 1) ?? ??? ?? ???. 2) ?? N?? ??? ???. 3) ?? ??? ?? ???? ?? ???
  • 15. ?? ????? ???? ??? 15 ? SFT ???? ? ?? ?? (polyglot 12.8b ??) 1) ?? ??? ?? ? ???? ??, ???? ??? ? ??? ?? (??? ?? ???? ?? ??) 2) ?? ??? ?? ???? ???? ?? : ??? ??? ??? ???? ??? ?? ??? ??, ?? ??? ?? ?? ??) Ko-Lima (??? ?? ?? ???, ??? ???) + ??/?? ? ????? ??? ???(?? 1??? ? ?? ??) ? ?? ?? 2M 90K 25K ???? ??? ? ?? ??
  • 16. ?? ????? ???? ??? 16 ? ?? ???? SFT ???? Name #N ??/??/?? ?? ????? 48163 Wizard 69615 Open orca 998520 koopen-platypus 24818 korquad-chat 32287 counsel_bot.jsonl 60367 evolve-instruct 36809 oig-smallchip2-dedu.jsonl 210282 Kullm-v2 141177 oig-instructions_en_ko.jsonl 49210 Super natural 49166 Shared GPT 34051 Everything 991 KoAlpaca 20366 KoCLima-vicuna 999 Naver ??? 319714 ?? 200?? ?? ?? ?? ? Stage1 SFT ?? : 25000 ? Stage2 Reward Model ?? : 8500 ? Stage2 Reward Model ?? : 1500 ? Stage3 PPO ?? : 15000 ??? ????? ??? ?? ?? ? ?????, ??? ????? ??? ????? ??? ?? ? oig-instructions ??? ?? (from Anthropic) : ??? ?? ?? ????(1024 token)?? ?? : opensource ?? + LLM? output?? ???? ??
  • 17. Reward Model Data ?? 17 ? Reward Model Data? ??? ??? ? ?? ? 1?? ???? ??? ??? 2? ??? ?? ?? Where can I find information about the Paris Climate Agreement? You can find information about the Paris Clim ate Agreement on the website of the United Nations Framework Convention on Climate C hange (UNFCCC). The website has a dedicate d section which contains information on the Agreement, its objectives and its implementa tion. Additionally, you can find news and upda tes related to the Agreement by searching on line news sources. <?? ???? ??>
  • 18. Reward Model Data ?? 18 ? ChatGPT 3.5 API? ???? ?? ?? ?? Where can I find information about the Paris Climate Agreement? You can find information about the Paris Clim ate Agreement on the website of the United Nations Framework Convention on Climate C hange (UNFCCC). The website has a dedicate d section which contains information on the Agreement, its objectives and its implementa tion. Additionally, you can find news and upda tes related to the Agreement by searching on line news sources. <?? ???? ??> You can find information about the Paris Climate Agreement from various sources, including official documents, international organizations, and reputable news outlets. Here are some suggestions on where to find information: 1.United Nations Framework Convention on Climate Change (UNFCCC) Website: - The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary source for official documents, updates, and information. - Website: UNFCCC 2.Paris Agreement Text: - The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and provisions of the agreement. - Paris Agreement Text - ChatGPT? ??? ?? : ??? ????? ?? ??? ?? + ??? ?? ?? ?? Good Bad
  • 19. Reward Model Data ?? 19 ? ChatGPT 3.5 API? ???? ?? ?? ?? Where can I find information about the Paris Climate Agreement? You can find information about the Paris Clim ate Agreement on the website of the United Nations Framework Convention on Climate C hange (UNFCCC). The website has a dedicate d section which contains information on the Agreement, its objectives and its implementa tion. Additionally, you can find news and upda tes related to the Agreement by searching on line news sources. <?? ???? ??> You can find information about the Paris Climate Agreement from various sources, including official documents, international organizations, and reputable news outlets. Here are some suggestions on where to find information: 1.United Nations Framework Convention on Climate Change (UNFCCC) Website: - The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary source for official documents, updates, and information. - Website: UNFCCC 2.Paris Agreement Text: - The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and provisions of the agreement. - Paris Agreement Text - ChatGPT? ??? ?? : ??? ????? ?? ??? ?? + ??? ?? ?? ?? Good Bad ??? ??? ??? ??? ??? ???? ? ??? ??
  • 20. ???, ??? ?? ???? ?? 20 ? ?? ?? ?? : Polyglot-1.3B, 3.8B, 5.8B, 12.8B ? ?? ??? : oig-instruction -> 4? ??? ?? ? GPUs : A100 (80GB) * 8 ? ? ?? ?? : MS/DeepSpeed-Chat ? ColossaiAI ? TRL(Transformers Reinforcement Learning)? ?? ?? ? ColossaiAI ? ?? ?? ?? ?? ????, 12.8B actor + 3.8B critic ???? ?? ???, ?? ??? ??? ?? ? TRL ? Actor? Critic? Shared Architecture? ????? ?? ?? ? ???? Vanila RLHF? ??? ??? ?? ?? ? ?? ????? ???? ?? ??? ??? ?? ?? ??? ???? ??? ? ?????.. ? (?? ??/??? ??? ???..) Stage 1 ?? Stage 2 ?? Stage 2 ?? PPO
  • 21. Stage 1: SFT ?? 21 ? ?? ??? ?? ? ??? ??? ???? + Template ?? ? ?? ??? ?? Context Length =1024 ? Epoch : LLaMA2, Instruct GPT??? ?? 2~3 Epoch ? LR 1~5e-5(AdamW, Cosine Scheduler) ? Zero-out loss on Prompt ? Polyglot-ko 1.3b~12.8B ? ??? Test ??? ????? ?? SFT ????! ??? ??? ??? ????: ### system: ???? ??? ?? ??? ??? ?????. ### ???: ??? ?? ??? ### ??:
  • 22. Stage 2: Reward Model ?? 22 ? GPTModel ?? Reward? ???? ?? Linear Layer ?? Pretrained GPT Linear(hidden_dim, 1) ??: ??? ?? ?? ?? ??: ??? <Loss function ???> EOS pooling
  • 23. Stage 2: Reward Model ?? 23 ? ChatGPT? ?? ?? ??? ??? ?????? ?? ? ??(1 epoch) ? ?? ?? ? ?? ?? Model ?? ??? ?? ?? ?? ?? ??? ?? ?? ?? ?? ?? ?? ??? polyglot-ko-1.3b 6.861068249 -5.620769024 0.9958334 polyglot-ko-3.8b 11.7929697 -3.29787612 0.9979167 polyglot-ko-5.8b 9.338997841 -8.786784172 0.9958334 ??? ?, ????
  • 24. Stage 2: Reward Model ?? 24 ? ??? Overfitting ?? You can find information about the Paris Climate Agreement on the website of the United Nations Framework Convention on Climate Change (UNFCCC). The website has a dedicated section which contains information on the Agreement, its objectives and its implementation. Additionally, you can find news and updates related to the Agreement by searching online news sources. You can find information about the Paris Climate Agreement from various sources, including official documents, international organizations, and reputable news outlets. Here are some suggestions on where to find information: 1.United Nations Framework Convention on Climate Change (UNFCCC) Website: - The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary source for official documents, updates, and information. - Website: UNFCCC 2.Paris Agreement Text: - The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and provisions of the agreement. - Paris Agreement Text - <?? ????? ??> <ChatGPT? ?? ??> ? 3??? ??? ?? ?? ?? ? ?? ??? ??? ??? ?? ??? ?? ??? ???? ?? ?? ? ???? ??? ????/??? ?? ????? ?? ??
  • 25. Stage 2: Reward Model ?? 25 ? ??? Overfitting ?? ? ?? ?? 1) ?? ???? ???? ?? ?? : ???? 17?? ???? ?? ??: ????? ?? ??? ????? ?????? ??? 1: ???, ????, ?? ?? ??? ???? ? ?? ??? ?????. ? ??? ?? ??? ?? ?? ??? ??? ?? ???? ????. ??? 2: ???? ??? ? ???! ?? ?? ?? ? ? ????????. ????? ???? ???? <??>
  • 26. Stage 2: Reward Model ?? 26 ? ??? ???? ??, ??? ??? ??? ??? ?? ? ???? vs ???? ? ???, ??? Generality? ?? ? ??? ?? Model ?? ??? ?? ?? ?? ?? ??? ?? ?? ?? ?? ?? ?? ??? polyglot-ko-1.3b 0.6244 0.3636 0.5987 polyglot-ko-3.8b 1.50507 1.1352 0.6337 polyglot-ko-5.8b 0.99981 0.5761 0.6574 TF + ??? - - 0.6267 ???? ?? ?? <???? ?? ??>
  • 27. Stage 2: Reward Model ?? 27 ? Overfitting ?? ?? 2) : ?? ?? ??? ??? ??? ?? ?? ??? ??? To ChatGPT : "??? ??? ??? ?? ???? ??? ???? ???. ??? ??? ?? ?? ??? ????? ???. ???? ? ?? ? ??? ?? ??? ????? ?? ??????. ?? ???? ??? ???." ??: ???? ?? ? ??? ? ?? ??? ?????? ?? : 1. ????? ???? ??? ?????. ?? ?? ????? ??? ??? ??? ?? ?????. 2. ???? ?? ?? ?? ? ??? ??? ???? ?? ????. ???? ???? ??? ?? ? ? ???? ?? ? ????. 3. ???? ?? ????? ?? ?? ????. ???? ??? ???? ?? ?? ????? ??? ?? ?? ?? ?? ????. 4. ???? ??? ???? ?? ??? ?????. ??? ???? ??? ????? ????? ?? ? ???? ??? ? ???? ???? ???? ?? ????. 5. ???? ? ?????? ??? ??? ?? ??? ?? ????. ??? ??? ?? ?? ?? ??? ???? ?? ?? ?? ? ????. 6. ???? ?? ???? ?? ?? ????. ???? ??? ??? ?? ?? ?? ??? ??? ?? ???? ?? ?? ??? ? ? ????. 7. ???? ???? ???? ? ?? ???? ?? ???? ????. ??? ???? ???? ??? ????? ?? ??? ? ???? ?? ?? ??? ? ?? ??? ???? ???? ???.
  • 28. Stage 2: Reward Model ?? 28 Model ?? ??? ?? ?? ?? ?? ??? ?? ?? ?? ?? ?? ?? ??? polyglot-ko-3.8b 6.8610 -5.6207 0.9958 polyglot-ko-3.8b + opensource + lying 1.8448 -1.5232 0.85125 ? ?????, ?? + ?? ??? + ? ??? = 1:1:1 ??? ???? ?? ?? ???? ?? ???? ??? ?? ???? ?? ?? ???? ?? ???
  • 29. Stage 3: PPO ?? 29 https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training ? PPO : Actor + Critic ?? ? ??? 4?? ??? ???; ? ??? ??(??) ? Inference 5? ???;;;
  • 30. Stage 3: PPO ?? 30 ?? ??? ? ???? ?? ??? Reward Clipping? ?? ?? Kl penalty? ??? ?? (Dynamic KL Penalty ???) Kl = 0.04, 0.025 Kl = 0.01
  • 31. Stage 3: PPO ?? 31 ? Reward Hacking ?? : ?? ???? ??? ??? Reward? ?? ?? ? ?? ??? ??? ? ??!? 12.8B Actor + 5.8B RewardModel ??: ???? ??? ??? ? ? ???? ?? : 1??) ?? ??? ?????. 2??) ??: ?? ???? ???! ?? : 1??) ?? ??? ????? 2??) ??: ? ??? ??? ?? : ? ??? ??? ????. 1??) ??? ?????. 2??) ?? ?? ?? ??: ? ??? ??? ?? : __ __ __ ?? ?? ?? ?? (?? ???? ???? ??)
  • 32. Stage 3: PPO ?? 32 ? PPO? ?? ? ????? ???? ?? ???? ? 1) ??? ?? ?? ?? ??. ??) 12.8B Actor + 3.8B Critic ? 2) Llama2? Rejection Sampling ?? Actor ?? ??1 ??2 ??3 ??4 ??5 Reward Model ??3 Best ?? ?? Supervised Learning(1 epoch)
  • 33. Stage 3: PPO ?? 33 ? PPO ?? ??? 15000?? 5????, 4?? Rejection Sampling + 1?? PPO ?? 3000? ??? ?? ?? X 4? ? 3?? ???? 4? Iteration ? ?? Iteration?? Actor? ?? Iteration?? ?? ? Actor ? 2?? Iteration??? diff(Best-Worst) > 2 ? ??? ?? (?? 800?) ? ???? ??? ???? ??, overfitting ??
  • 34. Stage 3: PPO ?? ?? 34 ? Rejection Sampling ? PPO? ?? ?? ?? ?? No RS RS1 RS2(>2) RS2(all) RS3(>2) RS4(>2) RS4(>2) + PPO(KL 0.04) RS4(>2) + PPO(KL 0.01) No RS + PPO Reward hacking / Local optimum Win Rate (vs oig ?? ???) (Reward Model ??) ?? ??? ?
  • 35. Stage 3: PPO ?? ?? 35 ? Human Evaluation ?? [?? ??] 1) ???: - ???? ?? ?? ????? (?? ??) - ??? ??? ??? ?? ?????? (???) - ?? ?? ?? ?? ??? - ??? ?? ?? ?? ?? ??? 2) ???: - ??? ??/??? ?? ?? ???? ??? ?? ??? ???? - ??? ????? 3) ???: - ???? ??? ????? ??? ???? Win Rate (Win = PPO win) ???, ??? -> 1~2% ??? ?? PPO vs SFT
  • 36. ??? Lessons Learned? 36 ? ??? ????? .. ???? ?? ????. ?? ??? Quality? ? ?? ? SFT >> Reward Modeling > PPO ? Quality is all you need (Llama2, LIMA) ? SFT? ?? PPO? ????? ??? ????? ???. ??, ?? ??? ??? ?? ??? ???(ex. ??, ?? ? NLP Task ??) ? InstructGPT ?? NLP Task? PPO? ?? ?? ????, ???? ???? ??? ? ?????? ? Failed ? ???? Reward? ???? ? ?? ? ?? ??? ??? ?? ?? ? Na? ve ? Reward Modeling?? PPO? ??? ? ???? ?? ??? ?? ? ???? ? Nope! ? Rejection Sampling?? Iterative Learning? ?? ?? ???? ?? ?? ??? ??? ?? ? ??? ??? ?? ?? Reward Reward ?? ?? ? ?? ??? ?? ??
  • 37. ??? Lessons Learned? 37 ? Reward Model? ??? ??? ? 1) Reward model? ??? Truthfulness? ?? ???? ???. -> ????? ??? ??? ? 2) ??? 100?? ??? 100?? ???? ??? ?? ? ?? ??? ??? ????? ?? ? ??? ?? ? 3) Reward ??? ?? ? 4) ?? ?? Reward Model ??? Llama2??? 2? ???, ? ?? ???? ???? https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s
  • 38. ??? Lessons Learned? 38 ? ??? ?? ??/??? ?? ?? ? SFT ?? ??? ???? ? ??? 2~3? ?? ???? ? ?? ??: DPO, Hydra-PPO, Offline-RL ? ?? ?? ?? ? RLHF ???? ??? ? issue? ?? ??? ?? ? ?? ??? Colossal AI ? TRL ? DeepSpeed? ??? ? ?? ?? TRL? ?? ??? .. https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s ?~~?? ????
  • 39. ????
  • 40. Step 3: ???? ??? ??| 40 ChatBot? ????? ????? Agent Reward Action State Environment ?????? ???? ?? ????? ????? https://luda.ai/ Agent Environment ? Action: ? ? State: ? ? Reward: ?
  • 41. Step 3: ???? ??? ??| 41 ChatBot? ????? ????? https://luda.ai/ Agent: ???(??) Environment ? State: ????? ?? ???? ? Action: ??? ?? ??? ??/??(??) ? ?? ?? Action Space == Vocabulary Size ? ?? ?? Action Space == Vocabulary Size * ? ??? ?? ? Reward: Reward Model? State+Action? ??? ? ???? Scalar Value
  • 42. Step 3: ???? ??? ??| 42 State, Action, Reward? Trajectory https://luda.ai/ State GPT Action Reward Model Reward State GPT Action Reward Model Reward continue <Episode 1> <Episode 2> (state, action, reward) (state, action, reward) Training Data
  • 43. Step 3: ???? ??? ??| 43 ???? ??? ?? ? OpenAI Gymnasium? Lunar Lander https://gymnasium.farama.org/environments/box2d/lunar_lander/, https://www.youtube.com/watch?v=U4vRW4fcXRA ??? ??? ?? ???? ???
  • 44. Step 3: ???? ??? ??| 44 ChatGPT? ??? ???? ???? ??(PPO) ? ??? ?? ????: Advantage Actor-Critic Actor Network ? ?? ?? ??? ?? ???? ?? ? Input: State ? Output: Action Probability Dist. ? Critic Network?? ?? ??? ?? Action? ?? ???? ??? ? Actor? ??? ?? ??? ?? ? Input: State ? output: ?? ????? ??(?? ??? ??? ????) ? ?? ??(???? ??)? ??? ??? ??? ??? ? ??? ?? Critic Network
  • 45. Step 3: ???? ??? ??| 45 Critic? ??? ? Critic? ?? ????? ??? ?? Reward? ?? ???? ???? Actor Network Critic Network Reward vs ?? ??? ???? ?? ?? ??? ??? ??? ? ?? Reward ???
  • 46. Step 3: ???? ??? ??| 46 Critic? ??? Action Actor Network ??? ???? update [???] ? High Variance ? ??? ?? ? ?? ?? ?? +1 +1 -1 -1 +1 Reward <High variance Reward ??> ? Critic? ???? Time 0 1 T_end T T+1 episode reward
  • 47. Step 3: ???? ??? ??| 47 Critic? ??? Action +1 +1 -1 -1 +1 Reward ? Critic ?? Time 0 1 T_end T T+1 Value +1.2 +1.3 -0.9 +0.3 +0.8 +0 Critic Network ? ?? S(t)?? ??? ?? ? Reward? ???? ??(value)? ?? ? V(??) ? ???, V(??) ? V(??+1)? ??? ???? Actor network? ??
  • 48. Step 3: ???? ??? ??| 48 Critic? ??? Action +1 +1 -1 -1 +1 Reward ? Critic ?? Time 0 1 T_end T T+1 Value +1.2 +1.3 -0.9 +0.3 +0.8 +0 Critic Network V(??) V(??+1) Advantage = r + V(??+1) -V(??) r Actor Network update [??] ? Low Variance ? ? Step?? ??? ??
  • 49. Step 3: ???? ??? ??| 49 ????: Loss Clipping https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12 ??? ??? (Critic? ??) ?? ?? ?? ??!! ??? ???..!
  • 50. Step 3: ???? ??? ??| 50 Loss Clipping https://huggingface.co/blog/deep-rl-ppo ?? Policy? ?? Policy? ?? r? ?? ??? ???? ??? Clipping Critic? ?? ??? Reward(advantage) ? ChatGPT? ??? Proximal Policy Optimization(PPO) ????? 2017?? ?? ? ??? ???? ??? ?? ??? ????? ???? ??? ????? ??(?)
  • 51. Step 3: ???? ??? ??| 51 [??]???? ???? ?? ? Neural net ??? ?? ?? C Loss function ? ???? ??? ? ??? ?? ?? ?? = Loss ? Loss? Gradient()? ?? ?, Gradient Decent? ???? ???? ? ?????? Loss??? ? Loss? ???? Reward? ?? ? Reward? ???? Gradient Ascent!! ???? ?ylog(?) ???? ? ? ? ?? Actions Agent Network(Policy) State
  • 52. Step 3: ???? ??? ??| 52 [??] ???? ???? ?? ? Reward? ?? ???? ???, Reward? ???(E)? ??? ??! ? ??? = ?? * ?? ?? ? E(Reward|?0) = ?? P(a|s) ? R(s,a) ?? s?? ?? a? ? ?? ?? ?? ?? Reward
  • 53. Step 3: ???? ??? ??| 53 [??] ???? ???? ?? ? Reward? ?? ???? ???, Reward? ???(E)? ??? ??! ? ??? = ?? * ?? ?? ? E(Reward|?0) = ?? P(a|s) ? R(s,a) = J ?? s?? ?? a? ? ?? ?? ?? ?? Reward (??, return) = Policy(Agent) ? ??? ???? ?? ??? ???? ??? J? ? ? ??? ??, Gradient Ascent! ? = ? + ? ? ?? ? ? = ? + ? ? ????? ? ? ? ?(?, ?) ? ? ?, ? ? ?? ?? ??? ? ?? ??, ??? ????
  • 54. Step 3: ???? ??? ??| 54 [??] ???? ???? ?? ? Reward? ?? ???? ???, Reward? ???(E)? ??? ??! ? ??? = ?? * ?? ?? ? E(Reward|?0) = ?? P(a|s) ? R(s,a) = J ?? s?? ?? a? ? ?? ?? ?? ?? Reward ? Action a? ?? ??? ???? Reward? ??? ??? ??. ? ??? roll out ???? ??? ?? ?? ?? ? Neural Net?? ?????! ? ??(?, ?)? ????, ? ? = ? + ? ? ??? ?, ? ? ? ? ? = ?? reward? ??(?, ?)? ??