�ݺ�ߣ

RLHF Lessons Learned
?? SDS ???

??
? RLHF?
? RLHF ?? ? ?? ??
? Stage ? ?? ??
? ?? ??
? Lessons Learned

RLHF?
3
https://www.youtube.com/watch?v=vziygFrRlZ4
? ??? ??? NO NO
? ?? ??? ???? ? Imitation Learning & Learning from human preference
Behavior Cloning : ??? ?? ?? ??? ???
(????, ????/Demonstration ??? ??)
Learning from human preference (OpenAI, 2017)
: ?? ??? ?? ???? ??? ??
? ??? ???? ???? ?? ??(????)

4
Reward Function ??? ???
https://youtu.be/tlOIHko8ySg, https://openai.com/research/faulty-reward-functions
Reward
? ? ? ??? Reward Function
? ???? -, ??? ??? +
? ??? ??? ?? ??? ???? ????

5
???? ?? ??? Reward Function ??
Boston Dynamics
????? ????? ??? Reward? ?????
- ?? ?? ??? ? ?? ??????
- ?? ?? +100??
- ?? ?? ?? ???? 0???
- ??? 10Cm ?? 1??? 2??? 10???
��

6
???? ??? ???????? How??
��??? ?? ???��
Reward
Model
GPT
��?? ????��
+100
Reinforce
?? ???? ??? ??? ????

7
?? ?? == ??? ???
? ??? ???? ??? ? ?? ?? ????
? ??? ???? ?????
P( ??|???1, ???2 �� , ?0)
?? ??
?? ??
P( ??|???1, ???2 �� , ?0) + ??? ???
?? ????? ?? ??
? ?? ??? ??
ChatGPT? ?? ??
? ??? ???? ??
Reward Model

ChatGPT? RLHF ??
8
https://www.youtube.com/watch?v=vziygFrRlZ4

Stage 1: Supervised Fine-tuned Model ???
9
? Goal
? ??? ??? ???? ?? ? ? ?? ?? ?? ?? (Instruction Fine-tuning, Imitation Learning)
? ???
? ?? ? ?? ??? ?? ??(GPT 3.5, 175B)
? ??? ?? ? ? ? ? ?? ???? ?? ��??-??�� ????
(Demonstration Dataset)
? ?? ??
? ??? ?? ??? ??? ??? Supervised Fine-tuning
(??: ??, ??: ??, Loss function: Cross-entropy loss)
? ??
? ?????, ??? ??? ??? ?? ?
��??? ?? ???��
GPT
��?? ????��

Stage 2: Reward Model ???
10
? Goal
? ??? ??? ?? ??? ??? ??? ???? Reward Model(RM) ??
? ???
? Stage1?? ?? ? SFT ??(6B, head ??)
? ??? ?? ? ?? ?? ??? ????, ? ? ?? ??? ?? ???? ????? ?.
? ?? ??
? ??? ? ???? ??? ??? ???? ?? (?? ?? ?? >> ?? ?? ??)
? ??
? ?? ??? ?? ??, ??? Stage 1??? ?? ??? ?? ?.
### ?? : ??? ?? ??
Reward
Model
??1) ??? ??2) ????
100?
Reward
Model
-100?

Stage 3: ???? ????
11
? Goal
? RM? SFT? ???? ???? ???? ? ??? ? ChatGPT ??
? ???
? Stage1?? ?? ? SFT ??, Stage2?? ?? ? RM ??
? ?? Stage?? ???? ?? ??? ??(??)?
? ?? ??
? PPO Algorithm(???? ????? ? ??)
? ??
? ???? ??? ???? ??? ?? ?? Stage. Stage2, 3? ?? ?? ??

???? RLHF ?? ???��
12
SFT ?? ???
Reward ??
???
???? ????
? ?? ?? ?? ??
? ??? ??-?? ?? ?? ? ?? ?? �C ?? ?? ? ?? ? ????? ?? set ??
Stage 1 Stage 2 Stage 3

RLHF? ?? ?? ?? ???? ? ?? �C ???? ??
13
SFT
RL
? Dataset ??? ??? ??
Reward
Model
Case1) ?? ?? Distribution
- ??: ??? ??? ????? ??
- ??: ?? ?? X
SFT
RL
Reward
Model
Case2) ?? ?? ??? Dist.
- ??: ?? ??? ??? ?? ??? ??
- ??: Data Mix ??? ???, ?? ???
SFT
RL
Reward
Model
Case 3) ?? ??
- ??: ??/???? ??
- ??: ????? 3?? ?????? ?
Stage ? ????? ???,
???? ??? ??
? ? ???? ??? ??
SFT
RL
Reward
Model

?? ????? ???? ???
14
? ?? ???? SFT ????
Name #N
??/??/?? ?? ????? 48163
Wizard 69615
Open orca 998520
koopen-platypus 24818
korquad-chat 32287
counsel_bot.jsonl 60367
evolve-instruct 36809
oig-smallchip2-dedu.jsonl 210282
Kullm-v2 141177
oig-instructions_en_ko.jsonl 49210
Super natural 49166
Shared GPT 34051
Everything 991
KoAlpaca 20366
Ko�CLima-vicuna 999
Naver ??? 319714
?? 200?? ??
?? ??
1) ?? ??? ?? ???.
2) ?? N?? ??? ???.
3) ?? ??? ?? ???? ?? ???

?? ????? ???? ???
15
? SFT ???? ? ?? ?? (polyglot 12.8b ??)
1) ?? ??? ?? ? ???? ??, ???? ??? ? ??? ??
(??? ?? ???? ?? ??)
2) ?? ??? ?? ???? ???? ??
: ??? ??? ??? ???? ??? ?? ??? ??, ?? ??? ?? ??
??) Ko-Lima (??? ?? ?? ???, ??? ???) + ??/?? ? ????? ??? ???(?? 1??? ?
?? ??)
? ?? ??
2M �� 90K �� 25K ???? ??? ? ?? ??

?? ????? ???? ???
16
? ?? ???? SFT ????
Name #N
??/??/?? ?? ????? 48163
Wizard 69615
Open orca 998520
koopen-platypus 24818
korquad-chat 32287
counsel_bot.jsonl 60367
evolve-instruct 36809
oig-smallchip2-dedu.jsonl 210282
Kullm-v2 141177
oig-instructions_en_ko.jsonl 49210
Super natural 49166
Shared GPT 34051
Everything 991
KoAlpaca 20366
Ko�CLima-vicuna 999
Naver ??? 319714
?? 200?? ??
?? ??
? Stage1 SFT ?? : 25000
? Stage2 Reward Model ?? : 8500
? Stage2 Reward Model ?? : 1500
? Stage3 PPO ?? : 15000
??? ????? ??? ?? ?? ? ?????,
??? ????? ??? ????? ??? ??
? ��oig-instructions�� ??? ?? (from Anthropic)
: ??? ?? ?? ????(1024 token)?? ??
: opensource ?? + LLM? output?? ???? ??

Reward Model Data ??
17
? Reward Model Data? ??? ???
? ?? ? 1?? ???? ???
??? 2? ??? ?? ??
Where can I find information about the Paris
Climate Agreement?
You can find information about the Paris Clim
ate Agreement on the website of the United
Nations Framework Convention on Climate C
hange (UNFCCC). The website has a dedicate
d section which contains information on the
Agreement, its objectives and its implementa
tion. Additionally, you can find news and upda
tes related to the Agreement by searching on
line news sources.
<?? ???? ??>

18
? ChatGPT 3.5 API? ???? ?? ?? ??
Climate Agreement?
line news sources.
<?? ???? ??>
You can find information about the Paris Climate Agreement from various sources, including official documents, international
organizations, and reputable news outlets. Here are some suggestions on where to find information:
1.United Nations Framework Convention on Climate Change (UNFCCC) Website:
- The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary
source for official documents, updates, and information.
- Website: UNFCCC
2.Paris Agreement Text:
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and
provisions of the agreement.
- Paris Agreement Text
- ��
ChatGPT? ??? ??
: ??? ????? ??
??? ?? + ??? ?? ?? ??
Good
Bad

19
? ChatGPT 3.5 API? ???? ?? ?? ??
Climate Agreement?
line news sources.
<?? ???? ??>
You can find information about the Paris Climate Agreement from various sources, including official documents, international
organizations, and reputable news outlets. Here are some suggestions on where to find information:
- The official website of the UNFCCC, the organization responsible for the Paris Climate Agreement, is a primary
source for official documents, updates, and information.
- Website: UNFCCC
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals, commitments, and
provisions of the agreement.
- ��
ChatGPT? ??? ??
: ??? ????? ??
??? ?? + ??? ?? ?? ??
Good
Bad
??? ??? ???
??? ???��
????
? ??? ??

???, ??? ?? ???? ??
20
? ?? ?? ?? : Polyglot-1.3B, 3.8B, 5.8B, 12.8B
? ?? ??? : oig-instruction -> 4? ??? ??
? GPUs : A100 (80GB) * 8 ?
? ?? ?? : MS/DeepSpeed-Chat
? ColossaiAI ? TRL(Transformers Reinforcement Learning)? ?? ??
? ColossaiAI ? ?? ?? ?? ?? ????, 12.8B actor + 3.8B critic ???? ?? ???, ?? ??? ??? ??
? TRL ? Actor? Critic? Shared Architecture? ????? ?? ?? ? ???? Vanila RLHF? ??? ??? ?? ??
? ?? ????? ???? ?? ??? ??? ?? ?? ??? ???? ??? ? ?????.. ? (?? ??/??? ??? ???..)
Stage 1 ?? Stage 2 ?? Stage 2 ?? PPO

Stage 1: SFT ??
21
? ?? ??? ??
? ??? ??? ???? + Template ??
? ?? ??? ?? Context Length =1024
? Epoch : LLaMA2, Instruct GPT??? ??
2~3 Epoch
? LR 1~5e-5(AdamW, Cosine Scheduler)
? Zero-out loss on Prompt
? Polyglot-ko 1.3b~12.8B
? ??? Test ??? ????? ??
SFT
????! ??? ??? ??? ????:
��
### system: ???? ??? ?? ??? ??? ?????.
### ???: ??? ?? ???
### ??:

Stage 2: Reward Model ??
22
? GPTModel ?? Reward? ???? ?? Linear Layer ??
Pretrained GPT
Linear(hidden_dim, 1)
??: ??? ?? ?? ?? ??: ???
<Loss function ???>
EOS
pooling

23
? ChatGPT? ?? ��??�� ??? ??? ?????? ?? ? ??(1 epoch)
? ?? ?? ? ?? ??
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-1.3b 6.861068249 -5.620769024 0.9958334
polyglot-ko-3.8b 11.7929697 -3.29787612 0.9979167
polyglot-ko-5.8b 9.338997841 -8.786784172 0.9958334
???
?, ????

24
? ??? Overfitting ??
You can find information about the Paris Climate Agreement on
the website of the United Nations Framework Convention on
Climate Change (UNFCCC). The website has a dedicated section
which contains information on the Agreement, its objectives and
its implementation. Additionally, you can find news and updates
related to the Agreement by searching online news sources.
You can find information about the Paris Climate Agreement from various sources, including official
documents, international organizations, and reputable news outlets. Here are some suggestions on
where to find information:
- The official website of the UNFCCC, the organization responsible for the Paris Climate
Agreement, is a primary source for official documents, updates, and information.
- Website: UNFCCC
- The full text of the Paris Agreement is available on the UNFCCC website. It outlines the goals,
commitments, and provisions of the agreement.
- ��
<?? ????? ??> <ChatGPT? ?? ??>
? 3??? ??? ?? ?? ??
? ?? ??? ??? ??? ?? ??? ??
??? ???? ?? ??
? ???? ??? ????/??? ?? ????? ?? ??

25
? ??? Overfitting ??
? ?? ??
1) ?? ???? ???? ?? ?? : ???? 17?? ???? ??
??: ????? ?? ??? ????? ??????
??? 1: ???, ????, ?? ?? ??? ????
? ?? ??? ?????. ? ??? ?? ??? ??
?? ??? ??? ?? ???? ????.
??? 2: ???? ??? ? ???! ?? ?? ?? ?
? ????????. ????? ???? ????
<??>

26
? ??? ???? ??, ??? ??? ??? ??? ?? ? ???? vs ????
? ???, ??? Generality? ?? ? ??? ??
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-1.3b 0.6244 0.3636 0.5987
polyglot-ko-3.8b 1.50507 1.1352 0.6337
polyglot-ko-5.8b 0.99981 0.5761 0.6574
TF + ??? - - 0.6267
???? ?? ??
<???? ?? ??>

27
? Overfitting ?? ?? 2)
: ?? ?? ??? ??? ??? ?? ?? ??? ???
To ChatGPT : "??? ??? ??? ?? ???? ??? ???? ???. ??? ??? ?? ?? ??? ????? ???.
???? ? ?? ? ??? ?? ??? ????? ?? ??????. ?? ???? ??? ???."
??: ???? ?? ? ??? ? ?? ??? ??????
?? :
1. ????? ???? ??? ?????. ?? ?? ????? ??? ??? ??? ?? ?????.
2. ???? ?? ?? ?? ? ??? ??? ???? ?? ????. ???? ???? ??? ?? ? ? ???? ?? ? ????.
3. ???? ?? ????? ?? ?? ????. ???? ??? ???? ?? ?? ????? ??? ?? ?? ?? ?? ????.
4. ???? ??? ???? ?? ??? ?????. ??? ???? ??? ????? ????? ?? ? ???? ??? ? ???? ????
???? ?? ????.
5. ???? ? ?????? ??? ??? ?? ??? ?? ????. ??? ??? ?? ?? ?? ??? ???? ?? ?? ?? ? ????.
6. ???? ?? ???? ?? ?? ????. ???? ??? ??? ?? ?? ?? ??? ??? ?? ???? ?? ?? ??? ? ? ????.
7. ???? ???? ???? ? ?? ???? ?? ???? ????. ??? ???? ???? ??? ????? ?? ??? ? ???? ??
?? ??? ? ?? ??? ???? ???? ???.

28
Model
?? ??? ??
?? ??
?? ??? ??
?? ??
?? ?? ?? ???
polyglot-ko-3.8b 6.8610 -5.6207 0.9958
polyglot-ko-3.8b
+ opensource + lying 1.8448 -1.5232 0.85125
? ?????, ?? + ?? ??? + ? ??? = 1:1:1 ??? ???? ??
?? ???? ??
???? ??? ?? ???? ?? ??
???? ?? ???

Stage 3: PPO ??
29
https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training
? PPO : Actor + Critic ??
? ??? 4?? ??? ???;
? ??? ??(??) ? Inference 5? ???;;;

Stage 3: PPO ??
30
?? ???
? ???? ?? ???��
Reward Clipping? ?? ??
Kl penalty? ??? ??
(Dynamic KL Penalty ???)
Kl = 0.04, 0.025
Kl = 0.01

Stage 3: PPO ??
31
? Reward Hacking ?? : ?? ???? ??? ??? Reward? ?? ?? ? ?? ??? ???
? ??!?
12.8B Actor +
5.8B RewardModel
??: ???? ??? ??? ? ? ????
?? :
1??) ?? ??? ?????.
2??) ��
??: ?? ???? ???!
?? :
1??) ?? ??? ?????
2??) ��
??: ? ??? ???
?? :
? ??? ??? ????.
1??) ??? ?????.
2??) ��
?? ?? ??
??: ? ??? ???
?? :
__ __ __
?? ?? ?? ??
(?? ???? ???? ??)

Stage 3: PPO ??
32
? PPO? ?? ? ????? ???? ?? ????
? 1) ??? ?? ?? ?? ??. ??) 12.8B Actor + 3.8B Critic
? 2) Llama2? Rejection Sampling ??
Actor
??
??1
??2
??3
??4
??5
Reward
Model ??3
Best ?? ??
Supervised Learning(1 epoch)

Stage 3: PPO ??
33
? PPO ?? ??? 15000?? 5????, 4?? Rejection Sampling + 1?? PPO ??
3000? ??? ?? ??
X 4?
? 3?? ???? 4? Iteration
? ?? Iteration?? Actor? ?? Iteration?? ?? ? Actor
? 2?? Iteration??? diff(Best-Worst) > 2 ? ??? ?? (?? 800?) ? ???? ??? ???? ??, overfitting ??

Stage 3: PPO ?? ??
34
? Rejection Sampling ? PPO? ?? ?? ?? ??
No RS
RS1
RS2(>2)
RS2(all)
RS3(>2)
RS4(>2)
RS4(>2) + PPO(KL 0.04)
RS4(>2) + PPO(KL 0.01)
No RS + PPO
Reward hacking /
Local optimum
Win Rate
(vs oig ?? ???)
(Reward Model ??)
?? ??? ?

Stage 3: PPO ?? ??
35
? Human Evaluation ??
[?? ??]
1) ???:
- ???? ?? ?? ????? (?? ??)
- ??? ??? ??? ?? ?????? (???)
- ?? ?? ?? ?? ???
- ??? ?? ?? ?? ?? ???
2) ???:
- ??? ??/??? ?? ?? ???? ??? ?? ??? ????
- ??? ?????
3) ???:
- ???? ??? ?????
??? ???? Win Rate
(Win = PPO win)
???, ??? -> 1~2% ??? ??
PPO vs SFT

??? Lessons Learned?
36
? ??? ????? .. ???? ?? ????. ?? ??? Quality? ? ?? ? SFT >> Reward Modeling > PPO
? Quality is all you need (Llama2, LIMA)
? SFT? ?? PPO? ????? ??? ????? ???. ??, ?? ??? ??? ?? ??? ???(ex. ??, ?? ? NLP Task ??)
? InstructGPT ?? NLP Task? PPO? ?? ?? ????, ???? ???? ??? ? ?????? ? Failed
? ???? Reward? ???? ? ?? ? ?? ??? ??? ?? ??
? Na?
ve ? Reward Modeling?? PPO? ??? ? ???? ?? ??? ?? ? ???? ? Nope!
? Rejection Sampling?? Iterative Learning? ?? ?? ???? ?? ?? ??? ??? ?? ? ??? ??? ?? ??
Reward
Reward
?? ?? ? ?? ???
?? ??

37
? Reward Model? ??? ???
? 1) Reward model? ??? Truthfulness? ?? ???? ???. -> ????? ??? ???
? 2) ??? 100?? ??? 100?? ???? ??? ?? ? ?? ??? ??? ????? ?? ? ??? ??
? 3) Reward ??? ??
? 4) ?? ?? Reward Model ??? Llama2??? 2? ???, ? ?? ???? ????
https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s

38
? ??? ?? ??/??? ?? ??
? SFT ?? ??? ????
? ??? 2~3? ?? ????
? ?? ??: DPO, Hydra-PPO, Offline-RL ? ?? ?? ??
? RLHF ???? ??? ? issue? ??�� ??? ??
? ?? ??? Colossal AI ? TRL ? DeepSpeed? ???
? ?? ?? TRL? ?? ??? ..
https://www.youtube.com/watch?v=hhiLw5Q_UFg&t=2419s
?~~?? ????

Step 3: ???? ??? ??| 40
ChatBot? ????? ?????
Agent
Reward
Action
State
Environment
?????? ???? ?? ????? ?????
https://luda.ai/
Agent
Environment
? Action: ?
? State: ?
? Reward: ?

Step 3: ???? ??? ??| 41
ChatBot? ????? ?????
https://luda.ai/
Agent:
???(??)
Environment
? State: ????? ?? ????
? Action: ??? ?? ??? ??/??(??)
? ?? ?? Action Space == Vocabulary Size
? ?? ?? Action Space == Vocabulary Size * ? ??? ??
? Reward: Reward Model? State+Action? ??? ? ???? Scalar
Value

Step 3: ???? ??? ??| 42
State, Action, Reward? Trajectory
https://luda.ai/
State
GPT
Action
Reward Model
Reward
State
GPT
Action
Reward Model
Reward
continue
<Episode 1> <Episode 2>
(state, action, reward) (state, action, reward) Training Data

Step 3: ???? ??? ??| 43
???? ??? ??
? OpenAI Gymnasium? Lunar Lander
https://gymnasium.farama.org/environments/box2d/lunar_lander/, https://www.youtube.com/watch?v=U4vRW4fcXRA
??? ??? ?? ???? ???

Step 3: ???? ??? ??| 44
ChatGPT? ??? ???? ???? ??(PPO)
? ??? ?? ????: Advantage Actor-Critic
Actor Network
? ?? ?? ??? ?? ???? ??
? Input: State
? Output: Action Probability Dist.
? Critic Network?? ?? ??? ?? Action?
?? ???? ???
? Actor? ??? ?? ??? ??
? Input: State
? output: ?? ????? ??(?? ??? ??? ????)
? ?? ??(???? ??)? ??? ??? ??? ??? ?
??? ??
Critic Network

Step 3: ???? ??? ??| 45
Critic? ???
? Critic? ?? ????? ??? ?? Reward? ?? ???? ????
Actor Network Critic Network
Reward
vs
?? ??? ???? ??
?? ??? ??? ??? ? ?? Reward ??��?

Step 3: ???? ??? ??| 46
Critic? ???
��
Action
Actor Network
??? ???? update
[???]
? High Variance
? ??? ??
? ?? ?? ??
+1 +1 -1 -1 +1
Reward
<High variance Reward ??>
? Critic? ????
Time 0 1 �� T_end
T T+1
episode
reward

Step 3: ???? ??? ??| 47
Critic? ???
��
Action
+1 +1 -1 -1 +1
Reward
? Critic ??
T T+1
Value +1.2 +1.3 -0.9 +0.3 +0.8 +0
Critic Network
? ?? S(t)?? ??? ?? ? Reward? ???? ??(value)? ?? ? V(??)
? ???, V(??) ? V(??+1)? ??? ???? Actor network? ??

Step 3: ???? ??? ??| 48
Critic? ???
��
Action
+1 +1 -1 -1 +1
Reward
? Critic ??
T T+1
Value +1.2 +1.3 -0.9 +0.3 +0.8 +0
Critic Network
V(??) V(??+1)
Advantage = r + V(??+1) -V(??)
r
Actor Network
update
[??]
? Low Variance
? ? Step?? ??? ??

Step 3: ???? ??? ??| 49
????: Loss Clipping
https://jonathan-hui.medium.com/rl-proximal-policy-optimization-ppo-explained-77f014ec3f12
??? ???��
(Critic? ??)
?? ?? ?? ??!!
??? ???..!

Step 3: ???? ??? ??| 50
Loss Clipping
https://huggingface.co/blog/deep-rl-ppo
?? Policy? ?? Policy? ??
r? ?? ??? ???? ??? Clipping
Critic? ?? ??? Reward(advantage)
? ChatGPT? ??? Proximal Policy Optimization(PPO) ????? 2017?? ??
? ??? ???? ??? ?? ??? ????? ???? ??? ????? ??(?)

Step 3: ???? ??? ??| 51
[??]???? ???? ??
? Neural net ??? ?? ?? �C Loss function
? ???? ???
? ??? ?? ?? ?? = Loss
? Loss? Gradient(��)? ?? ?,
Gradient Decent? ???? ????
? ?????? Loss???
? Loss? ???? Reward? ??
? Reward? ???? Gradient Ascent!!
???? �� ?ylog(?) ???? �� ? ?
?
??
Actions
Agent Network(Policy)
State

Step 3: ???? ??? ??| 52
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?��? P(a|s) ? R(s,a)
?? s?? ?? a? ? ?? ?? ?? ?? Reward

Step 3: ???? ??? ??| 53
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?��? P(a|s) ? R(s,a) = J
?? s?? ?? a? ? ?? ?? ?? ?? Reward
(??, return)
= Policy(Agent)
? ??? ???? ?? ??? ???? ??? J? ? ? ??? ??, Gradient Ascent!
?��
= ? + ? ? ?? ?
?��
= ? + ? ? ????? ? ? ? ?(?, ?)
? ? ?, ? ? ?? ?? ??? ? ?? ??, ??? ????

Step 3: ???? ??? ??| 54
[??] ???? ???? ??
? Reward? ?? ???? ???, Reward? ???(E)? ??? ??!
? ??? = ?? * ?? ?? ?
E(Reward|?0) = ?��? P(a|s) ? R(s,a) = J
?? s?? ?? a? ? ?? ?? ?? ?? Reward
? Action a? ?? ??? ???? Reward?
??? ??? ??.
? ??? roll out ???? ??? ?? ?? ??
? Neural Net?? ?????!
? ??(?, ?)? ????,
? ?��
= ? + ? ? ??? ?, ? ? ?
? ? = ?? reward? ??(?, ?)? ??

�ݺ�ߣ

RLHF_Lessons_learned.pdf

Recommended

More Related Content

What's hot (20)

Similar to RLHF_Lessons_learned.pdf (20)

RLHF_Lessons_learned.pdf