LLM Alignment: Complete Guide on SFT, RLHF, DPO, and GRPO
Introduction
Language model alignment (LLM Alignment) has become one of the most critical challenges in modern artificial intelligence. A pre-trained language model like GPT-4, Claude, or Llama, however powerful, is not naturally aligned with human intentions, ethical values, or desirable behaviors. Without alignment, these models can generate toxic content, provide dangerous instructions, or simply fail to properly follow user guidelines.
The Fundamental Problem
Imagine an ultra-intelligent assistant who knows everything about everything, but who doesn’t understand the difference between good and bad actions, who doesn’t know when to be concise or detailed, and who cannot distinguish a legitimate request from a harmful one. This is exactly the situation of an LLM after its pre-training: it can predict the next word with remarkable accuracy, but it has no notion of what is helpful, safe, or appropriate.
Why Alignment is Crucial
Without alignment, an LLM presents these problems:
- Doesn’t follow instructions: You ask for a short answer, it writes a 10-page essay
- Generates toxic content: Hateful, discriminatory, or offensive statements
- Hallucinates information: Invents facts with absolute confidence
- Gives dangerous advice: Instructions for making dangerous substances, encouragement of self-harm
- Lacks common sense: Responds literally without understanding social or ethical context
- Rejects legitimate requests: Too cautious and refuses to help on sensitive but legal topics
With alignment, the model becomes:
- ✅ Helpful: Responds precisely to what the user asks
- ✅ Honest: Admits when it doesn’t know, avoids hallucinations
- ✅ Harmless: Refuses dangerous requests, generates safe content
- ✅ Natural: Communicates like a helpful and empathetic human
What You Will Learn
This article is a complete and practical guide on the four major LLM alignment techniques:
- SFT (Supervised Fine-Tuning) - Learning by example
- RLHF (Reinforcement Learning from Human Feedback) - Reinforcement learning with human feedback
- DPO (Direct Preference Optimization) - Direct preference optimization
- GRPO (Group Relative Policy Optimization) - DeepSeek’s latest innovation
For each technique, we will explore:
- Fundamental concepts with concrete analogies
- Mathematical principles explained simply
- Practical implementations with complete code
- Advantages and disadvantages of each method
- Production applications with real examples
Best practices from industry experience
Table of Contents
- Overview of the Alignment Pipeline
- Step 1: SFT - Supervised Fine-Tuning
- Step 2: RLHF - Reinforcement Learning from Human Feedback
- Step 3: DPO - Direct Preference Optimization
- Step 4: GRPO - Group Relative Policy Optimization
- Comparisons and Method Selection
- Production and Best Practices
Overview of the Alignment Pipeline
The Journey of an LLM: From Pre-training to Perfect Assistant
Understanding LLM alignment means understanding a multi-step journey that transforms a word prediction model into an intelligent and safe assistant.
Step 0: Pre-training (The Foundation)
What happens? The model learns to predict the next word by reading billions of web pages, books, articles, source code, etc.
Analogy: It’s like a child reading an entire national library without any parental supervision. It learns language, grammar, facts, but also inappropriate content, biases, and without any notion of what is right or wrong.
Result: A very intelligent but “raw” model that can complete any text, even toxic or dangerous.
Concrete example:
1
2
3
Prompt: "How to hack a bank account"
Pre-trained model: "How to hack a bank account? Here are the detailed steps..."
❌ PROBLEM: The model completes without ethical judgment
Step 1: SFT - Supervised Fine-Tuning (Basic Education)
What happens? We train the model on examples of “good” conversations created by human experts.
Analogy: It’s like sending the child to school with teachers who show them how to behave, how to respond politely, how to structure their answers.
Result: The model learns the instruction-response format and adopts a helpful conversation style.
Concrete example:
1
2
3
4
After SFT:
Prompt: "How to hack a bank account"
Model: "I cannot and will not provide information on illegal activities. If you forgot your password, contact your bank."
✅ BETTER: The model refuses and proposes a legal alternative
Step 2: RLHF - Reinforcement Learning (The Refinement)
What happens? Humans compare thousands of response pairs (which is better?). The model learns to optimize to maximize human preferences.
Analogy: It’s as if the child constantly received feedback: “This way of responding is better than that one.” They adjust their behavior to maximize approval.
Result: The model generates responses that better match subtle human preferences (tone, length, usefulness).
Step 3: DPO - Direct Optimization (The Simplification)
What happens? A simplified version of RLHF that achieves the same objectives with less complexity.
Analogy: Instead of a complex reward system, we directly show the model “This response is preferred to that one” and it adjusts its weights accordingly.
Result: Same quality as RLHF but more stable, faster, simpler to implement.
Step 4: GRPO - Group Optimization (The 2024 Innovation)
What happens? An even more efficient method that compares multiple responses in groups rather than in pairs.
Analogy: Instead of comparing A vs B, then B vs C, then A vs C (3 comparisons), we compare A, B, and C together (1 comparison) and rank from best to worst.
Result: More efficient in terms of data and computation than previous methods.
Summary Pipeline Diagram
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
┌─────────────────────────────────────────────────────────────┐
│ STEP 0: PRE-TRAINING (By OpenAI, Meta, etc.) │
│ Input: Billions of tokens from the Internet │
│ Output: Base model (GPT-4, Llama, etc.) │
│ Duration: Weeks/Months | Cost: Millions $ │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: SFT (Supervised Fine-Tuning) │
│ Input: 10K-100K examples of ideal conversations │
│ Output: Model that follows instructions │
│ Duration: Hours/Days | Cost: Hundreds/Thousands $ │
│ ✅ Model understands instruction-response format │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: RLHF (Reinforcement Learning) │
│ Input: 50K-300K human comparisons (A > B) │
│ Output: Model aligned with human preferences │
│ Duration: Days/Weeks | Cost: Thousands $ │
│ ✅ Model generates high-quality responses │
│ ❌ Complex, unstable, expensive │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ALTERNATIVE: DPO (Direct Preference Optimization) │
│ Input: Same comparisons as RLHF │
│ Output: Same quality as RLHF │
│ Duration: Hours/Days | Cost: Hundreds $ │
│ ✅ Simpler, more stable, cheaper than RLHF │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2024 INNOVATION: GRPO (Group Relative Policy Opt.) │
│ Input: Group comparisons (A > B > C > D) │
│ Output: Quality superior to RLHF/DPO │
│ Duration: Hours/Days | Cost: Hundreds $ │
│ ✅ More efficient, better quality, less data │
└─────────────────────────────────────────────────────────────┘
Global Metaphor: Training a Personal Assistant
To better understand the complete pipeline, imagine that you’re training a personal assistant:
| Step | Metaphor | Concrete Analogy |
|---|---|---|
| Pre-training | Childhood | The child reads everything they can get their hands on, learns language but without discernment |
| SFT | School | Teachers teach them how to behave, respond politely, structure their thoughts |
| RLHF | Professional internship | A mentor constantly tells them “This way of doing is better than that one” until they internalize good practices |
| DPO | Accelerated coaching | An efficient coach directly shows them examples of good/bad behavior without detours |
| GRPO | Group mentoring | Several assistants are trained together, learning from each other more efficiently |
Step 1: SFT - Supervised Fine-Tuning
Definition and Objective
SFT (Supervised Fine-Tuning) is the first and most fundamental alignment step. It’s classic supervised learning where the model learns to imitate high-quality human demonstrations.
Main Objective: Transform a word prediction model into an instruction-following model.
Detailed Definition: SFT is a transfer learning technique that adapts a pre-trained language model to a specific downstream task by training it on carefully curated input-output pairs. Unlike traditional fine-tuning which may adapt the model to any task, SFT specifically focuses on teaching the model to:
- Understand Instructions: Parse and interpret user requests in natural language
- Follow Format Conventions: Respect conversational structures (system, user, assistant roles)
- Generate Appropriate Responses: Produce outputs that match the expected style, tone, and length
- Maintain Coherence: Stay on-topic and provide relevant information
- Exhibit Basic Safety: Avoid obviously harmful or inappropriate content
Complete Analogy: Imagine you hire a very intelligent assistant who is completely ignorant of social and professional conventions. SFT is like giving them a manual of 10,000 examples of successful conversations between an assistant and their boss, covering all possible types of requests:
- How to answer a factual question
- How to politely refuse an inappropriate request
- How to structure a complex explanation
- How to adapt tone according to context
- How to admit when you don’t know something
Your assistant reads all these examples, memorizes them, and learns to reproduce this communication style.
How Does SFT Work?
The Process in Detail
1. Collecting High-Quality Data
The quality of SFT depends entirely on the quality of training data. We collect ideal conversations created by human experts.
Standard Format:
1
2
3
4
5
{
"instruction": "The user's request or question",
"input": "Additional context (optional)",
"output": "The ideal response generated by a human expert"
}
Concrete Example:
1
2
3
4
5
{
"instruction": "Explain the concept of photosynthesis to an 8-year-old",
"input": "",
"output": "Photosynthesis is like the magic recipe plants use to make their food! Imagine that green leaves are little kitchens. They take three ingredients: sunlight (like electrical energy), water (that roots draw from the soil), and air (especially a gas called CO2). By mixing all that, leaves make sugar so the plant can grow, and as a bonus, they release oxygen that we breathe! That's why trees are so important for our planet."
}
Quality vs Quantity:
- ❌ Bad: 100,000 mediocre examples (copied from the Internet without verification)
- ✅ Good: 10,000 excellent examples (verified, corrected, diversified)
- 🏆 Optimal: 50,000 very high-quality examples
2. Formatting for Training
The model must learn the conversational format. We use special templates:
Alpaca Format (Popular, used by Stanford):
1
2
3
4
5
6
7
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}
ChatML Format (Used by OpenAI):
1
2
3
4
5
6
<|im_start|>system
You are a helpful, honest, and harmless assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>
Llama-2-Chat Format (Used by Meta):
1
2
3
4
5
<s>[INST] <<SYS>>
You are a helpful assistant who always responds accurately and respectfully.
<</SYS>>
{instruction} [/INST] {output} </s>
Why are these formats important?
They teach the model:
- 🎭 Roles: Who’s speaking? (system, user, assistant)
- 🔀 Transitions: When does a response begin and end
- 📏 Boundaries: Not to continue generating after finishing the response
3. Training by Maximum Likelihood
The model is trained to maximize the probability of generating the ideal response.
Simplified Formula:
1
Loss = -log P(ideal_response | instruction, model)
What this means concretely:
- The model sees: Instruction + Ideal response
- It tries to predict each word of the response
- Each time it’s wrong, we adjust its weights
- After thousands of examples, it learns to reproduce the style
Analogy: It’s like learning to play piano by watching a maestro. You see their fingers (the instruction), you hear the music (the response), and you try to reproduce. At first, it’s approximate, but after thousands of repetitions, you play like them.
Visualization of the SFT Process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: DATA COLLECTION │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Dataset of 50,000 examples │
│ │
│ Instruction 1 → Ideal response 1 │
│ Instruction 2 → Ideal response 2 │
│ ... │
│ Instruction 50K → Response 50K │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: SUPERVISED TRAINING │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ LLM Model (before SFT) │
│ "How to make a cake?" │
│ → "A cake is a dessert..." │
│ ❌ Incomplete response │
└──────────────────────────────────────┘
│
[Training]
│
▼
┌──────────────────────────────────────┐
│ LLM Model (after SFT) │
│ "How to make a cake?" │
│ → "Here's a simple recipe: │
│ 1. Preheat oven to 180°C │
│ 2. Mix 200g of flour..." │
│ ✅ Structured and useful response │
└──────────────────────────────────────┘
Practical SFT Implementation
Here’s a complete and functional example of SFT implementation with HuggingFace Transformers and PEFT (LoRA):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, TaskType
# ============================================================================
# STEP 1: DATA PREPARATION
# ============================================================================
def format_instruction(example):
"""
Transform a raw example into conversational format
Input: {"instruction": "...", "output": "..."}
Output: "### Instruction:\n...\n### Response:\n..."
"""
instruction_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
{output}"""
return instruction_template.format(
instruction=example["instruction"],
output=example["output"]
)
# Load an SFT dataset (example: Alpaca)
dataset = load_dataset("yahma/alpaca-cleaned")
# Example data:
# {
# "instruction": "Give three tips for staying healthy.",
# "output": "1. Eat a balanced diet...\n2. Exercise regularly...\n3. Get enough sleep..."
# }
# Data formatting
def preprocess_function(examples):
# Create complete text instruction + response
texts = [format_instruction(ex) for ex in examples]
# Tokenization
model_inputs = tokenizer(
texts,
truncation=True,
max_length=512,
padding="max_length"
)
# Important: labels are the same as input_ids
# The model learns to predict each following token
model_inputs["labels"] = model_inputs["input_ids"].copy()
return model_inputs
# ============================================================================
# STEP 2: MODEL AND TOKENIZER LOADING
# ============================================================================
model_name = "meta-llama/Llama-2-7b-hf" # or "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Important for padding
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16, # Memory saving
device_map="auto" # Automatic GPU distribution
)
# ============================================================================
# STEP 3: LORA CONFIGURATION (Efficient Training)
# ============================================================================
# LoRA allows training only 0.1% of model parameters
# instead of 100%, which drastically reduces memory and time
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, # Causal language model
r=8, # Rank of LoRA matrices (higher = more capacity)
lora_alpha=32, # Scaling factor
lora_dropout=0.1, # Dropout for regularization
target_modules=["q_proj", "v_proj"] # Modules to adapt
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%
# ============================================================================
# STEP 4: TRAINING DATA PREPARATION
# ============================================================================
# Dataset tokenization
tokenized_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset["train"].column_names
)
# Train/validation split
train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["train"].select(range(1000)) # 1000 examples for validation
# ============================================================================
# STEP 5: TRAINING CONFIGURATION
# ============================================================================
training_args = TrainingArguments(
output_dir="./sft-llama2-alpaca", # Save folder
# Training hyperparameters
num_train_epochs=3, # Number of epochs
per_device_train_batch_size=4, # Batch size per GPU
gradient_accumulation_steps=4, # Accumulation = effective batch of 16
# Optimization
learning_rate=2e-5, # Learning rate (critical!)
lr_scheduler_type="cosine", # Scheduler that progressively decreases
warmup_steps=100, # Progressive LR increase at start
# Saving and logging
logging_steps=10, # Log every 10 steps
save_steps=500, # Save every 500 steps
eval_steps=500, # Evaluate every 500 steps
save_total_limit=3, # Keep only 3 best checkpoints
# Memory optimizations
fp16=True, # Mixed precision (float16)
gradient_checkpointing=True, # Memory vs speed trade-off
# Evaluation
evaluation_strategy="steps",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
# Miscellaneous
report_to="tensorboard", # Visualization with TensorBoard
push_to_hub=False # Don't publish to HuggingFace Hub
)
# ============================================================================
# STEP 6: TRAINING
# ============================================================================
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# Launch training
print("🚀 Starting SFT training...")
trainer.train()
# Save final model
model.save_pretrained("./sft-llama2-final")
tokenizer.save_pretrained("./sft-llama2-final")
print("✅ Training complete!")
# ============================================================================
# STEP 7: INFERENCE WITH FINE-TUNED MODEL
# ============================================================================
def generate_response(instruction, model, tokenizer):
"""Generate a response with the SFT model"""
prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256, # Maximum response length
temperature=0.7, # Controls creativity (0.0 = deterministic, 1.0 = creative)
top_p=0.9, # Nucleus sampling
do_sample=True, # Random sampling
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract only the response (after "### Response:")
response = response.split("### Response:")[-1].strip()
return response
# Usage example
instruction = "Explain the concept of machine learning to a 10-year-old."
response = generate_response(instruction, model, tokenizer)
print(f"Instruction: {instruction}")
print(f"Response: {response}")
What Happens Technically
1. Forward Pass (Prediction):
1
2
3
4
5
# The model receives: "### Instruction: Explain photosynthesis\n### Response: Photo"
# It must predict: "synthesis"
input_tokens = ["###", "Instruction", ":", "Explain", "photosynthesis", "\n", "###", "Response", ":", "Photo"]
predicted_next_token = model(input_tokens) # → "synthesis"
2. Loss Calculation (Error Calculation):
1
2
3
4
5
6
# For each token of the response, we calculate the error
true_tokens = ["Photo", "synthesis", "is", "a", "process", "..."]
predicted_probs = model.predict_all_tokens(input)
loss = -sum(log(predicted_probs[true_token]) for true_token in true_tokens)
# The worse the prediction, the higher the loss
3. Backward Pass (Update):
1
2
3
4
5
6
# We calculate gradients (how to change weights to reduce error)
gradients = compute_gradients(loss, model.parameters())
# We update model weights
for param in model.parameters():
param -= learning_rate * gradients[param]
4. Repetition: This process repeats for each example, until the model learns to generate quality responses.
Tracking Metrics
During SFT training, we monitor several metrics:
1. Training Loss:
- Measures how much the model is wrong on training data
- Objective: Should decrease progressively
- Example: Epoch 1 → loss=2.5, Epoch 2 → loss=1.2, Epoch 3 → loss=0.8
2. Validation Loss:
- Measures performance on unseen data
- Objective: Should follow training loss without diverging too much
- Alert: If validation loss rises while training loss decreases → Overfitting!
3. Perplexity:
- Measure of model “confusion”
- Formula:
perplexity = exp(loss) - Interpretation:
- Perplexity = 10 → The model hesitates between ~10 possible words
- Perplexity = 2 → The model is very sure (hesitates between 2 words)
- Lower = better
4. Gradient Norm:
- Size of weight updates
- Alert: If too high → Exploding gradients (instability)
- Solution: Gradient clipping (
max_grad_norm=1.0)
Training Logs Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Epoch 1/3
Step 100/5000 | Loss: 2.451 | Perplexity: 11.60 | LR: 1.2e-5 | Time: 2m 15s
Step 200/5000 | Loss: 2.103 | Perplexity: 8.19 | LR: 1.8e-5 | Time: 4m 30s
...
Epoch 1 Complete | Train Loss: 1.820 | Val Loss: 1.856 | Val Perplexity: 6.40
Epoch 2/3
Step 100/5000 | Loss: 1.245 | Perplexity: 3.47 | LR: 2.0e-5 | Time: 2m 10s
...
Epoch 2 Complete | Train Loss: 1.102 | Val Loss: 1.134 | Val Perplexity: 3.11
Epoch 3/3
Step 100/5000 | Loss: 0.892 | Perplexity: 2.44 | LR: 1.5e-5 | Time: 2m 08s
...
✅ Training Complete!
Final Train Loss: 0.785 | Final Val Loss: 0.823 | Final Val Perplexity: 2.28
Advantages of SFT
✅ Simple and Intuitive: Classic supervised learning, easy to understand and implement
✅ Efficient: Quickly transforms a base model into a usable assistant
✅ Controllable: You decide exactly the response style by creating examples
✅ Inexpensive: Compared to pre-training, SFT costs 100-1000x less
✅ Fast: Training in hours/days rather than weeks/months
Disadvantages of SFT
❌ Limited Quality: The model cannot exceed the quality of human examples
❌ Data Creation Cost: Creating 50,000 high-quality examples requires hundreds of hours of human work
❌ No Preference Optimization: The model imitates, but doesn’t truly understand what makes a response “better”
❌ Overfitting: Risk of memorizing examples rather than generalizing
❌ No Iterative Feedback: Once examples are created, no continuous improvement
Ideal Use Cases for SFT
🎯 When to use SFT alone:
- Domain Adaptation: You want the model to speak your industry’s jargon
- Example: Fine-tune GPT to become an expert in medical law
- Specific Format: You have a precise output format to respect
- Example: Always generate responses structured in JSON
- Limited Budget: You don’t have resources for RLHF/DPO
- SFT alone already gives excellent results for many applications
- Rapid Prototyping: You want to test an idea quickly
- SFT is the fastest way to create a functional first prototype
Real Examples:
- Alpaca (Stanford): LLaMA-7B + 52K SFT examples → Functional conversational assistant
- Vicuna: Fine-tuning LLaMA on 70K ShareGPT conversations → Quality close to GPT-3.5
- WizardLM: LLaMA + SFT with automatically generated complex instructions
Step 2: RLHF - Reinforcement Learning from Human Feedback
Definition and Motivation
RLHF (Reinforcement Learning from Human Feedback) is a revolutionary technique that transformed GPT-3 (good but imperfect) into ChatGPT (excellent assistant). It’s the method that launched the conversational AI revolution in 2022.
Detailed Definition: RLHF is a multi-stage alignment paradigm that combines supervised learning with reinforcement learning to optimize language models based on human preferences. Unlike SFT which learns from fixed examples, RLHF enables the model to discover and optimize for subtle human preferences that are difficult to capture through demonstrations alone.
Core Components:
- Reward Modeling: Learning a function that predicts human preferences from comparison data
- Policy Optimization: Using reinforcement learning (typically PPO) to maximize expected reward
- Constraint Satisfaction: Maintaining similarity to the original model through KL divergence penalties
Theoretical Framework: RLHF formulates alignment as a constrained optimization problem:
1
max_π E_{x~D,y~π(·|x)}[R(x,y)] - β·KL(π||π_ref)
Where:
πis the policy (LLM) being optimizedπ_refis the reference policy (initial SFT model)R(x,y)is the reward model scoreβis the KL penalty coefficient (typically 0.01-0.1)KL(π||π_ref)prevents the model from deviating too far from safe behavior
Why RLHF is Transformative: Traditional supervised learning can only teach a model to imitate; it cannot teach the model to understand why one response is better than another. RLHF bridges this gap by:
- Capturing Implicit Preferences: Humans can easily say “A is better than B” even when they can’t articulate why
- Optimizing Directly for Quality: The model learns to maximize human satisfaction, not just mimic examples
- Handling Subjective Criteria: Can encode preferences about style, tone, helpfulness that vary by context
- Enabling Iterative Improvement: The model can be continuously refined as preferences evolve
The Problem with SFT Alone:
Imagine you ask: “Explain general relativity to me”
Model after SFT could generate 3 different responses:
Response A (too technical):
“General relativity postulates that the space-time metric obeys Einstein’s equations Gμν = 8πTμν, where Gμν is the Einstein tensor and Tμν the energy-momentum tensor…”
Response B (too simplistic):
“Gravity is like when a ball falls. Einstein said space is curved.”
Response C (just perfect):
“Einstein’s general relativity explains gravity differently from Newton. Instead of a mysterious attracting force, Einstein proposes that massive objects (like the Sun) curve space-time around them, like a bowling ball placed on a trampoline deforms the fabric. Planets aren’t “attracted” by the Sun, they simply follow the “curves” of space-time. This is verified by many observations like light bending near massive objects.”
Question: How does the model know that C is better than A and B?
Answer: With SFT alone, it doesn’t! It generates grammatically correct responses, but without notion of relative quality. This is where RLHF comes in.
Analogy: Training a Chef
Imagine you’re training a chef:
Method 1 - SFT (Learning by Recipes):
- You give 10,000 perfect recipes to the chef
- They learn to cook by following these recipes
- Problem: When they invent a new dish, they don’t know if it’s good or not
Method 2 - RLHF (Learning by Tasting):
- The chef cooks several versions of a dish
- Gourmets taste them and say “This one is better than that one”
- The chef adjusts their cooking to maximize gourmets’ approval
- Advantage: They learn subtle preferences (flavor balance, presentation, etc.)
The RLHF Pipeline in 3 Steps
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
┌─────────────────────────────────────────────────────────────┐
│ PHASE 1: COLLECTING HUMAN COMPARISONS │
│ │
│ For the same question, generate multiple responses │
│ and ask humans to rank them │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Question: "Explain gravity" │
│ │
│ Response A: [Technical response] │
│ Response B: [Simple response] │
│ Response C: [Balanced response] │
│ Response D: [Incorrect response] │
│ │
│ Human ranks: C > A > B > D │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 2: TRAINING A REWARD MODEL │
│ │
│ Create a model that predicts which response a human │
│ would prefer (without needing the human) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Reward Model learns: │
│ │
│ Score(C) = 8.5/10 │
│ Score(A) = 6.2/10 │
│ Score(B) = 4.1/10 │
│ Score(D) = 1.3/10 │
│ │
│ ✅ Model can now evaluate │
│ any response │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ PHASE 3: OPTIMIZATION BY REINFORCEMENT LEARNING │
│ │
│ Use Reward Model to train LLM to generate │
│ responses that maximize the score │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ LLM generates a response │
│ → Reward Model gives a score │
│ → If high score: reinforce │
│ → If low score: penalize │
│ │
│ After thousands of iterations: │
│ ✅ LLM learns to maximize │
│ score = human preferences │
└──────────────────────────────────────┘
Phase 1: Collecting Human Comparisons
Objective: Create a dataset of human preferences
Detailed Process:
- Generating Multiple Responses:
1 2 3 4 5 6 7 8 9 10 11 12 13
prompt = "What is artificial intelligence?" # Generate 4 different responses with the SFT model responses = [ model.generate(prompt, temperature=0.7) for _ in range(4) ] # responses = [ # "AI is the simulation of human intelligence...", # Response A # "AI is when computers think like us...", # Response B # "Artificial intelligence encompasses...", # Response C # "AI = intelligent robots", # Response D # ]
- Human Annotation:
1 2 3 4 5 6 7 8 9 10 11 12 13
Annotation interface: ┌───────────────────────────────────────────────────┐ │ Question: What is artificial intelligence? │ │ │ │ Rank these responses from best to worst: │ │ │ │ [ 1 ] Response C: "Artificial intelligence..." │ │ [ 2 ] Response A: "AI is the simulation..." │ │ [ 3 ] Response B: "AI is when..." │ │ [ 4 ] Response D: "AI = intelligent robots" │ │ │ │ [Validate ranking] │ └───────────────────────────────────────────────────┘
- Format of Collected Data:
1 2 3 4 5 6
comparison_data = { "prompt": "What is artificial intelligence?", "responses": [response_A, response_B, response_C, response_D], "ranking": [2, 1, 0, 3], # Indices in preference order (C > A > B > D) "annotator_id": "human_123" }
Required Volume:
- Minimum: 10,000 comparisons
- Recommended: 50,000 - 100,000 comparisons
- OpenAI for ChatGPT: ~300,000 comparisons
Human Cost:
- 1 comparison ≈ 2-5 minutes (reading + reflection + ranking)
- 50,000 comparisons ≈ 4,000 hours of human work
- At $20/hour ≈ $80,000 labeling cost
Phase 2: Training the Reward Model
Objective: Create a model that can predict human preferences automatically
What is a Reward Model?
A Reward Model is a neural network that takes as input a pair (prompt, response) and returns a numerical score representing the quality of the response according to human preferences.
Analogy: It’s like creating a gastronomic AI critic. After observing thousands of human judgments on dishes, it learns to predict whether a dish will be appreciated or not, without needing a human each time.
Reward Model Architecture:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
┌─────────────────────────────────────────────────────────────┐
│ INPUT │
│ Prompt: "Explain gravity" │
│ Response: "Gravity is a fundamental force..." │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TRANSFORMER ENCODER │
│ (Generally same architecture as LLM) │
│ Ex: LLaMA-7B, GPT-2, BERT │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ POOLING LAYER │
│ Extracts a fixed representation of the text │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LINEAR HEAD (Final layer) │
│ [hidden_dim] → [1] │
│ Transforms representation into a single score │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ OUTPUT │
│ Reward Score: 7.3 / 10 │
└─────────────────────────────────────────────────────────────┘
Training the Reward Model:
The Reward Model is trained with a ranking loss (Bradley-Terry model):
Mathematical Formula:
1
2
3
4
5
6
7
8
Loss = -log(σ(r(x, y_winner) - r(x, y_loser)))
Where:
- x = the prompt
- y_winner = the preferred response
- y_loser = the non-preferred response
- r(x, y) = score given by Reward Model
- σ = sigmoid function
Simple Explanation:
The model must learn to give a higher score to the preferred response than to the non-preferred response.
1
2
3
4
5
6
7
8
9
10
11
# Concrete example
prompt = "Explain photosynthesis"
response_good = "Photosynthesis is the process by which plants..."
response_bad = "Photosynthesis is when plants eat."
# The Reward Model must learn:
score_good = reward_model(prompt, response_good) # → should be high (e.g., 8.5)
score_bad = reward_model(prompt, response_bad) # → should be low (e.g., 2.1)
# Loss will be low if score_good >> score_bad
# Loss will be high if score_good ≈ score_bad (the model hasn't learned yet)
Practical Implementation of the Reward Model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
class RewardModel(nn.Module):
def __init__(self, model_name="meta-llama/Llama-2-7b-hf"):
super().__init__()
# Load the base transformer
self.transformer = AutoModel.from_pretrained(model_name)
# Regression head to produce a single score
self.reward_head = nn.Linear(
self.transformer.config.hidden_size, # E.g., 4096 for Llama-2-7B
1 # Single output score
)
def forward(self, input_ids, attention_mask):
# Pass through the transformer
outputs = self.transformer(
input_ids=input_ids,
attention_mask=attention_mask
)
# Take the representation of the last token (EOS)
# Shape: [batch_size, hidden_size]
last_hidden_state = outputs.last_hidden_state
# Extract the representation of the last non-padding token
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = input_ids.shape[0]
last_token_hidden = last_hidden_state[
torch.arange(batch_size, device=input_ids.device),
sequence_lengths
]
# Calculate the reward score
reward_score = self.reward_head(last_token_hidden).squeeze(-1)
# Shape: [batch_size]
return reward_score
# ============================================================================
# REWARD MODEL TRAINING
# ============================================================================
def train_reward_model(comparison_dataset, model, tokenizer, epochs=3):
"""
Trains the reward model on human comparisons
Args:
comparison_dataset: List of {prompt, response_chosen, response_rejected}
model: RewardModel to train
tokenizer: Corresponding tokenizer
"""
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
model.train()
for epoch in range(epochs):
total_loss = 0
for batch in comparison_dataset:
# batch = {
# 'prompt': "Explain photosynthesis",
# 'chosen': "Photosynthesis is the process...",
# 'rejected': "Photosynthesis is when..."
# }
# Tokenize the (prompt, chosen response) pairs
chosen_inputs = tokenizer(
batch['prompt'] + " " + batch['chosen'],
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
# Tokenize the (prompt, rejected response) pairs
rejected_inputs = tokenizer(
batch['prompt'] + " " + batch['rejected'],
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
# Calculate scores
reward_chosen = model(
input_ids=chosen_inputs['input_ids'],
attention_mask=chosen_inputs['attention_mask']
)
reward_rejected = model(
input_ids=rejected_inputs['input_ids'],
attention_mask=rejected_inputs['attention_mask']
)
# Loss: we want reward_chosen > reward_rejected
# Bradley-Terry loss
loss = -torch.log(
torch.sigmoid(reward_chosen - reward_rejected)
).mean()
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(comparison_dataset)
print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_loss:.4f}")
return model
# Usage example
reward_model = RewardModel("gpt2") # Use GPT-2 for the example (lightweight)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# Comparison dataset (simplified)
comparison_data = [
{
'prompt': "What is AI?",
'chosen': "Artificial intelligence is a field of computer science that aims to create systems capable of performing tasks that normally require human intelligence.",
'rejected': "AI is smart robots."
},
# ... 50,000 other examples
]
trained_reward_model = train_reward_model(comparison_data, reward_model, tokenizer)
Reward Model Validation:
How do you know if the Reward Model has learned well?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def evaluate_reward_model(model, tokenizer, test_pairs):
"""
Evaluates the accuracy of the reward model on test pairs
test_pairs = [
{'prompt': "...", 'chosen': "...", 'rejected': "..."},
...
]
"""
model.eval()
correct_predictions = 0
with torch.no_grad():
for pair in test_pairs:
# Calculate scores
chosen_inputs = tokenizer(
pair['prompt'] + " " + pair['chosen'],
return_tensors="pt",
truncation=True,
max_length=512
)
rejected_inputs = tokenizer(
pair['prompt'] + " " + pair['rejected'],
return_tensors="pt",
truncation=True,
max_length=512
)
reward_chosen = model(**chosen_inputs)
reward_rejected = model(**rejected_inputs)
# Did the model give a higher score to the chosen response?
if reward_chosen > reward_rejected:
correct_predictions += 1
accuracy = correct_predictions / len(test_pairs)
print(f"Reward Model Accuracy: {accuracy:.2%}")
return accuracy
# A good Reward Model generally achieves 70-80% accuracy
Phase 3: Reinforcement Learning Optimization (PPO)
Objective: Use the Reward Model to fine-tune the LLM to generate better responses
This is the most complex and powerful phase of RLHF. An algorithm called PPO (Proximal Policy Optimization) developed by OpenAI is used.
Analogy: Imagine a student (the LLM) who must write essays. They now have an automatic teacher (the Reward Model) who grades each essay instantly. The student writes an essay, receives a grade, and learns to adjust their style to get better grades. After thousands of essays, they become excellent.
The PPO Process in Detail:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: RESPONSE GENERATION │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────┴──────────────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ LLM Policy │ │ LLM Reference│
│ (learning) │ Generates │ (Frozen SFT) │
│ │ response │ │
└──────────────┘ └──────────────┘
│ │
│ "Photosynthesis is..." │
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: EVALUATION BY REWARD MODEL │
└─────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────┐
│ Reward Model │
│ gives a score │
│ Reward = 7.5/10 │
└────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: KL PENALTY CALCULATION │
│ (To prevent the model from drifting too far from SFT) │
└─────────────────────────────────────────────────────────────┘
│
▼
KL_penalty = KL_divergence(
P_policy(response | prompt),
P_reference(response | prompt)
)
Final_reward = Reward - β * KL_penalty
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 4: LLM UPDATE BY PPO │
│ Adjust weights to maximize Final_reward │
└─────────────────────────────────────────────────────────────┘
Why is the KL Penalty Crucial?
Without this penalty, the model could “cheat” the Reward Model by generating nonsensical responses that artificially maximize the score. The KL penalty forces the model to stay close to the initial SFT model.
Example of Drift without KL Penalty:
1
2
3
4
5
6
7
8
9
10
11
12
13
Prompt: "Explain photosynthesis"
Without KL penalty:
Generated response: "PHOTOSYNTHESIS! INCREDIBLE! MAGNIFICENT! Plants are EXCEPTIONAL!!!!!!!"
→ Reward Model score: 9.5/10 (it learned that enthusiasm is good)
❌ PROBLEM: The response is useless but games the system
With KL penalty:
Generated response: "Photosynthesis is the biological process..."
→ Reward Model score: 8.0/10
→ KL penalty: 0.5 (close to SFT model)
→ Final reward: 8.0 - 0.01 * 0.5 = 7.995
✅ BETTER: Useful and natural response
Simplified PPO Implementation for RLHF:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# ============================================================================
# PPO CONFIGURATION
# ============================================================================
ppo_config = PPOConfig(
model_name="meta-llama/Llama-2-7b-hf",
learning_rate=1.41e-5, # Learning rate (very important)
batch_size=64, # Number of examples per batch
mini_batch_size=4, # Mini-batch for PPO
gradient_accumulation_steps=16, # Gradient accumulation
# PPO hyperparameters
ppo_epochs=4, # Number of passes over each batch
# KL penalty
init_kl_coef=0.2, # Initial coefficient of KL penalty
target_kl=0.1, # Target KL divergence
# Clipping (PPO stability)
cliprange=0.2, # PPO clipping ratio
cliprange_value=0.2, # Clipping for value function
# Values
vf_coef=0.1, # Coefficient of value function loss
# Logging
log_with="tensorboard",
tracker_project_name="rlhf-llama2"
)
# ============================================================================
# MODEL LOADING
# ============================================================================
# 1. Policy Model (model to train)
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
"meta-llama/Llama-2-7b-sft", # Model after SFT
torch_dtype=torch.float16
)
# 2. Reference Model (frozen, for KL penalty)
ref_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-sft", # Same as policy but frozen
torch_dtype=torch.float16
)
ref_model.eval() # Evaluation mode (no gradient)
# 3. Reward Model (already trained)
reward_model = RewardModel.from_pretrained("./trained_reward_model")
reward_model.eval()
# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token
# ============================================================================
# PPO TRAINER CREATION
# ============================================================================
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer
)
# ============================================================================
# RLHF TRAINING LOOP
# ============================================================================
def rlhf_training_loop(ppo_trainer, reward_model, prompts, num_iterations=1000):
"""
Main RLHF training loop
Args:
ppo_trainer: Configured PPO trainer
reward_model: Trained reward model
prompts: List of prompts for generation
num_iterations: Number of iterations
"""
for iteration in range(num_iterations):
# ====================================================================
# STEP 1: RESPONSE GENERATION
# ====================================================================
# Select a batch of prompts
batch_prompts = sample_prompts(prompts, batch_size=ppo_config.batch_size)
# Tokenize prompts
inputs = tokenizer(
batch_prompts,
return_tensors="pt",
padding=True,
truncation=True
).to(policy_model.device)
# Generate responses with the policy model
with torch.no_grad():
response_tensors = ppo_trainer.generate(
inputs['input_ids'],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
# Decode responses
responses = tokenizer.batch_decode(
response_tensors,
skip_special_tokens=True
)
# ====================================================================
# STEP 2: REWARD CALCULATION
# ====================================================================
rewards = []
for prompt, response in zip(batch_prompts, responses):
# Calculate score with the reward model
reward_inputs = tokenizer(
prompt + " " + response,
return_tensors="pt",
truncation=True,
max_length=512
).to(reward_model.device)
with torch.no_grad():
reward_score = reward_model(**reward_inputs)
rewards.append(reward_score.cpu())
rewards = torch.tensor(rewards)
# ====================================================================
# STEP 3: PPO UPDATE
# ====================================================================
# The PPOTrainer automatically handles:
# - KL divergence calculation with ref_model
# - Application of KL penalty
# - PPO optimization
# - Clipping
stats = ppo_trainer.step(
queries=inputs['input_ids'],
responses=response_tensors,
scores=rewards
)
# ====================================================================
# LOGGING
# ====================================================================
if iteration % 10 == 0:
print(f"Iteration {iteration}/{num_iterations}")
print(f" Mean Reward: {rewards.mean().item():.3f}")
print(f" Mean KL: {stats['objective/kl']:.3f}")
print(f" Policy Loss: {stats['ppo/loss/policy']:.3f}")
# Generation example
example_prompt = batch_prompts[0]
example_response = responses[0]
print(f"\n Example:")
print(f" Prompt: {example_prompt}")
print(f" Response: {example_response}\n")
return policy_model
# Launch training
prompts_dataset = load_prompts("./rlhf_prompts.json") # Your prompts
final_model = rlhf_training_loop(ppo_trainer, reward_model, prompts_dataset)
# Save the final model
final_model.save_pretrained("./rlhf-llama2-final")
What Happens Under the Hood of PPO:
PPO is a sophisticated reinforcement learning algorithm. Here’s a simplified explanation:
1. Advantage Calculation:
1
2
3
4
5
6
7
8
# The advantage measures "how much better this action is than average"
advantage = reward - baseline
# Example:
# Response reward = 8.5
# Baseline (average of recent rewards) = 7.0
# Advantage = 8.5 - 7.0 = 1.5
# → This response is better than average, we want to reinforce it
2. Probability Ratio Calculation:
1
2
3
4
5
# We compare the probability of generating this response now vs before
ratio = P_new(response | prompt) / P_old(response | prompt)
# If ratio > 1: The model now generates this response more easily
# If ratio < 1: The model now generates this response less easily
3. PPO Clipping (Main innovation of PPO):
1
2
3
4
5
6
7
# We limit overly abrupt changes
clipped_ratio = clip(ratio, 1 - epsilon, 1 + epsilon) # epsilon = 0.2
# We take the minimum between the clipped and unclipped version
objective = min(ratio * advantage, clipped_ratio * advantage)
# Why? To prevent the model from changing too quickly and becoming unstable
4. Final Loss:
1
loss = -objective + vf_coef * value_loss - entropy_bonus
RLHF Monitoring Metrics
During RLHF training, several critical metrics are monitored:
1. Mean Reward:
- Average reward given by the Reward Model
- Goal: Should increase gradually
- Example: Iteration 0 → 5.2, Iteration 500 → 7.8, Iteration 1000 → 8.5
2. KL Divergence:
- Measures how far the policy model has drifted from the reference model
- Goal: Should remain low (< 0.1 - 0.5)
- Alert: If KL > 1.0 → Model is drifting too much, risk of mode collapse
3. Policy Loss:
- Loss of the PPO objective
- Goal: Should decrease but not too quickly
4. Value Loss:
- Error of the value function (predicts future reward)
- Goal: Should converge toward 0
Example Logs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Iteration 100/1000
Mean Reward: 6.234
Mean KL: 0.045
Policy Loss: -0.123
Value Loss: 0.234
Entropy: 5.678
Example:
Prompt: "Explain gravity"
Response: "Gravity is the attractive force between two massive objects..."
Reward: 7.2
Iteration 200/1000
Mean Reward: 7.123 ⬆
Mean KL: 0.089
Policy Loss: -0.245
Value Loss: 0.156
Entropy: 5.234
Iteration 1000/1000
Mean Reward: 8.567 ⬆⬆
Mean KL: 0.123
Policy Loss: -0.456
Value Loss: 0.045
Entropy: 4.890
✅ Training Complete!
Advantages of RLHF
✅ Exceptional Quality: Produces state-of-the-art quality models (ChatGPT, Claude, GPT-4)
✅ Learns Subtle Preferences: Captures nuances that SFT alone cannot (tone, style, length)
✅ Direct Optimization: Explicitly maximizes what humans prefer
✅ Flexibility: Can be applied to different objectives (utility, safety, creativity)
✅ Continuous Improvement: Can be iterated multiple times to gradually improve
Disadvantages of RLHF
❌ Extremely Complex: Difficult implementation with many sensitive hyperparameters
❌ Instability: PPO is notoriously unstable, risks of divergence, mode collapse
❌ Computational Cost: Requires 3 models in memory simultaneously (policy, reference, reward)
❌ Training Time: Much longer than SFT (days/weeks)
❌ Labeling Cost: Very expensive to create 50K-300K human comparisons
❌ Critical Hyperparameters: KL penalty, learning rate, clipping range must be finely tuned
❌ Requires Expertise: Few teams truly master RLHF in production
Real-World Examples of RLHF Usage
ChatGPT (OpenAI):
- GPT-3.5 → SFT on ~13K demonstrations → RLHF on ~300K comparisons
- Result: Radical transformation of GPT-3 into ChatGPT
Claude (Anthropic):
- Uses RLHF + Constitutional AI (RLAIF)
- Focus on safety and alignment with human values
GPT-4 (OpenAI):
- Massive RLHF with domain experts
- Better quality and safety than GPT-3.5
Step 3: DPO - Direct Preference Optimization
The 2023 Revolution: Simplifying RLHF
DPO (Direct Preference Optimization) is a major innovation published by Stanford researchers (Rafailov et al., 2023) in May 2023. It’s a method that achieves the same results as RLHF but with drastically reduced complexity.
Detailed Definition: DPO is an elegant reformulation of the RLHF objective that eliminates the need for explicit reward modeling and reinforcement learning. Instead of the traditional two-stage process (train reward model → optimize policy with RL), DPO directly optimizes the policy on preference data in a single supervised learning phase.
Key Theoretical Insight: The breakthrough of DPO comes from a mathematical reparameterization. The authors proved that the optimal policy π* for the RLHF objective can be expressed analytically in terms of the reward function:
1
π*(y|x) = π_ref(y|x) · exp(R(x,y)/β) / Z(x)
Where Z(x) is a partition function. By inverting this relationship, we can express the reward in terms of policies:
1
R(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·log Z(x)
The DPO Transformation: Substituting this reward expression into the Bradley-Terry preference model and observing that the partition function cancels out in preference comparisons, we obtain the DPO loss:
1
L_DPO(π_θ) = -E[(x,y_w,y_l)~D] [log σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) - β·log(π_θ(y_l|x)/π_ref(y_l|x)))]
This loss function:
- Increases the likelihood of preferred responses relative to the reference model
- Decreases the likelihood of rejected responses
- Implicitly maintains the KL constraint through the log-ratio formulation
- Requires no reward model: The preference signal is directly encoded in the loss
Why DPO Works: By optimizing this loss, we’re simultaneously:
- Teaching the model which responses are better (preference learning)
- Preventing mode collapse through the reference model constraint
- Avoiding the instabilities of RL optimization
- Achieving the same optimum as RLHF but through a simpler path
Practical Advantages:
- Single-stage training: No need to train a separate reward model
- Stable optimization: Standard gradient descent, no RL instabilities
- Memory efficient: Only 2 models in memory instead of 3
- Faster convergence: Typically 2-3x faster than PPO-based RLHF
- Easier debugging: Direct supervision signal, easier to diagnose problems
The Problem with RLHF
Recap of RLHF complexity:
1
2
3
4
5
6
7
8
RLHF = Train Reward Model + Optimize with PPO + 3 models in memory
Problems:
❌ Separate Reward Model to train (expensive)
❌ PPO unstable and difficult to tune
❌ KL penalty delicate to calibrate
❌ 3 models loaded simultaneously (huge memory)
❌ Many sensitive hyperparameters
The Brilliant Idea of DPO
Question: Is it possible to optimize directly on human preferences without going through a Reward Model and PPO?
Answer: YES! That’s exactly what DPO does.
Analogy: Instead of creating an automatic teacher (Reward Model) and then using their grades to guide the student (PPO), we show the student directly pairs of examples: “This essay is better than that one. Adjust yourself to produce more often essays like the first one.”
How Does DPO Work?
Mathematical Intuition
RLHF tries to maximize:
1
2
3
4
5
6
reward(x, y) - β * KL(π_θ || π_ref)
Where:
- reward(x, y) = Reward Model score
- KL = divergence between policy and reference
- β = penalty coefficient
DPO realizes that this optimization can be rewritten in a way that no longer requires an explicit Reward Model!
The Magic Formula of DPO:
1
2
3
4
5
6
7
8
9
Loss_DPO = -log σ(β * log(π_θ(y_w | x) / π_ref(y_w | x)) - β * log(π_θ(y_l | x) / π_ref(y_l | x)))
Where:
- y_w = preferred response (winner)
- y_l = rejected response (loser)
- π_θ = policy model (being trained)
- π_ref = reference model (frozen)
- β = hyperparameter (generally 0.1-0.5)
- σ = sigmoid function
Simple Explanation:
**π_θ(y_w x) / π_ref(y_w x)**: Probability ratio that the current model generates the good response vs the reference model - If > 1: The model generates the good response more easily than before ✅
- If < 1: The model generates the good response less easily than before ❌
**π_θ(y_l x) / π_ref(y_l x)**: Ratio for the bad response - If > 1: The model generates the bad response more easily than before ❌
- If < 1: The model generates the bad response less easily than before ✅
- Objective: Maximize the ratio for y_w and minimize the ratio for y_l
Visualization of the DPO Process:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
┌─────────────────────────────────────────────────────────────┐
│ INPUT: A Human Comparison │
│ Prompt: "Explain gravity" │
│ y_win: "Gravity is the curvature of spacetime..." │
│ y_lose: "Gravity is when things fall" │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: Calculate P(y_win | prompt) with π_θ and π_ref │
└─────────────────────────┬───────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Policy Model │ │ Ref Model │
│ P_θ(y_win) │ │ P_ref(y_win) │
│ = 0.024 │ │ = 0.012 │
└──────────────┘ └──────────────┘
│ │
└─────────────┬─────────────┘
│
▼
Ratio_win = 0.024 / 0.012 = 2.0
→ The model generates the good response
2x more easily than before ✅
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: Calculate P(y_lose | prompt) │
└─────────────────────────┬───────────────────────────────────┘
│
Ratio_lose = P_θ(y_lose) / P_ref(y_lose)
= 0.008 / 0.015 = 0.53
→ The model generates the bad response
2x less easily than before ✅
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: Calculate DPO Loss │
│ Loss = -log σ(β * log(2.0) - β * log(0.53)) │
│ = -log σ(0.3 * 0.69 - 0.3 * (-0.63)) │
│ = -log σ(0.207 + 0.189) = -log σ(0.396) │
│ = -log(0.598) = 0.514 │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 4: Backpropagation │
│ Update π_θ to reduce loss │
│ → Increase P_θ(y_win) │
│ → Decrease P_θ(y_lose) │
└─────────────────────────────────────────────────────────────┘
Practical Implementation of DPO
Here’s a complete and functional implementation with HuggingFace TRL:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import DPOTrainer
from peft import LoraConfig, get_peft_model
# ============================================================================
# STEP 1: DATA PREPARATION
# ============================================================================
# Format required for DPO:
# {
# 'prompt': "The question or instruction",
# 'chosen': "The preferred response",
# 'rejected': "The rejected response"
# }
# Example with a public dataset
dataset = load_dataset("Anthropic/hh-rlhf") # Human preference dataset
# Transform to the right format
def format_for_dpo(example):
"""
Transforms Anthropic format into standard DPO format
"""
return {
'prompt': example['prompt'],
'chosen': example['chosen'],
'rejected': example['rejected']
}
train_dataset = dataset['train'].map(format_for_dpo)
eval_dataset = dataset['test'].select(range(1000)).map(format_for_dpo)
# Example data:
# {
# 'prompt': "Human: How to make a chocolate cake?\n\nAssistant:",
# 'chosen': "Here's a simple recipe: 1. Preheat...",
# 'rejected': "A chocolate cake is simple..."
# }
# ============================================================================
# STEP 2: MODEL LOADING
# ============================================================================
model_name = "meta-llama/Llama-2-7b-hf" # Or your model after SFT
# Model to train (policy)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Reference model (will be automatically created by DPOTrainer)
# It's a frozen copy of the initial model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# ============================================================================
# STEP 3: LORA CONFIGURATION (Optional but Recommended)
# ============================================================================
# LoRA for efficient training
lora_config = LoraConfig(
r=16, # LoRA rank (higher for DPO than SFT)
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"], # More modules
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# ============================================================================
# STEP 4: DPO TRAINING CONFIGURATION
# ============================================================================
training_args = TrainingArguments(
output_dir="./dpo-llama2",
# Training hyperparameters
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch = 16
# Learning rate (often higher than SFT)
learning_rate=5e-6,
lr_scheduler_type="cosine",
warmup_steps=100,
# Optimizations
fp16=True,
gradient_checkpointing=True,
# Logging and saving
logging_steps=10,
save_steps=500,
eval_steps=500,
evaluation_strategy="steps",
save_total_limit=3,
# DPO specific
remove_unused_columns=False, # Important for DPO!
report_to="tensorboard"
)
# ============================================================================
# STEP 5: DPO TRAINER CREATION
# ============================================================================
dpo_trainer = DPOTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
# CRUCIAL HYPERPARAMETER: β (beta)
beta=0.1, # Controls the importance of KL penalty
# Low β (0.1): More aggressive, drifts more from reference model
# High β (0.5): More conservative, stays close to reference model
# Maximum length
max_length=512,
max_prompt_length=256,
# Loss type
loss_type="sigmoid" # Or "hinge", "ipo"
)
# ============================================================================
# STEP 6: TRAINING
# ============================================================================
print("🚀 Starting DPO training...")
dpo_trainer.train()
# Save
model.save_pretrained("./dpo-llama2-final")
tokenizer.save_pretrained("./dpo-llama2-final")
print("✅ DPO training completed!")
# ============================================================================
# STEP 7: EVALUATION AND INFERENCE
# ============================================================================
def generate_with_dpo_model(prompt, model, tokenizer):
"""Generates a response with the DPO model"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return response
# Test
prompt = "Human: Explain general relativity in simple terms.\n\nAssistant:"
response = generate_with_dpo_model(prompt, model, tokenizer)
print(f"Prompt: {prompt}")
print(f"Response: {response}")
Detailed Comparison: RLHF vs DPO
| Criterion | RLHF | DPO |
|---|---|---|
| Complexity | 🔴 Very complex (2 phases) | 🟢 Simple (1 phase) |
| Number of Models | 🔴 3 models (policy, ref, reward) | 🟢 2 models (policy, ref) |
| Stability | 🔴 PPO unstable | 🟢 Very stable |
| GPU Memory | 🔴 ~40-60 GB for 7B | 🟢 ~20-30 GB for 7B |
| Speed | 🔴 Slow (generation + PPO) | 🟢 2-3x faster |
| Hyperparameters | 🔴 Many and sensitive | 🟢 Mainly β |
| Final Quality | 🟡 Excellent | 🟢 Comparable to RLHF |
| Implementation Ease | 🔴 Very difficult | 🟢 Easy with TRL |
| Computational Cost | 🔴 High | 🟢 Moderate |
Verdict: DPO is generally preferable unless you already have RLHF infrastructure in place.
DPO Variants
Several DPO variants have been proposed:
1. IPO (Identity Preference Optimization):
- Modifies DPO loss to better handle weak preferences
- Better when humans are uncertain about their preference
2. CPO (Conservative Preference Optimization):
- Adds an explicit safety constraint
- Prevents the model from generating dangerous content even if rewarded
3. KTO (Kahneman-Tversky Optimization):
- Based on prospect theory in behavioral economics
- Better handling of asymmetric preferences
4. ORPO (Odds Ratio Preference Optimization):
- Combines SFT and DPO in a single step
- More data efficient
Advantages of DPO
✅ Simplicity: Single training phase, no separate Reward Model
✅ Stability: Much more stable than PPO, converges predictably
✅ Memory Efficiency: Requires 2 models instead of 3 (~30% GPU savings)
✅ Speed: 2-3x faster than RLHF
✅ Implementation Ease: ~100 lines of code with TRL
✅ Hyperparameters: Mainly β, much simpler to tune
✅ Quality: Results comparable to RLHF in most benchmarks
✅ Open Source: Excellent TRL library from HuggingFace
Disadvantages of DPO
❌ Data Required: Still needs 50K+ human comparisons (like RLHF)
❌ Labeling Cost: Same human cost as RLHF to create comparisons
❌ Less Control: No explicit Reward Model to debug
❌ Indirect Optimization: Optimizes an approximation of RLHF objective, not the direct objective
❌ Less Mature: More recent than RLHF (2023 vs 2017), less industrial feedback
Real-World Examples of DPO Usage
Zephyr-7B (HuggingFace):
- Mistral-7B + SFT + DPO
- Performance close to GPT-3.5 with only 7B parameters
- Became the most popular open-source model in late 2023
Starling-7B (Berkeley):
- Significant improvements on reasoning and code
- Uses DPO variants (RLAIF)
Tulu 2 (AllenAI):
- Suite of models optimized with DPO
- Focus on task diversity
Step 4: GRPO - Group Relative Policy Optimization
The Latest Innovation (2024): DeepSeek R1
GRPO (Group Relative Policy Optimization) is the very latest advancement in alignment, introduced by DeepSeek with their R1 model in December 2024. It’s a major improvement over DPO that better exploits group comparisons.
Detailed Definition: GRPO is a novel preference optimization algorithm that extends beyond pairwise comparisons to leverage full ranking information over groups of responses. Rather than comparing two responses at a time (A vs B), GRPO simultaneously considers multiple candidates and learns from their relative ordering (A > B > C > D).
Core Innovation: GRPO addresses a fundamental limitation of both RLHF and DPO: preference data efficiency. Traditional methods only extract binary signals from human feedback, while GRPO extracts richer multi-way comparison information.
Theoretical Foundation: GRPO models preferences using a Plackett-Luce ranking model, which generalizes the Bradley-Terry model (used in DPO) to handle rankings over arbitrary group sizes:
1
P(rank(y₁,...,yₖ) | x) = ∏ᵢ₌₁ᵏ exp(R(x,yᵢ)) / ∑ⱼ₌ᵢᵏ exp(R(x,yⱼ))
This model captures the probability of observing a particular ranking, where each response is sequentially selected proportional to its exponentiated reward.
GRPO Optimization Objective: Following DPO’s approach, GRPO directly parameterizes the reward in terms of policy ratios:
1
L_GRPO(π_θ) = -E[(x,{yᵢ}ⁱ₌₁ᵏ)~D] [∑ᵢ₌₁ᵏ advantage(yᵢ, rank) · log(π_θ(yᵢ|x)/π_ref(yᵢ|x))]
Where advantage(yᵢ, rank) is computed based on the response’s position in the ranking:
- Top-ranked responses get positive advantages (reinforce)
- Bottom-ranked responses get negative advantages (suppress)
- Middle-ranked responses get smaller advantages (gentle adjustment)
Advantage Computation Methods:
- Rank-based:
advantage(yᵢ) = (K - rankᵢ + 1) / K - 0.5 - Exponential:
advantage(yᵢ) = exp(-λ·(rankᵢ-1))for some λ > 0 - Tournament-style:
advantage(yᵢ) = ∑ⱼ≠ᵢ sign(rankⱼ - rankᵢ)
Key Advantages of Group Rankings:
- Information Efficiency: One K-way ranking provides
K(K-1)/2pairwise comparisons worth of information- Example: Ranking 4 responses gives 6 pairwise comparisons of info
- But requires only 1 human annotation instead of 6
Consistency Guarantee: Direct rankings eliminate preference cycles (A>B, B>C, C>A impossible)
Finer-Grained Signals: Captures strength of preferences (A » B > C ≈ D)
- Better Generalization: Learning from diverse quality levels improves robustness
Practical Implementation Strategy: GRPO typically uses:
- Group size K = 4-8: Balance between information gain and annotation difficulty
- Sampling strategies: Diverse sampling to ensure quality spread in groups
- Reward normalization: Standardize advantages within each group for stability
- Mixed training: Combine GRPO with standard DPO losses for robustness
Why GRPO Achieves Superior Performance: DeepSeek R1’s success with GRPO demonstrates that:
- Richer feedback signal enables faster learning with less data
- Explicit relative positioning helps model understand quality gradients
- Reduced annotation cost allows scaling to larger, more diverse datasets
- Better optimization landscape leads to more stable convergence
The Problem with DPO
DPO compares responses pairwise:
1
2
3
4
5
6
7
8
9
Prompt: "Explain photosynthesis"
Response A vs Response B → A is preferred
Response B vs Response C → B is preferred
Response A vs Response C → A is preferred
❌ PROBLEM:
- 3 comparisons needed for 3 responses
- Partial information (binary comparisons)
- Possible inconsistencies (A > B, B > C, but C > A??)
The GRPO Idea
Question: Instead of comparing pairwise, why not compare multiple responses at once and rank them?
Answer: That’s exactly what GRPO does!
1
2
3
4
5
6
7
8
9
Prompt: "Explain photosynthesis"
Generate 4-8 responses
Rank all together: A > C > B > D
✅ ADVANTAGE:
- 1 single annotation instead of 6 binary comparisons
- Richer information (complete ranking)
- Guaranteed consistency
- More data efficient
Analogy: Instead of asking “Is this movie better than that one?” multiple times, we directly ask “Rank these 5 movies from best to worst”. It’s faster, more consistent, and gives more information.
How Does GRPO Work?
The Process in Detail
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
┌─────────────────────────────────────────────────────────────┐
│ STEP 1: GROUP GENERATION │
│ For each prompt, generate K responses (K = 4-8) │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Prompt: "Explain gravity" │
│ │
│ y₁: "Gravity is the curvature..." │
│ y₂: "Gravity is the force..." │
│ y₃: "Newton discovered..." │
│ y₄: "Objects fall because..." │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 2: GROUP RANKING │
│ Rank all responses together │
└─────────────────────────┬───────────────────────────────────┘
│
▼
Ranking: y₁ > y₃ > y₂ > y₄
(1st, 2nd, 3rd, 4th)
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 3: RELATIVE REWARDS CALCULATION │
│ Use ranking to calculate rewards │
└─────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Rank-based rewards: │
│ r(y₁) = +1.5 (best) │
│ r(y₃) = +0.5 (2nd) │
│ r(y₂) = -0.5 (3rd) │
│ r(y₄) = -1.5 (worst) │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STEP 4: GRPO OPTIMIZATION │
│ Maximize probability of best responses │
│ Minimize probability of worst responses │
└─────────────────────────────────────────────────────────────┘
GRPO Formula
1
2
3
4
5
6
7
Loss_GRPO = -∑ᵢ₌₁ᴷ advantage(yᵢ) * log π_θ(yᵢ | x)
Where:
- K = number of responses in the group (e.g., 4-8)
- advantage(yᵢ) = relative reward based on rank
- advantage(y_best) > 0 (we want to increase its probability)
- advantage(y_worst) < 0 (we want to decrease its probability)
Advantage Calculation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Method 1: Rank-based reward
def compute_rank_advantage(rankings):
"""
rankings: [1, 3, 2, 4] (rank of each response)
"""
K = len(rankings)
advantages = []
for rank in rankings:
# Transform rank into advantage
# Best rank (1) → maximum positive advantage
# Worst rank (K) → maximum negative advantage
advantage = (K + 1 - 2 * rank) / K
advantages.append(advantage)
# Normalize (mean = 0)
advantages = np.array(advantages)
advantages = advantages - advantages.mean()
return advantages
# Example:
rankings = [1, 3, 2, 4] # y₁ is 1st, y₃ is 3rd, etc.
advantages = compute_rank_advantage(rankings)
# advantages = [0.75, -0.25, 0.25, -0.75]
# y₁ (rank 1) → +0.75 (strong advantage)
# y₄ (rank 4) → -0.75 (strong disadvantage)
Practical Implementation of GRPO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader
# ============================================================================
# STEP 1: CONFIGURATION
# ============================================================================
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Reference model (frozen)
ref_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
ref_model.eval()
# GRPO hyperparameters
K = 4 # Number of responses per group
beta = 0.1 # KL penalty coefficient
learning_rate = 1e-6
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
# ============================================================================
# STEP 2: GROUP GENERATION
# ============================================================================
def generate_group_responses(model, prompt, K=4):
"""
Generates K different responses for the same prompt
"""
responses = []
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
for _ in range(K):
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.8, # High temperature for diversity
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
responses.append(response)
return responses
# ============================================================================
# STEP 3: RANKING (CAN BE HUMAN OR AUTOMATIC)
# ============================================================================
def rank_responses_automatic(prompt, responses, reward_model):
"""
Ranks responses automatically with a reward model
(Alternative: ask humans to rank)
"""
scores = []
for response in responses:
# Calculate score with a reward model
inputs = tokenizer(
prompt + " " + response,
return_tensors="pt",
truncation=True,
max_length=512
).to(reward_model.device)
with torch.no_grad():
score = reward_model(**inputs)
scores.append(score.item())
# Create ranking based on scores
rankings = np.argsort(-np.array(scores)) + 1 # Rank 1 = best
return rankings.tolist()
# ============================================================================
# STEP 4: ADVANTAGES CALCULATION
# ============================================================================
def compute_advantages(rankings):
"""
Transforms ranks into advantages
"""
K = len(rankings)
advantages = []
for rank in rankings:
# Rank-based advantage
advantage = (K + 1 - 2 * rank) / K
advantages.append(advantage)
# Normalize
advantages = np.array(advantages)
advantages = advantages - advantages.mean()
return torch.tensor(advantages, dtype=torch.float32)
# ============================================================================
# STEP 5: GRPO LOSS
# ============================================================================
def compute_grpo_loss(model, ref_model, prompt, responses, advantages, beta=0.1):
"""
Calculates GRPO loss for a group of responses
"""
total_loss = 0
for i, (response, advantage) in enumerate(zip(responses, advantages)):
# Tokenize
full_text = prompt + " " + response
inputs = tokenizer(
full_text,
return_tensors="pt",
truncation=True,
max_length=512
).to(model.device)
# Calculate log-probability with policy model
outputs = model(**inputs, labels=inputs['input_ids'])
logprobs_policy = -outputs.loss
# Calculate log-probability with reference model
with torch.no_grad():
outputs_ref = ref_model(**inputs, labels=inputs['input_ids'])
logprobs_ref = -outputs_ref.loss
# KL divergence (approximation)
kl = logprobs_policy - logprobs_ref
# Loss for this response
# We maximize logprobs of good responses (advantage > 0)
# We minimize logprobs of bad responses (advantage < 0)
loss = -advantage * (logprobs_policy - beta * kl)
total_loss += loss
# Average over the group
avg_loss = total_loss / len(responses)
return avg_loss
# ============================================================================
# STEP 6: TRAINING LOOP
# ============================================================================
def train_grpo(model, ref_model, prompts, reward_model, epochs=3, K=4):
"""
Main GRPO training loop
"""
model.train()
for epoch in range(epochs):
epoch_loss = 0
for prompt in prompts:
# Generate K responses
responses = generate_group_responses(model, prompt, K=K)
# Rank responses
rankings = rank_responses_automatic(prompt, responses, reward_model)
# Calculate advantages
advantages = compute_advantages(rankings)
# Calculate loss
loss = compute_grpo_loss(
model, ref_model, prompt, responses, advantages, beta=0.1
)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_epoch_loss = epoch_loss / len(prompts)
print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_epoch_loss:.4f}")
return model
# Usage example
prompts_list = ["Explain photosynthesis", "What is AI?", ...]
trained_model = train_grpo(model, ref_model, prompts_list, reward_model)
Advantages of GRPO
✅ Data Efficiency: 1 group ranking » K binary comparisons
✅ Consistency: No possible contradictions in preferences
✅ Rich Information: Complete ranking instead of binary comparisons
✅ Better Quality: DeepSeek R1 surpasses GPT-4 in many benchmarks
✅ Faster: Fewer human annotations needed
✅ Flexible: Can use partial or incomplete rankings
✅ Robustness: Less sensitive to noise in annotations
Disadvantages of GRPO
❌ Very Recent: Few practical feedback (December 2024)
❌ Generation Cost: Requires generating K responses per prompt (K≈4-8)
❌ Cognitive Complexity: Ranking 8 responses is harder than comparing 2
❌ No Standard Library: Not yet integrated in TRL (coming)
❌ Limited Documentation: Fewer resources than DPO/RLHF
Final Comparison: SFT vs RLHF vs DPO vs GRPO
| Criterion | SFT | RLHF | DPO | GRPO |
|---|---|---|---|---|
| Implementation Complexity | 🟢 Easy | 🔴 Very difficult | 🟢 Easy | 🟡 Moderate |
| Stability | 🟢 Very stable | 🔴 Unstable | 🟢 Stable | 🟢 Stable |
| Final Quality | 🟡 Good | 🟢 Excellent | 🟢 Excellent | 🟢 Superior |
| Data Efficiency | 🟡 Moderate | 🔴 Low | 🟡 Moderate | 🟢 High |
| Computational Cost | 🟢 Low | 🔴 High | 🟡 Moderate | 🟡 Moderate |
| GPU Memory | 🟢 Low | 🔴 High | 🟡 Moderate | 🟡 Moderate |
| Training Time | 🟢 Fast | 🔴 Slow | 🟢 Fast | 🟢 Fast |
| Labeling Cost | 🟡 Moderate | 🔴 High | 🔴 High | 🟢 Reduced |
| Maturity | 🟢 Mature | 🟢 Mature | 🟡 Recent | 🔴 Very recent |
Comparisons and Method Selection
Decision Guide: Which Method to Choose?
Scenario 1: Limited Budget, Quick Prototype
Recommendation: SFT Only
- ✅ Fast to implement (few hours)
- ✅ Low cost (hundreds of euros)
- ✅ Already very usable results
- ❌ Quality limited by examples
Example: Create an internal chatbot for your company
Scenario 2: Maximum Quality, Comfortable Budget
Recommendation: SFT → DPO
- ✅ State-of-the-art quality
- ✅ More stable than RLHF
- ✅ Reasonable cost (few thousand euros)
- ❌ Requires human comparisons
Example: Launch a commercial chatbot product
Scenario 3: Existing RLHF Infrastructure
Recommendation: RLHF
- ✅ If you already master RLHF
- ✅ If you have GPU resources
- ❌ Otherwise, prefer DPO
Example: Large tech company with dedicated ML team
Scenario 4: Cutting-Edge Research
Recommendation: GRPO
- ✅ Better data efficiency
- ✅ Superior quality
- ❌ Very recent, little documentation
- ❌ Requires self-implementation
Example: Research lab, academic publication
Recommended Pipeline for Most Cases
1
2
3
4
5
6
7
8
9
10
11
12
13
1. SFT (Mandatory)
↓
Evaluation: Is the model already good enough?
↓
YES → Stop here
NO → Continue
↓
2. DPO (Recommended)
↓
Evaluation: Need improvement?
↓
YES → Iterate (more data)
NO → Deployment
Production and Best Practices
Best Labeling Practices
1. Annotator Quality:
- Train annotators on your criteria
- Use qualification tests
- Measure inter-annotator agreement (κ > 0.6)
2. Data Diversity:
- Cover all query types
- Include edge cases and difficult cases
- Balance domains
3. Quality Control:
- Double annotation for 10% of data
- Review major disagreements
- Continuous quality monitoring
Aligned Model Evaluation
Automatic Metrics:
- Perplexity: Not very correlated with perceived quality
- BLEU/ROUGE: Unsuitable for alignment
- Reward Model Score: Good proxy
Human Evaluation:
- Win Rate: Compare 2 models side by side
- Likert Scale: Rate 1-5 on several criteria
- A/B Testing: In production
Benchmarks:
- MT-Bench: Multi-turn conversations
- AlpacaEval: Comparison with GPT-4
- Arena Elo: Ranking by battles
Conclusion
The journey from raw pre-trained models to aligned, helpful assistants represents one of the most remarkable achievements in modern AI. What started with ChatGPT’s revolutionary RLHF approach has evolved into an ecosystem of increasingly efficient and accessible techniques.
The Evolution Timeline:
- 2020-2022: RLHF establishes the paradigm (ChatGPT’s breakthrough)
- 2023: DPO democratizes alignment (simpler, faster, equally effective)
- 2024: GRPO pushes efficiency boundaries (DeepSeek R1’s innovation)
Where We Stand Today: Alignment is no longer the exclusive domain of tech giants. With open-source tools like HuggingFace TRL, datasets like Anthropic’s HH-RLHF, and increasingly affordable compute, teams of all sizes can now build world-class aligned models.
Practical Roadmap for Your Projects:
- For rapid prototyping (days): SFT with 10K quality examples → 80% of the way there
- For production quality (weeks): SFT + DPO with 50K comparisons → state-of-the-art results
- For cutting-edge research (months): Explore GRPO and hybrid approaches → push boundaries
Looking Ahead: The field continues to evolve rapidly. Constitutional AI, RLHF from AI feedback (RLAIF), and multi-objective alignment are emerging frontiers. The next breakthrough might come from combining these techniques with synthetic data generation, continual learning, or entirely new paradigms we haven’t yet imagined.
Your Next Steps: Start simple, measure rigorously, iterate continuously. The most important alignment happens not in the algorithm choice, but in deeply understanding your users’ needs and values. Build models that don’t just follow instructions—build models that genuinely help humans thrive.
The future of AI alignment is being written today, and you’re now equipped to contribute to it.
