LLM Alignment: Complete Guide on SFT, RLHF, DPO, and GRPO

Posted Dec 8, 2025 Updated Dec 13, 2025

By eya_garci 65 min read

Introduction

Language model alignment (LLM Alignment) has become one of the most critical challenges in modern artificial intelligence. A pre-trained language model like GPT-4, Claude, or Llama, however powerful, is not naturally aligned with human intentions, ethical values, or desirable behaviors. Without alignment, these models can generate toxic content, provide dangerous instructions, or simply fail to properly follow user guidelines.

The Fundamental Problem

Imagine an ultra-intelligent assistant who knows everything about everything, but who doesn’t understand the difference between good and bad actions, who doesn’t know when to be concise or detailed, and who cannot distinguish a legitimate request from a harmful one. This is exactly the situation of an LLM after its pre-training: it can predict the next word with remarkable accuracy, but it has no notion of what is helpful, safe, or appropriate.

Why Alignment is Crucial

Without alignment, an LLM presents these problems:

Doesn’t follow instructions: You ask for a short answer, it writes a 10-page essay
Generates toxic content: Hateful, discriminatory, or offensive statements
Hallucinates information: Invents facts with absolute confidence
Gives dangerous advice: Instructions for making dangerous substances, encouragement of self-harm
Lacks common sense: Responds literally without understanding social or ethical context
Rejects legitimate requests: Too cautious and refuses to help on sensitive but legal topics

With alignment, the model becomes:

✅ Helpful: Responds precisely to what the user asks
✅ Honest: Admits when it doesn’t know, avoids hallucinations
✅ Harmless: Refuses dangerous requests, generates safe content
✅ Natural: Communicates like a helpful and empathetic human

What You Will Learn

This article is a complete and practical guide on the four major LLM alignment techniques:

SFT (Supervised Fine-Tuning) - Learning by example
RLHF (Reinforcement Learning from Human Feedback) - Reinforcement learning with human feedback
DPO (Direct Preference Optimization) - Direct preference optimization
GRPO (Group Relative Policy Optimization) - DeepSeek’s latest innovation

For each technique, we will explore:

Fundamental concepts with concrete analogies
Mathematical principles explained simply
Practical implementations with complete code
Advantages and disadvantages of each method
Production applications with real examples
Best practices from industry experience

Overview of the Alignment Pipeline
Step 1: SFT - Supervised Fine-Tuning
Step 2: RLHF - Reinforcement Learning from Human Feedback
Step 3: DPO - Direct Preference Optimization
Step 4: GRPO - Group Relative Policy Optimization
Comparisons and Method Selection
Production and Best Practices

Overview of the Alignment Pipeline

The Journey of an LLM: From Pre-training to Perfect Assistant

Understanding LLM alignment means understanding a multi-step journey that transforms a word prediction model into an intelligent and safe assistant.

Step 0: Pre-training (The Foundation)

What happens? The model learns to predict the next word by reading billions of web pages, books, articles, source code, etc.

Analogy: It’s like a child reading an entire national library without any parental supervision. It learns language, grammar, facts, but also inappropriate content, biases, and without any notion of what is right or wrong.

Result: A very intelligent but “raw” model that can complete any text, even toxic or dangerous.

Concrete example:

Prompt: "How to hack a bank account"
Pre-trained model: "How to hack a bank account? Here are the detailed steps..."
❌ PROBLEM: The model completes without ethical judgment

Step 1: SFT - Supervised Fine-Tuning (Basic Education)

What happens? We train the model on examples of “good” conversations created by human experts.

Analogy: It’s like sending the child to school with teachers who show them how to behave, how to respond politely, how to structure their answers.

Result: The model learns the instruction-response format and adopts a helpful conversation style.

Concrete example:

After SFT:
Prompt: "How to hack a bank account"
Model: "I cannot and will not provide information on illegal activities. If you forgot your password, contact your bank."
✅ BETTER: The model refuses and proposes a legal alternative

What happens? Humans compare thousands of response pairs (which is better?). The model learns to optimize to maximize human preferences.

Analogy: It’s as if the child constantly received feedback: “This way of responding is better than that one.” They adjust their behavior to maximize approval.

Result: The model generates responses that better match subtle human preferences (tone, length, usefulness).

Step 3: DPO - Direct Optimization (The Simplification)

What happens? A simplified version of RLHF that achieves the same objectives with less complexity.

Analogy: Instead of a complex reward system, we directly show the model “This response is preferred to that one” and it adjusts its weights accordingly.

Result: Same quality as RLHF but more stable, faster, simpler to implement.

Step 4: GRPO - Group Optimization (The 2024 Innovation)

What happens? An even more efficient method that compares multiple responses in groups rather than in pairs.

Analogy: Instead of comparing A vs B, then B vs C, then A vs C (3 comparisons), we compare A, B, and C together (1 comparison) and rank from best to worst.

Result: More efficient in terms of data and computation than previous methods.

Summary Pipeline Diagram

┌─────────────────────────────────────────────────────────────┐
│  STEP 0: PRE-TRAINING (By OpenAI, Meta, etc.)              │
│  Input: Billions of tokens from the Internet               │
│  Output: Base model (GPT-4, Llama, etc.)                   │
│  Duration: Weeks/Months | Cost: Millions $                 │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 1: SFT (Supervised Fine-Tuning)                      │
│  Input: 10K-100K examples of ideal conversations           │
│  Output: Model that follows instructions                    │
│  Duration: Hours/Days | Cost: Hundreds/Thousands $         │
│  ✅ Model understands instruction-response format          │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: RLHF (Reinforcement Learning)                     │
│  Input: 50K-300K human comparisons (A > B)                 │
│  Output: Model aligned with human preferences               │
│  Duration: Days/Weeks | Cost: Thousands $                  │
│  ✅ Model generates high-quality responses                 │
│  ❌ Complex, unstable, expensive                           │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  ALTERNATIVE: DPO (Direct Preference Optimization)         │
│  Input: Same comparisons as RLHF                            │
│  Output: Same quality as RLHF                               │
│  Duration: Hours/Days | Cost: Hundreds $                   │
│  ✅ Simpler, more stable, cheaper than RLHF                │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│  2024 INNOVATION: GRPO (Group Relative Policy Opt.)       │
│  Input: Group comparisons (A > B > C > D)                  │
│  Output: Quality superior to RLHF/DPO                      │
│  Duration: Hours/Days | Cost: Hundreds $                   │
│  ✅ More efficient, better quality, less data              │
└─────────────────────────────────────────────────────────────┘

Global Metaphor: Training a Personal Assistant

To better understand the complete pipeline, imagine that you’re training a personal assistant:

Step	Metaphor	Concrete Analogy
Pre-training	Childhood	The child reads everything they can get their hands on, learns language but without discernment
SFT	School	Teachers teach them how to behave, respond politely, structure their thoughts
RLHF	Professional internship	A mentor constantly tells them “This way of doing is better than that one” until they internalize good practices
DPO	Accelerated coaching	An efficient coach directly shows them examples of good/bad behavior without detours
GRPO	Group mentoring	Several assistants are trained together, learning from each other more efficiently

Step 1: SFT - Supervised Fine-Tuning

Definition and Objective

SFT (Supervised Fine-Tuning) is the first and most fundamental alignment step. It’s classic supervised learning where the model learns to imitate high-quality human demonstrations.

Main Objective: Transform a word prediction model into an instruction-following model.

Detailed Definition: SFT is a transfer learning technique that adapts a pre-trained language model to a specific downstream task by training it on carefully curated input-output pairs. Unlike traditional fine-tuning which may adapt the model to any task, SFT specifically focuses on teaching the model to:

Understand Instructions: Parse and interpret user requests in natural language
Follow Format Conventions: Respect conversational structures (system, user, assistant roles)
Generate Appropriate Responses: Produce outputs that match the expected style, tone, and length
Maintain Coherence: Stay on-topic and provide relevant information
Exhibit Basic Safety: Avoid obviously harmful or inappropriate content

Complete Analogy: Imagine you hire a very intelligent assistant who is completely ignorant of social and professional conventions. SFT is like giving them a manual of 10,000 examples of successful conversations between an assistant and their boss, covering all possible types of requests:

How to answer a factual question
How to politely refuse an inappropriate request
How to structure a complex explanation
How to adapt tone according to context
How to admit when you don’t know something

Your assistant reads all these examples, memorizes them, and learns to reproduce this communication style.

How Does SFT Work?

The Process in Detail

1. Collecting High-Quality Data

The quality of SFT depends entirely on the quality of training data. We collect ideal conversations created by human experts.

Standard Format:

  
{
    "instruction": "The user's request or question",
    "input": "Additional context (optional)",
    "output": "The ideal response generated by a human expert"
}

Concrete Example:

  
{
    "instruction": "Explain the concept of photosynthesis to an 8-year-old",
    "input": "",
    "output": "Photosynthesis is like the magic recipe plants use to make their food! Imagine that green leaves are little kitchens. They take three ingredients: sunlight (like electrical energy), water (that roots draw from the soil), and air (especially a gas called CO2). By mixing all that, leaves make sugar so the plant can grow, and as a bonus, they release oxygen that we breathe! That's why trees are so important for our planet."
}

Quality vs Quantity:

❌ Bad: 100,000 mediocre examples (copied from the Internet without verification)
✅ Good: 10,000 excellent examples (verified, corrected, diversified)
🏆 Optimal: 50,000 very high-quality examples

2. Formatting for Training

The model must learn the conversational format. We use special templates:

Alpaca Format (Popular, used by Stanford):

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}

ChatML Format (Used by OpenAI):

<|im_start|>system
You are a helpful, honest, and harmless assistant.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
{output}<|im_end|>

Llama-2-Chat Format (Used by Meta):

<s>[INST] <<SYS>>
You are a helpful assistant who always responds accurately and respectfully.
<</SYS>>

{instruction} [/INST] {output} </s>

Why are these formats important?

They teach the model:

🎭 Roles: Who’s speaking? (system, user, assistant)
🔀 Transitions: When does a response begin and end
📏 Boundaries: Not to continue generating after finishing the response

3. Training by Maximum Likelihood

The model is trained to maximize the probability of generating the ideal response.

Simplified Formula:

Loss = -log P(ideal_response | instruction, model)

What this means concretely:

The model sees: Instruction + Ideal response
It tries to predict each word of the response
Each time it’s wrong, we adjust its weights
After thousands of examples, it learns to reproduce the style

Analogy: It’s like learning to play piano by watching a maestro. You see their fingers (the instruction), you hear the music (the response), and you try to reproduce. At first, it’s approximate, but after thousands of repetitions, you play like them.

Visualization of the SFT Process

┌─────────────────────────────────────────────────────────────┐
│              STEP 1: DATA COLLECTION                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   Dataset of 50,000 examples         │
        │                                      │
        │  Instruction 1 → Ideal response 1   │
        │  Instruction 2 → Ideal response 2   │
        │  ...                                 │
        │  Instruction 50K → Response 50K     │
        └──────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│            STEP 2: SUPERVISED TRAINING                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   LLM Model (before SFT)             │
        │   "How to make a cake?"              │
        │   → "A cake is a dessert..."         │
        │   ❌ Incomplete response             │
        └──────────────────────────────────────┘
                              │
                     [Training]
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   LLM Model (after SFT)              │
        │   "How to make a cake?"              │
        │   → "Here's a simple recipe:        │
        │      1. Preheat oven to 180°C       │
        │      2. Mix 200g of flour..."       │
        │   ✅ Structured and useful response │
        └──────────────────────────────────────┘

Practical SFT Implementation

Here’s a complete and functional example of SFT implementation with HuggingFace Transformers and PEFT (LoRA):

  
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, TaskType

# ============================================================================
# STEP 1: DATA PREPARATION
# ============================================================================

def format_instruction(example):
    """
    Transform a raw example into conversational format
    
    Input: {"instruction": "...", "output": "..."}
    Output: "### Instruction:\n...\n### Response:\n..."
    """
    instruction_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
{output}"""
    
    return instruction_template.format(
        instruction=example["instruction"],
        output=example["output"]
    )

# Load an SFT dataset (example: Alpaca)
dataset = load_dataset("yahma/alpaca-cleaned")

# Example data:
# {
#   "instruction": "Give three tips for staying healthy.",
#   "output": "1. Eat a balanced diet...\n2. Exercise regularly...\n3. Get enough sleep..."
# }

# Data formatting
def preprocess_function(examples):
    # Create complete text instruction + response
    texts = [format_instruction(ex) for ex in examples]
    
    # Tokenization
    model_inputs = tokenizer(
        texts,
        truncation=True,
        max_length=512,
        padding="max_length"
    )
    
    # Important: labels are the same as input_ids
    # The model learns to predict each following token
    model_inputs["labels"] = model_inputs["input_ids"].copy()
    
    return model_inputs

# ============================================================================
# STEP 2: MODEL AND TOKENIZER LOADING
# ============================================================================

model_name = "meta-llama/Llama-2-7b-hf"  # or "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Important for padding

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,  # Memory saving
    device_map="auto"  # Automatic GPU distribution
)

# ============================================================================
# STEP 3: LORA CONFIGURATION (Efficient Training)
# ============================================================================

# LoRA allows training only 0.1% of model parameters
# instead of 100%, which drastically reduces memory and time

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,  # Causal language model
    r=8,  # Rank of LoRA matrices (higher = more capacity)
    lora_alpha=32,  # Scaling factor
    lora_dropout=0.1,  # Dropout for regularization
    target_modules=["q_proj", "v_proj"]  # Modules to adapt
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

# ============================================================================
# STEP 4: TRAINING DATA PREPARATION
# ============================================================================

# Dataset tokenization
tokenized_dataset = dataset.map(
    preprocess_function,
    batched=True,
    remove_columns=dataset["train"].column_names
)

# Train/validation split
train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["train"].select(range(1000))  # 1000 examples for validation

# ============================================================================
# STEP 5: TRAINING CONFIGURATION
# ============================================================================

training_args = TrainingArguments(
    output_dir="./sft-llama2-alpaca",  # Save folder
    
    # Training hyperparameters
    num_train_epochs=3,  # Number of epochs
    per_device_train_batch_size=4,  # Batch size per GPU
    gradient_accumulation_steps=4,  # Accumulation = effective batch of 16
    
    # Optimization
    learning_rate=2e-5,  # Learning rate (critical!)
    lr_scheduler_type="cosine",  # Scheduler that progressively decreases
    warmup_steps=100,  # Progressive LR increase at start
    
    # Saving and logging
    logging_steps=10,  # Log every 10 steps
    save_steps=500,  # Save every 500 steps
    eval_steps=500,  # Evaluate every 500 steps
    save_total_limit=3,  # Keep only 3 best checkpoints
    
    # Memory optimizations
    fp16=True,  # Mixed precision (float16)
    gradient_checkpointing=True,  # Memory vs speed trade-off
    
    # Evaluation
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    
    # Miscellaneous
    report_to="tensorboard",  # Visualization with TensorBoard
    push_to_hub=False  # Don't publish to HuggingFace Hub
)

# ============================================================================
# STEP 6: TRAINING
# ============================================================================

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Launch training
print("🚀 Starting SFT training...")
trainer.train()

# Save final model
model.save_pretrained("./sft-llama2-final")
tokenizer.save_pretrained("./sft-llama2-final")

print("✅ Training complete!")

# ============================================================================
# STEP 7: INFERENCE WITH FINE-TUNED MODEL
# ============================================================================

def generate_response(instruction, model, tokenizer):
    """Generate a response with the SFT model"""
    
    prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,  # Maximum response length
        temperature=0.7,  # Controls creativity (0.0 = deterministic, 1.0 = creative)
        top_p=0.9,  # Nucleus sampling
        do_sample=True,  # Random sampling
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only the response (after "### Response:")
    response = response.split("### Response:")[-1].strip()
    
    return response

# Usage example
instruction = "Explain the concept of machine learning to a 10-year-old."
response = generate_response(instruction, model, tokenizer)
print(f"Instruction: {instruction}")
print(f"Response: {response}")

What Happens Technically

1. Forward Pass (Prediction):

  
# The model receives: "### Instruction: Explain photosynthesis\n### Response: Photo"
# It must predict: "synthesis"

input_tokens = ["###", "Instruction", ":", "Explain", "photosynthesis", "\n", "###", "Response", ":", "Photo"]
predicted_next_token = model(input_tokens)  # → "synthesis"

2. Loss Calculation (Error Calculation):

  
# For each token of the response, we calculate the error
true_tokens = ["Photo", "synthesis", "is", "a", "process", "..."]
predicted_probs = model.predict_all_tokens(input)

loss = -sum(log(predicted_probs[true_token]) for true_token in true_tokens)
# The worse the prediction, the higher the loss

3. Backward Pass (Update):

  
# We calculate gradients (how to change weights to reduce error)
gradients = compute_gradients(loss, model.parameters())

# We update model weights
for param in model.parameters():
    param -= learning_rate * gradients[param]

4. Repetition: This process repeats for each example, until the model learns to generate quality responses.

Tracking Metrics

During SFT training, we monitor several metrics:

1. Training Loss:

Measures how much the model is wrong on training data
Objective: Should decrease progressively
Example: Epoch 1 → loss=2.5, Epoch 2 → loss=1.2, Epoch 3 → loss=0.8

2. Validation Loss:

Measures performance on unseen data
Objective: Should follow training loss without diverging too much
Alert: If validation loss rises while training loss decreases → Overfitting!

3. Perplexity:

Measure of model “confusion”
Formula: perplexity = exp(loss)
Interpretation:
- Perplexity = 10 → The model hesitates between ~10 possible words
- Perplexity = 2 → The model is very sure (hesitates between 2 words)
- Lower = better

4. Gradient Norm:

Size of weight updates
Alert: If too high → Exploding gradients (instability)
Solution: Gradient clipping (max_grad_norm=1.0)

Training Logs Example

Epoch 1/3
Step 100/5000 | Loss: 2.451 | Perplexity: 11.60 | LR: 1.2e-5 | Time: 2m 15s
Step 200/5000 | Loss: 2.103 | Perplexity: 8.19  | LR: 1.8e-5 | Time: 4m 30s
...
Epoch 1 Complete | Train Loss: 1.820 | Val Loss: 1.856 | Val Perplexity: 6.40

Epoch 2/3
Step 100/5000 | Loss: 1.245 | Perplexity: 3.47 | LR: 2.0e-5 | Time: 2m 10s
...
Epoch 2 Complete | Train Loss: 1.102 | Val Loss: 1.134 | Val Perplexity: 3.11

Epoch 3/3
Step 100/5000 | Loss: 0.892 | Perplexity: 2.44 | LR: 1.5e-5 | Time: 2m 08s
...
✅ Training Complete!
Final Train Loss: 0.785 | Final Val Loss: 0.823 | Final Val Perplexity: 2.28

Advantages of SFT

✅ Simple and Intuitive: Classic supervised learning, easy to understand and implement

✅ Efficient: Quickly transforms a base model into a usable assistant

✅ Controllable: You decide exactly the response style by creating examples

✅ Inexpensive: Compared to pre-training, SFT costs 100-1000x less

✅ Fast: Training in hours/days rather than weeks/months

Disadvantages of SFT

❌ Limited Quality: The model cannot exceed the quality of human examples

❌ Data Creation Cost: Creating 50,000 high-quality examples requires hundreds of hours of human work

❌ No Preference Optimization: The model imitates, but doesn’t truly understand what makes a response “better”

❌ Overfitting: Risk of memorizing examples rather than generalizing

❌ No Iterative Feedback: Once examples are created, no continuous improvement

Ideal Use Cases for SFT

🎯 When to use SFT alone:

Domain Adaptation: You want the model to speak your industry’s jargon
- Example: Fine-tune GPT to become an expert in medical law
Specific Format: You have a precise output format to respect
- Example: Always generate responses structured in JSON
Limited Budget: You don’t have resources for RLHF/DPO
- SFT alone already gives excellent results for many applications
Rapid Prototyping: You want to test an idea quickly
- SFT is the fastest way to create a functional first prototype

Real Examples:

Alpaca (Stanford): LLaMA-7B + 52K SFT examples → Functional conversational assistant
Vicuna: Fine-tuning LLaMA on 70K ShareGPT conversations → Quality close to GPT-3.5
WizardLM: LLaMA + SFT with automatically generated complex instructions

Step 2: RLHF - Reinforcement Learning from Human Feedback

Definition and Motivation

RLHF (Reinforcement Learning from Human Feedback) is a revolutionary technique that transformed GPT-3 (good but imperfect) into ChatGPT (excellent assistant). It’s the method that launched the conversational AI revolution in 2022.

Detailed Definition: RLHF is a multi-stage alignment paradigm that combines supervised learning with reinforcement learning to optimize language models based on human preferences. Unlike SFT which learns from fixed examples, RLHF enables the model to discover and optimize for subtle human preferences that are difficult to capture through demonstrations alone.

Core Components:

Reward Modeling: Learning a function that predicts human preferences from comparison data
Policy Optimization: Using reinforcement learning (typically PPO) to maximize expected reward
Constraint Satisfaction: Maintaining similarity to the original model through KL divergence penalties

Theoretical Framework: RLHF formulates alignment as a constrained optimization problem:

max_π E_{x~D,y~π(·|x)}[R(x,y)] - β·KL(π||π_ref)

Where:

π is the policy (LLM) being optimized
π_ref is the reference policy (initial SFT model)
R(x,y) is the reward model score
β is the KL penalty coefficient (typically 0.01-0.1)
KL(π||π_ref) prevents the model from deviating too far from safe behavior

Why RLHF is Transformative: Traditional supervised learning can only teach a model to imitate; it cannot teach the model to understand why one response is better than another. RLHF bridges this gap by:

Capturing Implicit Preferences: Humans can easily say “A is better than B” even when they can’t articulate why
Optimizing Directly for Quality: The model learns to maximize human satisfaction, not just mimic examples
Handling Subjective Criteria: Can encode preferences about style, tone, helpfulness that vary by context
Enabling Iterative Improvement: The model can be continuously refined as preferences evolve

The Problem with SFT Alone:

Imagine you ask: “Explain general relativity to me”

Model after SFT could generate 3 different responses:

Response A (too technical):

“General relativity postulates that the space-time metric obeys Einstein’s equations Gμν = 8πTμν, where Gμν is the Einstein tensor and Tμν the energy-momentum tensor…”

Response B (too simplistic):

“Gravity is like when a ball falls. Einstein said space is curved.”

Response C (just perfect):

“Einstein’s general relativity explains gravity differently from Newton. Instead of a mysterious attracting force, Einstein proposes that massive objects (like the Sun) curve space-time around them, like a bowling ball placed on a trampoline deforms the fabric. Planets aren’t “attracted” by the Sun, they simply follow the “curves” of space-time. This is verified by many observations like light bending near massive objects.”

Question: How does the model know that C is better than A and B?

Answer: With SFT alone, it doesn’t! It generates grammatically correct responses, but without notion of relative quality. This is where RLHF comes in.

Analogy: Training a Chef

Imagine you’re training a chef:

Method 1 - SFT (Learning by Recipes):

You give 10,000 perfect recipes to the chef
They learn to cook by following these recipes
Problem: When they invent a new dish, they don’t know if it’s good or not

Method 2 - RLHF (Learning by Tasting):

The chef cooks several versions of a dish
Gourmets taste them and say “This one is better than that one”
The chef adjusts their cooking to maximize gourmets’ approval
Advantage: They learn subtle preferences (flavor balance, presentation, etc.)

The RLHF Pipeline in 3 Steps

┌─────────────────────────────────────────────────────────────┐
│  PHASE 1: COLLECTING HUMAN COMPARISONS                      │
│                                                             │
│  For the same question, generate multiple responses        │
│  and ask humans to rank them                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   Question: "Explain gravity"        │
        │                                      │
        │   Response A: [Technical response]  │
        │   Response B: [Simple response]     │
        │   Response C: [Balanced response]   │
        │   Response D: [Incorrect response]  │
        │                                      │
        │   Human ranks: C > A > B > D        │
        └──────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 2: TRAINING A REWARD MODEL                           │
│                                                             │
│  Create a model that predicts which response a human       │
│  would prefer (without needing the human)                  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   Reward Model learns:               │
        │                                      │
        │   Score(C) = 8.5/10                 │
        │   Score(A) = 6.2/10                 │
        │   Score(B) = 4.1/10                 │
        │   Score(D) = 1.3/10                 │
        │                                      │
        │   ✅ Model can now evaluate         │
        │   any response                       │
        └──────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│  PHASE 3: OPTIMIZATION BY REINFORCEMENT LEARNING            │
│                                                             │
│  Use Reward Model to train LLM to generate                │
│  responses that maximize the score                         │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌──────────────────────────────────────┐
        │   LLM generates a response           │
        │   → Reward Model gives a score      │
        │   → If high score: reinforce        │
        │   → If low score: penalize          │
        │                                      │
        │   After thousands of iterations:    │
        │   ✅ LLM learns to maximize         │
        │   score = human preferences         │
        └──────────────────────────────────────┘

Phase 1: Collecting Human Comparisons

Objective: Create a dataset of human preferences

Detailed Process:

Generating Multiple Responses:

  
prompt = "What is artificial intelligence?"
   
# Generate 4 different responses with the SFT model
responses = [
    model.generate(prompt, temperature=0.7) for _ in range(4)
]
   
# responses = [
#   "AI is the simulation of human intelligence...",  # Response A
#   "AI is when computers think like us...",  # Response B
#   "Artificial intelligence encompasses...",  # Response C
#   "AI = intelligent robots",  # Response D
# ]

Human Annotation:

Annotation interface:
┌───────────────────────────────────────────────────┐
│ Question: What is artificial intelligence?       │
│                                                   │
│ Rank these responses from best to worst:         │
│                                                   │
│ [ 1 ] Response C: "Artificial intelligence..."   │
│ [ 2 ] Response A: "AI is the simulation..."      │
│ [ 3 ] Response B: "AI is when..."                │
│ [ 4 ] Response D: "AI = intelligent robots"      │
│                                                   │
│           [Validate ranking]                      │
└───────────────────────────────────────────────────┘

Format of Collected Data:

  
comparison_data = {
    "prompt": "What is artificial intelligence?",
    "responses": [response_A, response_B, response_C, response_D],
    "ranking": [2, 1, 0, 3],  # Indices in preference order (C > A > B > D)
    "annotator_id": "human_123"
}

Required Volume:

Minimum: 10,000 comparisons
Recommended: 50,000 - 100,000 comparisons
OpenAI for ChatGPT: ~300,000 comparisons

Human Cost:

1 comparison ≈ 2-5 minutes (reading + reflection + ranking)
50,000 comparisons ≈ 4,000 hours of human work
At $20/hour ≈ $80,000 labeling cost

Phase 2: Training the Reward Model

Objective: Create a model that can predict human preferences automatically

What is a Reward Model?

A Reward Model is a neural network that takes as input a pair (prompt, response) and returns a numerical score representing the quality of the response according to human preferences.

Analogy: It’s like creating a gastronomic AI critic. After observing thousands of human judgments on dishes, it learns to predict whether a dish will be appreciated or not, without needing a human each time.

Reward Model Architecture:

┌─────────────────────────────────────────────────────────────┐
│                    INPUT                                    │
│  Prompt: "Explain gravity"                                 │
│  Response: "Gravity is a fundamental force..."             │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              TRANSFORMER ENCODER                            │
│  (Generally same architecture as LLM)                      │
│  Ex: LLaMA-7B, GPT-2, BERT                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│           POOLING LAYER                                     │
│  Extracts a fixed representation of the text               │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│          LINEAR HEAD (Final layer)                          │
│  [hidden_dim] → [1]                                        │
│  Transforms representation into a single score             │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    OUTPUT                                   │
│  Reward Score: 7.3 / 10                                    │
└─────────────────────────────────────────────────────────────┘

Training the Reward Model:

The Reward Model is trained with a ranking loss (Bradley-Terry model):

Mathematical Formula:

Loss = -log(σ(r(x, y_winner) - r(x, y_loser)))

Where:
- x = the prompt
- y_winner = the preferred response
- y_loser = the non-preferred response
- r(x, y) = score given by Reward Model
- σ = sigmoid function

Simple Explanation:

The model must learn to give a higher score to the preferred response than to the non-preferred response.

  
# Concrete example
prompt = "Explain photosynthesis"
response_good = "Photosynthesis is the process by which plants..."
response_bad = "Photosynthesis is when plants eat."

# The Reward Model must learn:
score_good = reward_model(prompt, response_good)   # → should be high (e.g., 8.5)
score_bad = reward_model(prompt, response_bad)     # → should be low (e.g., 2.1)

# Loss will be low if score_good >> score_bad
# Loss will be high if score_good ≈ score_bad (the model hasn't learned yet)

Practical Implementation of the Reward Model:

  
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

class RewardModel(nn.Module):
    def __init__(self, model_name="meta-llama/Llama-2-7b-hf"):
        super().__init__()
        
        # Load the base transformer
        self.transformer = AutoModel.from_pretrained(model_name)
        
        # Regression head to produce a single score
        self.reward_head = nn.Linear(
            self.transformer.config.hidden_size,  # E.g., 4096 for Llama-2-7B
            1  # Single output score
        )
        
    def forward(self, input_ids, attention_mask):
        # Pass through the transformer
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        # Take the representation of the last token (EOS)
        # Shape: [batch_size, hidden_size]
        last_hidden_state = outputs.last_hidden_state
        
        # Extract the representation of the last non-padding token
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = input_ids.shape[0]
        last_token_hidden = last_hidden_state[
            torch.arange(batch_size, device=input_ids.device),
            sequence_lengths
        ]
        
        # Calculate the reward score
        reward_score = self.reward_head(last_token_hidden).squeeze(-1)
        # Shape: [batch_size]
        
        return reward_score

# ============================================================================
# REWARD MODEL TRAINING
# ============================================================================

def train_reward_model(comparison_dataset, model, tokenizer, epochs=3):
    """
    Trains the reward model on human comparisons
    
    Args:
        comparison_dataset: List of {prompt, response_chosen, response_rejected}
        model: RewardModel to train
        tokenizer: Corresponding tokenizer
    """
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
    model.train()
    
    for epoch in range(epochs):
        total_loss = 0
        
        for batch in comparison_dataset:
            # batch = {
            #   'prompt': "Explain photosynthesis",
            #   'chosen': "Photosynthesis is the process...",
            #   'rejected': "Photosynthesis is when..."
            # }
            
            # Tokenize the (prompt, chosen response) pairs
            chosen_inputs = tokenizer(
                batch['prompt'] + " " + batch['chosen'],
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512
            )
            
            # Tokenize the (prompt, rejected response) pairs
            rejected_inputs = tokenizer(
                batch['prompt'] + " " + batch['rejected'],
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=512
            )
            
            # Calculate scores
            reward_chosen = model(
                input_ids=chosen_inputs['input_ids'],
                attention_mask=chosen_inputs['attention_mask']
            )
            
            reward_rejected = model(
                input_ids=rejected_inputs['input_ids'],
                attention_mask=rejected_inputs['attention_mask']
            )
            
            # Loss: we want reward_chosen > reward_rejected
            # Bradley-Terry loss
            loss = -torch.log(
                torch.sigmoid(reward_chosen - reward_rejected)
            ).mean()
            
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(comparison_dataset)
        print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_loss:.4f}")
    
    return model

# Usage example
reward_model = RewardModel("gpt2")  # Use GPT-2 for the example (lightweight)
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Comparison dataset (simplified)
comparison_data = [
    {
        'prompt': "What is AI?",
        'chosen': "Artificial intelligence is a field of computer science that aims to create systems capable of performing tasks that normally require human intelligence.",
        'rejected': "AI is smart robots."
    },
    # ... 50,000 other examples
]

trained_reward_model = train_reward_model(comparison_data, reward_model, tokenizer)

Reward Model Validation:

How do you know if the Reward Model has learned well?

  
def evaluate_reward_model(model, tokenizer, test_pairs):
    """
    Evaluates the accuracy of the reward model on test pairs
    
    test_pairs = [
        {'prompt': "...", 'chosen': "...", 'rejected': "..."},
        ...
    ]
    """
    model.eval()
    correct_predictions = 0
    
    with torch.no_grad():
        for pair in test_pairs:
            # Calculate scores
            chosen_inputs = tokenizer(
                pair['prompt'] + " " + pair['chosen'],
                return_tensors="pt",
                truncation=True,
                max_length=512
            )
            
            rejected_inputs = tokenizer(
                pair['prompt'] + " " + pair['rejected'],
                return_tensors="pt",
                truncation=True,
                max_length=512
            )
            
            reward_chosen = model(**chosen_inputs)
            reward_rejected = model(**rejected_inputs)
            
            # Did the model give a higher score to the chosen response?
            if reward_chosen > reward_rejected:
                correct_predictions += 1
    
    accuracy = correct_predictions / len(test_pairs)
    print(f"Reward Model Accuracy: {accuracy:.2%}")
    
    return accuracy

# A good Reward Model generally achieves 70-80% accuracy

Phase 3: Reinforcement Learning Optimization (PPO)

Objective: Use the Reward Model to fine-tune the LLM to generate better responses

This is the most complex and powerful phase of RLHF. An algorithm called PPO (Proximal Policy Optimization) developed by OpenAI is used.

Analogy: Imagine a student (the LLM) who must write essays. They now have an automatic teacher (the Reward Model) who grades each essay instantly. The student writes an essay, receives a grade, and learns to adjust their style to get better grades. After thousands of essays, they become excellent.

The PPO Process in Detail:

┌─────────────────────────────────────────────────────────────┐
│  STEP 1: RESPONSE GENERATION                                │
└─────────────────────────────────────────────────────────────┘
                         │
      ┌──────────────────┴──────────────────┐
      │                                     │
      ▼                                     ▼
┌──────────────┐                  ┌──────────────┐
│ LLM Policy   │                  │ LLM Reference│
│ (learning)   │  Generates       │ (Frozen SFT) │
│              │  response        │              │
└──────────────┘                  └──────────────┘
      │                                     │
      │ "Photosynthesis is..."             │
      │                                     │
      ▼                                     ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: EVALUATION BY REWARD MODEL                         │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
              ┌────────────────────┐
              │  Reward Model      │
              │  gives a score     │
              │  Reward = 7.5/10   │
              └────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: KL PENALTY CALCULATION                             │
│  (To prevent the model from drifting too far from SFT)     │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
              KL_penalty = KL_divergence(
                  P_policy(response | prompt),
                  P_reference(response | prompt)
              )
              
              Final_reward = Reward - β * KL_penalty
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: LLM UPDATE BY PPO                                   │
│  Adjust weights to maximize Final_reward                   │
└─────────────────────────────────────────────────────────────┘

Why is the KL Penalty Crucial?

Without this penalty, the model could “cheat” the Reward Model by generating nonsensical responses that artificially maximize the score. The KL penalty forces the model to stay close to the initial SFT model.

Example of Drift without KL Penalty:

Prompt: "Explain photosynthesis"

Without KL penalty:
Generated response: "PHOTOSYNTHESIS! INCREDIBLE! MAGNIFICENT! Plants are EXCEPTIONAL!!!!!!!"
→ Reward Model score: 9.5/10 (it learned that enthusiasm is good)
❌ PROBLEM: The response is useless but games the system

With KL penalty:
Generated response: "Photosynthesis is the biological process..."
→ Reward Model score: 8.0/10
→ KL penalty: 0.5 (close to SFT model)
→ Final reward: 8.0 - 0.01 * 0.5 = 7.995
✅ BETTER: Useful and natural response

Simplified PPO Implementation for RLHF:

  
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# ============================================================================
# PPO CONFIGURATION
# ============================================================================

ppo_config = PPOConfig(
    model_name="meta-llama/Llama-2-7b-hf",
    learning_rate=1.41e-5,  # Learning rate (very important)
    batch_size=64,  # Number of examples per batch
    mini_batch_size=4,  # Mini-batch for PPO
    gradient_accumulation_steps=16,  # Gradient accumulation
    
    # PPO hyperparameters
    ppo_epochs=4,  # Number of passes over each batch
    
    # KL penalty
    init_kl_coef=0.2,  # Initial coefficient of KL penalty
    target_kl=0.1,  # Target KL divergence
    
    # Clipping (PPO stability)
    cliprange=0.2,  # PPO clipping ratio
    cliprange_value=0.2,  # Clipping for value function
    
    # Values
    vf_coef=0.1,  # Coefficient of value function loss
    
    # Logging
    log_with="tensorboard",
    tracker_project_name="rlhf-llama2"
)

# ============================================================================
# MODEL LOADING
# ============================================================================

# 1. Policy Model (model to train)
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained(
    "meta-llama/Llama-2-7b-sft",  # Model after SFT
    torch_dtype=torch.float16
)

# 2. Reference Model (frozen, for KL penalty)
ref_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-sft",  # Same as policy but frozen
    torch_dtype=torch.float16
)
ref_model.eval()  # Evaluation mode (no gradient)

# 3. Reward Model (already trained)
reward_model = RewardModel.from_pretrained("./trained_reward_model")
reward_model.eval()

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer.pad_token = tokenizer.eos_token

# ============================================================================
# PPO TRAINER CREATION
# ============================================================================

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=policy_model,
    ref_model=ref_model,
    tokenizer=tokenizer
)

# ============================================================================
# RLHF TRAINING LOOP
# ============================================================================

def rlhf_training_loop(ppo_trainer, reward_model, prompts, num_iterations=1000):
    """
    Main RLHF training loop
    
    Args:
        ppo_trainer: Configured PPO trainer
        reward_model: Trained reward model
        prompts: List of prompts for generation
        num_iterations: Number of iterations
    """
    
    for iteration in range(num_iterations):
        # ====================================================================
        # STEP 1: RESPONSE GENERATION
        # ====================================================================
        
        # Select a batch of prompts
        batch_prompts = sample_prompts(prompts, batch_size=ppo_config.batch_size)
        
        # Tokenize prompts
        inputs = tokenizer(
            batch_prompts,
            return_tensors="pt",
            padding=True,
            truncation=True
        ).to(policy_model.device)
        
        # Generate responses with the policy model
        with torch.no_grad():
            response_tensors = ppo_trainer.generate(
                inputs['input_ids'],
                max_new_tokens=256,
                temperature=0.7,
                top_p=0.9,
                do_sample=True
            )
        
        # Decode responses
        responses = tokenizer.batch_decode(
            response_tensors,
            skip_special_tokens=True
        )
        
        # ====================================================================
        # STEP 2: REWARD CALCULATION
        # ====================================================================
        
        rewards = []
        for prompt, response in zip(batch_prompts, responses):
            # Calculate score with the reward model
            reward_inputs = tokenizer(
                prompt + " " + response,
                return_tensors="pt",
                truncation=True,
                max_length=512
            ).to(reward_model.device)
            
            with torch.no_grad():
                reward_score = reward_model(**reward_inputs)
            
            rewards.append(reward_score.cpu())
        
        rewards = torch.tensor(rewards)
        
        # ====================================================================
        # STEP 3: PPO UPDATE
        # ====================================================================
        
        # The PPOTrainer automatically handles:
        # - KL divergence calculation with ref_model
        # - Application of KL penalty
        # - PPO optimization
        # - Clipping
        
        stats = ppo_trainer.step(
            queries=inputs['input_ids'],
            responses=response_tensors,
            scores=rewards
        )
        
        # ====================================================================
        # LOGGING
        # ====================================================================
        
        if iteration % 10 == 0:
            print(f"Iteration {iteration}/{num_iterations}")
            print(f"  Mean Reward: {rewards.mean().item():.3f}")
            print(f"  Mean KL: {stats['objective/kl']:.3f}")
            print(f"  Policy Loss: {stats['ppo/loss/policy']:.3f}")
            
            # Generation example
            example_prompt = batch_prompts[0]
            example_response = responses[0]
            print(f"\n  Example:")
            print(f"  Prompt: {example_prompt}")
            print(f"  Response: {example_response}\n")
    
    return policy_model

# Launch training
prompts_dataset = load_prompts("./rlhf_prompts.json")  # Your prompts
final_model = rlhf_training_loop(ppo_trainer, reward_model, prompts_dataset)

# Save the final model
final_model.save_pretrained("./rlhf-llama2-final")

What Happens Under the Hood of PPO:

PPO is a sophisticated reinforcement learning algorithm. Here’s a simplified explanation:

1. Advantage Calculation:

  
# The advantage measures "how much better this action is than average"
advantage = reward - baseline

# Example:
# Response reward = 8.5
# Baseline (average of recent rewards) = 7.0
# Advantage = 8.5 - 7.0 = 1.5
# → This response is better than average, we want to reinforce it

2. Probability Ratio Calculation:

  
# We compare the probability of generating this response now vs before
ratio = P_new(response | prompt) / P_old(response | prompt)

# If ratio > 1: The model now generates this response more easily
# If ratio < 1: The model now generates this response less easily

3. PPO Clipping (Main innovation of PPO):

  
# We limit overly abrupt changes
clipped_ratio = clip(ratio, 1 - epsilon, 1 + epsilon)  # epsilon = 0.2

# We take the minimum between the clipped and unclipped version
objective = min(ratio * advantage, clipped_ratio * advantage)

# Why? To prevent the model from changing too quickly and becoming unstable

4. Final Loss:

  
loss = -objective + vf_coef * value_loss - entropy_bonus

RLHF Monitoring Metrics

During RLHF training, several critical metrics are monitored:

1. Mean Reward:

Average reward given by the Reward Model
Goal: Should increase gradually
Example: Iteration 0 → 5.2, Iteration 500 → 7.8, Iteration 1000 → 8.5

2. KL Divergence:

Measures how far the policy model has drifted from the reference model
Goal: Should remain low (< 0.1 - 0.5)
Alert: If KL > 1.0 → Model is drifting too much, risk of mode collapse

3. Policy Loss:

Loss of the PPO objective
Goal: Should decrease but not too quickly

4. Value Loss:

Error of the value function (predicts future reward)
Goal: Should converge toward 0

Example Logs:

Iteration 100/1000
  Mean Reward: 6.234
  Mean KL: 0.045
  Policy Loss: -0.123
  Value Loss: 0.234
  Entropy: 5.678

  Example:
  Prompt: "Explain gravity"
  Response: "Gravity is the attractive force between two massive objects..."
  Reward: 7.2

Iteration 200/1000
  Mean Reward: 7.123 ⬆
  Mean KL: 0.089
  Policy Loss: -0.245
  Value Loss: 0.156
  Entropy: 5.234

Iteration 1000/1000
  Mean Reward: 8.567 ⬆⬆
  Mean KL: 0.123
  Policy Loss: -0.456
  Value Loss: 0.045
  Entropy: 4.890

✅ Training Complete!

Advantages of RLHF

✅ Exceptional Quality: Produces state-of-the-art quality models (ChatGPT, Claude, GPT-4)

✅ Learns Subtle Preferences: Captures nuances that SFT alone cannot (tone, style, length)

✅ Direct Optimization: Explicitly maximizes what humans prefer

✅ Flexibility: Can be applied to different objectives (utility, safety, creativity)

✅ Continuous Improvement: Can be iterated multiple times to gradually improve

Disadvantages of RLHF

❌ Extremely Complex: Difficult implementation with many sensitive hyperparameters

❌ Instability: PPO is notoriously unstable, risks of divergence, mode collapse

❌ Computational Cost: Requires 3 models in memory simultaneously (policy, reference, reward)

❌ Training Time: Much longer than SFT (days/weeks)

❌ Labeling Cost: Very expensive to create 50K-300K human comparisons

❌ Critical Hyperparameters: KL penalty, learning rate, clipping range must be finely tuned

❌ Requires Expertise: Few teams truly master RLHF in production

Real-World Examples of RLHF Usage

ChatGPT (OpenAI):

GPT-3.5 → SFT on ~13K demonstrations → RLHF on ~300K comparisons
Result: Radical transformation of GPT-3 into ChatGPT

Claude (Anthropic):

Uses RLHF + Constitutional AI (RLAIF)
Focus on safety and alignment with human values

GPT-4 (OpenAI):

Massive RLHF with domain experts
Better quality and safety than GPT-3.5

Step 3: DPO - Direct Preference Optimization

The 2023 Revolution: Simplifying RLHF

DPO (Direct Preference Optimization) is a major innovation published by Stanford researchers (Rafailov et al., 2023) in May 2023. It’s a method that achieves the same results as RLHF but with drastically reduced complexity.

Detailed Definition: DPO is an elegant reformulation of the RLHF objective that eliminates the need for explicit reward modeling and reinforcement learning. Instead of the traditional two-stage process (train reward model → optimize policy with RL), DPO directly optimizes the policy on preference data in a single supervised learning phase.

Key Theoretical Insight: The breakthrough of DPO comes from a mathematical reparameterization. The authors proved that the optimal policy π* for the RLHF objective can be expressed analytically in terms of the reward function:

π*(y|x) = π_ref(y|x) · exp(R(x,y)/β) / Z(x)

Where Z(x) is a partition function. By inverting this relationship, we can express the reward in terms of policies:

R(x,y) = β·log(π*(y|x)/π_ref(y|x)) + β·log Z(x)

The DPO Transformation: Substituting this reward expression into the Bradley-Terry preference model and observing that the partition function cancels out in preference comparisons, we obtain the DPO loss:

L_DPO(π_θ) = -E[(x,y_w,y_l)~D] [log σ(β·log(π_θ(y_w|x)/π_ref(y_w|x)) - β·log(π_θ(y_l|x)/π_ref(y_l|x)))]

This loss function:

Increases the likelihood of preferred responses relative to the reference model
Decreases the likelihood of rejected responses
Implicitly maintains the KL constraint through the log-ratio formulation
Requires no reward model: The preference signal is directly encoded in the loss

Why DPO Works: By optimizing this loss, we’re simultaneously:

Teaching the model which responses are better (preference learning)
Preventing mode collapse through the reference model constraint
Avoiding the instabilities of RL optimization
Achieving the same optimum as RLHF but through a simpler path

Practical Advantages:

Single-stage training: No need to train a separate reward model
Stable optimization: Standard gradient descent, no RL instabilities
Memory efficient: Only 2 models in memory instead of 3
Faster convergence: Typically 2-3x faster than PPO-based RLHF
Easier debugging: Direct supervision signal, easier to diagnose problems

The Problem with RLHF

Recap of RLHF complexity:

RLHF = Train Reward Model + Optimize with PPO + 3 models in memory

Problems:
❌ Separate Reward Model to train (expensive)
❌ PPO unstable and difficult to tune
❌ KL penalty delicate to calibrate
❌ 3 models loaded simultaneously (huge memory)
❌ Many sensitive hyperparameters

The Brilliant Idea of DPO

Question: Is it possible to optimize directly on human preferences without going through a Reward Model and PPO?

Answer: YES! That’s exactly what DPO does.

Analogy: Instead of creating an automatic teacher (Reward Model) and then using their grades to guide the student (PPO), we show the student directly pairs of examples: “This essay is better than that one. Adjust yourself to produce more often essays like the first one.”

How Does DPO Work?

Mathematical Intuition

RLHF tries to maximize:

reward(x, y) - β * KL(π_θ || π_ref)

Where:
- reward(x, y) = Reward Model score
- KL = divergence between policy and reference
- β = penalty coefficient

DPO realizes that this optimization can be rewritten in a way that no longer requires an explicit Reward Model!

The Magic Formula of DPO:

Loss_DPO = -log σ(β * log(π_θ(y_w | x) / π_ref(y_w | x)) - β * log(π_θ(y_l | x) / π_ref(y_l | x)))

Where:
- y_w = preferred response (winner)
- y_l = rejected response (loser)
- π_θ = policy model (being trained)
- π_ref = reference model (frozen)
- β = hyperparameter (generally 0.1-0.5)
- σ = sigmoid function

Simple Explanation:

**π_θ(y_w x) / π_ref(y_w x)**: Probability ratio that the current model generates the good response vs the reference model
- If > 1: The model generates the good response more easily than before ✅
- If < 1: The model generates the good response less easily than before ❌
**π_θ(y_l x) / π_ref(y_l x)**: Ratio for the bad response
- If > 1: The model generates the bad response more easily than before ❌
- If < 1: The model generates the bad response less easily than before ✅
Objective: Maximize the ratio for y_w and minimize the ratio for y_l

Visualization of the DPO Process:

┌─────────────────────────────────────────────────────────────┐
│  INPUT: A Human Comparison                                  │
│  Prompt: "Explain gravity"                                  │
│  y_win: "Gravity is the curvature of spacetime..."         │
│  y_lose: "Gravity is when things fall"                     │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 1: Calculate P(y_win | prompt) with π_θ and π_ref   │
└─────────────────────────┬───────────────────────────────────┘
                          │
            ┌─────────────┴─────────────┐
            │                           │
            ▼                           ▼
    ┌──────────────┐          ┌──────────────┐
    │ Policy Model │          │ Ref Model    │
    │ P_θ(y_win)   │          │ P_ref(y_win) │
    │ = 0.024      │          │ = 0.012      │
    └──────────────┘          └──────────────┘
            │                           │
            └─────────────┬─────────────┘
                          │
                          ▼
                  Ratio_win = 0.024 / 0.012 = 2.0
                  → The model generates the good response
                    2x more easily than before ✅
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: Calculate P(y_lose | prompt)                      │
└─────────────────────────┬───────────────────────────────────┘
                          │
                  Ratio_lose = P_θ(y_lose) / P_ref(y_lose)
                             = 0.008 / 0.015 = 0.53
                  → The model generates the bad response
                    2x less easily than before ✅
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: Calculate DPO Loss                                 │
│  Loss = -log σ(β * log(2.0) - β * log(0.53))              │
│       = -log σ(0.3 * 0.69 - 0.3 * (-0.63))                │
│       = -log σ(0.207 + 0.189) = -log σ(0.396)             │
│       = -log(0.598) = 0.514                               │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: Backpropagation                                    │
│  Update π_θ to reduce loss                                 │
│  → Increase P_θ(y_win)                                     │
│  → Decrease P_θ(y_lose)                                    │
└─────────────────────────────────────────────────────────────┘

Practical Implementation of DPO

Here’s a complete and functional implementation with HuggingFace TRL:

  
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import DPOTrainer
from peft import LoraConfig, get_peft_model

# ============================================================================
# STEP 1: DATA PREPARATION
# ============================================================================

# Format required for DPO:
# {
#   'prompt': "The question or instruction",
#   'chosen': "The preferred response",
#   'rejected': "The rejected response"
# }

# Example with a public dataset
dataset = load_dataset("Anthropic/hh-rlhf")  # Human preference dataset

# Transform to the right format
def format_for_dpo(example):
    """
    Transforms Anthropic format into standard DPO format
    """
    return {
        'prompt': example['prompt'],
        'chosen': example['chosen'],
        'rejected': example['rejected']
    }

train_dataset = dataset['train'].map(format_for_dpo)
eval_dataset = dataset['test'].select(range(1000)).map(format_for_dpo)

# Example data:
# {
#   'prompt': "Human: How to make a chocolate cake?\n\nAssistant:",
#   'chosen': "Here's a simple recipe: 1. Preheat...",
#   'rejected': "A chocolate cake is simple..."
# }

# ============================================================================
# STEP 2: MODEL LOADING
# ============================================================================

model_name = "meta-llama/Llama-2-7b-hf"  # Or your model after SFT

# Model to train (policy)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Reference model (will be automatically created by DPOTrainer)
# It's a frozen copy of the initial model

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# ============================================================================
# STEP 3: LORA CONFIGURATION (Optional but Recommended)
# ============================================================================

# LoRA for efficient training
lora_config = LoraConfig(
    r=16,  # LoRA rank (higher for DPO than SFT)
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # More modules
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# ============================================================================
# STEP 4: DPO TRAINING CONFIGURATION
# ============================================================================

training_args = TrainingArguments(
    output_dir="./dpo-llama2",
    
    # Training hyperparameters
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch = 16
    
    # Learning rate (often higher than SFT)
    learning_rate=5e-6,
    lr_scheduler_type="cosine",
    warmup_steps=100,
    
    # Optimizations
    fp16=True,
    gradient_checkpointing=True,
    
    # Logging and saving
    logging_steps=10,
    save_steps=500,
    eval_steps=500,
    evaluation_strategy="steps",
    save_total_limit=3,
    
    # DPO specific
    remove_unused_columns=False,  # Important for DPO!
    
    report_to="tensorboard"
)

# ============================================================================
# STEP 5: DPO TRAINER CREATION
# ============================================================================

dpo_trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    
    # CRUCIAL HYPERPARAMETER: β (beta)
    beta=0.1,  # Controls the importance of KL penalty
    # Low β (0.1): More aggressive, drifts more from reference model
    # High β (0.5): More conservative, stays close to reference model
    
    # Maximum length
    max_length=512,
    max_prompt_length=256,
    
    # Loss type
    loss_type="sigmoid"  # Or "hinge", "ipo"
)

# ============================================================================
# STEP 6: TRAINING
# ============================================================================

print("🚀 Starting DPO training...")
dpo_trainer.train()

# Save
model.save_pretrained("./dpo-llama2-final")
tokenizer.save_pretrained("./dpo-llama2-final")

print("✅ DPO training completed!")

# ============================================================================
# STEP 7: EVALUATION AND INFERENCE
# ============================================================================

def generate_with_dpo_model(prompt, model, tokenizer):
    """Generates a response with the DPO model"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test
prompt = "Human: Explain general relativity in simple terms.\n\nAssistant:"
response = generate_with_dpo_model(prompt, model, tokenizer)
print(f"Prompt: {prompt}")
print(f"Response: {response}")

Detailed Comparison: RLHF vs DPO

Criterion	RLHF	DPO
Complexity	🔴 Very complex (2 phases)	🟢 Simple (1 phase)
Number of Models	🔴 3 models (policy, ref, reward)	🟢 2 models (policy, ref)
Stability	🔴 PPO unstable	🟢 Very stable
GPU Memory	🔴 ~40-60 GB for 7B	🟢 ~20-30 GB for 7B
Speed	🔴 Slow (generation + PPO)	🟢 2-3x faster
Hyperparameters	🔴 Many and sensitive	🟢 Mainly β
Final Quality	🟡 Excellent	🟢 Comparable to RLHF
Implementation Ease	🔴 Very difficult	🟢 Easy with TRL
Computational Cost	🔴 High	🟢 Moderate

Verdict: DPO is generally preferable unless you already have RLHF infrastructure in place.

DPO Variants

Several DPO variants have been proposed:

1. IPO (Identity Preference Optimization):

Modifies DPO loss to better handle weak preferences
Better when humans are uncertain about their preference

2. CPO (Conservative Preference Optimization):

Adds an explicit safety constraint
Prevents the model from generating dangerous content even if rewarded

3. KTO (Kahneman-Tversky Optimization):

Based on prospect theory in behavioral economics
Better handling of asymmetric preferences

4. ORPO (Odds Ratio Preference Optimization):

Combines SFT and DPO in a single step
More data efficient

Advantages of DPO

✅ Simplicity: Single training phase, no separate Reward Model

✅ Stability: Much more stable than PPO, converges predictably

✅ Memory Efficiency: Requires 2 models instead of 3 (~30% GPU savings)

✅ Speed: 2-3x faster than RLHF

✅ Implementation Ease: ~100 lines of code with TRL

✅ Hyperparameters: Mainly β, much simpler to tune

✅ Quality: Results comparable to RLHF in most benchmarks

✅ Open Source: Excellent TRL library from HuggingFace

Disadvantages of DPO

❌ Data Required: Still needs 50K+ human comparisons (like RLHF)

❌ Labeling Cost: Same human cost as RLHF to create comparisons

❌ Less Control: No explicit Reward Model to debug

❌ Indirect Optimization: Optimizes an approximation of RLHF objective, not the direct objective

❌ Less Mature: More recent than RLHF (2023 vs 2017), less industrial feedback

Real-World Examples of DPO Usage

Zephyr-7B (HuggingFace):

Mistral-7B + SFT + DPO
Performance close to GPT-3.5 with only 7B parameters
Became the most popular open-source model in late 2023

Starling-7B (Berkeley):

Significant improvements on reasoning and code
Uses DPO variants (RLAIF)

Tulu 2 (AllenAI):

Suite of models optimized with DPO
Focus on task diversity

Step 4: GRPO - Group Relative Policy Optimization

The Latest Innovation (2024): DeepSeek R1

GRPO (Group Relative Policy Optimization) is the very latest advancement in alignment, introduced by DeepSeek with their R1 model in December 2024. It’s a major improvement over DPO that better exploits group comparisons.

Detailed Definition: GRPO is a novel preference optimization algorithm that extends beyond pairwise comparisons to leverage full ranking information over groups of responses. Rather than comparing two responses at a time (A vs B), GRPO simultaneously considers multiple candidates and learns from their relative ordering (A > B > C > D).

Core Innovation: GRPO addresses a fundamental limitation of both RLHF and DPO: preference data efficiency. Traditional methods only extract binary signals from human feedback, while GRPO extracts richer multi-way comparison information.

Theoretical Foundation: GRPO models preferences using a Plackett-Luce ranking model, which generalizes the Bradley-Terry model (used in DPO) to handle rankings over arbitrary group sizes:

P(rank(y₁,...,yₖ) | x) = ∏ᵢ₌₁ᵏ exp(R(x,yᵢ)) / ∑ⱼ₌ᵢᵏ exp(R(x,yⱼ))

This model captures the probability of observing a particular ranking, where each response is sequentially selected proportional to its exponentiated reward.

GRPO Optimization Objective: Following DPO’s approach, GRPO directly parameterizes the reward in terms of policy ratios:

L_GRPO(π_θ) = -E[(x,{yᵢ}ⁱ₌₁ᵏ)~D] [∑ᵢ₌₁ᵏ advantage(yᵢ, rank) · log(π_θ(yᵢ|x)/π_ref(yᵢ|x))]

Where advantage(yᵢ, rank) is computed based on the response’s position in the ranking:

Top-ranked responses get positive advantages (reinforce)
Bottom-ranked responses get negative advantages (suppress)
Middle-ranked responses get smaller advantages (gentle adjustment)

Advantage Computation Methods:

Rank-based: advantage(yᵢ) = (K - rankᵢ + 1) / K - 0.5
Exponential: advantage(yᵢ) = exp(-λ·(rankᵢ-1)) for some λ > 0
Tournament-style: advantage(yᵢ) = ∑ⱼ≠ᵢ sign(rankⱼ - rankᵢ)

Key Advantages of Group Rankings:

Information Efficiency: One K-way ranking provides K(K-1)/2 pairwise comparisons worth of information
- Example: Ranking 4 responses gives 6 pairwise comparisons of info
- But requires only 1 human annotation instead of 6
Consistency Guarantee: Direct rankings eliminate preference cycles (A>B, B>C, C>A impossible)
Finer-Grained Signals: Captures strength of preferences (A » B > C ≈ D)
Better Generalization: Learning from diverse quality levels improves robustness

Practical Implementation Strategy: GRPO typically uses:

Group size K = 4-8: Balance between information gain and annotation difficulty
Sampling strategies: Diverse sampling to ensure quality spread in groups
Reward normalization: Standardize advantages within each group for stability
Mixed training: Combine GRPO with standard DPO losses for robustness

Why GRPO Achieves Superior Performance: DeepSeek R1’s success with GRPO demonstrates that:

Richer feedback signal enables faster learning with less data
Explicit relative positioning helps model understand quality gradients
Reduced annotation cost allows scaling to larger, more diverse datasets
Better optimization landscape leads to more stable convergence

The Problem with DPO

DPO compares responses pairwise:

Prompt: "Explain photosynthesis"
Response A vs Response B → A is preferred
Response B vs Response C → B is preferred
Response A vs Response C → A is preferred

❌ PROBLEM:
- 3 comparisons needed for 3 responses
- Partial information (binary comparisons)
- Possible inconsistencies (A > B, B > C, but C > A??)

The GRPO Idea

Question: Instead of comparing pairwise, why not compare multiple responses at once and rank them?

Answer: That’s exactly what GRPO does!

Prompt: "Explain photosynthesis"
Generate 4-8 responses
Rank all together: A > C > B > D

✅ ADVANTAGE:
- 1 single annotation instead of 6 binary comparisons
- Richer information (complete ranking)
- Guaranteed consistency
- More data efficient

Analogy: Instead of asking “Is this movie better than that one?” multiple times, we directly ask “Rank these 5 movies from best to worst”. It’s faster, more consistent, and gives more information.

How Does GRPO Work?

The Process in Detail

┌─────────────────────────────────────────────────────────────┐
│  STEP 1: GROUP GENERATION                                   │
│  For each prompt, generate K responses (K = 4-8)           │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
        ┌──────────────────────────────────────┐
        │  Prompt: "Explain gravity"           │
        │                                      │
        │  y₁: "Gravity is the curvature..."   │
        │  y₂: "Gravity is the force..."       │
        │  y₃: "Newton discovered..."          │
        │  y₄: "Objects fall because..."       │
        └──────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 2: GROUP RANKING                                      │
│  Rank all responses together                               │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
              Ranking: y₁ > y₃ > y₂ > y₄
                     (1st, 2nd, 3rd, 4th)
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 3: RELATIVE REWARDS CALCULATION                       │
│  Use ranking to calculate rewards                          │
└─────────────────────────────────────────────────────────────┘
                          │
                          ▼
        ┌──────────────────────────────────────┐
        │  Rank-based rewards:                 │
        │  r(y₁) = +1.5  (best)               │
        │  r(y₃) = +0.5  (2nd)                │
        │  r(y₂) = -0.5  (3rd)                │
        │  r(y₄) = -1.5  (worst)              │
        └──────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  STEP 4: GRPO OPTIMIZATION                                  │
│  Maximize probability of best responses                    │
│  Minimize probability of worst responses                   │
└─────────────────────────────────────────────────────────────┘

GRPO Formula

Loss_GRPO = -∑ᵢ₌₁ᴷ advantage(yᵢ) * log π_θ(yᵢ | x)

Where:
- K = number of responses in the group (e.g., 4-8)
- advantage(yᵢ) = relative reward based on rank
- advantage(y_best) > 0 (we want to increase its probability)
- advantage(y_worst) < 0 (we want to decrease its probability)

Advantage Calculation:

  
# Method 1: Rank-based reward
def compute_rank_advantage(rankings):
    """
    rankings: [1, 3, 2, 4] (rank of each response)
    """
    K = len(rankings)
    advantages = []
    
    for rank in rankings:
        # Transform rank into advantage
        # Best rank (1) → maximum positive advantage
        # Worst rank (K) → maximum negative advantage
        advantage = (K + 1 - 2 * rank) / K
        advantages.append(advantage)
    
    # Normalize (mean = 0)
    advantages = np.array(advantages)
    advantages = advantages - advantages.mean()
    
    return advantages

# Example:
rankings = [1, 3, 2, 4]  # y₁ is 1st, y₃ is 3rd, etc.
advantages = compute_rank_advantage(rankings)
# advantages = [0.75, -0.25, 0.25, -0.75]
# y₁ (rank 1) → +0.75 (strong advantage)
# y₄ (rank 4) → -0.75 (strong disadvantage)

Practical Implementation of GRPO

  
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.utils.data import DataLoader

# ============================================================================
# STEP 1: CONFIGURATION
# ============================================================================

model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Reference model (frozen)
ref_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
ref_model.eval()

# GRPO hyperparameters
K = 4  # Number of responses per group
beta = 0.1  # KL penalty coefficient
learning_rate = 1e-6

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# ============================================================================
# STEP 2: GROUP GENERATION
# ============================================================================

def generate_group_responses(model, prompt, K=4):
    """
    Generates K different responses for the same prompt
    """
    responses = []
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    for _ in range(K):
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=256,
                temperature=0.8,  # High temperature for diversity
                top_p=0.9,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        responses.append(response)
    
    return responses

# ============================================================================
# STEP 3: RANKING (CAN BE HUMAN OR AUTOMATIC)
# ============================================================================

def rank_responses_automatic(prompt, responses, reward_model):
    """
    Ranks responses automatically with a reward model
    (Alternative: ask humans to rank)
    """
    scores = []
    
    for response in responses:
        # Calculate score with a reward model
        inputs = tokenizer(
            prompt + " " + response,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to(reward_model.device)
        
        with torch.no_grad():
            score = reward_model(**inputs)
        
        scores.append(score.item())
    
    # Create ranking based on scores
    rankings = np.argsort(-np.array(scores)) + 1  # Rank 1 = best
    
    return rankings.tolist()

# ============================================================================
# STEP 4: ADVANTAGES CALCULATION
# ============================================================================

def compute_advantages(rankings):
    """
    Transforms ranks into advantages
    """
    K = len(rankings)
    advantages = []
    
    for rank in rankings:
        # Rank-based advantage
        advantage = (K + 1 - 2 * rank) / K
        advantages.append(advantage)
    
    # Normalize
    advantages = np.array(advantages)
    advantages = advantages - advantages.mean()
    
    return torch.tensor(advantages, dtype=torch.float32)

# ============================================================================
# STEP 5: GRPO LOSS
# ============================================================================

def compute_grpo_loss(model, ref_model, prompt, responses, advantages, beta=0.1):
    """
    Calculates GRPO loss for a group of responses
    """
    total_loss = 0
    
    for i, (response, advantage) in enumerate(zip(responses, advantages)):
        # Tokenize
        full_text = prompt + " " + response
        inputs = tokenizer(
            full_text,
            return_tensors="pt",
            truncation=True,
            max_length=512
        ).to(model.device)
        
        # Calculate log-probability with policy model
        outputs = model(**inputs, labels=inputs['input_ids'])
        logprobs_policy = -outputs.loss
        
        # Calculate log-probability with reference model
        with torch.no_grad():
            outputs_ref = ref_model(**inputs, labels=inputs['input_ids'])
            logprobs_ref = -outputs_ref.loss
        
        # KL divergence (approximation)
        kl = logprobs_policy - logprobs_ref
        
        # Loss for this response
        # We maximize logprobs of good responses (advantage > 0)
        # We minimize logprobs of bad responses (advantage < 0)
        loss = -advantage * (logprobs_policy - beta * kl)
        
        total_loss += loss
    
    # Average over the group
    avg_loss = total_loss / len(responses)
    
    return avg_loss

# ============================================================================
# STEP 6: TRAINING LOOP
# ============================================================================

def train_grpo(model, ref_model, prompts, reward_model, epochs=3, K=4):
    """
    Main GRPO training loop
    """
    model.train()
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for prompt in prompts:
            # Generate K responses
            responses = generate_group_responses(model, prompt, K=K)
            
            # Rank responses
            rankings = rank_responses_automatic(prompt, responses, reward_model)
            
            # Calculate advantages
            advantages = compute_advantages(rankings)
            
            # Calculate loss
            loss = compute_grpo_loss(
                model, ref_model, prompt, responses, advantages, beta=0.1
            )
            
            # Backpropagation
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        
        avg_epoch_loss = epoch_loss / len(prompts)
        print(f"Epoch {epoch+1}/{epochs} | Loss: {avg_epoch_loss:.4f}")
    
    return model

# Usage example
prompts_list = ["Explain photosynthesis", "What is AI?", ...]
trained_model = train_grpo(model, ref_model, prompts_list, reward_model)

Advantages of GRPO

✅ Data Efficiency: 1 group ranking » K binary comparisons

✅ Consistency: No possible contradictions in preferences

✅ Rich Information: Complete ranking instead of binary comparisons

✅ Better Quality: DeepSeek R1 surpasses GPT-4 in many benchmarks

✅ Faster: Fewer human annotations needed

✅ Flexible: Can use partial or incomplete rankings

✅ Robustness: Less sensitive to noise in annotations

Disadvantages of GRPO

❌ Very Recent: Few practical feedback (December 2024)

❌ Generation Cost: Requires generating K responses per prompt (K≈4-8)

❌ Cognitive Complexity: Ranking 8 responses is harder than comparing 2

❌ No Standard Library: Not yet integrated in TRL (coming)

❌ Limited Documentation: Fewer resources than DPO/RLHF

Final Comparison: SFT vs RLHF vs DPO vs GRPO

Criterion	SFT	RLHF	DPO	GRPO
Implementation Complexity	🟢 Easy	🔴 Very difficult	🟢 Easy	🟡 Moderate
Stability	🟢 Very stable	🔴 Unstable	🟢 Stable	🟢 Stable
Final Quality	🟡 Good	🟢 Excellent	🟢 Excellent	🟢 Superior
Data Efficiency	🟡 Moderate	🔴 Low	🟡 Moderate	🟢 High
Computational Cost	🟢 Low	🔴 High	🟡 Moderate	🟡 Moderate
GPU Memory	🟢 Low	🔴 High	🟡 Moderate	🟡 Moderate
Training Time	🟢 Fast	🔴 Slow	🟢 Fast	🟢 Fast
Labeling Cost	🟡 Moderate	🔴 High	🔴 High	🟢 Reduced
Maturity	🟢 Mature	🟢 Mature	🟡 Recent	🔴 Very recent

Comparisons and Method Selection

Decision Guide: Which Method to Choose?

Scenario 1: Limited Budget, Quick Prototype

Recommendation: SFT Only

✅ Fast to implement (few hours)
✅ Low cost (hundreds of euros)
✅ Already very usable results
❌ Quality limited by examples

Example: Create an internal chatbot for your company

Scenario 2: Maximum Quality, Comfortable Budget

Recommendation: SFT → DPO

✅ State-of-the-art quality
✅ More stable than RLHF
✅ Reasonable cost (few thousand euros)
❌ Requires human comparisons

Example: Launch a commercial chatbot product

Scenario 3: Existing RLHF Infrastructure

Recommendation: RLHF

✅ If you already master RLHF
✅ If you have GPU resources
❌ Otherwise, prefer DPO

Example: Large tech company with dedicated ML team

Scenario 4: Cutting-Edge Research

Recommendation: GRPO

✅ Better data efficiency
✅ Superior quality
❌ Very recent, little documentation
❌ Requires self-implementation

Example: Research lab, academic publication

Recommended Pipeline for Most Cases

1. SFT (Mandatory)
   ↓
   Evaluation: Is the model already good enough?
   ↓
   YES → Stop here
   NO → Continue
   ↓
2. DPO (Recommended)
   ↓
   Evaluation: Need improvement?
   ↓
   YES → Iterate (more data)
   NO → Deployment

Production and Best Practices

Best Labeling Practices

1. Annotator Quality:

Train annotators on your criteria
Use qualification tests
Measure inter-annotator agreement (κ > 0.6)

2. Data Diversity:

Cover all query types
Include edge cases and difficult cases
Balance domains

3. Quality Control:

Double annotation for 10% of data
Review major disagreements
Continuous quality monitoring

Aligned Model Evaluation

Automatic Metrics:

Perplexity: Not very correlated with perceived quality
BLEU/ROUGE: Unsuitable for alignment
Reward Model Score: Good proxy

Human Evaluation:

Win Rate: Compare 2 models side by side
Likert Scale: Rate 1-5 on several criteria
A/B Testing: In production

Benchmarks:

MT-Bench: Multi-turn conversations
AlpacaEval: Comparison with GPT-4
Arena Elo: Ranking by battles

Conclusion

The journey from raw pre-trained models to aligned, helpful assistants represents one of the most remarkable achievements in modern AI. What started with ChatGPT’s revolutionary RLHF approach has evolved into an ecosystem of increasingly efficient and accessible techniques.

The Evolution Timeline:

2020-2022: RLHF establishes the paradigm (ChatGPT’s breakthrough)
2023: DPO democratizes alignment (simpler, faster, equally effective)
2024: GRPO pushes efficiency boundaries (DeepSeek R1’s innovation)

Where We Stand Today: Alignment is no longer the exclusive domain of tech giants. With open-source tools like HuggingFace TRL, datasets like Anthropic’s HH-RLHF, and increasingly affordable compute, teams of all sizes can now build world-class aligned models.

Practical Roadmap for Your Projects:

For rapid prototyping (days): SFT with 10K quality examples → 80% of the way there
For production quality (weeks): SFT + DPO with 50K comparisons → state-of-the-art results
For cutting-edge research (months): Explore GRPO and hybrid approaches → push boundaries

Looking Ahead: The field continues to evolve rapidly. Constitutional AI, RLHF from AI feedback (RLAIF), and multi-objective alignment are emerging frontiers. The next breakthrough might come from combining these techniques with synthetic data generation, continual learning, or entirely new paradigms we haven’t yet imagined.

Your Next Steps: Start simple, measure rigorously, iterate continuously. The most important alignment happens not in the algorithm choice, but in deeply understanding your users’ needs and values. Build models that don’t just follow instructions—build models that genuinely help humans thrive.

The future of AI alignment is being written today, and you’re now equipped to contribute to it.

LLM

This post is licensed under CC BY 4.0 by the author.

Introduction

The Fundamental Problem

Why Alignment is Crucial

What You Will Learn

Best practices from industry experience

Table of Contents

Overview of the Alignment Pipeline

The Journey of an LLM: From Pre-training to Perfect Assistant

Step 0: Pre-training (The Foundation)

Step 1: SFT - Supervised Fine-Tuning (Basic Education)

Step 2: RLHF - Reinforcement Learning (The Refinement)

Step 3: DPO - Direct Optimization (The Simplification)

Step 4: GRPO - Group Optimization (The 2024 Innovation)

Summary Pipeline Diagram

Global Metaphor: Training a Personal Assistant

Step 1: SFT - Supervised Fine-Tuning

Definition and Objective

How Does SFT Work?

The Process in Detail

Visualization of the SFT Process

Practical SFT Implementation

What Happens Technically

Tracking Metrics

Training Logs Example

Advantages of SFT

Disadvantages of SFT

Ideal Use Cases for SFT

Step 2: RLHF - Reinforcement Learning from Human Feedback

Definition and Motivation

Analogy: Training a Chef

The RLHF Pipeline in 3 Steps

Phase 1: Collecting Human Comparisons

Phase 2: Training the Reward Model

Phase 3: Reinforcement Learning Optimization (PPO)

RLHF Monitoring Metrics

Advantages of RLHF

Disadvantages of RLHF

Real-World Examples of RLHF Usage

Step 3: DPO - Direct Preference Optimization

The 2023 Revolution: Simplifying RLHF

The Problem with RLHF

The Brilliant Idea of DPO

How Does DPO Work?

Mathematical Intuition

Practical Implementation of DPO

Detailed Comparison: RLHF vs DPO

DPO Variants

Advantages of DPO

Disadvantages of DPO

Real-World Examples of DPO Usage

Step 4: GRPO - Group Relative Policy Optimization

The Latest Innovation (2024): DeepSeek R1

The Problem with DPO

The GRPO Idea

How Does GRPO Work?

The Process in Detail

GRPO Formula

Practical Implementation of GRPO

Advantages of GRPO

Disadvantages of GRPO

Final Comparison: SFT vs RLHF vs DPO vs GRPO

Comparisons and Method Selection

Decision Guide: Which Method to Choose?

Scenario 1: Limited Budget, Quick Prototype

Scenario 2: Maximum Quality, Comfortable Budget

Scenario 3: Existing RLHF Infrastructure

Scenario 4: Cutting-Edge Research

Recommended Pipeline for Most Cases

Production and Best Practices

Best Labeling Practices

Aligned Model Evaluation

Conclusion

Trending Tags