Section 17.2: DPO & Modern Preference Optimization

Your language model is secretly a reward model. You just need the right loss function to reveal it.
Reward, Secretly Rewarding AI Agent

Big Picture

DPO achieves RLHF-level alignment without reinforcement learning. The key insight is mathematical: the optimal policy under the RLHF objective has a closed-form relationship with the reward function. This can be implemented efficiently using parameter-efficient methods like LoRA. This means you can reparameterize the reward model loss directly in terms of the policy, training the language model on preference pairs using a simple classification-like objective. No reward model, no PPO, no value network. Building on the RLHF pipeline from Section 17.1, this dramatically simplifies the alignment pipeline and has spawned an entire family of "direct alignment" methods (KTO, ORPO, SimPO, IPO) that each address different limitations of the original formulation.

Prerequisites

Before starting, make sure you are familiar with alignment overview as covered in Section 17.1: RLHF: Reinforcement Learning from Human Feedback.

A side-by-side comparison showing RLHF's complex multi-stage pipeline versus DPO's streamlined single-stage approach — **Figure 17.2.1**: DPO looks at RLHF's elaborate setup and says: why hire a separate critic when the model can learn directly from preferences? Fewer moving parts, similar results.

1. The DPO Derivation

Fun Note

DPO's elegance is almost suspiciously simple. The entire RLHF pipeline (reward model training, PPO with value networks, careful hyperparameter tuning, and multiple GPU jobs) collapses into a single supervised learning objective that fits in a few lines of code. When the DPO paper was first circulated, many researchers assumed there must be a catch. The catch, it turned out, is that DPO is more sensitive to the quality of preference data than RLHF, which can partially compensate for noisy preferences through its reward model.

Direct Preference Optimization (Rafailov et al., 2023) begins with the same objective as RLHF: maximize expected reward while staying close to a reference policy. The standard RLHF objective is:

\max_{\pi} \; \mathbb{E}_{x \sim D, \, y \sim \pi}[r(x, y)] - \beta \, \text{KL}(\pi \| \pi_{\text{ref}})

Mental Model: The A/B Taste Test

Think of DPO as running a continuous A/B taste test. Instead of training a separate judge (reward model) to score each dish, you directly show the chef two dishes and tell them which one the diners preferred. The chef adjusts their technique to produce more of the preferred dish and less of the rejected one. This eliminates the middleman (the reward model) and its potential biases, making the training pipeline simpler and more stable, though it requires more carefully curated preference data.

The optimal solution to this constrained optimization problem has a closed-form expression:

\pi^*(y \mid x) = \frac{\pi_{\text{ref}}(y \mid x) \cdot \exp\!\bigl(r(x, y) / \beta\bigr)}{Z(x)}

where Z(x) is the partition function that normalizes the distribution. The crucial step in DPO is rearranging this expression to solve for the reward in terms of the policy:

r(x, y) = \beta \log \frac{\pi(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x)

When we substitute this into the Bradley-Terry preference model, the partition function Z(x) cancels (since it appears in both the chosen and rejected terms), yielding the DPO loss:

L_{\text{DPO}} = -\mathbb{E}\!\left[\log \sigma\!\left(\beta\left(\log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right)\right]

To make this concrete, consider a single preference pair with β = 0.1. Suppose the policy assigns log-probability −2.0 to the chosen response and −3.5 to the rejected response, while the reference assigns −2.3 and −3.0 respectively:

# Numeric walkthrough of a single DPO loss evaluation
import math

beta = 0.1
log_pi_chosen, log_pi_rejected = -2.0, -3.5
log_ref_chosen, log_ref_rejected = -2.3, -3.0

log_ratio_chosen = log_pi_chosen - log_ref_chosen # -2.0 - (-2.3) = 0.3
log_ratio_rejected = log_pi_rejected - log_ref_rejected # -3.5 - (-3.0) = -0.5
margin = beta * (log_ratio_chosen - log_ratio_rejected) # 0.1 * (0.3 - (-0.5)) = 0.08
loss = -math.log(1 / (1 + math.exp(-margin))) # -log(sigmoid(0.08)) = 0.653
print(f"margin={margin:.2f}, loss={loss:.3f}") # margin=0.08, loss=0.653

margin=0.08, loss=0.653

Code Fragment 17.2.1: Numeric walkthrough of a single DPO loss evaluation

Key Insight

The DPO loss has an elegant interpretation: it pushes the policy to increase the log-probability of chosen responses (relative to the reference) while decreasing the log-probability of rejected responses. The reference model acts as an implicit anchor, playing the same role as the KL penalty in PPO. The β parameter controls how aggressively the policy deviates from the reference.

Real-World Scenario: DPO for Tone Alignment in a Corporate Assistant

Who: Product team at an enterprise software company

Situation: The company's internal AI assistant needed to match the corporate communication style: professional, concise, and avoiding casual language or humor in customer-facing drafts.

Problem: RLHF was too complex for the small ML team (three engineers), requiring a separate reward model, PPO infrastructure, and careful hyperparameter tuning across multiple models.

Dilemma: SFT on style-matched examples improved tone but introduced regressions in factual accuracy; the model learned surface patterns without internalizing the preference structure.

Decision: They chose DPO because it required only preference pairs (no reward model), reducing the pipeline from three models to one training loop.

How: The team collected 5,000 preference pairs where employees chose between two response drafts for the same prompt. They trained with beta=0.1 for 2 epochs using TRL's DPOTrainer.

Result: The DPO-aligned model achieved 87% preference win rate against the SFT baseline in blind evaluation, matching the quality of a comparable RLHF setup while requiring 60% less engineering effort and no reward model infrastructure.

Lesson: For teams without dedicated RL infrastructure, DPO delivers alignment quality comparable to RLHF with dramatically simpler implementation; the key investment is collecting high-quality preference pairs.

Figure 17.2.1 contrasts the RLHF and DPO pipelines, highlighting how DPO eliminates the reward model and PPO stages.

Fun Fact

DPO (Direct Preference Optimization) eliminated the need for a separate reward model by baking preference learning directly into the language model's training loss. It is like cutting out the middleman in a supply chain: fewer moving parts, fewer things to break. Code Fragment 17.2.6 shows this approach in practice.

Figure 17.2.2: RLHF requires training a separate reward model and running PPO with four models in GPU memory. DPO simplifies this to a single training stage with only two models (policy + frozen reference).

The following implementation (Code Fragment 17.2.6) shows this approach in practice.

# DPO Training with TRL
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset

# Load SFT model as starting point
model = AutoModelForCausalLM.from_pretrained(
 "./sft-llama-8b-final",
 torch_dtype="bfloat16",
)
ref_model = AutoModelForCausalLM.from_pretrained(
 "./sft-llama-8b-final",
 torch_dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained("./sft-llama-8b-final")

# Load preference dataset
# Must have: prompt, chosen, rejected columns
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

print(f"Dataset size: {len(dataset)}")
print(f"Columns: {dataset.column_names}")
# prompt: the user query
# chosen: the preferred response
# rejected: the less-preferred response

# DPO training configuration
dpo_config = DPOConfig(
 output_dir="./dpo-llama-8b",
 beta=0.1, # KL penalty strength
 per_device_train_batch_size=2,
 gradient_accumulation_steps=16,
 learning_rate=5e-7, # small LR for stability
 num_train_epochs=1,
 max_length=2048,
 max_prompt_length=1024,
 warmup_ratio=0.1,
 logging_steps=10,
 bf16=True,
 loss_type="sigmoid", # standard DPO loss
 # Advanced options
 label_smoothing=0.0, # 0.1 can help with noisy preferences
 precompute_ref_log_probs=True, # saves memory
)

trainer = DPOTrainer(
 model=model,
 ref_model=ref_model,
 args=dpo_config,
 train_dataset=dataset,
 tokenizer=tokenizer, # use processing_class in TRL >= 0.14
)

trainer.train()
trainer.save_model("./dpo-llama-8b-final")

Dataset size: 61135
Columns: ['source', 'chosen', 'rejected', 'prompt', 'chosen_rating', 'rejected_rating']
{'train_loss': 0.5823, 'train_runtime': 2146.1, 'train_samples_per_second': 1.71}

Code Fragment 17.2.2: DPO Training with TRL

Key Insight

Why DPO avoids reward model training (and why that matters). The mathematical insight behind DPO is elegant: the optimal policy under the RLHF objective has a closed-form relationship to the reward function. This means you can reparameterize the reward in terms of the policy itself, eliminating the reward model entirely. In practice, this removes an entire stage of training (reward model fitting), eliminates reward hacking (where the policy exploits flaws in the reward model), and reduces the total compute budget by roughly 50%. The tradeoff is that DPO requires high-quality preference data where the chosen/rejected distinction is clear; it is less forgiving of noisy preferences than RLHF, which can learn a smooth reward function that tolerates annotation noise.

2. DPO Variants and Extensions

The success of DPO inspired a wave of variants, each addressing specific limitations. It is worth noting that DPO does not universally match RLHF quality: on complex tasks requiring long outputs or nuanced reasoning, PPO-based RLHF can still outperform DPO, likely because the separate reward model provides a richer optimization signal. The core differences among DPO variants lie in data requirements, loss formulations, and training dynamics.

2.1 KTO: Kahneman-Tversky Optimization

KTO (Ethayarajh et al., 2024) addresses a practical limitation of DPO: the requirement for paired preferences. In real applications, feedback often comes as binary signals (thumbs up or thumbs down) rather than pairwise comparisons. KTO works with unpaired binary feedback, using ideas from prospect theory to weight losses and gains asymmetrically. Code Fragment 17.2.5 demonstrates KTO training with TRL.

# KTO Training with TRL
from trl import KTOTrainer, KTOConfig

# KTO uses unpaired binary data
# Each example has: prompt, completion, label (True/False)
kto_dataset = load_dataset("trl-lib/kto-mix-14k", split="train")

print(f"Example: {kto_dataset[0]}")
# {'prompt': '...', 'completion': '...', 'label': True}

kto_config = KTOConfig(
 output_dir="./kto-llama-8b",
 beta=0.1,
 desirable_weight=1.0, # weight for positive examples
 undesirable_weight=1.0, # weight for negative examples
 per_device_train_batch_size=4,
 gradient_accumulation_steps=8,
 learning_rate=5e-7,
 num_train_epochs=1,
 max_length=2048,
 bf16=True,
)

kto_trainer = KTOTrainer(
 model=model,
 ref_model=ref_model,
 args=kto_config,
 train_dataset=kto_dataset,
 tokenizer=tokenizer,
)

kto_trainer.train()

Example: {'prompt': 'Write a short poem about spring.', 'completion': 'Blossoms unfurl...', 'label': True}
{'train_loss': 0.4917, 'train_runtime': 487.3}

Code Fragment 17.2.3: KTO Training with TRL

2.2 ORPO: Odds Ratio Preference Optimization

ORPO (Hong et al., 2024) eliminates the need for a separate reference model entirely. It combines the SFT objective with a preference optimization term in a single loss function. The key idea is to use the odds ratio of generating the chosen versus rejected response, contrasting them directly without a reference model baseline.

Note

ORPO's main advantage is memory efficiency. By removing the reference model, ORPO requires only a single model in GPU memory during training, making it practical for alignment of very large models on limited hardware. The tradeoff is that without a reference anchor, the optimization can be less stable than DPO for some tasks.

2.3 SimPO: Simple Preference Optimization

SimPO (Meng et al., 2024) also removes the reference model but takes a different approach. Instead of using log-probability ratios, SimPO uses the average log-probability of the response (normalized by length) as the implicit reward. It adds a target margin γ to the objective, encouraging a minimum quality gap between preferred and rejected responses.

2.4 IPO: Identity Preference Optimization

IPO (Azar et al., 2024) addresses a theoretical issue with DPO: under certain conditions, DPO can overfit to preference data, driving the log-probability ratio to infinity. IPO uses a squared loss instead of the sigmoid loss, providing better regularization properties and more stable training.

2.4 IPO: Identity Preference Optimization Comparison

Method	Reference Model	Data Format	Key Advantage	Key Limitation
DPO	Required (frozen)	Pairwise (chosen/rejected)	Well-studied, strong baselines	Needs paired data + reference model
KTO	Required (frozen)	Binary (good/bad)	Works with unpaired feedback	Less data-efficient than pairwise
ORPO	Not needed	Pairwise (chosen/rejected)	Single model, combined SFT+alignment	Can be less stable
SimPO	Not needed	Pairwise (chosen/rejected)	Length-normalized, margin-based	Newer, less extensively validated
IPO	Required (frozen)	Pairwise (chosen/rejected)	Prevents overfitting, squared loss	May underfit with limited data

3. Creating Preference Datasets

The quality of alignment training depends critically on the preference dataset. Creating high-quality preference data (using techniques from Section 13.2 on synthetic data pipelines) involves careful annotation design, quality control, and understanding of common pitfalls. Figure 17.2.3 outlines the preference data creation pipeline.

Figure 17.2.3: The preference data creation pipeline. High-quality datasets require diverse prompts, multiple response candidates, careful annotation with quality controls, and systematic filtering.

Tip: Annotation Best Practices

Clear guidelines: Define specific criteria for what makes a response "better" (accuracy, helpfulness, safety, conciseness)
Multiple annotators: Use at least 2-3 annotators per comparison to measure agreement
Calibration: Include known-answer items to detect annotator drift
Diversity: Ensure prompts span different tasks, difficulty levels, and domains
Margin filtering: Remove pairs where responses are nearly identical in quality (low signal-to-noise)

4. Synthetic Preference Generation

Human annotation is expensive and slow. A growing trend is to generate synthetic preference data using a stronger model (such as GPT-4 or Claude) as the judge. This approach, sometimes called "AI feedback" or RLAIF (Section 17.3), can produce large preference datasets at a fraction of the cost of human annotation. Code Fragment 17.2.4 demonstrates this LLM-as-judge approach for building synthetic preference datasets.

# Synthetic preference generation with LLM-as-judge
import openai
from dataclasses import dataclass
from typing import List, Tuple

@dataclass
class PreferencePair:
 prompt: str
 chosen: str
 rejected: str
 judge_rationale: str

def generate_preference_pair(
 prompt: str,
 response_a: str,
 response_b: str,
 judge_model: str = "gpt-4o",
) -> PreferencePair:
 """Use a strong model to judge which response is better."""

 judge_prompt = f"""Compare these two responses to the given prompt.
Evaluate on: accuracy, helpfulness, clarity, and safety.
Return JSON with "winner" (A or B) and "rationale".

Prompt: {prompt}

Response A: {response_a}

Response B: {response_b}"""

 client = openai.OpenAI()
 result = client.chat.completions.create(
 model=judge_model,
 messages=[{"role": "user", "content": judge_prompt}],
 response_format={"type": "json_object"},
 temperature=0.0,
 )

 judgment = eval(result.choices[0].message.content)

 if judgment["winner"] == "A":
 return PreferencePair(prompt, response_a, response_b, judgment["rationale"])
 else:
 return PreferencePair(prompt, response_b, response_a, judgment["rationale"])

def build_synthetic_dataset(
 prompts: List[str],
 model_name: str = "meta-llama/Llama-3.1-8B-Instruct",
 samples_per_prompt: int = 4,
) -> List[PreferencePair]:
 """Build a preference dataset using rejection sampling + LLM judge."""
 import itertools

 pairs = []
 for prompt in prompts:
 # Generate multiple responses with different temperatures
 responses = []
 for temp in [0.3, 0.5, 0.7, 1.0]:
 response = generate_response(model_name, prompt, temperature=temp)
 responses.append(response)

 # Create all pairwise comparisons
 for a, b in itertools.combinations(responses, 2):
 pair = generate_preference_pair(prompt, a, b)
 pairs.append(pair)

 return pairs

Dataset size: 500
Columns: ['source', 'chosen', 'rejected', 'prompt', 'chosen_rating', 'rejected_rating']

Prompt: Write a C++ function to find the longest common subsequence of two input strings...

Chosen: Here is a C++ function that finds the longest common subsequence using dynamic prog...

Rejected: You can find the longest common subsequence by comparing each character one by one...

Code Fragment 17.2.4: Synthetic preference generation with LLM-as-judge

Warning

Synthetic preferences inherit the biases of the judge model. If the judge systematically prefers verbose responses, the trained model will learn to be verbose. Always validate synthetic data against a held-out set of human preferences, and consider using multiple judge models to reduce individual model bias.

5. Practical Considerations for DPO Training

5.1 Hyperparameter Sensitivity

DPO training is sensitive to several key hyperparameters. The most important is β, which controls the strength of the implicit KL constraint. A β that is too low leads to aggressive optimization that can degrade coherence. A β that is too high produces minimal change from the SFT model. Code Fragment 17.2.7 provides recommended hyperparameter ranges for DPO training.

# Hyperparameter sweep for DPO
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class DPOSweepConfig:
 """Configuration for DPO hyperparameter search."""
 beta_values: List[float] = None
 learning_rates: List[float] = None
 warmup_ratios: List[float] = None

 def __post_init__(self):
 self.beta_values = self.beta_values or [0.05, 0.1, 0.2, 0.5]
 self.learning_rates = self.learning_rates or [1e-7, 5e-7, 1e-6]
 self.warmup_ratios = self.warmup_ratios or [0.05, 0.1]

def evaluate_dpo_run(
 model_path: str,
 eval_dataset,
 metrics: List[str] = None,
) -> Dict[str, float]:
 """Evaluate a DPO checkpoint on standard metrics."""
 metrics = metrics or ["win_rate", "coherence", "kl_divergence"]
 results = {}

 # Win rate: how often the model's output is preferred
 # over the SFT baseline by an LLM judge
 results["win_rate"] = compute_win_rate(model_path, eval_dataset)

 # Coherence: perplexity on held-out text
 results["coherence"] = compute_perplexity(model_path, eval_dataset)

 # KL divergence from reference
 results["kl_divergence"] = compute_kl(model_path, eval_dataset)

 # Reward accuracy: agreement with held-out preferences
 results["reward_accuracy"] = compute_reward_accuracy(
 model_path, eval_dataset
 )

 return results

# Typical ranges for well-performing DPO
recommended_ranges = {
 "beta": "0.1 to 0.5 (start with 0.1)",
 "learning_rate": "1e-7 to 5e-6 (much lower than SFT)",
 "epochs": "1 to 3 (more can overfit)",
 "batch_size": "32 to 128 (larger is more stable)",
 "warmup_ratio": "0.05 to 0.15",
 "label_smoothing": "0.0 to 0.1 (helps with noisy data)",
}

Code Fragment 17.2.5: DPO hyperparameter sweep configuration. The most critical parameter is beta (KL penalty strength), which controls how far the policy deviates from the reference. The recommended ranges provide a starting point; monitoring the implicit reward margin during training helps identify the optimal settings.

Key Insight

The single most important signal during DPO training is the implicit reward margin: the gap between the model's log-probability ratio for chosen versus rejected responses. If this margin grows steadily and plateaus, training is healthy. If it grows without bound, the model is overfitting. If it barely moves, β is too high or the learning rate is too low. Monitor this metric alongside validation loss.

Note

When using DPO with LoRA (a common practical choice), set the LoRA rank higher than you would for SFT. DPO needs more capacity in the adapter to capture fine-grained preference distinctions. A rank of 64 to 128 is typical for DPO, compared to 8 to 32 for SFT.

6. Online and Iterative DPO

Standard DPO trains on a fixed, offline dataset of preference pairs. This creates a subtle but important limitation: the policy being optimized may drift into regions of the output space that the preference dataset does not cover, leading to uncertain or misleading reward signals. Online DPO and iterative DPO address this by generating fresh preference data from the model being trained, creating a tighter feedback loop between the policy and the preference signal.

6.1 Online DPO

In online DPO, the training loop alternates between generation and optimization. At each step, the current policy generates multiple candidate responses for a batch of prompts. These responses are then ranked (by a reward model, an LLM judge, or human annotators) to create fresh preference pairs. The DPO loss is computed on these on-policy pairs rather than stale offline data. This approach is more expensive per iteration but produces higher quality alignment because the preference signal always reflects the current model's behavior. Code Fragment 17.2.8 illustrates this online training loop.

# Online DPO conceptual loop
def online_dpo_step(policy, prompts, reward_model, beta=0.1):
 """One step of online DPO training."""
 # Step 1: Generate candidate responses from current policy
 candidates = []
 for prompt in prompts:
 responses = policy.generate(prompt, num_return_sequences=4)
 candidates.append((prompt, responses))

 # Step 2: Score responses with reward model
 preference_pairs = []
 for prompt, responses in candidates:
 scores = [reward_model.score(prompt, r) for r in responses]
 # Take best and worst as chosen/rejected
 best_idx = scores.index(max(scores))
 worst_idx = scores.index(min(scores))
 preference_pairs.append({
 "prompt": prompt,
 "chosen": responses[best_idx],
 "rejected": responses[worst_idx],
 })

 # Step 3: Compute DPO loss on fresh, on-policy data
 loss = compute_dpo_loss(policy, preference_pairs, beta=beta)
 return loss

Code Fragment 17.2.6: Online DPO training step. Unlike standard (offline) DPO which uses a static preference dataset, online DPO generates fresh responses from the current policy at each step, then collects preferences on these on-policy outputs. This reduces distribution mismatch between the training data and the evolving policy.

6.2 Iterative DPO

Iterative DPO takes a more practical middle ground. Instead of generating on-policy data at every gradient step (which is computationally expensive), it runs DPO in multiple rounds. After each round of DPO training, the improved model generates a new preference dataset, which is used for the next round. Typically three to five rounds are sufficient, with each round consisting of a standard offline DPO training run. Meta's Llama 3 training used iterative DPO (which they called "DPO with rejection sampling") to progressively improve alignment quality.

6.3 Mitigating Reward Model Overoptimization

A persistent challenge in preference optimization (whether RLHF or DPO) is reward overoptimization, also known as Goodhart's Law applied to language models. As the policy optimizes against a reward signal (explicit reward model or implicit DPO preferences), it eventually finds adversarial outputs that score highly on the reward metric but are actually low quality by human judgment. The model learns to exploit quirks in the reward signal rather than genuinely improving.

Several techniques mitigate this problem:

KL penalty calibration: The β parameter in DPO controls how far the policy can drift from the reference model. Higher β values prevent overoptimization at the cost of slower improvement. Monitoring the KL divergence during training and stopping when it exceeds a threshold (typically 5 to 15 nats) provides an early warning signal.
Reward model ensembles: Training multiple reward models on different data splits and averaging their scores makes it harder for the policy to exploit any single model's weaknesses. If the ensemble members disagree, the reward signal is unreliable and the policy should be penalized for that uncertainty.
Length penalties: Models often discover that longer responses receive higher reward scores (a common bias in reward models trained on human preferences). Explicitly normalizing rewards by response length or adding a length penalty prevents this exploit.
Periodic human evaluation: Automated metrics eventually become the target of optimization. Regularly sampling model outputs and evaluating them with human raters catches overoptimization that reward models miss. This is expensive but essential for high-stakes deployments.
Conservative optimization (CPO/RPO): Variants like Conservative DPO add pessimistic reward estimates, training the model to be conservative when the reward signal is uncertain. This sacrifices some peak performance for robustness against overoptimization.

Warning

Reward overoptimization is not a theoretical concern; it appears reliably in practice. Models trained with DPO for too many epochs often produce verbose, sycophantic outputs that score highly on automated reward metrics but frustrate real users. The best defense is a combination of early stopping based on KL divergence monitoring, held-out human evaluation, and iterative training with fresh preference data.

Self-Check

Q1: What is the key mathematical insight that enables DPO to eliminate the reward model?

Show Answer

The optimal policy under the RLHF objective has a closed-form relationship with the reward function: r(x,y) = β log(π(y|x)/π_ref(y|x)) + β log Z(x). When this is substituted into the Bradley-Terry preference model, the partition function Z(x) cancels out, allowing the loss to be written directly in terms of the policy and reference model log-probabilities.

Q2: How does KTO differ from DPO in terms of data requirements?

Show Answer

DPO requires pairwise preference data: for each prompt, you need both a chosen and a rejected response. KTO works with unpaired binary feedback, where each example is simply a (prompt, response, good/bad label) triple. This makes KTO practical when feedback comes as thumbs up/down signals rather than A/B comparisons.

Q3: What advantage do ORPO and SimPO have over DPO in terms of memory?

Show Answer

ORPO and SimPO eliminate the need for a separate reference model. DPO requires keeping a frozen copy of the SFT model in GPU memory alongside the trainable policy. ORPO and SimPO need only the single policy model, roughly halving memory requirements and making alignment of larger models feasible on limited hardware.

Q4: Why might synthetic preferences from LLM judges introduce systematic biases?

Show Answer

LLM judges have their own biases: they may prefer verbose responses, formal language, responses that agree with the prompt's framing, or outputs that match their own training distribution. These biases are transferred to the preference dataset and then amplified during training. Validating against human preferences and using multiple judge models can mitigate but not eliminate this issue.

Q5: What should you monitor to detect overfitting during DPO training?

Show Answer

Monitor the implicit reward margin (the gap in log-probability ratios between chosen and rejected responses). Healthy training shows steady growth that plateaus. If the margin grows without bound, the model is overfitting to the preference data. Also monitor validation loss, generation quality on held-out prompts, and the KL divergence from the reference model.

Key Insight

✅ Key Takeaways

DPO reparameterizes the RLHF objective to train directly on preference pairs, eliminating the reward model and RL training loop.
KTO extends the approach to binary (unpaired) feedback, making it practical when only thumbs up/down signals are available.
ORPO and SimPO further simplify the pipeline by removing the reference model, halving GPU memory requirements.
IPO addresses DPO's overfitting tendencies with a squared loss formulation that provides better regularization.
Preference data quality is the most important factor in alignment quality. Invest in annotation guidelines, inter-annotator agreement, and diversity.
Synthetic preferences from LLM judges can scale data creation but inherit judge biases. Always validate against human preferences.

Lab: Train a Model with DPO on Preference Data

Duration: ~60 minutes Advanced

Objective

Implement the DPO loss from scratch to understand the mathematics, then use TRL's DPOTrainer to fine-tune a small model on preference data and measure alignment improvement.

What You'll Practice

Preparing preference data in chosen/rejected pair format
Computing per-token log probabilities for response sequences
Implementing the DPO loss function from first principles
Using TRL's DPOTrainer for streamlined preference optimization

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install transformers trl datasets peft accelerate torch

Code Fragment 17.2.7: This command installs transformers, trl, datasets, peft, accelerate, and torch for the DPO alignment lab. The trl library provides DPOTrainer and DPOConfig, while peft enables LoRA-based parameter-efficient training.

Steps

Step 1: Load and explore a preference dataset

Load a dataset containing prompt/chosen/rejected triples for preference learning.

# Load a preference dataset: each example has a prompt, a chosen
# (better) response, and a rejected (worse) response for DPO training.
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train_prefs")
dataset = dataset.shuffle(seed=42).select(range(500))

print(f"Dataset size: {len(dataset)}")
print(f"Columns: {dataset.column_names}")

example = dataset[0]
print(f"\nPrompt: {example['chosen'][0]['content'][:200]}")
print(f"\nChosen: {example['chosen'][1]['content'][:200]}")
print(f"\nRejected: {example['rejected'][1]['content'][:200]}")

Code Fragment 17.2.8: Load a preference dataset: each example has a prompt, a chosen

Hint

Each example has "chosen" and "rejected" columns containing message lists. The prompt is the user message; the response is the assistant message in each pair.

Step 2: Implement DPO loss from scratch

Build the core DPO loss to understand the math before using the library.

# DPO loss from scratch: compute log-probability ratios between
# chosen and rejected responses under the policy and reference models.
import torch
import torch.nn.functional as F

def compute_log_probs(model, tokenizer, text, device):
 """Compute sum of per-token log probabilities for a sequence."""
 inputs = tokenizer(text, return_tensors="pt",
 truncation=True, max_length=256).to(device)
 with torch.no_grad():
 outputs = model(**inputs)
 logits = outputs.logits[:, :-1, :]
 labels = inputs['input_ids'][:, 1:]
 log_probs = F.log_softmax(logits, dim=-1)
 token_lps = log_probs.gather(2, labels.unsqueeze(2)).squeeze(2)
 return token_lps.sum()

def dpo_loss(pi_chosen, pi_rejected, ref_chosen, ref_rejected, beta=0.1):
 """Compute the DPO loss from log probabilities."""
 # TODO: Implement:
 # log_ratio_chosen = pi_chosen - ref_chosen
 # log_ratio_rejected = pi_rejected - ref_rejected
 # loss = -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
 pass

# Test with known values
loss = dpo_loss(torch.tensor(-10.0), torch.tensor(-15.0),
 torch.tensor(-11.0), torch.tensor(-14.0), beta=0.1)
print(f"Test DPO loss: {loss.item():.4f}")

Test DPO loss: 0.6731

Code Fragment 17.2.9: DPO loss from scratch: compute log-probability ratios between

Hint

The DPO loss is: -F.logsigmoid(beta * ((pi_chosen - ref_chosen) - (pi_rejected - ref_rejected))). When the policy correctly prefers chosen over rejected (more than the reference does), the loss is small.

Step 3: Run DPO training with TRL

Use DPOTrainer for a production-quality implementation with LoRA.

# Library shortcut: DPOTrainer with LoRA for memory-efficient
# preference optimization. Handles reference model copies internally.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
 tokenizer.pad_token = tokenizer.eos_token

peft_config = LoraConfig(r=16, lora_alpha=32,
 target_modules=["q_proj", "v_proj"],
 lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")

# TODO: Configure DPOConfig with beta=0.1, learning_rate=5e-5
training_args = DPOConfig(
 output_dir="./dpo-smollm2",
 num_train_epochs=1,
 per_device_train_batch_size=2,
 gradient_accumulation_steps=4,
 beta=0.1,
 learning_rate=5e-5,
 max_length=512,
 logging_steps=10,
 fp16=True,
 report_to="none",
)

trainer = DPOTrainer(model=model, args=training_args,
 train_dataset=dataset, processing_class=tokenizer,
 peft_config=peft_config)
trainer.train()

{'train_loss': 0.6214, 'train_runtime': 142.8, 'train_samples_per_second': 3.50}

Code Fragment 17.2.10: Library shortcut: DPOTrainer with LoRA for memory-efficient

Hint

DPOTrainer automatically creates a reference model copy internally. The beta parameter controls how much the policy is allowed to deviate from the reference; 0.1 is a common starting value.

Step 4: Evaluate alignment improvement

Measure how often the trained model prefers chosen over rejected responses.

# Evaluate alignment: check how often the DPO-trained model assigns
# higher log-probability to chosen responses vs. rejected ones.
eval_data = load_dataset("trl-lib/ultrafeedback_binarized", split="test_prefs")
eval_data = eval_data.shuffle(seed=42).select(range(20))

correct = 0
for ex in eval_data:
 prompt = ex['chosen'][0]['content']
 chosen_text = f"{prompt} {ex['chosen'][1]['content']}"
 rejected_text = f"{prompt} {ex['rejected'][1]['content']}"

 chosen_lp = compute_log_probs(model, tokenizer, chosen_text, model.device)
 rejected_lp = compute_log_probs(model, tokenizer, rejected_text, model.device)

 if chosen_lp > rejected_lp:
 correct += 1

print(f"Preference accuracy: {correct}/{len(eval_data)} = {correct/len(eval_data)*100:.1f}%")
print("(Random baseline: 50%)")

Preference accuracy: 13/20 = 65.0%
(Random baseline: 50%)

Code Fragment 17.2.11: Evaluate alignment: check how often the DPO-trained model assigns

Hint

After DPO training, the model should prefer the chosen response 60 to 70% of the time, up from the ~50% random baseline. Higher beta values make the model stick closer to the reference.

Expected Output

A working DPO loss implementation that produces sensible gradients
DPOTrainer completing with decreasing loss
Preference accuracy improving from ~50% to ~60 to 70%

Stretch Goals

Compare DPO with different beta values (0.01, 0.1, 0.5) and observe the effect on response style
Implement the IPO (Identity Preference Optimization) loss variant and compare with standard DPO
Create your own preference dataset by generating pairs from different models and labeling them

Complete Solution

# Complete DPO lab: load preferences, implement loss from scratch,
# train with TRL DPOTrainer, and evaluate alignment improvement.
import torch, torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer, DPOConfig
from peft import LoraConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train_prefs")
dataset = dataset.shuffle(seed=42).select(range(500))

def compute_log_probs(model, tokenizer, text, device):
 inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256).to(device)
 with torch.no_grad():
 out = model(**inputs)
 logits = out.logits[:, :-1, :]
 labels = inputs['input_ids'][:, 1:]
 lps = F.log_softmax(logits, dim=-1).gather(2, labels.unsqueeze(2)).squeeze(2)
 return lps.sum()

def dpo_loss(pi_c, pi_r, ref_c, ref_r, beta=0.1):
 return -F.logsigmoid(beta * ((pi_c - ref_c) - (pi_r - ref_r)))

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

peft_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"],
 lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
args = DPOConfig(output_dir="./dpo-smollm2", num_train_epochs=1, per_device_train_batch_size=2,
 gradient_accumulation_steps=4, beta=0.1, learning_rate=5e-5, max_length=512,
 logging_steps=10, fp16=True, report_to="none")
trainer = DPOTrainer(model=model, args=args, train_dataset=dataset,
 processing_class=tokenizer, peft_config=peft_config)
trainer.train()

eval_data = load_dataset("trl-lib/ultrafeedback_binarized", split="test_prefs").shuffle(seed=42).select(range(20))
correct = sum(1 for ex in eval_data
 if compute_log_probs(model, tokenizer, f"{ex['chosen'][0]['content']} {ex['chosen'][1]['content']}", model.device)
 > compute_log_probs(model, tokenizer, f"{ex['chosen'][0]['content']} {ex['rejected'][1]['content']}", model.device))
print(f"Preference accuracy: {correct}/20 = {correct/20*100:.1f}%")

{'train_loss': 0.6214, 'train_runtime': 142.8, 'train_samples_per_second': 3.50}
Preference accuracy: 13/20 = 65.0%

Code Fragment 17.2.12: Complete DPO lab: load preferences, implement loss from scratch,

Research Frontier

DPO variants are rapidly expanding: IPO addresses overfitting to preference noise, KTO works with binary (good/bad) feedback instead of paired preferences, and ORPO eliminates the need for a separate reference model entirely. Research on online DPO (such as OAIF) generates preference pairs on the fly during training rather than using a static dataset, improving sample efficiency and reducing distribution mismatch.

The open challenge is understanding when DPO fails relative to RLHF, since theoretical analysis suggests DPO may struggle with preferences that require reasoning about latent reward structure.

Exercises

Exercise 17.2.1: DPO derivation intuition Conceptual

Explain the key insight behind DPO: how does it eliminate the need for a separate reward model? What is the mathematical relationship it exploits?

Answer Sketch

DPO exploits the fact that the optimal RLHF policy has a closed-form relationship with the reward function: r(x, y) = beta * log(pi(y|x) / pi_ref(y|x)) + C. This means you can express the reward model loss directly in terms of the policy's log-probabilities, without ever training a separate reward model. DPO directly optimizes the policy on preference pairs using: loss = -log(sigmoid(beta * (log_pi(y_w|x) - log_pi_ref(y_w|x) - log_pi(y_l|x) + log_pi_ref(y_l|x)))).

Exercise 17.2.2: DPO training data Coding

Write code to prepare a preference dataset for DPO training. Each example should have: prompt, chosen response, and rejected response. Show how to format this for the TRL library.

Answer Sketch

Format each example as: {'prompt': 'How do I sort a list?', 'chosen': 'Use sorted(): sorted_list = sorted(my_list)', 'rejected': 'You can try maybe using a loop or something to sort it I guess'}. For TRL: create a Dataset with these columns. Pass to DPOTrainer(model=model, ref_model=ref_model, train_dataset=dataset, beta=0.1, args=DPOConfig(...)). The ref_model is a frozen copy of the model before DPO training.

Exercise 17.2.3: DPO variants comparison Analysis

Compare DPO, KTO, and ORPO. What limitation does each subsequent method address? When would you choose each?

Answer Sketch

DPO: requires paired preferences (chosen + rejected for same prompt). KTO (Kahneman-Tversky Optimization): works with unpaired data (just a label of 'good' or 'bad' per response), easier to collect. ORPO: integrates alignment into SFT training in a single stage, no separate reference model needed. Choose DPO when you have paired preference data. Choose KTO when you only have thumbs-up/thumbs-down labels. Choose ORPO for maximum simplicity (one training phase instead of two).

Exercise 17.2.4: Reference model management Coding

In DPO, the reference model (pi_ref) is typically a frozen copy of the SFT model. Explain why the reference model is necessary and write code showing how to set it up efficiently using model sharing.

Answer Sketch

The reference model prevents the policy from deviating too far from the SFT model (same purpose as the KL penalty in RLHF). Without it, the model could collapse to degenerate outputs that trivially satisfy the preference signal. Efficient setup: from trl import DPOTrainer; ref_model = AutoModelForCausalLM.from_pretrained('sft_model_path'). For memory efficiency with PEFT: share the base model and use the adapter-free version as ref: DPOTrainer(model=peft_model, ref_model=None) (TRL uses the base model as reference automatically when using LoRA).

Exercise 17.2.5: Preference data quality Conceptual

Explain why the quality of preference data matters more than quantity for DPO training. What are three common failure modes in preference data collection?

Answer Sketch

DPO directly learns from preference signals, so noisy or inconsistent preferences teach the model confused behavior. Failure modes: (1) Annotator disagreement: ambiguous pairs where reasonable people disagree produce noisy gradients. (2) Length bias: annotators prefer longer responses regardless of quality, teaching verbosity. (3) Position bias: annotators prefer whichever response is shown first. Mitigations: clear annotation guidelines, randomized presentation order, inter-annotator agreement filtering, and quality-focused (not volume-focused) collection.

What Comes Next

In the next section, Section 17.3: Constitutional AI & Self-Alignment, we examine Constitutional AI and self-alignment, where models learn to critique and improve their own outputs. The preference data formats used by DPO connect directly to the synthetic data generation principles in Section 13.1 and benefit from the data preparation workflows in Section 14.2.

References & Further Reading

Core DPO Paper

Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.

The paper that changed alignment training by showing the RLHF objective can be reparameterized to eliminate the reward model and RL loop entirely. Essential reading for anyone working on preference-based alignment.

📄 Paper

DPO Variants & Extensions

Azar, M. G., Rowland, M., Piot, B., et al. (2024). A General Theoretical Paradigm to Understand Learning from Human Feedback. AISTATS 2024.

Introduces IPO (Identity Preference Optimization) and provides a unified theoretical framework for understanding DPO and its variants. Addresses DPO's overfitting issues with a more robust squared loss formulation.

📄 Paper

Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). KTO: Model Alignment as Prospect Theoretic Optimization. ICML 2024.

Eliminates the need for paired preferences by using only binary good/bad labels per response. Based on Kahneman and Tversky's prospect theory, making it practical when paired comparison data is scarce.

📄 Paper

Hong, J., Lee, N., & Thorne, J. (2024). ORPO: Monolithic Preference Optimization without Reference Model. EMNLP 2024.

Combines SFT and preference optimization into a single training stage, eliminating the reference model entirely. Reduces training complexity and memory requirements while maintaining competitive alignment quality.

📄 Paper

Meng, Y., Xia, M., & Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward. NeurIPS 2024.

Uses average log probability as a length-normalized, reference-free reward signal. Achieves state-of-the-art results with simpler implementation than DPO. A strong default choice for practitioners.

📄 Paper

Applied DPO

Tunstall, L., Beeching, E., Lambert, N., et al. (2023). Zephyr: Direct Distillation of LM Alignment.

Demonstrates the full pipeline of distilling alignment from a larger model using DPO on synthetic preferences. The Zephyr recipe became a standard template for open-source alignment work.

📄 Paper