Chapter 18: Alignment: RLHF, DPO & Preference Tuning

Chapter opener illustration: Alignment: RLHF.

"The question of machine intelligence is not whether machines can think, but whether machines can be taught to care about the right things."
Reward, Philosophically Wired AI Agent

Looking Back

Fine-tuning (Chapter 18) and PEFT (Chapter 19) teach a model new capabilities. Alignment teaches it new preferences. RLHF, DPO, IPO, KTO, ORPO, this chapter is the alignment buffet: how preference data is collected, how the algorithms differ, and which one you reach for depending on your data and compute budget.

Chapter Overview

GPT-3 in 2020 could write a passable essay. GPT-3 also cheerfully wrote phishing emails, racial slurs, and instructions for synthesizing nerve agents. The most visible difference between that GPT-3 and the ChatGPT that broke the internet two years later was alignment (RLHF on top of additional instruction tuning), though instruction tuning and continued pretraining contributed too. RLHF, Constitutional AI, and the DPO family of methods turned raw next-token predictors into assistants that usually decline to help with the nerve agents and politely answer the essay question; jailbreak research (Chapter 47) shows the refusals are a strong default rather than a guarantee. This chapter covers how that transformation actually works and which algorithm to reach for in 2026.

This chapter covers the full landscape of preference-based alignment methods. It begins with RLHF, the technique that powered ChatGPT's breakthrough, walking through the three-stage pipeline of supervised fine-tuning, reward modeling, and proximal policy optimization. It then explores modern alternatives like Direct Preference Optimization (DPO) that eliminate the need for a separate reward model, Constitutional AI for scalable self-alignment, and Reinforcement Learning with Verifiable Rewards (RLVR) for training reasoning capabilities. These techniques build on the parameter-efficient methods covered earlier and connect directly to the production safety considerations explored later in the book.

By the end of this chapter, you will understand the theoretical foundations and practical engineering of each alignment family, know when to choose one method over another, and be able to implement preference tuning pipelines using current open-source tooling.

Big Picture

Alignment is what separates a raw language model from a helpful, harmless assistant. This chapter covers RLHF, DPO, and constitutional AI, the techniques that shaped models like ChatGPT and Claude. Understanding alignment is essential for the safety discussions in Chapter 37 and for anyone building user-facing AI systems.

Note: Learning Objectives

Explain the three-stage RLHF pipeline (SFT, reward model training, PPO) and the role of each component
Describe how the Bradley-Terry preference model converts pairwise comparisons into a scalar reward signal
Derive the DPO objective from the RLHF formulation and explain why it eliminates the reward model
Compare DPO, KTO, ORPO, SimPO, and IPO in terms of data requirements, training stability, and performance
Implement preference tuning pipelines using TRL, building on PEFT techniques, including dataset preparation and hyperparameter selection
Explain Constitutional AI and RLAIF as approaches to scalable, principle-based alignment, connecting to safety in production
Describe how RLVR uses verifiable rewards (math, code correctness) to train reasoning without human labels
Analyze the GRPO algorithm and its role in DeepSeek-R1 and similar reasoning-focused models

Prerequisites

Chapter 16: Fine-tuning Foundations (SFT, LoRA, training loops)
Chapter 6: Pretraining & Scaling Laws (pretraining objectives, optimizers, training dynamics)
Chapter 7: Modern LLM Landscape (frontier model families, instruction-tuned variants)
Reinforcement learning foundations from Chapter 0 (policy, reward, optimization; section 0.5 covers RL)
Familiarity with PyTorch training loops and the Hugging Face ecosystem

Sections

Lab 18: Run DPO on Llama-3.2-1B With a UltraFeedback Subset

Objective

Fine-tune Llama-3.2-1B with Direct Preference Optimization on 2,000 preference pairs from UltraFeedback. By the end you will have a measurably more helpful model, traces of the reward margins, and a feel for the difference between SFT and preference tuning.

Steps

Step 1: Load preference data. load_dataset("HuggingFaceH4/ultrafeedback_binarized"). Take 2000 random samples with the chosen and rejected fields. Inspect 5 by hand to confirm the chosen response really is better.
Step 2: SFT warm-up (optional but recommended). First SFT on the chosen responses for 1 epoch (Section 18.1). This stabilizes DPO; skipping often produces a flat loss curve.
Step 3: Configure DPO. Use trl.DPOTrainer with beta=0.1, learning_rate=5e-7 (much smaller than SFT), per_device_train_batch_size=2, num_train_epochs=1. Verify trainable params = full model (DPO does full fine-tuning by default; combine with LoRA to keep it cheap).
Step 4: Train + log reward margins. Train ~30 to 90 minutes on a T4. The key metric is rewards/margins: it should grow positively over training (chosen reward minus rejected reward).
Step 5: Pairwise eval vs. baseline. On 100 held-out prompts, generate from both the SFT-only and the DPO model. Use GPT-4o-mini as a judge (preview of Chapter 46): which is more helpful? Win rate should be >55% for DPO.
Step 6: Ablate beta. Re-run with beta=0.5 (stays closer to reference model). Compare win rate and KL divergence from the original. This builds intuition for the alignment-tax tradeoff.

Expected Output

Expected time: 4 to 5 hours. Difficulty: advanced. Artifact: a DPO-tuned model + win-rate report vs. SFT baseline.

What's Next?

Next: Chapter 19: Tools of the Trade, Training & Adaptation Stack. Chapter 19 closes Part IV with the consolidated training-and-adaptation toolbox: Axolotl, TRL, torchtune, Unsloth, Hugging Face PEFT, LLaMA-Factory, the preference-dataset zoo, and the recipe configs the open community settled on after DeepSeek-R1. Then Part V breaks the text-only frame entirely, jumping into audio, vision, document understanding, 3D scenes, and embodied vision-language-action models.