Part IV: Training and Adapting

Chapter 17: Alignment: RLHF, DPO & Preference Tuning

"The question of machine intelligence is not whether machines can think, but whether machines can be taught to care about the right things."

Reward Reward, Philosophically Wired AI Agent
Alignment, RLHF, DPO and Preference Tuning chapter illustration
Figure 17.0.1: Human preferences steer model behavior: judges score outputs, and the model gradually learns to be helpful, harmless, and honest.

Chapter Overview

Pretraining and supervised fine-tuning produce capable language models, but raw capability is not the same as usefulness or safety. Alignment is the process of steering an LLM's behavior so that it follows instructions, produces helpful responses, avoids harmful outputs, and generally reflects human preferences. Without alignment, even the most powerful base model may generate toxic, incoherent, or off-topic text.

This chapter covers the full landscape of preference-based alignment methods. It begins with RLHF, the technique that powered ChatGPT's breakthrough, walking through the three-stage pipeline of supervised fine-tuning, reward modeling, and proximal policy optimization. It then explores modern alternatives like Direct Preference Optimization (DPO) that eliminate the need for a separate reward model, Constitutional AI for scalable self-alignment, and Reinforcement Learning with Verifiable Rewards (RLVR) for training reasoning capabilities. These techniques build on the parameter-efficient methods covered earlier and connect directly to the production safety considerations explored later in the book.

By the end of this chapter, you will understand the theoretical foundations and practical engineering of each alignment family, know when to choose one method over another, and be able to implement preference tuning pipelines using current open-source tooling.

Big Picture

Alignment is what separates a raw language model from a helpful, harmless assistant. This chapter covers RLHF, DPO, and constitutional AI, the techniques that shaped models like ChatGPT and Claude. Understanding alignment is essential for the safety discussions in Chapter 32 and for anyone building user-facing AI systems.

Learning Objectives

Prerequisites

Sections

What's Next?

In the next part, Part V: Retrieval and Conversation, we connect LLMs to external knowledge and conversation through embeddings, RAG, and dialogue systems.