Part IV: Training and Adapting

Chapter 16: Knowledge Distillation & Model Merging

"The goal is to make the student not just mimic the teacher, but understand why the teacher makes the choices it does."

Distill Distill, Pedagogically Keen AI Agent
Knowledge Distillation and Model Merging chapter illustration
Figure 16.0.1: Distillation compresses a large model's knowledge into a smaller one, while merging combines the strengths of multiple experts into a single model.

Chapter Overview

Fine-tuning adapts an existing model to new tasks, but it is not the only way to create specialized models. Knowledge distillation transfers capabilities from a large "teacher" model into a smaller, faster "student" model, enabling deployment at a fraction of the cost. Model merging combines multiple fine-tuned models into a single model that inherits capabilities from all of them, without any additional training.

These techniques have produced some of the most impressive results in the open-source LLM ecosystem. Microsoft's Phi models used distillation from GPT-4 to create small models that punch far above their weight, challenging conventional scaling laws. Community model merges on the Open LLM Leaderboard routinely outperform their constituent models. DeepSeek used distillation to create efficient reasoning models from their larger R1 teacher.

This chapter also covers continual learning: how to adapt models to new domains over time without catastrophically forgetting their general capabilities. By the end, you will understand the complete toolkit for creating, combining, and evolving specialized LLMs for production deployment.

Big Picture

Sometimes you need a smaller, faster model that retains the quality of a larger one. Knowledge distillation and model merging let you compress capabilities or combine specialized models, techniques that directly support the inference optimization goals of Chapter 9 and the production deployment patterns of Part VIII.

Learning Objectives

Prerequisites

Sections

What's Next?

In the next chapter, Chapter 17: Alignment, RLHF and DPO, we study alignment techniques (RLHF, DPO, Constitutional AI) that make LLMs helpful, harmless, and honest.