Chapter 17: Parameter-Efficient Fine-Tuning, Distillation & Model Merging

Chapter opener illustration: Parameter-Efficient Fine-Tuning.

"The best parameter is the one you don't have to train."
LoRA, Refreshingly Frugal AI Agent

Looking Back

Chapter 18 introduces fine-tuning. This chapter covers what most practitioners actually do instead: parameter-efficient fine-tuning. LoRA, QLoRA, IA³, prefix tuning, prompt tuning, and the merging tricks (DARE, TIES) that let you stack multiple adapters. PEFT is the reason fine-tuning a 70B model on one GPU is possible.

Chapter Overview

See Also

The deep treatment of catastrophic forgetting lives in Section 16.1. The discussion below focuses on how PEFT mitigates the same effect.

Full fine-tuning of a 7B parameter model requires about 14 GB just for the weights in FP16, plus optimizer states that push the total past 56 GB. For most practitioners, this puts full fine-tuning out of reach without expensive multi-GPU setups. Parameter-efficient fine-tuning (PEFT) methods solve this problem by training only a tiny fraction of parameters (often less than 1%) while achieving quality that often rivals full fine-tuning on tasks close to the pretraining distribution. The match is empirical rather than guaranteed: Biderman et al. (2024) report that LoRA still trails full fine-tuning on programming and math benchmarks that demand substantial behavioral change, so test on your own task before assuming parity.

This chapter covers the most important PEFT techniques in depth, starting with LoRA and QLoRA (the dominant methods in practice) and extending to newer approaches like DoRA, LoRA+, and adapter-based methods. You will learn not just the theory behind each method, but also how to configure hyperparameters, select target modules, and merge adapters for efficient deployment.

The final section surveys the rapidly evolving ecosystem of training platforms and tools, from Unsloth (which delivers 2x speedups with half the memory) to managed platforms like Axolotl and LLaMA-Factory. By the end of this chapter, you will be able to fine-tune any open-weight model on a single consumer GPU.

Big Picture

Full fine-tuning is expensive and often unnecessary. Parameter-efficient methods like LoRA and QLoRA let you adapt large models by training only a small fraction of their parameters, dramatically reducing compute costs. These techniques make fine-tuning accessible even on consumer hardware, a practical skill used throughout Parts V and VI.

Note: Learning Objectives

Explain the mathematical foundation of LoRA, including the low-rank decomposition W' = W + BA and why it works in transformer weight matrices
Configure LoRA hyperparameters (rank, alpha, target modules, dropout) for different model architectures and task types
Apply QLoRA with NF4 quantization, double quantization, and paged optimizers to fine-tune large models on consumer hardware
Compare advanced PEFT methods (DoRA, LoRA+, Prefix Tuning, IA3, adapters) and select the right one for a given scenario
Implement multi-adapter serving strategies using LoRAX or S-LoRA for production deployments
Use modern training platforms (Unsloth, Axolotl, LLaMA-Factory, torchtune, TRL) to streamline the fine-tuning workflow
Merge trained LoRA adapters back into base models and evaluate the merged result
Select appropriate cloud infrastructure (Colab, Lambda Labs, RunPod, Modal) based on budget and scale requirements

Prerequisites

Chapter 16: Fine-Tuning Fundamentals (supervised fine-tuning workflow, data preparation, evaluation)
Chapter 9: Inference Optimization (quantization basics, GPU memory concepts)
Chapter 3: The Transformer Architecture (attention mechanism, weight matrices)
Familiarity with PyTorch training loops and the Hugging Face Transformers library
Basic linear algebra (matrix multiplication, rank of a matrix)

Sections

Lab 17: Fine-Tune Llama-3.2-3B as a Writing-Style Mimic on Free Colab T4

Objective

Train a QLoRA adapter that makes Llama-3.2-3B write like a specific author (Hemingway, Asimov, or your own blog). The whole pipeline fits in a free Colab T4 (16 GB VRAM). At the end you will have a saved adapter you can swap onto the base model and demo to a friend, plus first-hand intuition for the LoRA hyperparameters this chapter covers.

Steps

Step 1: Build a style corpus. Collect 200 to 1000 short passages (100 to 500 words each) from your target author. Public-domain sources: Project Gutenberg for classic authors, your own blog/journal for personal style. Save as style.jsonl with {"text": "..."} per line. Convert to instruction format: {"instruction":"Continue in the author's style:", "input":"<first 50 words>", "output":"<next 200 words>"}.
Step 2: Launch Colab + Unsloth. Open a free Colab T4 notebook. Install: pip install unsloth bitsandbytes trl datasets. Load unsloth/Llama-3.2-3B-Instruct-bnb-4bit with max_seq_length=2048. Confirm GPU has >14 GB free.
Step 3: Configure LoRA. Use FastLanguageModel.get_peft_model(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], lora_dropout=0). Verify with model.print_trainable_parameters(): you should see <1% of params trainable.
Step 4: Train. Use SFTTrainer with per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=2, bf16=True. Expect 30 to 60 minutes wall-clock. Watch the loss go from ~2.0 to ~1.4.
Step 5: Inference + side-by-side. Generate 5 completions from the same prompt with (a) base Llama-3.2-3B, (b) your fine-tuned adapter. Save to compare.html. The style shift should be obvious; if not, your data is too small or epochs too few.
Step 6: Save + share. model.save_pretrained("hemingway-lora") then push to Hugging Face Hub. Anyone can now load your style on top of Llama-3.2-3B in 3 lines. Bonus: try merging the adapter with model.merge_and_unload() for adapter-free deployment (Section 17.7).

Expected Output

Expected time: 4 to 6 hours (a weekend afternoon). Difficulty: intermediate. Artifact: a Hugging Face Hub adapter + before/after comparison page.

What's Next?

Next: Chapter 18: Alignment, RLHF, DPO & Preference Tuning. SFT and PEFT teach a model how to behave on examples you label. Alignment asks the harder question: how do you teach it human preferences when "the right answer" is fuzzy or absent? Chapter 18 walks through RLHF (PPO loop, KL penalty, reward model), DPO (the trick that eliminated the reward model), GRPO (DeepSeek-R1's reasoning recipe), and how 2025-26 settled on which to use when.