Parameter-Efficient Fine-Tuning, Distillation & Model Merging

Chapter opener illustration: Parameter-Efficient Fine-Tuning.

"The best parameter is the one you don't have to train."

LoRALoRA, Refreshingly Frugal AI Agent
Looking Back

Chapter 18 introduces fine-tuning. This chapter covers what most practitioners actually do instead: parameter-efficient fine-tuning. LoRA, QLoRA, IA3, prefix tuning, prompt tuning, and the merging tricks (DARE, TIES) that let you stack multiple adapters. PEFT is the reason fine-tuning a 70B model on one GPU is possible.

Chapter Overview

See Also

The deep treatment of catastrophic forgetting lives in Section 16.1. The discussion below focuses on how PEFT mitigates the same effect.

Full fine-tuning of a 7B parameter model requires about 14 GB just for the weights in FP16, plus optimizer states that push the total past 56 GB. For most practitioners, this puts full fine-tuning out of reach without expensive multi-GPU setups. Parameter-efficient fine-tuning (PEFT) methods solve this problem by training only a tiny fraction of parameters (often less than 1%) while achieving quality that often rivals full fine-tuning on tasks close to the pretraining distribution. The match is empirical rather than guaranteed: Biderman et al. (2024) report that LoRA still trails full fine-tuning on programming and math benchmarks that demand substantial behavioral change, so test on your own task before assuming parity.

This chapter covers the most important PEFT techniques in depth, starting with LoRA and QLoRA (the dominant methods in practice) and extending to newer approaches like DoRA, LoRA+, and adapter-based methods. You will learn not just the theory behind each method, but also how to configure hyperparameters, select target modules, and merge adapters for efficient deployment.

The final section surveys the rapidly evolving ecosystem of training platforms and tools, from Unsloth (which delivers 2x speedups with half the memory) to managed platforms like Axolotl and LLaMA-Factory. By the end of this chapter, you will be able to fine-tune any open-weight model on a single consumer GPU.

Big Picture

Full fine-tuning is expensive and often unnecessary. Parameter-efficient methods like LoRA and QLoRA let you adapt large models by training only a small fraction of their parameters, dramatically reducing compute costs. These techniques make fine-tuning accessible even on consumer hardware, a practical skill used throughout Parts V and VI.

Note: Learning Objectives

Prerequisites

Sections

Lab 17: Fine-Tune Llama-3.2-3B as a Writing-Style Mimic on Free Colab T4

Objective

Train a QLoRA adapter that makes Llama-3.2-3B write like a specific author (Hemingway, Asimov, or your own blog). The whole pipeline fits in a free Colab T4 (16 GB VRAM). At the end you will have a saved adapter you can swap onto the base model and demo to a friend, plus first-hand intuition for the LoRA hyperparameters this chapter covers.

Steps

  1. Step 1: Build a style corpus. Collect 200 to 1000 short passages (100 to 500 words each) from your target author. Public-domain sources: Project Gutenberg for classic authors, your own blog/journal for personal style. Save as style.jsonl with {"text": "..."} per line. Convert to instruction format: {"instruction":"Continue in the author's style:", "input":"<first 50 words>", "output":"<next 200 words>"}.
  2. Step 2: Launch Colab + Unsloth. Open a free Colab T4 notebook. Install: pip install unsloth bitsandbytes trl datasets. Load unsloth/Llama-3.2-3B-Instruct-bnb-4bit with max_seq_length=2048. Confirm GPU has >14 GB free.
  3. Step 3: Configure LoRA. Use FastLanguageModel.get_peft_model(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], lora_dropout=0). Verify with model.print_trainable_parameters(): you should see <1% of params trainable.
  4. Step 4: Train. Use SFTTrainer with per_device_train_batch_size=2, gradient_accumulation_steps=4, learning_rate=2e-4, num_train_epochs=2, bf16=True. Expect 30 to 60 minutes wall-clock. Watch the loss go from ~2.0 to ~1.4.
  5. Step 5: Inference + side-by-side. Generate 5 completions from the same prompt with (a) base Llama-3.2-3B, (b) your fine-tuned adapter. Save to compare.html. The style shift should be obvious; if not, your data is too small or epochs too few.
  6. Step 6: Save + share. model.save_pretrained("hemingway-lora") then push to Hugging Face Hub. Anyone can now load your style on top of Llama-3.2-3B in 3 lines. Bonus: try merging the adapter with model.merge_and_unload() for adapter-free deployment (Section 17.7).

Expected Output

Expected time: 4 to 6 hours (a weekend afternoon). Difficulty: intermediate. Artifact: a Hugging Face Hub adapter + before/after comparison page.

What's Next?

Next: Chapter 18: Alignment, RLHF, DPO & Preference Tuning. SFT and PEFT teach a model how to behave on examples you label. Alignment asks the harder question: how do you teach it human preferences when "the right answer" is fuzzy or absent? Chapter 18 walks through RLHF (PPO loop, KL penalty, reward model), DPO (the trick that eliminated the reward model), GRPO (DeepSeek-R1's reasoning recipe), and how 2025-26 settled on which to use when.