
"The best parameter is the one you don't have to train."
LoRA, Refreshingly Frugal AI Agent
Chapter 18 introduces fine-tuning. This chapter covers what most practitioners actually do instead: parameter-efficient fine-tuning. LoRA, QLoRA, IA3, prefix tuning, prompt tuning, and the merging tricks (DARE, TIES) that let you stack multiple adapters. PEFT is the reason fine-tuning a 70B model on one GPU is possible.
Chapter Overview
The deep treatment of catastrophic forgetting lives in Section 16.1. The discussion below focuses on how PEFT mitigates the same effect.
Full fine-tuning of a 7B parameter model requires about 14 GB just for the weights in FP16, plus optimizer states that push the total past 56 GB. For most practitioners, this puts full fine-tuning out of reach without expensive multi-GPU setups. Parameter-efficient fine-tuning (PEFT) methods solve this problem by training only a tiny fraction of parameters (often less than 1%) while achieving quality that often rivals full fine-tuning on tasks close to the pretraining distribution. The match is empirical rather than guaranteed: Biderman et al. (2024) report that LoRA still trails full fine-tuning on programming and math benchmarks that demand substantial behavioral change, so test on your own task before assuming parity.
This chapter covers the most important PEFT techniques in depth, starting with LoRA and QLoRA (the dominant methods in practice) and extending to newer approaches like DoRA, LoRA+, and adapter-based methods. You will learn not just the theory behind each method, but also how to configure hyperparameters, select target modules, and merge adapters for efficient deployment.
The final section surveys the rapidly evolving ecosystem of training platforms and tools, from Unsloth (which delivers 2x speedups with half the memory) to managed platforms like Axolotl and LLaMA-Factory. By the end of this chapter, you will be able to fine-tune any open-weight model on a single consumer GPU.
Full fine-tuning is expensive and often unnecessary. Parameter-efficient methods like LoRA and QLoRA let you adapt large models by training only a small fraction of their parameters, dramatically reducing compute costs. These techniques make fine-tuning accessible even on consumer hardware, a practical skill used throughout Parts V and VI.
- Explain the mathematical foundation of LoRA, including the low-rank decomposition W' = W + BA and why it works in transformer weight matrices
- Configure LoRA hyperparameters (rank, alpha, target modules, dropout) for different model architectures and task types
- Apply QLoRA with NF4 quantization, double quantization, and paged optimizers to fine-tune large models on consumer hardware
- Compare advanced PEFT methods (DoRA, LoRA+, Prefix Tuning, IA3, adapters) and select the right one for a given scenario
- Implement multi-adapter serving strategies using LoRAX or S-LoRA for production deployments
- Use modern training platforms (Unsloth, Axolotl, LLaMA-Factory, torchtune, TRL) to streamline the fine-tuning workflow
- Merge trained LoRA adapters back into base models and evaluate the merged result
- Select appropriate cloud infrastructure (Colab, Lambda Labs, RunPod, Modal) based on budget and scale requirements
Prerequisites
- Chapter 16: Fine-Tuning Fundamentals (supervised fine-tuning workflow, data preparation, evaluation)
- Chapter 9: Inference Optimization (quantization basics, GPU memory concepts)
- Chapter 3: The Transformer Architecture (attention mechanism, weight matrices)
- Familiarity with PyTorch training loops and the Hugging Face Transformers library
- Basic linear algebra (matrix multiplication, rank of a matrix)
Sections
- 17.1 LoRA & QLoRA LoRA is the single most important technique for practical LLM fine-tuning. Entry
- 17.2 Advanced PEFT Methods LoRA dominates the PEFT landscape, but it is not the only option. Advanced
- 17.3 Training Platforms & Tools The fine-tuning tool landscape is evolving rapidly. Intermediate
- 17.3a Training Tool Comparison, Cloud Compute & Recommended Workflows Side-by-side tool comparison matrix, GPU/cloud pricing landscape, and beginner-to-advanced end-to-end workflows. Intermediate
- 17.4 Soft Prompts: Prompt Tuning, Prefix Tuning, and P-Tuning Soft prompt methods occupy a fascinating middle ground between prompt engineering and fine-tuning. Intermediate
- 17.5 Knowledge Distillation: Foundations & Pipelines The teacher-student framework, white-box vs. black-box distillation, real-world LLM case studies, small-but-capable models, and the practical pipeline. Advanced
- 17.6 Distillation: Licensing, Speculative & Reasoning Provider licensing constraints, speculative distillation for inference acceleration, and chain-of-thought distillation that transfers reasoning into smaller students. Advanced
- 17.7 Model Merging & Composition Model merging creates multi-skilled models by combining weights from specialized fine-tunes, requiring no GPU training at all. Advanced
- 17.8 Continual Learning & Domain Adaptation LLMs are expensive to pretrain and quickly become outdated. Advanced
Objective
Train a QLoRA adapter that makes Llama-3.2-3B write like a specific author (Hemingway, Asimov, or your own blog). The whole pipeline fits in a free Colab T4 (16 GB VRAM). At the end you will have a saved adapter you can swap onto the base model and demo to a friend, plus first-hand intuition for the LoRA hyperparameters this chapter covers.
Steps
- Step 1: Build a style corpus. Collect 200 to 1000 short passages (100 to 500 words each) from your target author. Public-domain sources: Project Gutenberg for classic authors, your own blog/journal for personal style. Save as
style.jsonlwith{"text": "..."}per line. Convert to instruction format:{"instruction":"Continue in the author's style:", "input":"<first 50 words>", "output":"<next 200 words>"}. - Step 2: Launch Colab + Unsloth. Open a free Colab T4 notebook. Install:
pip install unsloth bitsandbytes trl datasets. Loadunsloth/Llama-3.2-3B-Instruct-bnb-4bitwithmax_seq_length=2048. Confirm GPU has >14 GB free. - Step 3: Configure LoRA. Use
FastLanguageModel.get_peft_model(r=16, lora_alpha=32, target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"], lora_dropout=0). Verify withmodel.print_trainable_parameters(): you should see <1% of params trainable. - Step 4: Train. Use
SFTTrainerwithper_device_train_batch_size=2,gradient_accumulation_steps=4,learning_rate=2e-4,num_train_epochs=2,bf16=True. Expect 30 to 60 minutes wall-clock. Watch the loss go from ~2.0 to ~1.4. - Step 5: Inference + side-by-side. Generate 5 completions from the same prompt with (a) base Llama-3.2-3B, (b) your fine-tuned adapter. Save to
compare.html. The style shift should be obvious; if not, your data is too small or epochs too few. - Step 6: Save + share.
model.save_pretrained("hemingway-lora")then push to Hugging Face Hub. Anyone can now load your style on top of Llama-3.2-3B in 3 lines. Bonus: try merging the adapter withmodel.merge_and_unload()for adapter-free deployment (Section 17.7).
Expected Output
Expected time: 4 to 6 hours (a weekend afternoon). Difficulty: intermediate. Artifact: a Hugging Face Hub adapter + before/after comparison page.
What's Next?
Next: Chapter 18: Alignment, RLHF, DPO & Preference Tuning. SFT and PEFT teach a model how to behave on examples you label. Alignment asks the harder question: how do you teach it human preferences when "the right answer" is fuzzy or absent? Chapter 18 walks through RLHF (PPO loop, KL penalty, reward model), DPO (the trick that eliminated the reward model), GRPO (DeepSeek-R1's reasoning recipe), and how 2025-26 settled on which to use when.