LoRA & QLoRA

Section 17.1

"Why update a billion parameters when a few thousand will do? Efficiency is not laziness; it is elegance under constraint."

LoRALoRA, Elegantly Lazy AI Agent
Big Picture

LoRA is the single most important technique for practical LLM fine-tuning. Instead of updating all model weights, LoRA freezes the pretrained model and injects small trainable low-rank matrices into each layer. This reduces trainable parameters by 100x or more, cuts GPU memory by 60-70%, and produces adapters that can be swapped at serving time without reloading the base model. QLoRA extends this further by quantizing the frozen weights to 4-bit (building on the quantization techniques from Section 9.1), enabling fine-tuning of 70B models on a single 48GB GPU.

Prerequisites

This section builds on fine-tuning fundamentals from Section 16.1: When and Why to Fine-Tune and the transformer architecture covered in Section 3.1: Transformer Architecture Deep Dive. You should also be comfortable with the quantization concepts from Section 9.1: Quantization Fundamentals, as QLoRA builds directly on 4-bit quantization techniques. An understanding of catastrophic forgetting from Section 16.2 will help you appreciate why parameter-efficient methods preserve base model capabilities.

A small sticky note attached to a large frozen neural network, representing LoRA's lightweight trainable parameters
Figure 17.1.1: LoRA in a nutshell: instead of rewriting the whole textbook, just slap a sticky note on the relevant pages. Same knowledge, fraction of the effort.
See Also

For a hands-on walkthrough of LoRA and PEFT using Hugging Face libraries, see Hugging Face: Transformers, Datasets, and Hub.

A delivery truck with custom decals representing QLoRA's quantized base model with low-rank adapters
Figure 17.1.2: QLoRA: take your already-compact quantized model and add custom decals. You get personalization without needing a bigger garage.

17.1.1 The Full Fine-Tuning Problem

Key Insight
Why: LoRA targets attention by default, not FFNs

The convention to target Q and V (and sometimes K and O) but rarely the FFN comes from Hu et al.'s original ablation: attention projection matrices contributed most of the per-task gain at a fraction of the parameters, because attention is where task-specific routing lives (which tokens attend where), while FFNs hold factual knowledge learned during pretraining that fine-tuning should not disturb. This maps onto the Geva et al. (2021) finding that FFN layers behave like key-value memory stores. The principle: adapt the parts of the network that need to learn new behavior, freeze the parts that store existing knowledge. Targeting FFNs in LoRA can be useful for domain adaptation where new facts matter; targeting attention is better for new tasks.

See Also

LoRA is most commonly used as the parameter-efficient backbone of preference fine-tuning (DPO, ORPO, IPO). For the preference-optimization side of the same workflow, see Section 18.3: DPO & Modern Preference Optimization.

Tip: Overfit One Batch Before Starting Any Training Run

Before launching a multi-hour fine-tuning job, run 100 steps on a single batch of 8-16 examples with no regularization (weight_decay=0, no dropout). Loss should reach near zero. If it does not, you have a bug: wrong masking (loss computed on prompt tokens), broken chat template, mismatched tokenizer, or learning rate orders of magnitude too low. This 5-minute check has saved countless wasted GPU-hours. If loss goes to zero, kill the run and launch properly.

Key Insight
Mental Model: Why LoRA Targets Attention, Not FFN

A natural question: why does LoRA target attention matrices rather than FFN layers that contain most of the parameters? The answer connects to the functional distinction. FFN layers store factual knowledge as static key-value pairs baked into weights (see Section 4.1). Adapting FFN rows overwrites stored facts, expensive, risks erasing useful pretrained knowledge. Attention projections, by contrast, determine how tokens route information: which positions can speak to which. Adapting Q, K, V, O reshapes routing patterns without touching factual stores. This is why fine-tuning with LoRA applied only to attention achieves strong task performance with minimal catastrophic forgetting: you are teaching new communication patterns while leaving knowledge intact. When a task requires injecting genuinely new factual knowledge (not just new behavior), LoRA on FFN or full fine-tuning becomes necessary.

When you fine-tune a model with full parameter updates, every weight in the model gets a gradient, an optimizer state (momentum and variance for Adam), and a copy of the updated weight. For a 7B parameter model in FP16, that means 14 GB just for the weights, plus roughly 42 GB for optimizer states, totaling over 56 GB of GPU memory. Scaling to 13B or 70B models makes this prohibitively expensive.

The key insight behind parameter-efficient methods is that the weight changes during fine-tuning are low-rank. Research (Aghajanyan et al., 2021) has shown that when you compute the difference between a fine-tuned model and its pretrained base (the "task-specific delta"), this delta matrix has a very low intrinsic dimensionality, meaning it can be well-approximated by a matrix of much lower rank. Most of the information in the update can be captured by a much smaller matrix. Think of it like image compression: a JPEG discards high-frequency details that humans barely notice, keeping the image recognizable at a fraction of the file size. Similarly, LoRA discards the high-rank components of the weight update, keeping only the most important directions of change.

Fun Note

LoRA adapters for a 7B model are typically 10 to 50 MB in size. The base model itself is 14 GB. This means you can store 280 different task-specific adapters in the same disk space as a single full fine-tune. Some teams maintain a "LoRA library" with dozens of specialized adapters (one for legal text, one for medical notes, one for code review) that share a single base model in GPU memory and swap at request time. It is like having 280 employees who all share the same brain but carry different notebooks.

Key Insight: Mental Model: The Sticky Note on a Textbook

Think of LoRA as adding sticky notes to a textbook rather than rewriting the pages. The original textbook (pretrained weights) stays untouched. Each sticky note (low-rank adapter) captures a small correction or addition that adjusts the meaning for your specific task. At inference time, you can peel off one set of sticky notes and apply a different set for a new task, all without ever altering the original textbook pages.

Fun Fact

LoRA achieves 90%+ of full fine-tuning performance while training less than 1% of the parameters. It is the deep learning equivalent of altering a suit instead of buying a new one: same great fit, fraction of the cost.

Table 17.1.1a: Model Size Comparison (as of 2026).
Model SizeFull FT Memory (FP16 + Adam)LoRA Memory (r=16)QLoRA Memory (NF4, r=16)
7B~56 GB~16 GB~6 GB
13B~104 GB~28 GB~10 GB
70B~560 GB~160 GB~36 GB

Why does LoRA work with so few parameters? The deep intuition is that fine-tuning does not need to change the model's fundamental capabilities; it only needs to steer existing capabilities toward a specific task. The pretrained weight matrices encode rich, general-purpose representations built from trillions of tokens. The task-specific adaptation (learning to follow instructions, adopting a particular output format, specializing in a domain) requires only small directional adjustments to these representations. Mathematically, these adjustments occupy a low-dimensional subspace of the full weight space, which is precisely what low-rank matrices capture. This is why rank-4 or rank-8 LoRA frequently matches full fine-tuning: the effective dimensionality of the task adaptation is genuinely small, regardless of the model's total parameter count.

Key Insight

The effectiveness of LoRA is grounded in the mathematical phenomenon of low intrinsic dimensionality, a concept with roots in topology and the manifold hypothesis. The manifold hypothesis, widely discussed in machine learning theory, posits that high-dimensional data (such as model weight matrices with millions of entries) actually lies on or near a low-dimensional manifold. In linear algebra terms, the task-specific weight update occupies a tiny subspace of the full parameter space. This is not unique to neural networks: the same principle appears in signal processing (signals are sparse in the Fourier or wavelet basis), in physics (many-body systems have low effective degrees of freedom near equilibrium), and in statistics (principal component analysis exploits the same low-rank structure to compress data). LoRA's rank decomposition W = BA is, mathematically, a constrained version of the singular value decomposition (SVD) applied to the weight update, keeping only the most significant singular vectors. The remarkable finding that ranks as low as 4 or 8 suffice tells us that fine-tuning operates in a surprisingly low-dimensional subspace of the full parameter manifold.

Warning
Common Mistake: Higher Rank Does Not Always Mean Better Quality

A common misconception is that increasing the LoRA rank (r) always improves fine-tuning quality. In practice, ranks above 32 rarely improve performance for most tasks and can actually degrade results by increasing overfitting risk on small datasets. The sweet spot for most tasks is r=8 to r=16. Higher ranks (32 to 64) are useful only for complex multi-task fine-tuning or when the task requires substantial behavioral change. Always start with r=8 and increase only if validation metrics plateau. More parameters to train also means more memory, slower training, and larger adapter files.

17.1.2 LoRA Mathematics

LoRA decomposes the weight update into two low-rank matrices: instead of training a full d-by-k matrix Delta-W with d times k parameters, LoRA trains B (d by r) and A (r by k) with r much smaller than d and k, giving only r times d plus k parameters.
Figure 17.1.3: LoRA freezes the pretrained matrix W and learns the update as the product of two low-rank factors B (down-projection) and A (up-projection). With rank r set to 8-32, the number of trainable parameters drops by 100× or more relative to full fine-tuning, while the merged weight matrix at inference is mathematically identical to standard fine-tuning.
Algorithm 17.1.1: Algorithm: Under the Hood: LoRA Backward Pass

The forward pass W′ = W + (α/r)BA is well-known. The backward pass is where the memory saving actually happens. PyTorch sees W₀ with requires_grad=False and never allocates a gradient tensor for it. Only A and B accumulate gradients:

# Input: frozen base weights W0 (d_in x d_out), trainable LoRA factors A (d_in x rank), B (rank x d_out), scaling alpha
# Output: output y = x @ (W0 + (alpha/rank) * A @ B); gradients flow only through A and B
from torch import nn
import torch
class LoRALinear(nn.Module):
    def __init__(self, d_in, d_out, rank, alpha):
        super().__init__()
        self.W0 = nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
        self.A = nn.Parameter(torch.randn(d_in, rank) * 0.02)
        self.B = nn.Parameter(torch.zeros(rank, d_out))
        self.scale = alpha / rank
    def forward(self, x):
        return x @ self.W0 + (x @ self.A @ self.B) * self.scale
    layer = LoRALinear(512, 512, rank=8, alpha=16)
    layer(torch.randn(4, 512)).sum().backward()
    print(layer.A.grad.shape) # (512, 8)
    print(layer.W0.grad) # None <-- the memory saving

The 60-70% optimizer-memory saving in LoRA training is exactly this: no grad_W0 tensor allocated.

17.1.2.1 The Core Decomposition

LoRA (Low-Rank Adaptation) works by expressing the weight update as a product of two small matrices. For a pretrained weight matrix W of dimension d × k, instead of computing a full update ΔW (also d × k), LoRA decomposes it as:

W' = W + ΔW = W + BA

where B is d × r and A is r × k, with the rank r being much smaller than both d and k. Typical values of r range from 4 to 64, while d and k are typically 4096 or larger. This means the number of trainable parameters drops from d × k (e.g., 16.7 million for a 4096 × 4096 matrix) to r × (d + k) (e.g., 131,072 for r=16). visualizes this decomposition.

LoRA decomposition. Input x feeds both a frozen pretrained weight matrix W of size d_in by d_out (drawn large.
Figure 17.1.4: LoRA replaces the full weight update with the product of two small trainable matrices A and B, where the rank r is much smaller than the original matrix dimensions. For a 4096 by 4096 layer with r = 8, that is roughly 128 times fewer trainable parameters and roughly the same forward FLOPs because the bypass path runs in parallel with the frozen W. Concept after Hu et al., 2021, LoRA: Low-Rank Adaptation of Large Language Models.

17.1.2.2 Initialization and Scaling

LoRA uses a specific initialization strategy: matrix A is initialized with a random Gaussian distribution, and matrix B is initialized to all zeros. This means that at the start of training, BA = 0, and the model behaves exactly like the pretrained base. Training gradually moves the model away from this starting point.

The scaling factor α (alpha) controls the magnitude of the LoRA update. The actual update applied is:

W' = W + (α / r) · BA

The ratio α/r acts as a learning rate multiplier for the LoRA weights. A common convention is to set α = 2r (so the effective multiplier is 2), but the optimal value depends on the task. Increasing α relative to r makes the adaptation more aggressive; decreasing it keeps the model closer to the pretrained weights.

Warning: Common Misconception

Many tutorials present r and alpha as two independent knobs, leading readers to tune them separately. They are coupled. What actually controls update magnitude is the ratio α/r, not either value alone. Doubling r from 8 to 16 while keeping α=16 silently halves your effective learning rate. The "α = 2r" rule (and the newer rank-stabilized variant use_rslora=True, which scales by α/sqrt(r) instead) exists precisely so you can change r without re-tuning your learning rate schedule. Treat α/r as a single hyperparameter when comparing configurations.

Key Insight

When you double the rank r, you should also consider doubling α to maintain the same effective learning rate. Many practitioners set α = 2 × r as a starting point, then adjust based on validation performance. If training diverges, reduce α; if the model barely moves from the base, increase it.

Real-World Scenario
LoRA for Customer Support Classification

Who: ML engineer at a mid-size SaaS company

Situation: The team needed to fine-tune Llama-2 7B for routing customer support tickets into 48 categories, but their GPU budget was a single A100 40GB.

Problem: Full fine-tuning of 7B parameters required over 100GB of GPU memory (parameters, gradients, optimizer states), far exceeding available hardware.

Dilemma: They could either pay for multi-GPU cloud instances (4x the monthly cost) or accept a smaller base model with lower accuracy.

Decision: They chose LoRA with rank 16 and alpha 32, targeting only the query and value projection matrices. This reduced trainable parameters from 7B to 4.2M (0.06%), fitting comfortably in 24GB with QLoRA 4-bit quantization.

How: Using the PEFT library with BitsAndBytesConfig for 4-bit loading, they trained for 3 epochs on 15,000 labeled tickets. Training took 2.5 hours on their single A100.

Result: The QLoRA model achieved 91.3% classification accuracy, only 1.2% below a full fine-tune baseline (92.5%). Memory usage peaked at 18GB, and the adapter checkpoint was just 33MB.

Lesson: LoRA makes fine-tuning accessible on constrained hardware with minimal accuracy trade-off; start with rank 16 and adjust only if validation loss plateaus.

17.1.2.3 Why Low-Rank Works

The effectiveness of low-rank adaptation rests on a remarkable empirical finding: the "intrinsic dimensionality" of fine-tuning updates is far lower than the full parameter count would suggest. When researchers analyzed the singular value decomposition of ΔW matrices from full fine-tuning runs, they found that a small number of singular values capture the vast majority of the update's information content. In many cases, ranks as low as 4 or 8 capture over 90% of the useful signal.

This makes intuitive sense. Fine-tuning typically adapts a model to a specific domain or task format. The knowledge required for this adaptation (new terminology, output format preferences, domain-specific reasoning patterns) is a small modification relative to the vast general knowledge encoded in the pretrained weights.

Warning
Common Misconception: "LoRA Is Just Cheaper Full Fine-Tuning"

LoRA is not merely a budget-friendly approximation of full fine-tuning. It works because task-specific weight updates are inherently low-rank: the useful signal in fine-tuning occupies a small subspace of the full parameter space. This means LoRA is not sacrificing quality for efficiency; it is exploiting the structure of the problem. In many benchmarks, LoRA at rank 16 matches full fine-tuning on task accuracy while using less than 1% of the trainable parameters. Increasing rank beyond what the task requires does not help and can even hurt through overfitting, just as adding unnecessary features hurts a classical ML model.

17.1.3 LoRA Hyperparameters in Practice

Knowing that low-rank works does not tell you which rank to pick or which layers to adapt. Three hyperparameters dominate every LoRA configuration in production: the rank r, the scaling factor alpha, and the set of target modules. Each comes with empirical defaults and known failure modes, and we walk through them in the order you should tune them.

17.1.3.1 Rank (r) Selection

Table 17.1.2a: 3.1 Rank (r) Selection Comparison (as of 2026).
RankTrainable Params (7B model)Best ForRisk
4~2MSimple format adaptation, chat templatesMay underfit complex tasks
8~4MClassification, simple instruction followingGood default for most tasks
16~8MDomain adaptation, moderate complexitySlight increase in memory
32~16MComplex reasoning, code generationDiminishing returns begin
64~33MVery complex tasks, near full FT qualityMemory approaches full FT

17.1.3.2 Target Module Selection

Library Shortcut: peft LoraConfig (target_modules, r, alpha)

The Hugging Face peft library reduces a LoRA setup to a single LoraConfig plus get_peft_model. Pass target_modules="all-linear" to attach adapters to every linear layer at once, set r and lora_alpha with the rule of thumb alpha = 2 * r, and use use_rslora=True to switch on rank-stabilized scaling. The same config object plugs straight into TRL's SFTTrainer and DPOTrainer.

Show code
pip install peft transformers
from peft import LoraConfig, get_peft_model, TaskType
cfg = LoraConfig(task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32,
                 lora_dropout=0.05, target_modules="all-linear",
                 bias="none", use_rslora=True)
model = get_peft_model(base_model, cfg)
model.print_trainable_parameters()
Code Fragment 17.1.1b: Canonical peft LoRA recipe with rank-stabilized scaling and full linear-layer coverage.

Not all weight matrices benefit equally from LoRA adaptation. The standard practice is to apply LoRA to the attention projection matrices: q_proj, k_proj, v_proj, and o_proj. Research and practice have converged on the recommendation to also include the MLP layers (gate_proj, up_proj, down_proj) for best results, though this increases trainable parameters. Code Fragment 17.1.6 shows this approach in practice.

# Configure LoRA adapter parameters: rank, alpha, target modules
# Lower rank reduces trainable parameters; alpha scales the adapter contribution
from peft import LoraConfig, get_peft_model, TaskType

# Standard configuration: attention layers only
lora_config_basic = LoraConfig(
 task_type=TaskType.CAUSAL_LM,
 r=16,
 lora_alpha=32, # alpha = 2r
 lora_dropout=0.05,
 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
 bias="none",
)

# Recommended: attention + MLP layers for best quality
lora_config_full = LoraConfig(
 task_type=TaskType.CAUSAL_LM,
 r=16,
 lora_alpha=32,
 lora_dropout=0.05,
 target_modules=[
 "q_proj", "k_proj", "v_proj", "o_proj",
 "gate_proj", "up_proj", "down_proj",
 ],
 bias="none",
)

# Apply LoRA to a model
model = get_peft_model(model, lora_config_full)
model.print_trainable_parameters()
Output: trainable params: 13,631,488 || all params: 6,751,219,712 || trainable%: 0.2019
Code Fragment 17.1.1c: Two LoraConfig presets side by side: an attention-only configuration (q_proj, k_proj, v_proj, o_proj) versus a full-coverage configuration that also targets the MLP projections (gate_proj, up_proj, down_proj). Adding MLP layers roughly doubles trainable parameters but often closes the quality gap with full fine-tuning.

Code Fragment 17.1.2b configures LoRA adapters.

# Set up parameter-efficient fine-tuning with LoRA adapters
# Freeze the base model and train only the low-rank decomposition matrices
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# === Option A: Load adapter separately ===
base_model = AutoModelForCausalLM.from_pretrained(
 "meta-llama/Meta-Llama-3-8B",
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
# Load different adapters dynamically
model = PeftModel.from_pretrained(base_model, "./lora-medical")
# Switch to another adapter
model.load_adapter("./lora-legal", adapter_name="legal")
model.set_adapter("legal")

# === Option B: Merge and save ===
model = AutoPeftModelForCausalLM.from_pretrained(
 "./lora-medical",
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-medical-merged")
tokenizer.save_pretrained("./llama3-medical-merged")

# === Option C: Merge QLoRA (requires dequantization) ===
# Load the QLoRA model in full precision for merging
model = AutoPeftModelForCausalLM.from_pretrained(
 "./qlora-output",
 torch_dtype=torch.float16,
 device_map="auto",
 low_cpu_mem_usage=True,
)
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")

print("Merged model saved. Upload to HF Hub or serve with vLLM.")
Output: Merged model saved. Upload to HF Hub or serve with vLLM.
Code Fragment 17.1.1d: Three deployment patterns for LoRA adapters: load the adapter on top of a fresh base model (composable), use AutoPeftModelForCausalLM (one call), or merge_and_unload() to bake the adapter into the base weights for inference without the peft runtime. The merged path eliminates the per-forward-pass overhead of the adapter branch.
Key Insight

When merging QLoRA adapters, you must load the base model in higher precision (FP16 or BF16) first, then merge. This is because the merge operation (W' = W + (α/r) · BA) needs sufficient numerical precision. Merging in 4-bit would introduce unacceptable quantization noise. After merging, you can re-quantize the merged model to GGUF or AWQ for efficient serving.

Key Insight

LoRA adapters unlock multi-tenant serving. Because LoRA keeps the base model frozen and stores task-specific knowledge in small adapter matrices, you can serve dozens of different fine-tuned "models" from a single base model in GPU memory. At inference time, you simply swap the active adapter (a few megabytes) instead of loading an entirely separate model (many gigabytes). This has profound implications for inference optimization and deployment: frameworks like LoRAX and S-LoRA can serve hundreds of adapters concurrently with minimal overhead, making personalized models economically viable at scale.

17.1.4 LoRA Hyperparameter Tuning Guide

Finding optimal LoRA hyperparameters requires systematic experimentation. Here is a practical guide based on what works across a wide range of tasks and model sizes.

Table 17.1.3a: Finding optimal LoRA hyperparameters requires systematic experimentation.
HyperparameterDefaultWhen to IncreaseWhen to Decrease
r (rank)16Complex tasks, large datasets, reasoningSimple format changes, small datasets
lora_alpha2 × rModel not adapting enoughTraining diverging, loss spiking
lora_dropout0.05Overfitting (val loss rises)Large dataset, underfitting
learning_rate2e-4Underfitting, slow convergenceDivergence, loss oscillation
max_grad_norm1.0Very stable trainingGradient spikes (try 0.3)
Note

LoRA learning rates are typically 5-10x higher than full fine-tuning learning rates. This is because only a small fraction of parameters are being updated, so each update needs to have a larger effect. A learning rate of 2e-4 for LoRA corresponds roughly to 2e-5 for full fine-tuning in terms of per-step model change.

Note: What lora_dropout actually drops

The lora_dropout parameter applies standard Bernoulli dropout to the input of the low-rank delta path (the $A$ matrix's input), not to the base model. During each forward pass a fraction $p$ of input features going into $A$ are zeroed and the remaining ones are scaled by $1/(1-p)$ to preserve expected magnitudes; the frozen base matrix $W_0$ receives the unmodified input. This regularizes only the adapter, so the base model's frozen behavior is never perturbed. A small value such as $p=0.05$ is enough for most fine-tunes because the adapter has so few parameters; raising it past $0.1$ tends to slow convergence without further reducing overfitting.

17.1.5 The PEFT Library Ecosystem

The Hugging Face peft library provides a unified interface for all parameter-efficient methods. Beyond basic LoRA, it supports loading adapters from the Hub, combining adapters, and quantized training workflows. Code Fragment 17.1.2c shows this approach in practice.

import torch
# Configure LoRA adapter parameters: rank, alpha, target modules
# Lower rank reduces trainable parameters; alpha scales the adapter contribution
from peft import (
PeftModel,
PeftConfig,
get_peft_model,
LoraConfig,
TaskType,
AutoPeftModelForCausalLM,
)
# Load a LoRA adapter from Hugging Face Hub
model = AutoPeftModelForCausalLM.from_pretrained(
"username/my-lora-adapter", # Adapter repo on HF Hub
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Inspect adapter configuration
config = PeftConfig.from_pretrained("username/my-lora-adapter")
print(f"Base model: {config.base_model_name_or_path}")
print(f"Rank: {config.r}, Alpha: {config.lora_alpha}")
print(f"Target modules: {config.target_modules}")
# Push adapter to Hub (only saves the small adapter weights)
model.push_to_hub("username/my-lora-adapter")
# Adapter size: typically 50-200 MB vs 14+ GB for full model
Output: Base model: meta-llama/Meta-Llama-3-8B Rank: 16, Alpha: 32 Target modules: ['q_proj', 'v_proj']
Code Fragment 17.1.2d: Loading a published LoRA adapter from the Hugging Face Hub with AutoPeftModelForCausalLM and inspecting its PeftConfig (base model, rank, alpha, target modules). The push_to_hub call at the end uploads only the adapter weights, typically 50 to 200 MB versus the 14+ GB of the base model.

QLoRA: 4-Bit Quantized LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit NF4 quantization of the base model, reducing memory by roughly 4x compared to standard LoRA in FP16. The bitsandbytes library handles the quantization transparently through BitsAndBytesConfig. The frozen base weights are stored in 4-bit precision, while the LoRA adapter matrices train in BF16. This enables fine-tuning a 70B model on a single 48 GB GPU. Code Fragment 17.1.9 shows the setup.

# pip install bitsandbytes peft transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True, # quantize the quantization constants
)
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model) # freeze + enable gradients
lora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
    "gate_proj", "up_proj", "down_proj"],
    bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 4,544,075,776 || trainable%: 0.30
Output: rank trainable total pct 4 524,288 135,000,320 0.3884% 8 1,048,576 135,524,608 0.7737% 16 2,097,152 136,573,184 1.5354% 32 4,194,304 138,670,336 3.0248% 64 8,388,608 142,864,640 5.8716%
Code Fragment 17.1.3b: QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit NF4 quantization of the base model.

Under the Hood: NF4, Blockwise, and Double Quantization

Three quantization tricks make QLoRA work. The first is the NF4 (4-bit NormalFloat) data type, a 16-value codebook hand-designed for tensors whose values are approximately $\mathcal{N}(0, \sigma^2)$ distributed (the regime that pretrained transformer weights consistently fall into after layer-norm). Rather than placing the 16 levels at uniform spacing across the value range, NF4 places them at the quantiles of a unit normal so that each level has equal expected probability mass. The full 16 anchor values (used by every modern bitsandbytes install) are:

Real-World Scenario: The 16 NF4 anchor values

$$\begin{aligned}\text{NF4} = \{&-1.0,\ -0.6962,\ -0.5251,\ -0.3949,\ -0.2844,\ -0.1848,\\ &-0.0911,\ 0.0,\ 0.0796,\ 0.1609,\ 0.2461,\ 0.3379,\\ &0.4407,\ 0.5626,\ 0.7230,\ 1.0\}\end{aligned}$$

The codebook is asymmetric around zero (9 positive, 7 negative including the explicit $0$) because the encoded distribution is forced to lie in $[-1, 1]$ by per-block scaling and exactly one anchor must sit at each endpoint to keep the dynamic range. Spacing is tight near zero (where the bulk of weight mass lives) and sparse near $\pm 1$ (the tails), the opposite of uniform quantization.

The second trick is blockwise quantization. A single global scale across an entire weight matrix is dominated by the largest outlier, which crushes precision for the typical weight. QLoRA instead splits each tensor into blocks of 64 contiguous parameters and quantizes each block independently against its own absolute-maximum scale $c_i = \max(|w_j|)$ for $w_j$ in block $i$. Each weight is stored as a 4-bit code (16 levels) plus the block shares a 32-bit FP32 scale. The per-parameter cost is therefore $4 + 32/64 = 4.5$ bits.

The third trick, double quantization, attacks the residual 0.5-bit overhead. The block scales $c_i$ are themselves a long FP32 vector (one per 64 weights), so QLoRA groups those scales into super-blocks of 256 and quantizes each super-block to 8-bit precision against a single FP32 super-scale. The per-parameter cost drops to $4 + 8/64 + 32/(64 \cdot 256) \approx 4.127$ bits, recovering roughly 0.37 bits per parameter (about 3 GB on a 65B model) at negligible accuracy cost. The complete chain is summarized in Table 17.1.4a.

Table 17.1.4b: Bits per parameter introduced by each layer of QLoRA quantization, for a representative 65B-parameter base model.
LayerPer-parameter cost65B model size
FP16 baseline16 bits130 GB
NF4 codebook only (1 global scale)4 bits32.5 GB
+ Blockwise FP32 scales (block=64)4.5 bits36.5 GB
+ Double quantization (super-block=256)4.127 bits33.5 GB
Key Insight: Why the codebook is normal-shaped

Pretrained transformer weights look like draws from a zero-mean normal after the layer-norm rescaling baked into modern architectures. An information-theoretically optimal 4-bit quantizer for a $\mathcal{N}(0, 1)$ source places its levels at the 16 equal-probability quantiles of the normal CDF, which is exactly how the NF4 anchors are derived. A uniform 4-bit grid would waste codewords on the rare tails and starve the dense center, hurting accuracy. NF4 is therefore a Lloyd-Max-style optimal quantizer rather than a clever heuristic.

Tip: Start with LoRA Rank 8 or 16

For most fine-tuning tasks, LoRA rank 8 produces strong results with minimal trainable parameters. Increase to 16 or 32 only if you see underfitting. Higher ranks add parameters quadratically but rarely improve quality beyond rank 32.

Research Frontier

LoRA variants continue to proliferate: DoRA decomposes weight updates into magnitude and direction components, while rsLoRA applies rank-dependent scaling for more stable training at higher ranks. Research on LoRA composition explores stacking, merging, and routing among multiple LoRA adapters at inference time, enabling modular skill composition without retraining. An open theoretical question is why low-rank adaptation works as well as full fine-tuning despite the severe parameter reduction, and whether optimal rank can be predicted from task properties.

Recent work on LoRA Soups (2024) demonstrates that averaging multiple LoRA adapters trained on different data can outperform any individual adapter, paralleling the model soup findings in full fine-tuning.

Lab: Fine-Tune a Model with LoRA and Compare Rank Settings
Duration: ~60 minutes Intermediate

Objective

Apply LoRA adapters to a language model using PEFT, train on a small dataset, and compare how different rank values (r=4, r=16, r=64) affect trainable parameter count, memory usage, and output quality.

What You'll Practice

  • Configuring LoRA with the PEFT library (target modules, rank, alpha)
  • Inspecting trainable vs. frozen parameter counts
  • Training LoRA adapters with SFTTrainer
  • Saving, reloading, and merging adapters into the base model
  • Comparing rank settings on quality and efficiency metrics

Setup

The following cell installs the required packages and configures the environment for this lab.

Steps

Step 1: Load the base model and apply LoRA

Load a small model and wrap it with a LoRA configuration. Inspect the parameter counts to see how few parameters are actually trainable.

# Load a small model and wrap with LoRA: only the low-rank adapter
# matrices are trainable; all original weights stay frozen.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    base_model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
        )
    # TODO: Create a LoraConfig with rank=16, alpha=32
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        task_type=TaskType.CAUSAL_LM,
        lora_dropout=0.05,
        bias="none"
        )
    model = get_peft_model(base_model, lora_config)
    model.print_trainable_parameters()
Code Fragment 17.1.4c: This lab step loads SmolLM2-135M-Instruct with AutoModelForCausalLM, creates a LoraConfig targeting q_proj and v_proj layers with rank 16 and alpha 32, and calls get_peft_model to wrap the base model. The print_trainable_parameters() call reveals that fewer than 0.1% of parameters are trainable.
Hint

The print_trainable_parameters() method shows something like "trainable params: 131,072 || all params: 135,000,000 || trainable%: 0.097". Only the LoRA matrices are trainable; everything else is frozen.

Step 2: Compare parameter counts across ranks

Create LoRA configs with different ranks and compare the trainable parameter counts.

from transformers import AutoModelForCausalLM
import torch
# Compare trainable parameter counts across LoRA ranks (4 to 64).
# Shows how rank controls the expressiveness vs. efficiency tradeoff.
import pandas as pd
ranks = [4, 8, 16, 32, 64]
results = []
for r in ranks:
    base = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto")
    config = LoraConfig(r=r, lora_alpha=r * 2,
        target_modules=["q_proj", "v_proj"],
        task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
    pm = get_peft_model(base, config)
    # TODO: Extract trainable and total parameter counts
    trainable = sum(p.numel() for p in pm.parameters() if p.requires_grad)
    total = sum(p.numel() for p in pm.parameters())
    results.append({"rank": r, "trainable": trainable,
        "total": total, "pct": f"{trainable/total*100:.4f}%"})
    del pm, base
    torch.cuda.empty_cache()
    df = pd.DataFrame(results)
    print(df.to_string(index=False))
Output: rank trainable total pct 4 40960 134555648 0.0304% 8 81920 134596608 0.0609% 16 163840 134678528 0.1216% 32 327680 134842368 0.2429% 64 655360 135170048 0.4849%
Code Fragment 17.1.5: Sweep LoRA rank from 4 to 64 and tabulate trainable parameters, total parameters, and the trainable fraction. The output makes the rank vs. capacity tradeoff visible: doubling rank doubles adapter parameters but leaves the base model untouched.
Hint

Notice how trainable parameters scale linearly with rank. Doubling the rank roughly doubles the trainable parameter count. The common heuristic alpha = 2 * rank keeps the effective learning rate stable across rank settings.

Step 3: Train a LoRA adapter

Fine-tune the LoRA adapter (r=16) on a small dataset.

# Train the LoRA adapter (r=16) on 300 instruction examples.
# Only adapter weights update; base model gradients are skipped.
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/no_robots", split="train")
dataset = dataset.shuffle(seed=42).select(range(300))
def format_chat(example):
    return {"text": tokenizer.apply_chat_template(
        example["messages"], tokenize=False, add_generation_prompt=False)}
formatted = dataset.map(format_chat)
# TODO: Configure training (use higher LR than full fine-tuning)
training_args = SFTConfig(
    output_dir="./lora-smollm2-r16",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=2e-4, # 10x higher than full FT
    logging_steps=10,
    max_seq_length=512,
    fp16=True,
    warmup_ratio=0.1,
    save_strategy="epoch",
    report_to="none",
    )
trainer = SFTTrainer(model=model, args=training_args,
    train_dataset=formatted, processing_class=tokenizer)
result = trainer.train()
print(f"Training loss: {result.training_loss:.4f}")
Output: Training loss: 1.5237
Code Fragment 17.1.6a: Fine-tuning the LoRA adapter on a 300-example slice of HuggingFaceH4/no_robots with TRL's SFTTrainer. The learning rate is 2e-4, roughly 10x what full fine-tuning would use, because only the small adapter matrices receive gradients.
Hint

LoRA typically uses 5x to 10x higher learning rate than full fine-tuning because only the small adapter matrices receive gradients. A learning rate of 2e-4 is a good starting point.

Step 4: Save and merge the LoRA adapter

Save the adapter weights (only a few KB), then reload and merge them into the base model.

from transformers import AutoModelForCausalLM
import torch
# Save the tiny adapter checkpoint, then merge it back into the
# base model for deployment without the PEFT dependency.
import os
from peft import PeftModel
# Save adapter only
model.save_pretrained("./lora-adapter-r16")
adapter_size = sum(
os.path.getsize(os.path.join("./lora-adapter-r16", f))
for f in os.listdir("./lora-adapter-r16")
if f.endswith(('.safetensors', '.bin')))
print(f"Adapter size: {adapter_size / 1024:.1f} KB")
# Reload base model and merge adapter
base_fresh = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto")
peft_model = PeftModel.from_pretrained(base_fresh, "./lora-adapter-r16")
merged_model = peft_model.merge_and_unload()
print("Adapter merged successfully!")
Output: Adapter size: 524.0 KB Adapter merged successfully!
Code Fragment 17.1.7: Save the trained adapter to disk (a few MB), then merge it into the base model with merge_and_unload() so the result can be deployed as a single set of weights. The merged model is functionally identical to the adapter-plus-base pair but has no peft runtime dependency.
Hint

The merge_and_unload() method folds the LoRA weights into the base model, producing a standard model that can be used without the PEFT library at inference time.

Step 5: Test the merged model

Generate responses and compare with the base model.

import torch
# Compare merged model outputs against base model on test prompts.
# Look for improved instruction-following after LoRA fine-tuning.
test_prompts = [
    "Explain photosynthesis to a 10-year-old.",
    "Write a short email declining a meeting politely.",
    "List 3 advantages of renewable energy.",
    ]
for prompt in test_prompts:
    messages = [{"role": "user", "content": prompt}]
    fmt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(fmt, return_tensors="pt").to(merged_model.device)
    with torch.no_grad():
        outputs = merged_model.generate(
            **inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
        response = tokenizer.decode(
            outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
        print(f"Prompt: {prompt}\nResponse: {response[:300]}\n{'-'*50}")
Output: Prompt: Explain photosynthesis to a 10-year-old. Response: Photosynthesis is how plants make their own food using sunlight! Think of it like a recipe: plants take water from the soil and carbon dioxide from the air, then use sunlight as the energy to turn them into sugar (their food) and oxygen... -------------------------------------------------- Prompt: Write a short email declining a meeting politely. Response: Subject: RE: Meeting Request Hi [Name], Thank you for the invitation. Unfortunately, I have a scheduling conflict and will not be able to attend. I would be happy to review any notes or follow up afterward. Best regards -------------------------------------------------- ...
Code Fragment 17.1.8: Side-by-side generation from the original base model and the LoRA-merged model on the same test prompts. The diff in outputs is the entire signal that fine-tuning produced: same architecture, same tokens, only the 0.1% of weights touched by the adapter differ.
Hint

To get a fair comparison, reload the original base model and generate with the same settings. The LoRA-tuned model should show improved instruction following despite training less than 0.1% of parameters.

Expected Output

  • A parameter table showing r=4 trains ~0.02%, r=16 trains ~0.1%, r=64 trains ~0.4%
  • A saved adapter that is only a few hundred KB (vs. ~270MB full model)
  • Merged model outputs showing improved instruction following

Stretch Goals

  • Train adapters at r=4 and r=64, compare outputs, and measure which rank is best for this dataset size
  • Add QLoRA by loading in 4-bit (BitsAndBytesConfig(load_in_4bit=True)) and compare memory
  • Target all linear layers (target_modules="all-linear") and compare results
Complete Solution
# Complete LoRA lab solution: load model, apply adapter, train,
# save/merge, and compare outputs before and after fine-tuning.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, PeftModel, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch, pandas as pd, os
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
# Step 1: Apply LoRA
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"],
    task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Step 2: Compare ranks
for r in [4, 8, 16, 32, 64]:
    b = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    c = LoraConfig(r=r, lora_alpha=r*2, target_modules=["q_proj","v_proj"],
        task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
    pm = get_peft_model(b, c)
    t = sum(p.numel() for p in pm.parameters() if p.requires_grad)
    a = sum(p.numel() for p in pm.parameters())
    print(f"r={r}: {t:,} trainable / {a:,} total = {t/a*100:.4f}%")
    del pm, b; torch.cuda.empty_cache()
    # Step 3: Train
    ds = load_dataset("HuggingFaceH4/no_robots", split="train").shuffle(seed=42).select(range(300))
    def fmt(ex): return {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False, add_generation_prompt=False)}
    formatted = ds.map(fmt)
    args = SFTConfig(output_dir="./lora-smollm2-r16", num_train_epochs=3, per_device_train_batch_size=4,
        gradient_accumulation_steps=2, learning_rate=2e-4, logging_steps=10, max_seq_length=512,
        fp16=True, warmup_ratio=0.1, save_strategy="epoch", report_to="none")
    trainer = SFTTrainer(model=model, args=args, train_dataset=formatted, processing_class=tokenizer)
    trainer.train()
    # Step 4: Save and merge
    model.save_pretrained("./lora-adapter-r16")
    base_fresh = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    merged = PeftModel.from_pretrained(base_fresh, "./lora-adapter-r16").merge_and_unload()
    # Step 5: Test
    for p in ["Explain photosynthesis to a 10-year-old.", "Write a short email declining a meeting.", "List 3 advantages of renewable energy."]:
        msgs = [{"role":"user","content":p}]
        f = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
        i = tokenizer(f, return_tensors="pt").to(merged.device)
        with torch.no_grad():
            o = merged.generate(**i, max_new_tokens=200, temperature=0.7, do_sample=True)
            print(f"Q: {p}\nA: {tokenizer.decode(o[0][i['input_ids'].shape[1]:], skip_special_tokens=True)[:300]}\n")
Output: trainable params: 2,097,152 || all params: 136,573,184 || trainable%: 1.5354 r=4: 524,288 trainable / 135,000,320 total = 0.3884% r=8: 1,048,576 trainable / 135,524,608 total = 0.7737% r=16: 2,097,152 trainable / 136,573,184 total = 1.5354% r=32: 4,194,304 trainable / 138,670,336 total = 3.0248% r=64: 8,388,608 trainable / 142,864,640 total = 5.8716% Q: Explain photosynthesis to a 10-year-old. A: Photosynthesis is how plants make their own food using sunlight... Q: Write a short email declining a meeting. A: Subject: RE: Meeting Request... Q: List 3 advantages of renewable energy. A: 1. Renewable energy sources do not run out...
Code Fragment 17.1.9a: End-to-end LoRA recipe in a single script: load SmolLM2, wrap with a rank-16 adapter, fine-tune with SFTTrainer, then save and merge. Use it as a template for any small instruction-tuning experiment on consumer GPUs.
Key Takeaways
Self-Check
Q1: In the LoRA decomposition W' = W + BA, what are the dimensions of matrices B and A for a weight matrix of size d×k with rank r?
Show Answer
B is d×r and A is r×k. The total trainable parameters are r×(d+k), which is much smaller than the d×k parameters in the original matrix when r is small. For example, with d=k=4096 and r=16, this is 131,072 versus 16,777,216 (a 128x reduction).
Q2: Why is matrix B initialized to zeros and A initialized with random values, rather than the reverse?
Show Answer
Initializing B to zeros ensures that BA = 0 at the start of training, so the model begins as an exact copy of the pretrained base. This provides a stable starting point. The choice of B=0 (rather than A=0) is a convention; either would work. The key requirement is that the product BA starts at zero so training begins from the pretrained behavior.
Q3: What is the role of the alpha/r scaling factor in LoRA, and how should you adjust alpha when changing rank?
Show Answer
The ratio alpha/r acts as a learning rate multiplier for the LoRA update. When you double r, you should generally double alpha to maintain the same effective scaling. A common starting point is alpha = 2r. If the model is not adapting enough, increase alpha; if training is unstable, decrease it.
Q4: What three innovations does QLoRA combine, and why is each one necessary?
Show Answer
QLoRA combines: (1) NF4 quantization, which reduces base model memory by 4x with minimal quality loss using a data type optimized for normally distributed weights; (2) double quantization, which compresses the quantization constants themselves, saving an additional ~0.4 bits per parameter; (3) paged optimizers, which use unified memory to gracefully handle GPU memory spikes during training without OOM errors.
Q5: When should you keep LoRA adapters separate versus merging them into the base model?
Show Answer
Keep adapters separate when you need to serve multiple tasks from a single base model (hot-swapping adapters per request), when storage is a concern (adapters are ~50MB vs. 14+ GB for a full model), or when you plan to update adapters independently. Merge when you need maximum inference speed (zero adapter overhead), when using inference engines that do not support adapter loading, or when deploying a single dedicated model for one task.

Exercises

Exercise 15.1.1: LoRA mechanism Conceptual

Explain how LoRA works: what does it mean to add low-rank matrices A and B to a frozen weight matrix W? Why does this reduce memory and compute requirements?

Answer Sketch

Instead of updating W (d x d, millions of parameters), LoRA freezes W and learns two small matrices: A (d x r) and B (r x d), where r << d (typically 8 to 64). The effective weight becomes W' = W + BA. This reduces trainable parameters from d*d to 2*d*r. For a 4096x4096 matrix with r=16, parameters drop from 16.7M to 131K (128x reduction). Memory savings come from not storing optimizer states (momentum, variance) for the frozen W.

Exercise 15.1.2: Rank and alpha selection Analysis

You are fine-tuning a 7B model with LoRA. Compare the expected outcomes of rank=8/alpha=16 versus rank=64/alpha=128. How does the rank affect model capacity, training speed, and risk of overfitting?

Answer Sketch

Rank 8: 0.1% trainable parameters, fastest training, lowest memory, but limited capacity for complex adaptations. Best for simple style transfer or single-task tuning. Rank 64: ~0.8% trainable parameters, slower but still much faster than full fine-tuning, more capacity for complex multi-task adaptation. Higher risk of overfitting on small datasets. Alpha/rank ratio (both 2.0 here) controls the effective learning rate of the LoRA updates; keeping it consistent means the learning dynamics are similar despite different ranks.

Exercise 15.1.3: QLoRA setup Coding

Write the Python code to load a model in 4-bit quantization using bitsandbytes and attach LoRA adapters using the PEFT library. Include the BitsAndBytesConfig and LoraConfig.

Answer Sketch

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16). Load: model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B', quantization_config=bnb_config). LoRA: lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','v_proj','k_proj','o_proj'], lora_dropout=0.05). Apply: model = get_peft_model(model, lora_config). This enables fine-tuning a 7B model on a single 16GB GPU.

Exercise 15.1.4: Target module selection Conceptual

Which transformer modules should you apply LoRA to? Compare applying LoRA to only attention projections (q, v) versus all linear layers (q, k, v, o, gate, up, down). What are the tradeoffs?

Answer Sketch

Attention only (q, v): the original LoRA paper's default. Minimal parameters, fast training, works well for many tasks. All linear layers: ~3x more parameters but better adaptation capacity. Research shows that including the MLP layers (gate, up, down) improves performance on complex tasks and domain adaptation. Tradeoff: more target modules means more memory, slower training, but better quality. Start with q, v; expand to all linear layers if quality is insufficient.

Exercise 15.1.5: Adapter merging and serving Coding

Write code that merges a LoRA adapter back into the base model weights for deployment. Explain when you would serve with merged weights versus keeping the adapter separate.

Answer Sketch

Merge: merged_model = model.merge_and_unload(). Save: merged_model.save_pretrained('merged_model/'). Merge for deployment when: (1) you have a single adapter and want maximum inference speed (no adapter overhead), (2) you are using a serving framework that does not support adapters. Keep separate when: (1) you serve multiple adapters from the same base model (multi-tenant), (2) you want to A/B test adapters, (3) you need to quickly swap or update adapters without redeploying the base model.

What Comes Next

In the next section, Section 17.2: Advanced PEFT Methods, we survey advanced PEFT methods beyond LoRA, including DoRA, LoRA+, prefix tuning, adapters, and prompt tuning.

Further Reading

Foundational Papers

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. The original LoRA paper that introduced low-rank decomposition for weight updates, enabling fine-tuning with a fraction of the parameters. Useful for anyone implementing parameter-efficient training, as nearly every modern PEFT workflow builds on these ideas.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized Language Models. NeurIPS 2023. Combines 4-bit quantization with LoRA adapters, making it possible to fine-tune 65B parameter models on a single 48GB GPU. Practitioners working with limited hardware should read this to understand the quantization and paging tricks that reduce memory by 4x or more.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. ICML 2019. Introduced adapter chapters inserted between transformer layers, establishing the conceptual foundation for parameter-efficient transfer. Researchers interested in the historical arc from full fine-tuning to modern PEFT methods should start here.

Theory & Surveys

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021. Provides the theoretical grounding for why low-rank adaptation works: pretrained models have low intrinsic dimensionality. Anyone wanting to understand the mathematical justification behind LoRA's success should consult this analysis.
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning. A comprehensive survey covering the full taxonomy of PEFT methods, from adapters to prompt tuning to reparameterization approaches. Ideal for practitioners comparing LoRA against alternatives and looking for guidance on which method to select for their use case.
Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., et al. (2023). Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models. Nature Machine Intelligence, 5, 220-235. A systematic review published in Nature Machine Intelligence that organizes PEFT methods into a clear taxonomy and benchmarks their performance across tasks. Recommended for researchers seeking a rigorous comparison framework with experimental results on large-scale models.