Section 15.1: LoRA & QLoRA

"Why update a billion parameters when a few thousand will do? Efficiency is not laziness; it is elegance under constraint."
LoRA, Elegantly Lazy AI Agent

Big Picture

LoRA is the single most important technique for practical LLM fine-tuning. Instead of updating all model weights, LoRA freezes the pretrained model and injects small trainable low-rank matrices into each layer. This reduces trainable parameters by 100x or more, cuts GPU memory by 60-70%, and produces adapters that can be swapped at serving time without reloading the base model. QLoRA extends this further by quantizing the frozen weights to 4-bit (building on the quantization techniques from Section 09.1), enabling fine-tuning of 70B models on a single 48GB GPU.

Prerequisites

This section builds on fine-tuning fundamentals from Section 14.1: When and Why to Fine-Tune and transformer architecture covered in Section 04.1: Transformer Architecture Deep Dive. You should also be comfortable with the quantization concepts from Section 09.1: Quantization Fundamentals, as QLoRA builds directly on 4-bit quantization techniques. An understanding of catastrophic forgetting from Section 14.2 will help you appreciate why parameter-efficient methods preserve base model capabilities.

A small sticky note attached to a large frozen neural network, representing LoRA's lightweight trainable parameters — **Figure 15.1.1**: LoRA in a nutshell: instead of rewriting the whole textbook, just slap a sticky note on the relevant pages. Same knowledge, fraction of the effort.

Appendix Reference

For a hands-on walkthrough of LoRA and PEFT using Hugging Face libraries, see Appendix K: HuggingFace: Transformers, Datasets, and Hub.

A delivery truck with custom decals representing QLoRA's quantized base model with low-rank adapters — **Figure 15.1.2**: QLoRA: take your already-compact quantized model and add custom decals. You get personalization without needing a bigger garage.

1. The Full Fine-Tuning Problem

When you fine-tune a model with full parameter updates, every weight in the model gets a gradient, an optimizer state (momentum and variance for Adam), and a copy of the updated weight. For a 7B parameter model in FP16, that means 14 GB just for the weights, plus roughly 42 GB for optimizer states, totaling over 56 GB of GPU memory. Scaling to 13B or 70B models makes this prohibitively expensive.

The key insight behind parameter-efficient methods is that the weight changes during fine-tuning are low-rank. Research (Aghajanyan et al., 2021) has shown that when you compute the difference between a fine-tuned model and its pretrained base (the "task-specific delta"), this delta matrix has a very low intrinsic dimensionality, meaning it can be well-approximated by a matrix of much lower rank. Most of the information in the update can be captured by a much smaller matrix. Think of it like image compression: a JPEG discards high-frequency details that humans barely notice, keeping the image recognizable at a fraction of the file size. Similarly, LoRA discards the high-rank components of the weight update, keeping only the most important directions of change.

Fun Note

LoRA adapters for a 7B model are typically 10 to 50 MB in size. The base model itself is 14 GB. This means you can store 280 different task-specific adapters in the same disk space as a single full fine-tune. Some teams maintain a "LoRA library" with dozens of specialized adapters (one for legal text, one for medical notes, one for code review) that share a single base model in GPU memory and swap at request time. It is like having 280 employees who all share the same brain but carry different notebooks.

Mental Model: The Sticky Note on a Textbook

Think of LoRA as adding sticky notes to a textbook rather than rewriting the pages. The original textbook (pretrained weights) stays untouched. Each sticky note (low-rank adapter) captures a small correction or addition that adjusts the meaning for your specific task. At inference time, you can peel off one set of sticky notes and apply a different set for a new task, all without ever altering the original textbook pages.

Fun Fact

LoRA achieves 90%+ of full fine-tuning performance while training less than 1% of the parameters. It is the deep learning equivalent of altering a suit instead of buying a new one: same great fit, fraction of the cost.

Model Size Comparison

Model Size	Full FT Memory (FP16 + Adam)	LoRA Memory (r=16)	QLoRA Memory (NF4, r=16)
7B	~56 GB	~16 GB	~6 GB
13B	~104 GB	~28 GB	~10 GB
70B	~560 GB	~160 GB	~36 GB

Why does LoRA work with so few parameters? The deep intuition is that fine-tuning does not need to change the model's fundamental capabilities; it only needs to steer existing capabilities toward a specific task. The pretrained weight matrices encode rich, general-purpose representations built from trillions of tokens. The task-specific adaptation (learning to follow instructions, adopting a particular output format, specializing in a domain) requires only small directional adjustments to these representations. Mathematically, these adjustments occupy a low-dimensional subspace of the full weight space, which is precisely what low-rank matrices capture. This is why rank-4 or rank-8 LoRA frequently matches full fine-tuning: the effective dimensionality of the task adaptation is genuinely small, regardless of the model's total parameter count.

Key Insight

The effectiveness of LoRA is grounded in the mathematical phenomenon of low intrinsic dimensionality, a concept with roots in topology and the manifold hypothesis. The manifold hypothesis, widely discussed in machine learning theory, posits that high-dimensional data (such as model weight matrices with millions of entries) actually lies on or near a low-dimensional manifold. In linear algebra terms, the task-specific weight update occupies a tiny subspace of the full parameter space. This is not unique to neural networks: the same principle appears in signal processing (signals are sparse in the Fourier or wavelet basis), in physics (many-body systems have low effective degrees of freedom near equilibrium), and in statistics (principal component analysis exploits the same low-rank structure to compress data). LoRA's rank decomposition W = BA is, mathematically, a constrained version of the singular value decomposition (SVD) applied to the weight update, keeping only the most significant singular vectors. The remarkable finding that ranks as low as 4 or 8 suffice tells us that fine-tuning operates in a surprisingly low-dimensional subspace of the full parameter manifold.

Common Mistake: Higher Rank Does Not Always Mean Better Quality

A common misconception is that increasing the LoRA rank (r) always improves fine-tuning quality. In practice, ranks above 32 rarely improve performance for most tasks and can actually degrade results by increasing overfitting risk on small datasets. The sweet spot for most tasks is r=8 to r=16. Higher ranks (32 to 64) are useful only for complex multi-task fine-tuning or when the task requires substantial behavioral change. Always start with r=8 and increase only if validation metrics plateau. More parameters to train also means more memory, slower training, and larger adapter files.

2. LoRA Mathematics

2.1 The Core Decomposition

LoRA (Low-Rank Adaptation) works by expressing the weight update as a product of two small matrices. For a pretrained weight matrix W of dimension d × k, instead of computing a full update ΔW (also d × k), LoRA decomposes it as:

W' = W + ΔW = W + BA

where B is d × r and A is r × k, with the rank r being much smaller than both d and k. Typical values of r range from 4 to 64, while d and k are typically 4096 or larger. This means the number of trainable parameters drops from d × k (e.g., 16.7 million for a 4096 × 4096 matrix) to r × (d + k) (e.g., 131,072 for r=16). Figure 15.1.1 visualizes this decomposition.

LoRA weight decomposition diagram showing frozen pretrained weight matrix W with a low-rank bypass of two small trainable matrices A and B, where the weight update is decomposed as delta-W equals W_A times W_B

Figure 15.1.3: LoRA decomposes the weight update into two small trainable matrices B and A, reducing parameters by ~128x. Source: Sebastian Raschka, 2023. Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA).

2.2 Initialization and Scaling

LoRA uses a specific initialization strategy: matrix A is initialized with a random Gaussian distribution, and matrix B is initialized to all zeros. This means that at the start of training, BA = 0, and the model behaves exactly like the pretrained base. Training gradually moves the model away from this starting point.

The scaling factor α (alpha) controls the magnitude of the LoRA update. The actual update applied is:

W' = W + (α / r) · BA

The ratio α/r acts as a learning rate multiplier for the LoRA weights. A common convention is to set α = 2r (so the effective multiplier is 2), but the optimal value depends on the task. Increasing α relative to r makes the adaptation more aggressive; decreasing it keeps the model closer to the pretrained weights.

Key Insight

When you double the rank r, you should also consider doubling α to maintain the same effective learning rate. Many practitioners set α = 2 × r as a starting point, then adjust based on validation performance. If training diverges, reduce α; if the model barely moves from the base, increase it.

Real-World Scenario: LoRA for Customer Support Classification

Who: ML engineer at a mid-size SaaS company

Situation: The team needed to fine-tune Llama 2 7B for routing customer support tickets into 48 categories, but their GPU budget was a single A100 40GB.

Problem: Full fine-tuning of 7B parameters required over 100GB of GPU memory (parameters, gradients, optimizer states), far exceeding available hardware.

Dilemma: They could either pay for multi-GPU cloud instances (4x the monthly cost) or accept a smaller base model with lower accuracy.

Decision: They chose LoRA with rank 16 and alpha 32, targeting only the query and value projection matrices. This reduced trainable parameters from 7B to 4.2M (0.06%), fitting comfortably in 24GB with QLoRA 4-bit quantization.

How: Using the PEFT library with BitsAndBytesConfig for 4-bit loading, they trained for 3 epochs on 15,000 labeled tickets. Training took 2.5 hours on their single A100.

Result: The QLoRA model achieved 91.3% classification accuracy, only 1.2% below a full fine-tune baseline (92.5%). Memory usage peaked at 18GB, and the adapter checkpoint was just 33MB.

Lesson: LoRA makes fine-tuning accessible on constrained hardware with minimal accuracy trade-off; start with rank 16 and adjust only if validation loss plateaus.

2.3 Why Low-Rank Works

The effectiveness of low-rank adaptation rests on a remarkable empirical finding: the "intrinsic dimensionality" of fine-tuning updates is far lower than the full parameter count would suggest. When researchers analyzed the singular value decomposition of ΔW matrices from full fine-tuning runs, they found that a small number of singular values capture the vast majority of the update's information content. In many cases, ranks as low as 4 or 8 capture over 90% of the useful signal.

This makes intuitive sense. Fine-tuning typically adapts a model to a specific domain or task format. The knowledge required for this adaptation (new terminology, output format preferences, domain-specific reasoning patterns) is a small modification relative to the vast general knowledge encoded in the pretrained weights.

Common Misconception: "LoRA Is Just Cheaper Full Fine-Tuning"

LoRA is not merely a budget-friendly approximation of full fine-tuning. It works because task-specific weight updates are inherently low-rank: the useful signal in fine-tuning occupies a small subspace of the full parameter space. This means LoRA is not sacrificing quality for efficiency; it is exploiting the structure of the problem. In many benchmarks, LoRA at rank 16 matches full fine-tuning on task accuracy while using less than 1% of the trainable parameters. Increasing rank beyond what the task requires does not help and can even hurt through overfitting, just as adding unnecessary features hurts a classical ML model.

3. LoRA Hyperparameters in Practice

3.1 Rank (r) Selection

3.1 Rank (r) Selection Comparison

Rank	Trainable Params (7B model)	Best For	Risk
4	~2M	Simple format adaptation, chat templates	May underfit complex tasks
8	~4M	Classification, simple instruction following	Good default for most tasks
16	~8M	Domain adaptation, moderate complexity	Slight increase in memory
32	~16M	Complex reasoning, code generation	Diminishing returns begin
64	~33M	Very complex tasks, near full FT quality	Memory approaches full FT

3.2 Target Module Selection

Not all weight matrices benefit equally from LoRA adaptation. The standard practice is to apply LoRA to the attention projection matrices: q_proj, k_proj, v_proj, and o_proj. Research and practice have converged on the recommendation to also include the MLP layers (gate_proj, up_proj, down_proj) for best results, though this increases trainable parameters. Code Fragment 15.1.7 shows this approach in practice.

# Configure LoRA adapter parameters: rank, alpha, target modules
# Lower rank reduces trainable parameters; alpha scales the adapter contribution
from peft import LoraConfig, get_peft_model, TaskType

# Standard configuration: attention layers only
lora_config_basic = LoraConfig(
 task_type=TaskType.CAUSAL_LM,
 r=16,
 lora_alpha=32, # alpha = 2r
 lora_dropout=0.05,
 target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
 bias="none",
)

# Recommended: attention + MLP layers for best quality
lora_config_full = LoraConfig(
 task_type=TaskType.CAUSAL_LM,
 r=16,
 lora_alpha=32,
 lora_dropout=0.05,
 target_modules=[
 "q_proj", "k_proj", "v_proj", "o_proj",
 "gate_proj", "up_proj", "down_proj",
 ],
 bias="none",
)

# Apply LoRA to a model
model = get_peft_model(model, lora_config_full)
model.print_trainable_parameters()

trainable params: 13,631,488 || all params: 6,751,219,712 || trainable%: 0.2019

Code Fragment 15.1.1: Configure LoRA adapter parameters: rank, alpha, target modules

Code Fragment 15.1.2 configures LoRA adapters.

# Set up parameter-efficient fine-tuning with LoRA adapters
# Freeze the base model and train only the low-rank decomposition matrices
from peft import PeftModel, AutoPeftModelForCausalLM
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# === Option A: Load adapter separately ===
base_model = AutoModelForCausalLM.from_pretrained(
 "meta-llama/Meta-Llama-3-8B",
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
# Load different adapters dynamically
model = PeftModel.from_pretrained(base_model, "./lora-medical")
# Switch to another adapter
model.load_adapter("./lora-legal", adapter_name="legal")
model.set_adapter("legal")

# === Option B: Merge and save ===
model = AutoPeftModelForCausalLM.from_pretrained(
 "./lora-medical",
 torch_dtype=torch.bfloat16,
 device_map="auto",
)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./llama3-medical-merged")
tokenizer.save_pretrained("./llama3-medical-merged")

# === Option C: Merge QLoRA (requires dequantization) ===
# Load the QLoRA model in full precision for merging
model = AutoPeftModelForCausalLM.from_pretrained(
 "./qlora-output",
 torch_dtype=torch.float16,
 device_map="auto",
 low_cpu_mem_usage=True,
)
merged = model.merge_and_unload()
merged.save_pretrained("./merged-model")

print("Merged model saved. Upload to HF Hub or serve with vLLM.")

Code Fragment 15.1.2: Set up parameter-efficient fine-tuning with LoRA adapters

Key Insight

When merging QLoRA adapters, you must load the base model in higher precision (FP16 or BF16) first, then merge. This is because the merge operation (W' = W + (α/r) · BA) needs sufficient numerical precision. Merging in 4-bit would introduce unacceptable quantization noise. After merging, you can re-quantize the merged model to GGUF or AWQ for efficient serving.

Key Insight

LoRA adapters unlock multi-tenant serving. Because LoRA keeps the base model frozen and stores task-specific knowledge in small adapter matrices, you can serve dozens of different fine-tuned "models" from a single base model in GPU memory. At inference time, you simply swap the active adapter (a few megabytes) instead of loading an entirely separate model (many gigabytes). This has profound implications for inference optimization and deployment: frameworks like LoRAX and S-LoRA can serve hundreds of adapters concurrently with minimal overhead, making personalized models economically viable at scale.

7. LoRA Hyperparameter Tuning Guide

Finding optimal LoRA hyperparameters requires systematic experimentation. Here is a practical guide based on what works across a wide range of tasks and model sizes.

Hyperparameter	Default	When to Increase	When to Decrease
`r` (rank)	16	Complex tasks, large datasets, reasoning	Simple format changes, small datasets
`lora_alpha`	2 × r	Model not adapting enough	Training diverging, loss spiking
`lora_dropout`	0.05	Overfitting (val loss rises)	Large dataset, underfitting
`learning_rate`	2e-4	Underfitting, slow convergence	Divergence, loss oscillation
`max_grad_norm`	1.0	Very stable training	Gradient spikes (try 0.3)

Note

LoRA learning rates are typically 5-10x higher than full fine-tuning learning rates. This is because only a small fraction of parameters are being updated, so each update needs to have a larger effect. A learning rate of 2e-4 for LoRA corresponds roughly to 2e-5 for full fine-tuning in terms of per-step model change.

8. The PEFT Library Ecosystem

The Hugging Face peft library provides a unified interface for all parameter-efficient methods. Beyond basic LoRA, it supports loading adapters from the Hub, combining adapters, and quantized training workflows. Code Fragment 15.1.2 shows this approach in practice.

# Configure LoRA adapter parameters: rank, alpha, target modules
# Lower rank reduces trainable parameters; alpha scales the adapter contribution
from peft import (
 PeftModel,
 PeftConfig,
 get_peft_model,
 LoraConfig,
 TaskType,
 AutoPeftModelForCausalLM,
)

# Load a LoRA adapter from Hugging Face Hub
model = AutoPeftModelForCausalLM.from_pretrained(
 "username/my-lora-adapter", # Adapter repo on HF Hub
 device_map="auto",
 torch_dtype=torch.bfloat16,
)

# Inspect adapter configuration
config = PeftConfig.from_pretrained("username/my-lora-adapter")
print(f"Base model: {config.base_model_name_or_path}")
print(f"Rank: {config.r}, Alpha: {config.lora_alpha}")
print(f"Target modules: {config.target_modules}")

# Push adapter to Hub (only saves the small adapter weights)
model.push_to_hub("username/my-lora-adapter")
# Adapter size: typically 50-200 MB vs 14+ GB for full model

Base model: meta-llama/Meta-Llama-3-8B Rank: 16, Alpha: 32 Target modules: ['q_proj', 'v_proj']

Code Fragment 15.1.3: Configure LoRA adapter parameters: rank, alpha, target modules

QLoRA: 4-Bit Quantized LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with 4-bit NF4 quantization of the base model, reducing memory by roughly 4x compared to standard LoRA in FP16. The bitsandbytes library handles the quantization transparently through BitsAndBytesConfig. The frozen base weights are stored in 4-bit precision, while the LoRA adapter matrices train in BF16. This enables fine-tuning a 70B model on a single 48 GB GPU. Code Fragment 15.1.10 shows the setup.

# pip install bitsandbytes peft transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit NF4 quantization config
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True, # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
 "meta-llama/Llama-3.1-8B-Instruct",
 quantization_config=bnb_config,
 device_map="auto",
)
model = prepare_model_for_kbit_training(model) # freeze + enable gradients

lora_config = LoraConfig(r=16, lora_alpha=32, lora_dropout=0.05,
 target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
 "gate_proj", "up_proj", "down_proj"],
 bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 4,544,075,776 || trainable%: 0.30

rank trainable total pct 4 524,288 135,000,320 0.3884% 8 1,048,576 135,524,608 0.7737% 16 2,097,152 136,573,184 1.5354% 32 4,194,304 138,670,336 3.0248% 64 8,388,608 142,864,640 5.8716%

Code Fragment 15.1.4: pip install bitsandbytes peft transformers

Self-Check

Q1: In the LoRA decomposition W' = W + BA, what are the dimensions of matrices B and A for a weight matrix of size d×k with rank r?

Show Answer

B is d×r and A is r×k. The total trainable parameters are r×(d+k), which is much smaller than the d×k parameters in the original matrix when r is small. For example, with d=k=4096 and r=16, this is 131,072 versus 16,777,216 (a 128x reduction).

Q2: Why is matrix B initialized to zeros and A initialized with random values, rather than the reverse?

Show Answer

Initializing B to zeros ensures that BA = 0 at the start of training, so the model begins as an exact copy of the pretrained base. This provides a stable starting point. The choice of B=0 (rather than A=0) is a convention; either would work. The key requirement is that the product BA starts at zero so training begins from the pretrained behavior.

Q3: What is the role of the alpha/r scaling factor in LoRA, and how should you adjust alpha when changing rank?

Show Answer

The ratio alpha/r acts as a learning rate multiplier for the LoRA update. When you double r, you should generally double alpha to maintain the same effective scaling. A common starting point is alpha = 2r. If the model is not adapting enough, increase alpha; if training is unstable, decrease it.

Q4: What three innovations does QLoRA combine, and why is each one necessary?

Show Answer

QLoRA combines: (1) NF4 quantization, which reduces base model memory by 4x with minimal quality loss using a data type optimized for normally distributed weights; (2) double quantization, which compresses the quantization constants themselves, saving an additional ~0.4 bits per parameter; (3) paged optimizers, which use unified memory to gracefully handle GPU memory spikes during training without OOM errors.

Q5: When should you keep LoRA adapters separate versus merging them into the base model?

Show Answer

Keep adapters separate when you need to serve multiple tasks from a single base model (hot-swapping adapters per request), when storage is a concern (adapters are ~50MB vs. 14+ GB for a full model), or when you plan to update adapters independently. Merge when you need maximum inference speed (zero adapter overhead), when using inference engines that do not support adapter loading, or when deploying a single dedicated model for one task.

Tip: Start with LoRA Rank 8 or 16

For most fine-tuning tasks, LoRA rank 8 produces strong results with minimal trainable parameters. Increase to 16 or 32 only if you see underfitting. Higher ranks add parameters quadratically but rarely improve quality beyond rank 32.

Key Takeaways

LoRA decomposes weight updates into two small matrices (W' = W + BA), reducing trainable parameters by 100x or more while matching full fine-tuning quality on most tasks.
Rank (r) controls the capacity of the adaptation. Start with r=16 for most tasks; increase to 32 or 64 only for complex reasoning tasks with sufficient data.
Alpha scaling (α/r) acts as a learning rate multiplier. Set α = 2r as a default, and adjust based on training stability and downstream performance.
Target all linear layers (attention + MLP) for best quality. Targeting only attention layers is faster but may sacrifice 1-3% accuracy on complex tasks.
QLoRA enables 70B fine-tuning on a single GPU by combining NF4 quantization, double quantization, and paged optimizers, at the cost of ~30% slower training.
Adapter merging converts the LoRA model into a standard model format. Merge in high precision (FP16/BF16), then re-quantize for serving if needed.
LoRA learning rates are 5-10x higher than full fine-tuning rates because fewer parameters share the gradient signal.

Lab: Fine-Tune a Model with LoRA and Compare Rank Settings

Duration: ~60 minutes Intermediate

Objective

Apply LoRA adapters to a language model using PEFT, train on a small dataset, and compare how different rank values (r=4, r=16, r=64) affect trainable parameter count, memory usage, and output quality.

What You'll Practice

Configuring LoRA with the PEFT library (target modules, rank, alpha)
Inspecting trainable vs. frozen parameter counts
Training LoRA adapters with SFTTrainer
Saving, reloading, and merging adapters into the base model
Comparing rank settings on quality and efficiency metrics

Setup

The following cell installs the required packages and configures the environment for this lab.

pip install transformers trl peft datasets accelerate torch bitsandbytes

Code Fragment 15.1.5: This command installs the required packages for LoRA fine-tuning: transformers, trl, peft, datasets, accelerate, torch, and bitsandbytes. The bitsandbytes library enables optional 4-bit quantization for QLoRA experiments.

Steps

Step 1: Load the base model and apply LoRA

Load a small model and wrap it with a LoRA configuration. Inspect the parameter counts to see how few parameters are actually trainable.

# Load a small model and wrap with LoRA: only the low-rank adapter
# matrices are trainable; all original weights stay frozen.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
 tokenizer.pad_token = tokenizer.eos_token

base_model = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype=torch.float16, device_map="auto"
)

# TODO: Create a LoraConfig with rank=16, alpha=32
lora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=["q_proj", "v_proj"],
 task_type=TaskType.CAUSAL_LM,
 lora_dropout=0.05,
 bias="none"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

Code Fragment 15.1.6: This lab step loads SmolLM2-135M-Instruct with AutoModelForCausalLM, creates a LoraConfig targeting q_proj and v_proj layers with rank 16 and alpha 32, and calls get_peft_model to wrap the base model. The print_trainable_parameters() call reveals that fewer than 0.1% of parameters are trainable.

Hint

The print_trainable_parameters() method shows something like "trainable params: 131,072 || all params: 135,000,000 || trainable%: 0.097". Only the LoRA matrices are trainable; everything else is frozen.

Step 2: Compare parameter counts across ranks

Create LoRA configs with different ranks and compare the trainable parameter counts.

# Compare trainable parameter counts across LoRA ranks (4 to 64).
# Shows how rank controls the expressiveness vs. efficiency tradeoff.
import pandas as pd

ranks = [4, 8, 16, 32, 64]
results = []

for r in ranks:
 base = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype=torch.float16, device_map="auto")
 config = LoraConfig(r=r, lora_alpha=r * 2,
 target_modules=["q_proj", "v_proj"],
 task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
 pm = get_peft_model(base, config)

 # TODO: Extract trainable and total parameter counts
 trainable = sum(p.numel() for p in pm.parameters() if p.requires_grad)
 total = sum(p.numel() for p in pm.parameters())
 results.append({"rank": r, "trainable": trainable,
 "total": total, "pct": f"{trainable/total*100:.4f}%"})
 del pm, base
 torch.cuda.empty_cache()

df = pd.DataFrame(results)
print(df.to_string(index=False))

Code Fragment 15.1.7: Compare trainable parameter counts across LoRA ranks (4 to 64).

Hint

Notice how trainable parameters scale linearly with rank. Doubling the rank roughly doubles the trainable parameter count. The common heuristic alpha = 2 * rank keeps the effective learning rate stable across rank settings.

Step 3: Train a LoRA adapter

Fine-tune the LoRA adapter (r=16) on a small dataset.

# Train the LoRA adapter (r=16) on 300 instruction examples.
# Only adapter weights update; base model gradients are skipped.
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/no_robots", split="train")
dataset = dataset.shuffle(seed=42).select(range(300))

def format_chat(example):
 return {"text": tokenizer.apply_chat_template(
 example["messages"], tokenize=False, add_generation_prompt=False)}

formatted = dataset.map(format_chat)

# TODO: Configure training (use higher LR than full fine-tuning)
training_args = SFTConfig(
 output_dir="./lora-smollm2-r16",
 num_train_epochs=3,
 per_device_train_batch_size=4,
 gradient_accumulation_steps=2,
 learning_rate=2e-4, # 10x higher than full FT
 logging_steps=10,
 max_seq_length=512,
 fp16=True,
 warmup_ratio=0.1,
 save_strategy="epoch",
 report_to="none",
)

trainer = SFTTrainer(model=model, args=training_args,
 train_dataset=formatted, processing_class=tokenizer)
result = trainer.train()
print(f"Training loss: {result.training_loss:.4f}")

Training loss: 1.5237

Code Fragment 15.1.8: Train the LoRA adapter (r=16) on 300 instruction examples.

Hint

LoRA typically uses 5x to 10x higher learning rate than full fine-tuning because only the small adapter matrices receive gradients. A learning rate of 2e-4 is a good starting point.

Step 4: Save and merge the LoRA adapter

Save the adapter weights (only a few KB), then reload and merge them into the base model.

# Save the tiny adapter checkpoint, then merge it back into the
# base model for deployment without the PEFT dependency.
import os
from peft import PeftModel

# Save adapter only
model.save_pretrained("./lora-adapter-r16")
adapter_size = sum(
 os.path.getsize(os.path.join("./lora-adapter-r16", f))
 for f in os.listdir("./lora-adapter-r16")
 if f.endswith(('.safetensors', '.bin')))
print(f"Adapter size: {adapter_size / 1024:.1f} KB")

# Reload base model and merge adapter
base_fresh = AutoModelForCausalLM.from_pretrained(
 model_name, torch_dtype=torch.float16, device_map="auto")
peft_model = PeftModel.from_pretrained(base_fresh, "./lora-adapter-r16")
merged_model = peft_model.merge_and_unload()
print("Adapter merged successfully!")

Adapter size: 524.0 KB Adapter merged successfully!

Code Fragment 15.1.9: Save the tiny adapter checkpoint, then merge it back into the

Hint

The merge_and_unload() method folds the LoRA weights into the base model, producing a standard model that can be used without the PEFT library at inference time.

Step 5: Test the merged model

Generate responses and compare with the base model.

# Compare merged model outputs against base model on test prompts.
# Look for improved instruction-following after LoRA fine-tuning.
test_prompts = [
 "Explain photosynthesis to a 10-year-old.",
 "Write a short email declining a meeting politely.",
 "List 3 advantages of renewable energy.",
]

for prompt in test_prompts:
 messages = [{"role": "user", "content": prompt}]
 fmt = tokenizer.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True)
 inputs = tokenizer(fmt, return_tensors="pt").to(merged_model.device)
 with torch.no_grad():
 outputs = merged_model.generate(
 **inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
 response = tokenizer.decode(
 outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
 print(f"Prompt: {prompt}\nResponse: {response[:300]}\n{'-'*50}")

Prompt: Explain photosynthesis to a 10-year-old. Response: Photosynthesis is how plants make their own food using sunlight! Think of it like a recipe: plants take water from the soil and carbon dioxide from the air, then use sunlight as the energy to turn them into sugar (their food) and oxygen... -------------------------------------------------- Prompt: Write a short email declining a meeting politely. Response: Subject: RE: Meeting Request Hi [Name], Thank you for the invitation. Unfortunately, I have a scheduling conflict and will not be able to attend. I would be happy to review any notes or follow up afterward. Best regards -------------------------------------------------- ...

Code Fragment 15.1.10: Compare merged model outputs against base model on test prompts.

Hint

To get a fair comparison, reload the original base model and generate with the same settings. The LoRA-tuned model should show improved instruction following despite training less than 0.1% of parameters.

Expected Output

A parameter table showing r=4 trains ~0.02%, r=16 trains ~0.1%, r=64 trains ~0.4%
A saved adapter that is only a few hundred KB (vs. ~270MB full model)
Merged model outputs showing improved instruction following

Stretch Goals

Train adapters at r=4 and r=64, compare outputs, and measure which rank is best for this dataset size
Add QLoRA by loading in 4-bit (BitsAndBytesConfig(load_in_4bit=True)) and compare memory
Target all linear layers (target_modules="all-linear") and compare results

Complete Solution

# Complete LoRA lab solution: load model, apply adapter, train,
# save/merge, and compare outputs before and after fine-tuning.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, PeftModel, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch, pandas as pd, os

model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token

# Step 1: Apply LoRA
base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"],
 task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

# Step 2: Compare ranks
for r in [4, 8, 16, 32, 64]:
 b = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
 c = LoraConfig(r=r, lora_alpha=r*2, target_modules=["q_proj","v_proj"],
 task_type=TaskType.CAUSAL_LM, lora_dropout=0.05, bias="none")
 pm = get_peft_model(b, c)
 t = sum(p.numel() for p in pm.parameters() if p.requires_grad)
 a = sum(p.numel() for p in pm.parameters())
 print(f"r={r}: {t:,} trainable / {a:,} total = {t/a*100:.4f}%")
 del pm, b; torch.cuda.empty_cache()

# Step 3: Train
ds = load_dataset("HuggingFaceH4/no_robots", split="train").shuffle(seed=42).select(range(300))
def fmt(ex): return {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False, add_generation_prompt=False)}
formatted = ds.map(fmt)

args = SFTConfig(output_dir="./lora-smollm2-r16", num_train_epochs=3, per_device_train_batch_size=4,
 gradient_accumulation_steps=2, learning_rate=2e-4, logging_steps=10, max_seq_length=512,
 fp16=True, warmup_ratio=0.1, save_strategy="epoch", report_to="none")
trainer = SFTTrainer(model=model, args=args, train_dataset=formatted, processing_class=tokenizer)
trainer.train()

# Step 4: Save and merge
model.save_pretrained("./lora-adapter-r16")
base_fresh = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
merged = PeftModel.from_pretrained(base_fresh, "./lora-adapter-r16").merge_and_unload()

# Step 5: Test
for p in ["Explain photosynthesis to a 10-year-old.", "Write a short email declining a meeting.", "List 3 advantages of renewable energy."]:
 msgs = [{"role":"user","content":p}]
 f = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
 i = tokenizer(f, return_tensors="pt").to(merged.device)
 with torch.no_grad():
 o = merged.generate(**i, max_new_tokens=200, temperature=0.7, do_sample=True)
 print(f"Q: {p}\nA: {tokenizer.decode(o[0][i['input_ids'].shape[1]:], skip_special_tokens=True)[:300]}\n")

trainable params: 2,097,152 || all params: 136,573,184 || trainable%: 1.5354 r=4: 524,288 trainable / 135,000,320 total = 0.3884% r=8: 1,048,576 trainable / 135,524,608 total = 0.7737% r=16: 2,097,152 trainable / 136,573,184 total = 1.5354% r=32: 4,194,304 trainable / 138,670,336 total = 3.0248% r=64: 8,388,608 trainable / 142,864,640 total = 5.8716% Q: Explain photosynthesis to a 10-year-old. A: Photosynthesis is how plants make their own food using sunlight... Q: Write a short email declining a meeting. A: Subject: RE: Meeting Request... Q: List 3 advantages of renewable energy. A: 1. Renewable energy sources do not run out...

Code Fragment 15.1.11: Complete LoRA lab solution: load model, apply adapter, train,

Research Frontier

LoRA variants continue to proliferate: DoRA decomposes weight updates into magnitude and direction components, while rsLoRA applies rank-dependent scaling for more stable training at higher ranks. Research on LoRA composition explores stacking, merging, and routing among multiple LoRA adapters at inference time, enabling modular skill composition without retraining. An open theoretical question is why low-rank adaptation works as well as full fine-tuning despite the severe parameter reduction, and whether optimal rank can be predicted from task properties.

Recent work on LoRA Soups (2024) demonstrates that averaging multiple LoRA adapters trained on different data can outperform any individual adapter, paralleling the model soup findings in full fine-tuning.

Exercises

Exercise 15.1.1: LoRA mechanism Conceptual

Explain how LoRA works: what does it mean to add low-rank matrices A and B to a frozen weight matrix W? Why does this reduce memory and compute requirements?

Answer Sketch

Instead of updating W (d x d, millions of parameters), LoRA freezes W and learns two small matrices: A (d x r) and B (r x d), where r << d (typically 8 to 64). The effective weight becomes W' = W + BA. This reduces trainable parameters from d*d to 2*d*r. For a 4096x4096 matrix with r=16, parameters drop from 16.7M to 131K (128x reduction). Memory savings come from not storing optimizer states (momentum, variance) for the frozen W.

Exercise 15.1.2: Rank and alpha selection Analysis

You are fine-tuning a 7B model with LoRA. Compare the expected outcomes of rank=8/alpha=16 versus rank=64/alpha=128. How does the rank affect model capacity, training speed, and risk of overfitting?

Answer Sketch

Rank 8: 0.1% trainable parameters, fastest training, lowest memory, but limited capacity for complex adaptations. Best for simple style transfer or single-task tuning. Rank 64: ~0.8% trainable parameters, slower but still much faster than full fine-tuning, more capacity for complex multi-task adaptation. Higher risk of overfitting on small datasets. Alpha/rank ratio (both 2.0 here) controls the effective learning rate of the LoRA updates; keeping it consistent means the learning dynamics are similar despite different ranks.

Exercise 15.1.3: QLoRA setup Coding

Write the Python code to load a model in 4-bit quantization using bitsandbytes and attach LoRA adapters using the PEFT library. Include the BitsAndBytesConfig and LoraConfig.

Answer Sketch

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16). Load: model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3-8B', quantization_config=bnb_config). LoRA: lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','v_proj','k_proj','o_proj'], lora_dropout=0.05). Apply: model = get_peft_model(model, lora_config). This enables fine-tuning a 7B model on a single 16GB GPU.

Exercise 15.1.4: Target module selection Conceptual

Which transformer modules should you apply LoRA to? Compare applying LoRA to only attention projections (q, v) versus all linear layers (q, k, v, o, gate, up, down). What are the tradeoffs?

Answer Sketch

Attention only (q, v): the original LoRA paper's default. Minimal parameters, fast training, works well for many tasks. All linear layers: ~3x more parameters but better adaptation capacity. Research shows that including the MLP layers (gate, up, down) improves performance on complex tasks and domain adaptation. Tradeoff: more target modules means more memory, slower training, but better quality. Start with q, v; expand to all linear layers if quality is insufficient.

Exercise 15.1.5: Adapter merging and serving Coding

Write code that merges a LoRA adapter back into the base model weights for deployment. Explain when you would serve with merged weights versus keeping the adapter separate.

Answer Sketch

Merge: merged_model = model.merge_and_unload(). Save: merged_model.save_pretrained('merged_model/'). Merge for deployment when: (1) you have a single adapter and want maximum inference speed (no adapter overhead), (2) you are using a serving framework that does not support adapters. Keep separate when: (1) you serve multiple adapters from the same base model (multi-tenant), (2) you want to A/B test adapters, (3) you need to quickly swap or update adapters without redeploying the base model.

What Comes Next

In the next section, Section 15.2: Advanced PEFT Methods, we survey advanced PEFT methods beyond LoRA, including DoRA, LoRA+, prefix tuning, adapters, and prompt tuning.

Bibliography

Foundational Papers

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

The original LoRA paper that introduced low-rank decomposition for weight updates, enabling fine-tuning with a fraction of the parameters. Essential reading for anyone implementing parameter-efficient training, as nearly every modern PEFT workflow builds on these ideas.

PEFT Core

Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized Language Models. NeurIPS 2023.

Combines 4-bit quantization with LoRA adapters, making it possible to fine-tune 65B parameter models on a single 48GB GPU. Practitioners working with limited hardware should read this to understand the quantization and paging tricks that reduce memory by 4x or more.

Quantization + PEFT

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., de Laroussilhe, Q., Gesmundo, A., Attariyan, M., & Gelly, S. (2019). Parameter-Efficient Transfer Learning for NLP. ICML 2019.

Introduced adapter chapters inserted between transformer layers, establishing the conceptual foundation for parameter-efficient transfer. Researchers interested in the historical arc from full fine-tuning to modern PEFT methods should start here.

Adapter Origins

Theory & Surveys

Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. ACL 2021.

Provides the theoretical grounding for why low-rank adaptation works: pre-trained models have low intrinsic dimensionality. Anyone wanting to understand the mathematical justification behind LoRA's success should consult this analysis.

Theory

Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning.

A comprehensive survey covering the full taxonomy of PEFT methods, from adapters to prompt tuning to reparameterization approaches. Ideal for practitioners comparing LoRA against alternatives and looking for guidance on which method to select for their use case.

Survey

Ding, N., Qin, Y., Yang, G., Wei, F., Yang, Z., Su, Y., Hu, S., Chen, Y., Chan, C.-M., Chen, W., et al. (2023). Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models. Nature Machine Intelligence, 5, 220-235.

A systematic review published in Nature Machine Intelligence that organizes PEFT methods into a clear taxonomy and benchmarks their performance across tasks. Recommended for researchers seeking a rigorous comparison framework with experimental results on large-scale models.

Comprehensive Review