Part 4: Training and Adapting
Chapter 15: Parameter-Efficient Fine-Tuning

Advanced PEFT Methods

"One adapter is good. Six adapters, each specialized and hot-swappable, is a production strategy."

LoRA LoRA, Hot-Swapping AI Agent
Big Picture

LoRA dominates the PEFT landscape, but it is not the only option. Researchers have developed numerous alternatives that offer different tradeoffs in parameter count, training speed, inference overhead, and task specialization. DoRA improves LoRA by decomposing weights into magnitude and direction components. LoRA+ uses different learning rates for the A and B matrices. Prefix Tuning and Prompt Tuning prepend learnable tokens rather than modifying weights (drawing on the prompt engineering intuition of steering via input context). IA3 achieves extreme parameter efficiency by learning only rescaling vectors. Understanding these alternatives helps you select the right tool for each scenario, particularly when operating under tight memory, latency, or multi-tenant serving constraints. The Section 09.2 combine naturally with these lightweight adaptation approaches. The LoRA foundations from Section 15.1 provide the baseline that these methods extend or replace.

Prerequisites

This section extends the LoRA and QLoRA foundations from Section 15.1, so make sure you understand low-rank decomposition (W' = W + BA) and the role of rank, alpha, and target module selection. Familiarity with the transformer attention mechanism from Section 04.1 is important for understanding how prefix tuning and adapter methods modify the forward pass. The prompt engineering concepts from Section 11.1 provide useful context for prompt tuning, which bridges the gap between manual prompting and learned adaptation.

A model wearing various adapter accessories like hats, scarves, and glasses representing different PEFT methods
Figure 15.2.1: PEFT methods are like accessories for your base model. Mix and match adapters until you find the look that works for your task.
Common Mistake: Switching PEFT Methods Before Tuning LoRA Hyperparameters

Teams often jump from LoRA to DoRA, IA3, or prefix tuning hoping for a quick accuracy boost, without first optimizing LoRA's hyperparameters (rank, alpha, target modules, learning rate). In practice, a well-tuned LoRA configuration outperforms a poorly tuned DoRA or adapter setup. Before switching methods, make sure you have tried: (1) increasing rank to 32 or 64, (2) adjusting alpha relative to rank, (3) targeting all linear layers (not just attention), and (4) tuning the learning rate. If a properly tuned LoRA still falls short, then explore alternatives like DoRA or full fine-tuning.

1. DoRA: Weight-Decomposed Low-Rank Adaptation

DoRA (2024) improves on LoRA by decomposing the pretrained weight into its magnitude and direction components before applying the low-rank update. Specifically, a weight vector w is decomposed as w = m · (w / ||w||), where m is a learnable magnitude scalar and the direction is updated via standard LoRA. This decomposition aligns more closely with how full fine-tuning actually modifies weights, resulting in better performance for the same rank.

Tip

If you are already using LoRA and want a quick accuracy boost, try switching to DoRA before increasing rank. In most benchmarks, DoRA at rank 16 outperforms LoRA at rank 32, while using fewer trainable parameters. The PEFT library supports DoRA as a drop-in replacement: just set use_dora=True in your LoraConfig.

In practice, DoRA consistently outperforms LoRA by 1-3% across benchmarks when using the same rank and target modules, with only a marginal increase in trainable parameters (the additional magnitude vectors are tiny). The training speed is nearly identical to LoRA. For deployment considerations, the Section 09.1 apply equally to DoRA-adapted models. Figure 15.2.1 compares the LoRA and DoRA weight update mechanisms.

Mental Model: The Wardrobe of Accessories

Think of advanced PEFT methods as a wardrobe of accessories for the same outfit. DoRA separates direction from magnitude, like choosing both which way to point and how far to reach. Prompt tuning adds learned tokens to the input, like pinning a badge onto a jacket that changes how others perceive you. Each accessory targets a different aspect of the model's behavior, and you can mix and match them depending on the task at hand.

Standard LoRA W + B A W' = W + BA Updates weight directly in the combined space DoRA m × direction + B A W' = m · (V + BA) / ||V + BA|| Separates magnitude from direction for better learning
Figure 15.2.2: DoRA separates weight magnitude (m) from direction, applying LoRA only to the directional component.

The following implementation (Code Fragment 15.2.2) shows how to enable DoRA with a single configuration flag.

# Configure QLoRA adapter parameters on top of the quantized base
# The adapter trains in float16 while the base stays in 4-bit NormalFloat
from peft import LoraConfig, get_peft_model

# DoRA configuration: enable the use_dora flag
dora_config = LoraConfig(
 r=16,
 lora_alpha=32,
 target_modules=[
 "q_proj", "k_proj", "v_proj", "o_proj",
 "gate_proj", "up_proj", "down_proj",
 ],
 lora_dropout=0.05,
 use_dora=True, # Enable DoRA
 task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, dora_config)
model.print_trainable_parameters()
# Slightly more params than LoRA due to magnitude vectors,
# but typically ~1-3% better accuracy at same rank.
rank= 4 standard_norm=0.2641 rslora_norm=0.5282 rank= 16 standard_norm=0.2604 rslora_norm=1.0416 rank= 64 standard_norm=0.2583 rslora_norm=2.0667 rank=256 standard_norm=0.2571 rslora_norm=4.1134
Code Fragment 15.2.1: Configure QLoRA adapter parameters on top of the quantized base

Code Fragment 15.2.5 shows how Prefix Tuning prepends learnable vectors to each attention layer.

# Prefix Tuning: prepend learnable key-value vectors to each attention layer
# The model attends to these "virtual tokens" alongside real input tokens
from peft import PrefixTuningConfig, get_peft_model, TaskType

# Prefix Tuning configuration
prefix_config = PrefixTuningConfig(
 task_type=TaskType.CAUSAL_LM,
 num_virtual_tokens=30, # Number of prefix tokens
 prefix_projection=True, # Use MLP to project prefix (more stable)
 encoder_hidden_size=1024, # Hidden size of projection MLP
)

# Wrap base model with trainable LoRA adapters
model = get_peft_model(base_model, prefix_config)
model.print_trainable_parameters()
# Typically 0.1-0.5% of total parameters
Code Fragment 15.2.2: Prefix Tuning configuration that prepends learnable virtual tokens to each transformer layer. The prefix_projection flag enables a small MLP that projects the prefix, improving training stability. This approach modifies the model's behavior through attention steering rather than weight modification.

Code Fragment 15.2.4 demonstrates the bottleneck adapter pattern using LLaMA-Adapter style chapters.

# Adapter layers: insert small bottleneck modules between transformer layers
# Uses LLaMA-Adapter style via PEFT's AdaptionPromptConfig
from peft import AdaptionPromptConfig, get_peft_model

# Note: For bottleneck adapters, use the adapters library
# from adapters import AutoAdapterModel

# Example with LLaMA-Adapter style (via PEFT)
adapter_config = AdaptionPromptConfig(
 adapter_len=10, # Length of adapter prompt
 adapter_layers=30, # Number of layers to add adapters
 task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, adapter_config)
model.print_trainable_parameters()
Code Fragment 15.2.3: Bottleneck adapter setup using the LLaMA-Adapter style via PEFT. These adapters insert small trainable chapters between existing transformer layers, adding sequential computation at inference time in exchange for flexible compositional adaptation.

5. IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations

IA3 (few-shot parameter-efficient fine-tuning Is All You Need, 2022) takes parameter efficiency to the extreme. Instead of learning new matrices or inserting new layers, IA3 learns only three rescaling vectors that modulate the keys, values, and intermediate activations in the feedforward layers. The total number of trainable parameters is typically 10x smaller than LoRA.

The tradeoff is that IA3's limited capacity makes it best suited for simple adaptation tasks (format changes, style transfer) rather than complex domain adaptation. It excels in few-shot settings where overfitting is a concern. Code Fragment 15.2.5 demonstrates IA3 configuration.

# IA3 configuration: learns only rescaling vectors
# Extreme parameter efficiency at the cost of limited adaptation capacity
from peft import IA3Config, get_peft_model, TaskType

ia3_config = IA3Config(
 task_type=TaskType.CAUSAL_LM,
 target_modules=["k_proj", "v_proj", "down_proj"],
 feedforward_modules=["down_proj"],
)

# Wrap base model with IA3 rescaling vectors
model = get_peft_model(base_model, ia3_config)
model.print_trainable_parameters()
# trainable params: ~500K for a 7B model (0.007%)
Code Fragment 15.2.4: IA3 configuration targeting keys, values, and feedforward layers with rescaling vectors. IA3 achieves extreme parameter efficiency (roughly 0.007% of total parameters for a 7B model) at the cost of limited adaptation capacity. This is ideal for simple format or style adaptation tasks where overfitting is a concern.

6. Comprehensive PEFT Method Comparison

6. Comprehensive PEFT Method Comparison Intermediate
Method Params (%) Memory Inference Overhead Best For
LoRA 0.1-0.5% Low Zero (after merge) General purpose, most tasks
QLoRA 0.1-0.5% Very Low Zero (after merge) Large models on limited GPU
DoRA 0.1-0.5% Low Zero (after merge) When LoRA quality is insufficient
LoRA+ 0.1-0.5% Low Zero (after merge) Faster convergence needed
Prefix Tuning 0.1-0.5% Low Small (longer KV cache) NLU tasks, multi-task serving
Prompt Tuning <0.01% Very Low Negligible Very large models, simple tasks
Adapters 0.5-3% Medium Small (sequential) Compositional multi-task
IA3 <0.01% Very Low Negligible Few-shot, style adaptation
Warning

Prompt Tuning and IA3 achieve extreme parameter efficiency, but they are significantly less capable than LoRA for complex adaptation tasks. If your task requires learning new knowledge (domain-specific terminology, code patterns, specialized reasoning), LoRA or DoRA with a reasonable rank (16-64) will substantially outperform these lighter methods. Reserve IA3 and Prompt Tuning for scenarios where simplicity or parameter count is the primary constraint.

Why this matters: The proliferation of PEFT methods (DoRA, AdaLoRA, IA3, GaLore) is not just academic variety; each addresses a specific limitation of vanilla LoRA. DoRA handles magnitude-direction decomposition for better convergence. AdaLoRA allocates rank adaptively, putting more parameters where they help most. IA3 achieves extreme parameter efficiency (orders of magnitude fewer parameters than LoRA) at the cost of some task performance. The practical guidance is: start with standard LoRA, and only explore these variants when you have a specific bottleneck (convergence issues, memory constraints, or multi-task serving requirements).

7. Multi-Adapter Serving

One of LoRA's most powerful production features is the ability to serve many adapters from a single base model. This enables multi-tenant deployments where each customer or task gets its own fine-tuned behavior without duplicating the base model weights, a significant advantage for inference optimization. Two main systems support this at scale: LoRAX (formerly LoRAX by Predibase) and S-LoRA.

7.1 Architecture Overview

Figure 15.2.3 illustrates how a single base model serves multiple LoRA adapters dynamically at request time.

Base Model (GPU) Single copy in VRAM Medical LoRA Legal LoRA Finance LoRA Code LoRA Request Router Routes each request to the correct adapter based on tenant ID or task type Batches requests across adapters for GPU utilization
Figure 15.2.3: Multi-adapter serving loads one base model and dynamically applies per-request LoRA adapters.

7.2 LoRAX and S-LoRA

LoRAX (Predibase) is a production-grade serving system that can host hundreds of fine-tuned LoRA adapters on a single GPU. It keeps the base model in GPU memory and dynamically loads adapter weights per request. Key features include adapter weight caching, batched inference across different adapters, and automatic adapter management.

S-LoRA (from UC Berkeley) takes a more research-oriented approach, using unified paging to manage adapter memory and custom CUDA kernels for batched LoRA computation. S-LoRA can serve thousands of adapters simultaneously, with adapters stored in a tiered memory system (GPU, CPU, disk) and paged in on demand.

Key Insight

Multi-adapter serving is one of the strongest arguments for LoRA over other PEFT methods in production. A single A100 GPU can serve a base 7B model with hundreds of LoRA adapters, effectively providing hundreds of specialized models at the cost of one. This is far more economical than deploying separate merged models for each use case.

Real-World Scenario: Multi-Tenant LoRA Serving for Enterprise Customers

Who: Daniela, a platform architect at a B2B content-generation SaaS company.

Situation: The company had 50 enterprise customers, each requiring a specialized writing style (tone, vocabulary, formatting conventions) for their generated marketing copy. The product ran on a 7B-parameter base model.

Problem: Deploying 50 separate fine-tuned models would require 50 GPUs, pushing monthly infrastructure costs above $150,000. The engineering team also dreaded maintaining 50 independent model deployments, each with its own versioning and rollback pipeline.

Decision: Daniela trained 50 lightweight LoRA adapters (each roughly 20 MB) and served them all from a single A100 using LoRAX. Per-request routing attached the correct adapter based on the customer's API key, and new customer onboarding required only a fresh LoRA training run (under two hours on one GPU).

Result: Total GPU cost dropped from 50 instances to one A100, a roughly 98% cost reduction. New adapters slotted into the shared serving infrastructure with zero downtime. Median response latency increased by only 8ms compared to a dedicated model, which was imperceptible to end users.

Lesson: For multi-tenant serving where each customer needs a specialized model, LoRA adapters plus a shared base model eliminate the linear scaling of GPU costs with customer count. The key enabler is adapter hot-swapping at the serving layer.

8. Choosing the Right PEFT Method

With so many PEFT options available, the decision can feel overwhelming. Here is a practical decision framework based on your constraints and requirements.

8. Choosing the Right PEFT Method Intermediate Comparison
ScenarioRecommended MethodReasoning
General fine-tuning (default)LoRA (r=16)Best quality/efficiency tradeoff, widest ecosystem support
Limited GPU memoryQLoRA4-bit base model frees VRAM for larger models or batches
Need extra quality over LoRADoRADrop-in upgrade, consistent 1-3% improvement
Training speed is criticalLoRA+1.5-2x faster convergence, same final quality
Multi-tenant serving (100+ tasks)LoRA + LoRAXHot-swappable adapters from single base
Extreme parameter budget (<1K params)IA3Learns only rescaling vectors, minimal overfitting
Very large model (100B+), simple taskPrompt TuningUltra-lightweight, scales well with model size
NLU classification tasksPrefix TuningStrong at steering attention for classification
Note

When in doubt, start with LoRA. It has the widest library support, the most documentation, and works well across virtually all tasks and model sizes. Move to specialized methods only when you have a specific constraint (memory, serving architecture, parameter count) that LoRA cannot satisfy.

9. GaLore: Gradient Low-Rank Projection

While LoRA reduces the number of trainable parameters by adding low-rank adapters, GaLore (Gradient Low-Rank Projection) takes a fundamentally different approach: it reduces optimizer memory by projecting gradients into a low-rank subspace. This distinction matters because optimizer states (momentum and variance in Adam) typically consume two to three times the memory of the parameters themselves. GaLore enables full-parameter training of large models on consumer hardware, something that LoRA alone cannot achieve because LoRA still freezes the base model and only updates the adapter.

9.1 How GaLore Works

During training, GaLore periodically computes the SVD of the gradient matrix for each weight layer, retaining only the top-r singular vectors. The optimizer states (Adam's first and second moments) are maintained in this reduced r-dimensional space rather than the full parameter space. Every T steps (typically T=200), the projection matrices are recomputed from the current gradient to track the evolving optimization landscape. The key insight is that gradient matrices during LLM training tend to be approximately low-rank, so very little information is lost by this projection. Code Fragment 15.2.6 provides a conceptual implementation.

# GaLore conceptual implementation
import torch

class GaLoreProjector:
 """Project gradients to low-rank subspace for memory-efficient training."""

 def __init__(self, rank: int, update_freq: int = 200):
 self.rank = rank
 self.update_freq = update_freq
 self.step = 0
 self.projector = None

 def project(self, grad: torch.Tensor) -> torch.Tensor:
 """Project full gradient to low-rank subspace."""
 if self.step % self.update_freq == 0:
 # Recompute projection via SVD
 U, S, Vh = torch.linalg.svd(grad, full_matrices=False)
 self.projector = U[:, :self.rank]

 self.step += 1
 # Project gradient: (d, r) @ (r, d) is never formed explicitly
 return self.projector.T @ grad # shape: (rank, d_out)

 def back_project(self, low_rank_update: torch.Tensor) -> torch.Tensor:
 """Map low-rank update back to full parameter space."""
 return self.projector @ low_rank_update
Code Fragment 15.2.5: GaLore conceptual implementation showing gradient projection via SVD. The projector periodically recomputes the low-rank subspace (every 200 steps by default), then projects gradients into this smaller space for memory-efficient optimizer states. This enables full-parameter training of 7B models on a single 24 GB GPU.

In practice, the galore_torch library wraps this into a drop-in optimizer replacement:

# Library shortcut: GaLore optimizer (pip install galore-torch)
from galore_torch import GaLoreAdamW8bit

optimizer = GaLoreAdamW8bit(
 model.parameters(),
 lr=1e-5,
 rank=128, # low-rank projection dimension
 update_proj_gap=200, # recompute SVD every 200 steps
)
# Use this optimizer in any standard training loop; no other changes needed.
Code Fragment 15.2.6: GaLore library shortcut. The GaLoreAdamW8bit optimizer handles SVD projection, subspace tracking, and 8-bit quantization of optimizer states internally. Replace your standard optimizer with this single line to enable full-parameter training on consumer GPUs.

The memory savings are substantial. For a 7B parameter model, standard Adam requires roughly 42 GB of optimizer state memory (two copies of all parameters in float32). With GaLore at rank 128, the optimizer states shrink to approximately 2 to 4 GB, enabling full-parameter training of 7B models on a single 24 GB GPU. The authors demonstrated that GaLore can pre-train LLaMA models up to 7B parameters on a single GPU with no loss in quality compared to full-rank Adam.

Key Insight

GaLore and LoRA solve different problems. LoRA reduces the number of trainable parameters by adding small adapters. GaLore reduces optimizer memory by projecting gradients into a low-rank space while still updating all parameters. You can combine both: use GaLore for pre-training or full fine-tuning, and LoRA for lightweight adaptation where you want a modular, hot-swappable adapter.

10. rsLoRA: Rank-Stabilized LoRA

Standard LoRA initializes the low-rank matrices A and B such that the adapter output is scaled by a fixed factor α/r, where α is a hyperparameter and r is the rank. This scaling creates a practical problem: when you change the rank, the effective magnitude of the adapter's contribution changes, requiring you to re-tune the learning rate and α for each rank setting. This makes rank selection tedious and error-prone.

rsLoRA (rank-stabilized LoRA) addresses this by changing the scaling factor from α/r to α/√r. This seemingly small modification has a significant theoretical and practical impact. The √r scaling ensures that the adapter's output magnitude remains stable as the rank changes, because the variance of the product BA scales proportionally with r under random initialization. With α/√r scaling, doubling the rank does not double the adapter's contribution; it increases it by only √2, which is the correct normalization for maintaining stable training dynamics. Code Fragment 15.2.6 compares the two scaling approaches.

# rsLoRA vs standard LoRA scaling comparison
import torch
import math

def lora_forward(x, A, B, alpha, rank, use_rslora=False):
 """Compare standard LoRA and rsLoRA scaling."""
 lora_output = x @ A @ B # shape: (batch, d_out)

 if use_rslora:
 # rsLoRA: scale by alpha / sqrt(rank)
 scaling = alpha / math.sqrt(rank)
 else:
 # Standard LoRA: scale by alpha / rank
 scaling = alpha / rank

 return lora_output * scaling

# Demonstrate stability across ranks
d_in, d_out, alpha = 4096, 4096, 16.0
x = torch.randn(1, d_in)

for rank in [4, 16, 64, 256]:
 A = torch.randn(d_in, rank) * 0.01
 B = torch.randn(rank, d_out) * 0.01

 std_out = lora_forward(x, A, B, alpha, rank, use_rslora=False)
 rs_out = lora_forward(x, A, B, alpha, rank, use_rslora=True)

 print(f"rank={rank:3d} standard_norm={std_out.norm():.4f}"
 f" rslora_norm={rs_out.norm():.4f}")
Code Fragment 15.2.7: rsLoRA vs standard LoRA scaling comparison

The practical benefit is straightforward: with rsLoRA, you can change the rank without retuning other hyperparameters. A learning rate and α that work well at rank 16 will also work well at rank 64 or rank 256. This makes hyperparameter search much faster (complementing the training loop fundamentals from Section 14.3), because you can tune the learning rate at a low rank (which trains quickly) and then scale up the rank for higher quality without adjusting anything else. rsLoRA is available in PEFT via the use_rslora=True parameter in LoraConfig.

Property Comparison
PropertyStandard LoRArsLoRAGaLore
What it optimizesParameter countParameter countOptimizer memory
Scaling factorα/rα/√rN/A (full params)
Rank-stable?No (retune per rank)YesYes
Updates base weights?No (adapter only)No (adapter only)Yes (all params)
Typical use caseFine-tuningFine-tuningPre-training, full fine-tuning
Library supportPEFT, UnslothPEFT (use_rslora=True)galore-torch, PEFT
Note

rsLoRA is a drop-in improvement with no computational overhead. If you are using PEFT's LoraConfig, add use_rslora=True and your adapter scaling will be rank-stable. There is no reason not to enable it for any new LoRA training run. GaLore requires more setup (a custom optimizer) but enables training regimes that are otherwise impossible on limited hardware.

Self-Check
Q1: How does DoRA differ from standard LoRA, and when would you prefer it?
Show Answer
DoRA decomposes the pretrained weight into magnitude and direction components, then applies LoRA only to the direction. This more closely mirrors how full fine-tuning modifies weights. In the PEFT library, you enable it with use_dora=True. Prefer DoRA when you want a free 1-3% accuracy improvement with minimal extra cost over standard LoRA.
Q2: What is the key idea behind LoRA+, and what practical benefit does it provide?
Show Answer
LoRA+ assigns different learning rates to the A and B matrices in the LoRA decomposition. Based on theoretical analysis, matrix B should have a higher learning rate (typically 2-16x) than matrix A. The practical benefit is 1.5-2x faster convergence, reaching the same quality in fewer training steps.
Q3: What is the fundamental difference between Prefix Tuning and Prompt Tuning?
Show Answer
Prefix Tuning prepends learnable key-value vectors to every attention layer in the model. Prompt Tuning prepends learnable embeddings only at the input embedding layer. Prefix Tuning is more expressive (modifies attention at every layer) but has more parameters. Prompt Tuning is extremely lightweight and works best with very large models.
Q4: Why is multi-adapter serving a uniquely strong advantage of LoRA over other PEFT methods?
Show Answer
LoRA adapters are small weight matrices that can be applied additively to frozen base model weights. This means a single base model in GPU memory can serve hundreds of different adapters by swapping them per request. Systems like LoRAX and S-LoRA make this efficient at scale. Adapter-based methods add sequential computation, making them harder to batch. Prefix and Prompt Tuning can also be swapped, but LoRA has far better tooling and ecosystem support for this use case.
Q5: In what scenario would IA3 be a better choice than LoRA?
Show Answer
IA3 is preferred when you have an extremely tight parameter budget (it trains roughly 10x fewer parameters than LoRA), when you are fine-tuning on very small datasets where overfitting is a concern, or when the task is simple (style transfer, format change) and does not require learning substantial new knowledge. For complex domain adaptation or reasoning tasks, LoRA will significantly outperform IA3.
Tip: Apply LoRA to All Linear Layers

Applying LoRA to only the attention layers (q_proj, v_proj) is the classic default, but recent research shows targeting all linear layers (including MLP) gives better results for a modest increase in trainable parameters. Try target_modules="all-linear" in PEFT.

Key Takeaways
Fun Fact

The PEFT method zoo has grown so large that researchers now publish "survey of survey" papers just to catalog them all. At last count, the literature contained over 40 distinct PEFT variants, many named with creative acronyms: LoRA, DoRA, QLoRA, rsLoRA, LoRA+, AdaLoRA, LoHa, LoKr, OFT, BOFT, VeRA, IA3, and more. Some researchers joke that the field has more LoRA variants than there are letters in the alphabet. The good news: despite this Cambrian explosion, about 90% of practitioners use plain LoRA or QLoRA, and that is perfectly fine for most tasks.

Research Frontier

Unified PEFT frameworks are emerging that combine adapter insertion, soft prompt tuning, and low-rank decomposition into a single configurable system, allowing automated search over PEFT method combinations. Research on (IA)^3 demonstrates that learning just three rescaling vectors per layer can match LoRA performance with even fewer parameters. The frontier challenge is developing PEFT methods that work reliably for multimodal models (vision-language, audio-language), where optimal adapter placement differs from text-only transformers.

Recent work on mixture-of-LoRA-experts (2024) routes inputs to specialized LoRA adapters using a learned gating mechanism, combining the modularity of multi-adapter serving with the capacity of larger models.

Exercises

Exercise 15.2.1: DoRA improvement Conceptual

Explain how DoRA (Weight-Decomposed Low-Rank Adaptation) improves on standard LoRA. What is the magnitude-direction decomposition?

Answer Sketch

DoRA decomposes each weight matrix into a magnitude component (scalar per column) and a direction component (unit vector). LoRA is then applied only to the direction component, while the magnitude is learned separately. This mirrors how full fine-tuning naturally adjusts both magnitude and direction. Standard LoRA couples these two, making it harder to learn the right balance. DoRA typically improves accuracy by 1 to 3% over LoRA with minimal additional overhead.

Exercise 15.2.2: Prefix tuning Conceptual

Describe how prefix tuning works. How does prepending learned 'virtual tokens' to the input differ from manual prompt engineering?

Answer Sketch

Prefix tuning prepends a sequence of learned continuous vectors (virtual tokens) to each layer's key and value representations. Unlike manual prompt engineering (which uses discrete text tokens), prefix tuning optimizes in continuous space, allowing the model to find representations that no natural language token could express. The virtual tokens are trained via backpropagation while the model is frozen. Prefix tuning modifies the attention pattern without changing any model weights.

Exercise 15.2.3: IA3 efficiency Coding

IA3 learns only three rescaling vectors per layer. Calculate the total number of trainable parameters for a 7B model with 32 layers, where each layer has hidden dimension 4096.

Answer Sketch

IA3 learns three vectors per layer: one for keys (d=4096), one for values (d=4096), and one for the FFN intermediate layer (d=11008 for Llama). Per layer: 4096 + 4096 + 11008 = 19,200 parameters. Total: 32 * 19,200 = 614,400 parameters. For a 7B model, this is 0.009% of total parameters. Compare to LoRA rank 16: ~10M parameters (0.14%). IA3 is 16x more parameter-efficient than LoRA but may sacrifice adaptation quality for complex tasks.

Exercise 15.2.4: Method selection guide Analysis

You need to fine-tune a single base model for 50 different customer tenants, each with unique data. Which PEFT method would you choose, and why?

Answer Sketch

LoRA is ideal for multi-tenant serving. Each tenant gets their own small adapter (~10 to 50MB) that can be hot-swapped at inference time without reloading the base model (~14GB for 7B). Frameworks like LoRAX and S-LoRA serve hundreds of adapters from a single GPU. Prefix tuning is an alternative but adds serving latency. Full fine-tuning would require 50 separate model copies (700GB+). LoRA gives per-tenant customization at 1/300th the storage cost.

Exercise 15.2.5: Prompt tuning implementation Coding

Write the key code to add a soft prompt (10 learned tokens) to a model using the PEFT library's PromptTuningConfig. Include model loading, config setup, and training.

Answer Sketch

Config: config = PromptTuningConfig(task_type='CAUSAL_LM', num_virtual_tokens=10, prompt_tuning_init='RANDOM'). Apply: model = get_peft_model(model, config). The model now has 10 * hidden_dim trainable parameters (e.g., 10 * 4096 = 40,960 for Llama-7B). Train normally with any SFT trainer. At inference, the learned soft prompt is prepended automatically. Total trainable parameters: 0.0006% of the model.

What Comes Next

In the next section, Section 15.3: Training Platforms & Tools, we cover training platforms and tools, the practical infrastructure for running PEFT workflows at scale.

Bibliography
LoRA Variants

Liu, S., Wang, C., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., & Chen, M.-H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024.

Decomposes pre-trained weights into magnitude and direction components, applying LoRA only to the directional part. This approach closes the gap with full fine-tuning on several benchmarks and is recommended reading for anyone seeking the next step beyond vanilla LoRA.

LoRA Extension

Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient Low Rank Adaptation of Large Models. ICML 2024.

Identifies that the A and B matrices in LoRA should use different learning rates for optimal convergence. The fix is simple to implement and yields consistent improvements, making this a quick win for any LoRA training pipeline.

Training Optimization

Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. ICLR 2023.

Introduces importance-based rank allocation across weight matrices, pruning less critical singular values during training. Teams working with tight parameter budgets should read this to learn how adaptive rank can outperform fixed-rank LoRA.

Adaptive Rank
Alternative PEFT Approaches

Li, X. L. & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021.

Prepends trainable continuous vectors to the key and value sequences at each layer, achieving competitive performance on generation tasks with even fewer parameters than adapters. Useful reading for understanding the prompt-based branch of the PEFT family tree.

Prompt-Based PEFT

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., & Neubig, G. (2022). Towards a Unified View of Parameter-Efficient Transfer Learning. ICLR 2022.

Demonstrates that adapters, prefix tuning, and LoRA can all be understood as modifications to attention, providing a unified mathematical framework. Researchers comparing PEFT methods should consult this paper for its clean theoretical perspective and controlled experiments.

Unified Framework

Renduchintala, A., Konuk, T., & Kuchaiev, O. (2024). Tied-LoRA: Enhancing Parameter Efficiency of LoRA with Weight Tying.

Shares LoRA adapter weights across layers using a tying strategy, reducing the total adapter parameter count further while maintaining quality. Practitioners deploying many LoRA adapters simultaneously will find this approach valuable for reducing storage and serving costs.

Weight Sharing