"One adapter is good. Six adapters, each specialized and hot-swappable, is a production strategy."
LoRA, Hot-Swapping AI Agent
LoRA dominates the PEFT landscape, but it is not the only option. Researchers have developed numerous alternatives that offer different tradeoffs in parameter count, training speed, inference overhead, and task specialization. DoRA improves LoRA by decomposing weights into magnitude and direction components. LoRA+ uses different learning rates for the A and B matrices. Prefix Tuning and Prompt Tuning prepend learnable tokens rather than modifying weights (drawing on the prompt engineering intuition of steering via input context). IA3 achieves extreme parameter efficiency by learning only rescaling vectors. Understanding these alternatives helps you select the right tool for each scenario, particularly when operating under tight memory, latency, or multi-tenant serving constraints. The Section 09.2 combine naturally with these lightweight adaptation approaches. The LoRA foundations from Section 15.1 provide the baseline that these methods extend or replace.
Prerequisites
This section extends the LoRA and QLoRA foundations from Section 15.1, so make sure you understand low-rank decomposition (W' = W + BA) and the role of rank, alpha, and target module selection. Familiarity with the transformer attention mechanism from Section 04.1 is important for understanding how prefix tuning and adapter methods modify the forward pass. The prompt engineering concepts from Section 11.1 provide useful context for prompt tuning, which bridges the gap between manual prompting and learned adaptation.
Teams often jump from LoRA to DoRA, IA3, or prefix tuning hoping for a quick accuracy boost, without first optimizing LoRA's hyperparameters (rank, alpha, target modules, learning rate). In practice, a well-tuned LoRA configuration outperforms a poorly tuned DoRA or adapter setup. Before switching methods, make sure you have tried: (1) increasing rank to 32 or 64, (2) adjusting alpha relative to rank, (3) targeting all linear layers (not just attention), and (4) tuning the learning rate. If a properly tuned LoRA still falls short, then explore alternatives like DoRA or full fine-tuning.
1. DoRA: Weight-Decomposed Low-Rank Adaptation
DoRA (2024) improves on LoRA by decomposing the pretrained weight into its magnitude and direction components before applying the low-rank update. Specifically, a weight vector w is decomposed as w = m · (w / ||w||), where m is a learnable magnitude scalar and the direction is updated via standard LoRA. This decomposition aligns more closely with how full fine-tuning actually modifies weights, resulting in better performance for the same rank.
If you are already using LoRA and want a quick accuracy boost, try switching to DoRA before increasing rank. In most benchmarks, DoRA at rank 16 outperforms LoRA at rank 32, while using fewer trainable parameters. The PEFT library supports DoRA as a drop-in replacement: just set use_dora=True in your LoraConfig.
In practice, DoRA consistently outperforms LoRA by 1-3% across benchmarks when using the same rank and target modules, with only a marginal increase in trainable parameters (the additional magnitude vectors are tiny). The training speed is nearly identical to LoRA. For deployment considerations, the Section 09.1 apply equally to DoRA-adapted models. Figure 15.2.1 compares the LoRA and DoRA weight update mechanisms.
Think of advanced PEFT methods as a wardrobe of accessories for the same outfit. DoRA separates direction from magnitude, like choosing both which way to point and how far to reach. Prompt tuning adds learned tokens to the input, like pinning a badge onto a jacket that changes how others perceive you. Each accessory targets a different aspect of the model's behavior, and you can mix and match them depending on the task at hand.
The following implementation (Code Fragment 15.2.2) shows how to enable DoRA with a single configuration flag.
# Configure QLoRA adapter parameters on top of the quantized base
# The adapter trains in float16 while the base stays in 4-bit NormalFloat
from peft import LoraConfig, get_peft_model
# DoRA configuration: enable the use_dora flag
dora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
use_dora=True, # Enable DoRA
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, dora_config)
model.print_trainable_parameters()
# Slightly more params than LoRA due to magnitude vectors,
# but typically ~1-3% better accuracy at same rank.
Code Fragment 15.2.5 shows how Prefix Tuning prepends learnable vectors to each attention layer.
# Prefix Tuning: prepend learnable key-value vectors to each attention layer
# The model attends to these "virtual tokens" alongside real input tokens
from peft import PrefixTuningConfig, get_peft_model, TaskType
# Prefix Tuning configuration
prefix_config = PrefixTuningConfig(
task_type=TaskType.CAUSAL_LM,
num_virtual_tokens=30, # Number of prefix tokens
prefix_projection=True, # Use MLP to project prefix (more stable)
encoder_hidden_size=1024, # Hidden size of projection MLP
)
# Wrap base model with trainable LoRA adapters
model = get_peft_model(base_model, prefix_config)
model.print_trainable_parameters()
# Typically 0.1-0.5% of total parameters
prefix_projection flag enables a small MLP that projects the prefix, improving training stability. This approach modifies the model's behavior through attention steering rather than weight modification.Code Fragment 15.2.4 demonstrates the bottleneck adapter pattern using LLaMA-Adapter style chapters.
# Adapter layers: insert small bottleneck modules between transformer layers
# Uses LLaMA-Adapter style via PEFT's AdaptionPromptConfig
from peft import AdaptionPromptConfig, get_peft_model
# Note: For bottleneck adapters, use the adapters library
# from adapters import AutoAdapterModel
# Example with LLaMA-Adapter style (via PEFT)
adapter_config = AdaptionPromptConfig(
adapter_len=10, # Length of adapter prompt
adapter_layers=30, # Number of layers to add adapters
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, adapter_config)
model.print_trainable_parameters()
5. IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations
IA3 (few-shot parameter-efficient fine-tuning Is All You Need, 2022) takes parameter efficiency to the extreme. Instead of learning new matrices or inserting new layers, IA3 learns only three rescaling vectors that modulate the keys, values, and intermediate activations in the feedforward layers. The total number of trainable parameters is typically 10x smaller than LoRA.
The tradeoff is that IA3's limited capacity makes it best suited for simple adaptation tasks (format changes, style transfer) rather than complex domain adaptation. It excels in few-shot settings where overfitting is a concern. Code Fragment 15.2.5 demonstrates IA3 configuration.
# IA3 configuration: learns only rescaling vectors
# Extreme parameter efficiency at the cost of limited adaptation capacity
from peft import IA3Config, get_peft_model, TaskType
ia3_config = IA3Config(
task_type=TaskType.CAUSAL_LM,
target_modules=["k_proj", "v_proj", "down_proj"],
feedforward_modules=["down_proj"],
)
# Wrap base model with IA3 rescaling vectors
model = get_peft_model(base_model, ia3_config)
model.print_trainable_parameters()
# trainable params: ~500K for a 7B model (0.007%)
6. Comprehensive PEFT Method Comparison
| Method | Params (%) | Memory | Inference Overhead | Best For |
|---|---|---|---|---|
| LoRA | 0.1-0.5% | Low | Zero (after merge) | General purpose, most tasks |
| QLoRA | 0.1-0.5% | Very Low | Zero (after merge) | Large models on limited GPU |
| DoRA | 0.1-0.5% | Low | Zero (after merge) | When LoRA quality is insufficient |
| LoRA+ | 0.1-0.5% | Low | Zero (after merge) | Faster convergence needed |
| Prefix Tuning | 0.1-0.5% | Low | Small (longer KV cache) | NLU tasks, multi-task serving |
| Prompt Tuning | <0.01% | Very Low | Negligible | Very large models, simple tasks |
| Adapters | 0.5-3% | Medium | Small (sequential) | Compositional multi-task |
| IA3 | <0.01% | Very Low | Negligible | Few-shot, style adaptation |
Prompt Tuning and IA3 achieve extreme parameter efficiency, but they are significantly less capable than LoRA for complex adaptation tasks. If your task requires learning new knowledge (domain-specific terminology, code patterns, specialized reasoning), LoRA or DoRA with a reasonable rank (16-64) will substantially outperform these lighter methods. Reserve IA3 and Prompt Tuning for scenarios where simplicity or parameter count is the primary constraint.
Why this matters: The proliferation of PEFT methods (DoRA, AdaLoRA, IA3, GaLore) is not just academic variety; each addresses a specific limitation of vanilla LoRA. DoRA handles magnitude-direction decomposition for better convergence. AdaLoRA allocates rank adaptively, putting more parameters where they help most. IA3 achieves extreme parameter efficiency (orders of magnitude fewer parameters than LoRA) at the cost of some task performance. The practical guidance is: start with standard LoRA, and only explore these variants when you have a specific bottleneck (convergence issues, memory constraints, or multi-task serving requirements).
7. Multi-Adapter Serving
One of LoRA's most powerful production features is the ability to serve many adapters from a single base model. This enables multi-tenant deployments where each customer or task gets its own fine-tuned behavior without duplicating the base model weights, a significant advantage for inference optimization. Two main systems support this at scale: LoRAX (formerly LoRAX by Predibase) and S-LoRA.
7.1 Architecture Overview
Figure 15.2.3 illustrates how a single base model serves multiple LoRA adapters dynamically at request time.
7.2 LoRAX and S-LoRA
LoRAX (Predibase) is a production-grade serving system that can host hundreds of fine-tuned LoRA adapters on a single GPU. It keeps the base model in GPU memory and dynamically loads adapter weights per request. Key features include adapter weight caching, batched inference across different adapters, and automatic adapter management.
S-LoRA (from UC Berkeley) takes a more research-oriented approach, using unified paging to manage adapter memory and custom CUDA kernels for batched LoRA computation. S-LoRA can serve thousands of adapters simultaneously, with adapters stored in a tiered memory system (GPU, CPU, disk) and paged in on demand.
Multi-adapter serving is one of the strongest arguments for LoRA over other PEFT methods in production. A single A100 GPU can serve a base 7B model with hundreds of LoRA adapters, effectively providing hundreds of specialized models at the cost of one. This is far more economical than deploying separate merged models for each use case.
Who: Daniela, a platform architect at a B2B content-generation SaaS company.
Situation: The company had 50 enterprise customers, each requiring a specialized writing style (tone, vocabulary, formatting conventions) for their generated marketing copy. The product ran on a 7B-parameter base model.
Problem: Deploying 50 separate fine-tuned models would require 50 GPUs, pushing monthly infrastructure costs above $150,000. The engineering team also dreaded maintaining 50 independent model deployments, each with its own versioning and rollback pipeline.
Decision: Daniela trained 50 lightweight LoRA adapters (each roughly 20 MB) and served them all from a single A100 using LoRAX. Per-request routing attached the correct adapter based on the customer's API key, and new customer onboarding required only a fresh LoRA training run (under two hours on one GPU).
Result: Total GPU cost dropped from 50 instances to one A100, a roughly 98% cost reduction. New adapters slotted into the shared serving infrastructure with zero downtime. Median response latency increased by only 8ms compared to a dedicated model, which was imperceptible to end users.
Lesson: For multi-tenant serving where each customer needs a specialized model, LoRA adapters plus a shared base model eliminate the linear scaling of GPU costs with customer count. The key enabler is adapter hot-swapping at the serving layer.
8. Choosing the Right PEFT Method
With so many PEFT options available, the decision can feel overwhelming. Here is a practical decision framework based on your constraints and requirements.
| Scenario | Recommended Method | Reasoning |
|---|---|---|
| General fine-tuning (default) | LoRA (r=16) | Best quality/efficiency tradeoff, widest ecosystem support |
| Limited GPU memory | QLoRA | 4-bit base model frees VRAM for larger models or batches |
| Need extra quality over LoRA | DoRA | Drop-in upgrade, consistent 1-3% improvement |
| Training speed is critical | LoRA+ | 1.5-2x faster convergence, same final quality |
| Multi-tenant serving (100+ tasks) | LoRA + LoRAX | Hot-swappable adapters from single base |
| Extreme parameter budget (<1K params) | IA3 | Learns only rescaling vectors, minimal overfitting |
| Very large model (100B+), simple task | Prompt Tuning | Ultra-lightweight, scales well with model size |
| NLU classification tasks | Prefix Tuning | Strong at steering attention for classification |
When in doubt, start with LoRA. It has the widest library support, the most documentation, and works well across virtually all tasks and model sizes. Move to specialized methods only when you have a specific constraint (memory, serving architecture, parameter count) that LoRA cannot satisfy.
9. GaLore: Gradient Low-Rank Projection
While LoRA reduces the number of trainable parameters by adding low-rank adapters, GaLore (Gradient Low-Rank Projection) takes a fundamentally different approach: it reduces optimizer memory by projecting gradients into a low-rank subspace. This distinction matters because optimizer states (momentum and variance in Adam) typically consume two to three times the memory of the parameters themselves. GaLore enables full-parameter training of large models on consumer hardware, something that LoRA alone cannot achieve because LoRA still freezes the base model and only updates the adapter.
9.1 How GaLore Works
During training, GaLore periodically computes the SVD of the gradient matrix for each weight layer, retaining only the top-r singular vectors. The optimizer states (Adam's first and second moments) are maintained in this reduced r-dimensional space rather than the full parameter space. Every T steps (typically T=200), the projection matrices are recomputed from the current gradient to track the evolving optimization landscape. The key insight is that gradient matrices during LLM training tend to be approximately low-rank, so very little information is lost by this projection. Code Fragment 15.2.6 provides a conceptual implementation.
# GaLore conceptual implementation
import torch
class GaLoreProjector:
"""Project gradients to low-rank subspace for memory-efficient training."""
def __init__(self, rank: int, update_freq: int = 200):
self.rank = rank
self.update_freq = update_freq
self.step = 0
self.projector = None
def project(self, grad: torch.Tensor) -> torch.Tensor:
"""Project full gradient to low-rank subspace."""
if self.step % self.update_freq == 0:
# Recompute projection via SVD
U, S, Vh = torch.linalg.svd(grad, full_matrices=False)
self.projector = U[:, :self.rank]
self.step += 1
# Project gradient: (d, r) @ (r, d) is never formed explicitly
return self.projector.T @ grad # shape: (rank, d_out)
def back_project(self, low_rank_update: torch.Tensor) -> torch.Tensor:
"""Map low-rank update back to full parameter space."""
return self.projector @ low_rank_update
In practice, the galore_torch library wraps this into a drop-in optimizer replacement:
# Library shortcut: GaLore optimizer (pip install galore-torch)
from galore_torch import GaLoreAdamW8bit
optimizer = GaLoreAdamW8bit(
model.parameters(),
lr=1e-5,
rank=128, # low-rank projection dimension
update_proj_gap=200, # recompute SVD every 200 steps
)
# Use this optimizer in any standard training loop; no other changes needed.
GaLoreAdamW8bit optimizer handles SVD projection, subspace tracking, and 8-bit quantization of optimizer states internally. Replace your standard optimizer with this single line to enable full-parameter training on consumer GPUs.The memory savings are substantial. For a 7B parameter model, standard Adam requires roughly 42 GB of optimizer state memory (two copies of all parameters in float32). With GaLore at rank 128, the optimizer states shrink to approximately 2 to 4 GB, enabling full-parameter training of 7B models on a single 24 GB GPU. The authors demonstrated that GaLore can pre-train LLaMA models up to 7B parameters on a single GPU with no loss in quality compared to full-rank Adam.
GaLore and LoRA solve different problems. LoRA reduces the number of trainable parameters by adding small adapters. GaLore reduces optimizer memory by projecting gradients into a low-rank space while still updating all parameters. You can combine both: use GaLore for pre-training or full fine-tuning, and LoRA for lightweight adaptation where you want a modular, hot-swappable adapter.
10. rsLoRA: Rank-Stabilized LoRA
Standard LoRA initializes the low-rank matrices A and B such that the adapter output is scaled by a fixed factor α/r, where α is a hyperparameter and r is the rank. This scaling creates a practical problem: when you change the rank, the effective magnitude of the adapter's contribution changes, requiring you to re-tune the learning rate and α for each rank setting. This makes rank selection tedious and error-prone.
rsLoRA (rank-stabilized LoRA) addresses this by changing the scaling factor from α/r to α/√r. This seemingly small modification has a significant theoretical and practical impact. The √r scaling ensures that the adapter's output magnitude remains stable as the rank changes, because the variance of the product BA scales proportionally with r under random initialization. With α/√r scaling, doubling the rank does not double the adapter's contribution; it increases it by only √2, which is the correct normalization for maintaining stable training dynamics. Code Fragment 15.2.6 compares the two scaling approaches.
# rsLoRA vs standard LoRA scaling comparison
import torch
import math
def lora_forward(x, A, B, alpha, rank, use_rslora=False):
"""Compare standard LoRA and rsLoRA scaling."""
lora_output = x @ A @ B # shape: (batch, d_out)
if use_rslora:
# rsLoRA: scale by alpha / sqrt(rank)
scaling = alpha / math.sqrt(rank)
else:
# Standard LoRA: scale by alpha / rank
scaling = alpha / rank
return lora_output * scaling
# Demonstrate stability across ranks
d_in, d_out, alpha = 4096, 4096, 16.0
x = torch.randn(1, d_in)
for rank in [4, 16, 64, 256]:
A = torch.randn(d_in, rank) * 0.01
B = torch.randn(rank, d_out) * 0.01
std_out = lora_forward(x, A, B, alpha, rank, use_rslora=False)
rs_out = lora_forward(x, A, B, alpha, rank, use_rslora=True)
print(f"rank={rank:3d} standard_norm={std_out.norm():.4f}"
f" rslora_norm={rs_out.norm():.4f}")
The practical benefit is straightforward: with rsLoRA, you can change the rank without retuning other hyperparameters. A learning rate and α that work well at rank 16 will also work well at rank 64 or rank 256. This makes hyperparameter search much faster (complementing the training loop fundamentals from Section 14.3), because you can tune the learning rate at a low rank (which trains quickly) and then scale up the rank for higher quality without adjusting anything else. rsLoRA is available in PEFT via the use_rslora=True parameter in LoraConfig.
| Property | Standard LoRA | rsLoRA | GaLore |
|---|---|---|---|
| What it optimizes | Parameter count | Parameter count | Optimizer memory |
| Scaling factor | α/r | α/√r | N/A (full params) |
| Rank-stable? | No (retune per rank) | Yes | Yes |
| Updates base weights? | No (adapter only) | No (adapter only) | Yes (all params) |
| Typical use case | Fine-tuning | Fine-tuning | Pre-training, full fine-tuning |
| Library support | PEFT, Unsloth | PEFT (use_rslora=True) | galore-torch, PEFT |
rsLoRA is a drop-in improvement with no computational overhead. If you are using PEFT's LoraConfig, add use_rslora=True and your adapter scaling will be rank-stable. There is no reason not to enable it for any new LoRA training run. GaLore requires more setup (a custom optimizer) but enables training regimes that are otherwise impossible on limited hardware.
Show Answer
use_dora=True. Prefer DoRA when you want a free 1-3% accuracy improvement with minimal extra cost over standard LoRA.Show Answer
Show Answer
Show Answer
Show Answer
Applying LoRA to only the attention layers (q_proj, v_proj) is the classic default, but recent research shows targeting all linear layers (including MLP) gives better results for a modest increase in trainable parameters. Try target_modules="all-linear" in PEFT.
- DoRA improves LoRA by decomposing weights into magnitude and direction, yielding 1-3% gains with a single configuration flag (
use_dora=True). - LoRA+ accelerates convergence by using asymmetric learning rates for A and B matrices, saving training compute without sacrificing quality.
- Prefix Tuning prepends learnable key-value pairs to every attention layer, offering an alternative to weight modification that works well for NLU tasks.
- Prompt Tuning is the most parameter-efficient method (<0.01%), but is only competitive with very large models and simple tasks.
- IA3 learns only rescaling vectors, achieving extreme parameter efficiency at the cost of limited adaptation capacity.
- Multi-adapter serving (LoRAX, S-LoRA) lets you host hundreds of specialized models on a single GPU by hot-swapping LoRA adapters per request.
- When in doubt, use LoRA. It has the best ecosystem support, works across all model sizes and task types, and can be upgraded to DoRA or LoRA+ with minimal changes.
The PEFT method zoo has grown so large that researchers now publish "survey of survey" papers just to catalog them all. At last count, the literature contained over 40 distinct PEFT variants, many named with creative acronyms: LoRA, DoRA, QLoRA, rsLoRA, LoRA+, AdaLoRA, LoHa, LoKr, OFT, BOFT, VeRA, IA3, and more. Some researchers joke that the field has more LoRA variants than there are letters in the alphabet. The good news: despite this Cambrian explosion, about 90% of practitioners use plain LoRA or QLoRA, and that is perfectly fine for most tasks.
Unified PEFT frameworks are emerging that combine adapter insertion, soft prompt tuning, and low-rank decomposition into a single configurable system, allowing automated search over PEFT method combinations. Research on (IA)^3 demonstrates that learning just three rescaling vectors per layer can match LoRA performance with even fewer parameters. The frontier challenge is developing PEFT methods that work reliably for multimodal models (vision-language, audio-language), where optimal adapter placement differs from text-only transformers.
Recent work on mixture-of-LoRA-experts (2024) routes inputs to specialized LoRA adapters using a learned gating mechanism, combining the modularity of multi-adapter serving with the capacity of larger models.
Exercises
Explain how DoRA (Weight-Decomposed Low-Rank Adaptation) improves on standard LoRA. What is the magnitude-direction decomposition?
Answer Sketch
DoRA decomposes each weight matrix into a magnitude component (scalar per column) and a direction component (unit vector). LoRA is then applied only to the direction component, while the magnitude is learned separately. This mirrors how full fine-tuning naturally adjusts both magnitude and direction. Standard LoRA couples these two, making it harder to learn the right balance. DoRA typically improves accuracy by 1 to 3% over LoRA with minimal additional overhead.
Describe how prefix tuning works. How does prepending learned 'virtual tokens' to the input differ from manual prompt engineering?
Answer Sketch
Prefix tuning prepends a sequence of learned continuous vectors (virtual tokens) to each layer's key and value representations. Unlike manual prompt engineering (which uses discrete text tokens), prefix tuning optimizes in continuous space, allowing the model to find representations that no natural language token could express. The virtual tokens are trained via backpropagation while the model is frozen. Prefix tuning modifies the attention pattern without changing any model weights.
IA3 learns only three rescaling vectors per layer. Calculate the total number of trainable parameters for a 7B model with 32 layers, where each layer has hidden dimension 4096.
Answer Sketch
IA3 learns three vectors per layer: one for keys (d=4096), one for values (d=4096), and one for the FFN intermediate layer (d=11008 for Llama). Per layer: 4096 + 4096 + 11008 = 19,200 parameters. Total: 32 * 19,200 = 614,400 parameters. For a 7B model, this is 0.009% of total parameters. Compare to LoRA rank 16: ~10M parameters (0.14%). IA3 is 16x more parameter-efficient than LoRA but may sacrifice adaptation quality for complex tasks.
You need to fine-tune a single base model for 50 different customer tenants, each with unique data. Which PEFT method would you choose, and why?
Answer Sketch
LoRA is ideal for multi-tenant serving. Each tenant gets their own small adapter (~10 to 50MB) that can be hot-swapped at inference time without reloading the base model (~14GB for 7B). Frameworks like LoRAX and S-LoRA serve hundreds of adapters from a single GPU. Prefix tuning is an alternative but adds serving latency. Full fine-tuning would require 50 separate model copies (700GB+). LoRA gives per-tenant customization at 1/300th the storage cost.
Write the key code to add a soft prompt (10 learned tokens) to a model using the PEFT library's PromptTuningConfig. Include model loading, config setup, and training.
Answer Sketch
Config: config = PromptTuningConfig(task_type='CAUSAL_LM', num_virtual_tokens=10, prompt_tuning_init='RANDOM'). Apply: model = get_peft_model(model, config). The model now has 10 * hidden_dim trainable parameters (e.g., 10 * 4096 = 40,960 for Llama-7B). Train normally with any SFT trainer. At inference, the learned soft prompt is prepended automatically. Total trainable parameters: 0.0006% of the model.
What Comes Next
In the next section, Section 15.3: Training Platforms & Tools, we cover training platforms and tools, the practical infrastructure for running PEFT workflows at scale.
Liu, S., Wang, C., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., & Chen, M.-H. (2024). DoRA: Weight-Decomposed Low-Rank Adaptation. ICML 2024.
Decomposes pre-trained weights into magnitude and direction components, applying LoRA only to the directional part. This approach closes the gap with full fine-tuning on several benchmarks and is recommended reading for anyone seeking the next step beyond vanilla LoRA.
Hayou, S., Ghosh, N., & Yu, B. (2024). LoRA+: Efficient Low Rank Adaptation of Large Models. ICML 2024.
Identifies that the A and B matrices in LoRA should use different learning rates for optimal convergence. The fix is simple to implement and yields consistent improvements, making this a quick win for any LoRA training pipeline.
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. ICLR 2023.
Introduces importance-based rank allocation across weight matrices, pruning less critical singular values during training. Teams working with tight parameter budgets should read this to learn how adaptive rank can outperform fixed-rank LoRA.
Li, X. L. & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021.
Prepends trainable continuous vectors to the key and value sequences at each layer, achieving competitive performance on generation tasks with even fewer parameters than adapters. Useful reading for understanding the prompt-based branch of the PEFT family tree.
He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., & Neubig, G. (2022). Towards a Unified View of Parameter-Efficient Transfer Learning. ICLR 2022.
Demonstrates that adapters, prefix tuning, and LoRA can all be understood as modifications to attention, providing a unified mathematical framework. Researchers comparing PEFT methods should consult this paper for its clean theoretical perspective and controlled experiments.
Renduchintala, A., Konuk, T., & Kuchaiev, O. (2024). Tied-LoRA: Enhancing Parameter Efficiency of LoRA with Weight Tying.
Shares LoRA adapter weights across layers using a tying strategy, reducing the total adapter parameter count further while maintaining quality. Practitioners deploying many LoRA adapters simultaneously will find this approach valuable for reducing storage and serving costs.
