Part 4: Training and Adapting
Chapter 15: Parameter-Efficient Fine-Tuning

Soft Prompts: Prompt Tuning, Prefix Tuning, and P-Tuning

"A well-placed soft prompt does not change what the model knows. It changes what the model is trying to do."

LoRA LoRA, Softly Persuasive AI Agent
Big Picture

Soft prompt methods occupy a fascinating middle ground between prompt engineering and fine-tuning. Instead of choosing discrete text tokens that a human could read and interpret, soft prompt methods learn continuous, real-valued vectors in embedding space. These learned vectors are prepended to the input (or inserted into hidden layers) and steer the model toward a desired behavior without modifying the base model's weights. The family includes Prompt Tuning (input layer only, extremely lightweight), Prefix Tuning (all attention layers, more expressive), P-Tuning v1 (encoder-generated embeddings, bridging NLU and NLG), and P-Tuning v2 (deep prefix, strong NLU performance at all scales). Together they form a spectrum: as you add parameters, you gain task performance but reduce the elegance of a near-zero overhead approach. Cross-reference: Section 15.1 (LoRA), Section 15.2 (Advanced PEFT), Chapter 11 (Prompt Engineering).

Prerequisites

This section assumes familiarity with the LoRA and QLoRA techniques from Section 15.1 and the advanced PEFT methods in Section 15.2. You should understand how transformer attention works (Section 4.1) and be familiar with discrete prompt engineering from Chapter 11. Background in fine-tuning fundamentals from Chapter 14 is also helpful.

1. The Soft Prompt Family: A Taxonomy

A soft prompt is a set of learnable continuous vectors that are concatenated with the model's input representations and updated through gradient descent. Unlike the discrete tokens used in Chapter 11's prompt engineering, soft prompts exist only as floating-point tensors: they have no natural language interpretation, cannot be read by a human, and are discovered entirely by the optimizer.

Key Insight

Soft prompts reveal something profound about LLMs: the optimal instruction for a task is not necessarily expressible in human language. When you let the optimizer search freely in embedding space, it finds vectors that steer the model more effectively than any discrete prompt a human could write. These vectors often correspond to no real words at all. This suggests that the "language" models respond to best is not English or any natural language; it is the geometry of their own embedding space.

Common Mistake: Confusing Soft Prompts with Prompt Engineering

Despite the similar names, soft prompts and prompt engineering (covered in Chapter 11) are fundamentally different techniques. Prompt engineering crafts human-readable text instructions. Soft prompts are learned continuous vectors with no natural language interpretation; they cannot be read, shared as text, or transferred between different models. Soft prompts require gradient-based training, while prompt engineering requires only API access. Additionally, soft prompts are tied to a specific model checkpoint: if the base model is updated, the soft prompts must be retrained.

The key architectural question that distinguishes each method is where the learned vectors are inserted:

Insertion Points by Method Prompt Tuning Prefix Tuning / P-Tuning v2 P-Tuning v1 Soft tokens [Layer 0] Transformer Block 1 Transformer Block 2 ... (no changes) Transformer Block N Parameters: < 0.01% Prefix KV [Layer 0] Prefix KV [Block 1] Prefix KV [Block 2] ... (every layer) Prefix KV [Block N] Parameters: 0.1 - 1% LSTM/MLP Encoder (generates embeddings) Input layer (selected positions) Transformer Block 2 ... (no changes) Transformer Block N Parameters: ~0.01-0.1% Timeline Prompt Tuning 2021 Prefix Tuning 2021 P-Tuning v1 2021 P-Tuning v2 2022
Figure 15.4.1: The four main soft prompt methods differ in where they insert their learnable parameters. Prompt Tuning touches only the input embedding layer; Prefix Tuning and P-Tuning v2 inject learned key-value pairs into every attention layer; P-Tuning v1 uses a trainable encoder to place embeddings at chosen input positions.

A useful mental model: think of a soft prompt as a context-setting preamble that only the model's internals can read. You are teaching the model to adopt a "mode" for a given task without rewriting any of its core knowledge.

Hard Prompts vs. Soft Prompts

Hard prompts (discrete text tokens from Chapter 11) must live in a model's existing vocabulary and are human-readable. Soft prompts are arbitrary floating-point vectors: they can represent concepts that no single word captures, and they are found by gradient descent rather than human authorship. The trade-off is interpretability vs. expressiveness.

2. Prompt Tuning (Lester et al., 2021)

Prompt Tuning, introduced by Lester, Al-Rfou, and Constant at Google (2021), is the simplest member of the soft-prompt family. It prepends a small set of learnable vectors (the "soft prompt") to the token embeddings at the very first layer of the transformer. All other weights remain frozen. During forward and backward passes, only these prefix vectors accumulate gradients.

The parameter count is striking: for a prompt of 100 tokens on a model with a 4096-dimensional embedding, you are training 100 x 4096 = 409,600 parameters. For a 7B-parameter model, that is roughly 0.006% of model parameters. Storage and swapping costs are negligible.

The main finding of the original paper was that Prompt Tuning performance scales with model size. On small models (100M parameters), it significantly underperforms full fine-tuning. As models approach 10B parameters, Prompt Tuning closes the gap almost entirely. For practitioners using frontier-scale models, it can match fine-tuning performance at a fraction of the cost.

The following code (Code Fragment 15.4.1) shows how to set up Prompt Tuning with the HuggingFace PEFT library.

# Code Fragment 15.4.1: Prompt Tuning with HuggingFace PEFT
# Adds learnable tokens only at the input embedding layer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType

MODEL_NAME = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype="auto")

# Configure Prompt Tuning
# num_virtual_tokens: length of the soft prompt (10-100 is typical)
# prompt_tuning_init: TEXT initializes from a real string; RANDOM is also valid
peft_config = PromptTuningConfig(
 task_type=TaskType.CAUSAL_LM,
 prompt_tuning_init=PromptTuningInit.TEXT,
 prompt_tuning_init_text="Classify the sentiment of the following text:",
 num_virtual_tokens=20,
 tokenizer_name_or_path=MODEL_NAME,
)

# Wrap model: only the prompt embeddings are trainable
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Typical output: trainable params: 81,920 || all params: 1,235,814,400
# || trainable%: 0.0066

# Standard training loop (SFTTrainer or custom loop works unchanged)
# The PEFT wrapper transparently prepends the soft prompt during forward passes
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

dataset = load_dataset("imdb", split="train[:2000]")

training_args = TrainingArguments(
 output_dir="./prompt-tuning-output",
 num_train_epochs=3,
 per_device_train_batch_size=8,
 learning_rate=3e-2, # Higher LR than LoRA; soft prompts need it
 warmup_steps=100,
 logging_steps=50,
 save_strategy="epoch",
 fp16=True,
)

trainer = SFTTrainer(
 model=model,
 args=training_args,
 train_dataset=dataset,
 dataset_text_field="text",
 max_seq_length=512,
)
trainer.train()

# Saving: only the soft prompt vectors are stored (~320 KB for 20 tokens)
model.save_pretrained("./my-prompt-tuning-adapter")
Code Fragment 15.4.1: Prompt Tuning with HuggingFace PEFT. The PromptTuningConfig prepends 20 learnable virtual tokens (initialized from a text string) to every input. Only these ~320 KB of soft prompt embeddings are trained, leaving the full model frozen.
Initialization Matters

Initializing the soft prompt from a meaningful text string (as shown above) consistently outperforms random initialization, especially when training data is limited. The intuition: you are starting the optimizer in a region of embedding space that is already near a useful representation. Random initialization requires many more gradient steps to escape noise.

When to use Prompt Tuning: very large models (10B+), scenarios where you need dozens of task-specific adapters sharing one base model, and simple classification or generation tasks where the additional expressiveness of deeper methods is not required.

3. Prefix Tuning (Li & Liang, 2021)

Prefix Tuning, from Li and Liang at Stanford (2021), takes a more aggressive approach: it prepends learnable key-value pairs to every transformer layer's attention mechanism, not just the input. Each layer gets its own set of prefix tokens that directly modulate what the attention mechanism attends to at that depth.

The practical consequence: Prefix Tuning is far more expressive than Prompt Tuning. Because prefixes influence attention at every layer, the model can steer mid-level representations and not just the initial input conditioning. This makes a material difference on generation-heavy tasks such as table-to-text and summarization, where the output structure is complex and sequential.

A key implementation detail is reparameterization. Directly optimizing prefix parameters can be unstable. During training, the prefix vectors are generated by a small MLP applied to a lower-dimensional latent space. This MLP introduces smoother gradients and prevents collapse or divergence. At inference time, the MLP is discarded; only the final prefix key-value tensors are stored and used.

Parameter count sits roughly between 0.1% and 1% of model parameters, depending on prefix length and model depth. This is roughly 10-100x more than Prompt Tuning, but still far below LoRA.

# Code Fragment 15.4.2: Prefix Tuning with HuggingFace PEFT
# Injects learned KV pairs at every attention layer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType

MODEL_NAME = "google/flan-t5-large" # Encoder-decoder well-suited to prefix tuning

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

# PrefixTuningConfig with reparameterization enabled by default
peft_config = PrefixTuningConfig(
 task_type=TaskType.SEQ_2_SEQ_LM,
 num_virtual_tokens=30, # Prefix length per layer
 encoder_hidden_size=512, # Dimension of the reparameterization MLP
 # prefix_projection=True by default: uses MLP during training,
 # discards it at inference and uses the final projected vectors
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 1,835,008 || all params: 783,150,080
# || trainable%: 0.2343 (about 0.23%)

# At inference, the MLP is transparent: model.forward() uses cached prefix KVs
# This means zero extra latency from the reparameterization MLP after training
print("Prefix Tuning config:", peft_config)
# PrefixTuningConfig(num_virtual_tokens=30, encoder_hidden_size=512, ...)
Code Fragment 15.4.2: Prefix Tuning with HuggingFace PEFT. Unlike Prompt Tuning, Prefix Tuning injects learned key-value pairs into every attention layer via a reparameterization MLP (encoder_hidden_size=512). After training, the MLP is discarded and only the cached prefix KVs are used at inference.
Encoder-Decoder vs. Decoder-Only

Prefix Tuning was originally evaluated on encoder-decoder models (T5, BART) for tasks like summarization and table-to-text translation, where it showed its greatest strengths. On decoder-only models (GPT-style), it also works but the gains over Prompt Tuning are less dramatic for simple tasks. If you are working on structured generation with flan-T5 or similar, Prefix Tuning is the natural starting point.

When to use Prefix Tuning: generation tasks requiring structural control (summarization, data-to-text, dialogue response generation), encoder-decoder architectures, and situations where Prompt Tuning underperforms due to small model size.

4. P-Tuning v1 (Liu et al., 2021)

P-Tuning (v1), from Liu et al. at Tsinghua University and MIT (2021), targets a different problem: can autoregressive models like GPT compete with BERT on natural language understanding (NLU) tasks, given the right prompting strategy?

The paper's core insight was that the GPT-NLU performance gap was not an architectural limitation but a prompting problem. Hard prompts for NLU tasks like "The capital of France is [MASK]" require the learnable tokens to appear at specific non-prefix positions in the input. But naively placing random embeddings in the middle of a sequence leads to poor optimization: nearby token embeddings interfere with gradient flow, and the learned vectors collapse to degenerate solutions.

P-Tuning v1 solves this by routing the learnable prompt tokens through a small prompt encoder: an LSTM (or MLP) that takes positional indices as input and produces contextualized embeddings. The LSTM's sequential structure provides smooth gradient signal and prevents the collapse problem.

The resulting approach allows learnable tokens to appear anywhere in the input template, not just as a prefix. A P-Tuning template for a knowledge-probing task might look like: [P1][P2] The capital of [P3][P4] is, where P1-P4 are learned vectors generated by the LSTM encoder.

# Code Fragment 15.4.3: P-Tuning v1 concept with HuggingFace PEFT
# Note: HuggingFace PEFT implements P-Tuning v1 as PromptEncoderConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PromptEncoderConfig, PromptEncoderReparameterizationType, get_peft_model, TaskType

MODEL_NAME = "gpt2-medium"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# PromptEncoderConfig maps to P-Tuning v1:
# encoder_reparameterization_type selects LSTM or MLP
peft_config = PromptEncoderConfig(
 task_type=TaskType.CAUSAL_LM,
 num_virtual_tokens=20,
 encoder_reparameterization_type=PromptEncoderReparameterizationType.LSTM,
 encoder_hidden_size=128, # LSTM hidden dimension
 encoder_num_layers=2, # Number of LSTM layers
 encoder_dropout=0.0,
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 307,340 || all params: 354,841,948
# Includes both the virtual token embeddings AND the LSTM encoder parameters

# The LSTM encoder is used during both forward and backward passes.
# At inference, the LSTM is still present (unlike Prefix Tuning's MLP,
# which is discarded). However, you can cache the encoder outputs once
# and reuse them for all inference calls on a given task.

# Example: manual encode to show the two-stage process
import torch
prompt_encoder = model.prompt_encoder["default"]
# Step 1: LSTM generates contextualized prompt embeddings from learned input embeddings
# Step 2: these embeddings are prepended to the token embeddings in model.forward()
Code Fragment 15.4.3: P-Tuning v1 with HuggingFace PEFT. The LSTM-based prompt encoder generates contextualized virtual token embeddings (rather than independent vectors), producing soft prompts where each token is influenced by its neighbors in the sequence.

The results from Liu et al. were striking at the time: GPT-style models, previously thought to be weak at NLU tasks without task-specific heads, matched or exceeded BERT on SuperGLUE benchmarks when guided by P-Tuning prompts. The conclusion: the bottleneck was prompt design, not model capability.

5. P-Tuning v2 (Liu et al., 2022)

P-Tuning v2, also from Liu et al. (2022), represents a convergence of ideas: it adopts the deep, layer-wise prefix injection strategy from Prefix Tuning but applies it to NLU tasks with a classification head, and demonstrates strong performance even at small model scales (as small as 300M parameters).

The key architectural features of P-Tuning v2 are:

# Code Fragment 15.4.4: P-Tuning v2 for sequence classification with PEFT
# Uses deep prefix tuning (all layers) with a classification head
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
import torch

MODEL_NAME = "bert-base-uncased" # Works on encoder models too

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# P-Tuning v2 for classification: use SEQ_CLS task type
# This adds a linear classification head on top of the frozen model
model = AutoModelForSequenceClassification.from_pretrained(
 MODEL_NAME,
 num_labels=2 # Binary classification
)

# P-Tuning v2 is PrefixTuningConfig with prefix_projection=False
# (direct optimization, no reparameterization MLP)
peft_config = PrefixTuningConfig(
 task_type=TaskType.SEQ_CLS,
 num_virtual_tokens=16,
 prefix_projection=False, # Direct prefix optimization (P-Tuning v2 style)
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 197,634 || all params: 109,682,178
# || trainable%: 0.1801

# Training with a standard classification objective
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np

dataset = load_dataset("glue", "sst2")

def tokenize(batch):
 return tokenizer(batch["sentence"], truncation=True, max_length=128, padding="max_length")

tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

def compute_metrics(eval_pred):
 logits, labels = eval_pred
 predictions = np.argmax(logits, axis=-1)
 return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
 output_dir="./p-tuning-v2-sst2",
 num_train_epochs=5,
 per_device_train_batch_size=32,
 per_device_eval_batch_size=64,
 learning_rate=5e-3, # Higher than full fine-tuning; prefix params need it
 warmup_ratio=0.1,
 evaluation_strategy="epoch",
 save_strategy="best",
 load_best_model_at_end=True,
 metric_for_best_model="accuracy",
)

trainer = Trainer(
 model=model,
 args=training_args,
 train_dataset=tokenized["train"],
 eval_dataset=tokenized["validation"],
 compute_metrics=compute_metrics,
)
trainer.train()
Code Fragment 15.4.4: P-Tuning v2 for sequence classification with PEFT. Deep prefix tokens are injected at every layer (num_layers defaults to the full model depth), and a classification head is trained alongside the soft prompts for NLU tasks like sentiment analysis.
P-Tuning v2 vs. Prefix Tuning: Nearly Twin Architectures

P-Tuning v2 and Prefix Tuning share the same deep prefix injection structure. The meaningful differences are task focus (NLU with a classification head vs. NLG with language modeling loss), reparameterization (Prefix Tuning uses an MLP; P-Tuning v2 optimizes directly), and scale sensitivity (P-Tuning v2 claims consistent gains down to 300M parameters while Prompt Tuning degrades badly at small scale). In practice, when using the HuggingFace PEFT library, the distinction between the two comes down to the config settings shown above.

6. Comprehensive Comparison

The table below summarizes the key characteristics of each soft prompt method and places them alongside LoRA and QLoRA from Section 15.1 for context.

Method Parameters Where Applied Strengths Weaknesses Best For
Prompt Tuning
Lester et al. 2021
< 0.01% Input layer only Minimal storage; trivial to swap per task; near-zero inference overhead Requires very large models; underperforms on small models and complex tasks Large-scale models (10B+); simple classification and generation; multi-task serving
Prefix Tuning
Li & Liang 2021
0.1 - 1% All attention layers (KV) More expressive; strong on generation; reparameterization improves stability Higher parameter count than Prompt Tuning; MLP overhead during training Summarization; table-to-text; encoder-decoder models
P-Tuning v1
Liu et al. 2021
~0.01 - 0.1% Input layer (flexible positions) Non-prefix insertion; enables GPT-style models on NLU; LSTM encoder adds stability LSTM encoder present at inference; complex template design required Knowledge probing; NLU with GPT-style models; SuperGLUE-style tasks
P-Tuning v2
Liu et al. 2022
0.1 - 1% All layers (deep prefix) Strong NLU across model scales; no verbalizer needed; classification head Slightly more complex setup; less commonly used than LoRA in practice NER, POS, NLU tasks; small-to-medium models; sequence labeling
LoRA
Hu et al. 2021, Sec 15.1
0.1 - 2% Weight matrices (WQ, WV, etc.) Excellent performance across scales and tasks; well-studied; wide tooling support More parameters than Prompt Tuning; requires rank selection Most fine-tuning tasks; the dominant PEFT method in practice
QLoRA
Dettmers et al. 2023, Sec 15.1
0.1 - 2% + 4-bit base Weight matrices (quantized base) LoRA quality at 4-bit memory; enables 70B fine-tuning on single GPU Quantization overhead; limited hardware support for NF4 Large models on consumer hardware; resource-constrained fine-tuning

For practitioners making a practical choice, the following decision guide covers the most common scenarios:

Decision flowchart for choosing PEFT method: Prompt Tuning for 10B+ models, Prefix Tuning for generation tasks, P-Tuning v2 for NLU, and LoRA/QLoRA as the default
Figure 15.4.2: Decision guide for choosing among soft prompt methods and LoRA. Most practitioners should default to LoRA unless they have a specific reason to use a soft prompt approach.

7. Practical Considerations and Limitations

Soft prompt methods have unique practical characteristics that differ from weight-based methods like LoRA. Understanding these helps you avoid common pitfalls.

Training Dynamics

Soft prompts are notoriously sensitive to learning rate. They typically need much higher learning rates (1e-2 to 1e-1) than LoRA (1e-4 to 1e-3), because the prompt parameters must "move far" in embedding space relative to where they are initialized. Using a learning rate that works for LoRA will leave soft prompts nearly unchanged. Conversely, too-high a learning rate causes oscillation and poor convergence. A linear warmup of at least 100-200 steps is strongly recommended.

Multi-Task Serving

One of the most compelling use cases for soft prompts is multi-task serving with a single frozen base model. Each task gets its own soft prompt (a few hundred kilobytes at most), and switching tasks at inference time is a matter of swapping the prepended vectors. This pattern is far cheaper to operate than maintaining separate LoRA adapters when the number of tasks is very large (hundreds or thousands), since the per-task storage is negligible and the base model is never copied.

# Code Fragment 15.4.5: Multi-task soft prompt serving pattern
# One base model, many task-specific soft prompts loaded on demand
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE_MODEL = "meta-llama/Llama-3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base_model = AutoModelForCausalLM.from_pretrained(
 BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
)

# Load multiple soft prompt adapters (each is ~200-500 KB)
task_adapters = {
 "sentiment": "./soft-prompts/sentiment-adapter",
 "summarization": "./soft-prompts/summary-adapter",
 "ner": "./soft-prompts/ner-adapter",
}

def run_task(task_name: str, text: str) -> str:
 """Load the appropriate soft prompt and run inference."""
 adapter_path = task_adapters[task_name]
 # PeftModel.from_pretrained loads the prompt vectors and
 # wraps the base model to prepend them during forward passes
 model = PeftModel.from_pretrained(base_model, adapter_path)
 model.eval()

 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 with torch.no_grad():
 outputs = model.generate(**inputs, max_new_tokens=64)
 return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Switch tasks with no base model reload
result_a = run_task("sentiment", "The product exceeded all my expectations.")
result_b = run_task("summarization", "Long article text goes here...")
Code Fragment 15.4.5: Multi-task soft prompt serving pattern. A single frozen base model loads different soft prompt adapters on demand, switching between sentiment analysis and summarization tasks by swapping only the small prompt vectors. No model reload is needed between tasks.

Composability with LoRA

Soft prompts can be combined with LoRA adapters. A model can have both a LoRA adapter (modifying weight matrices) and a prefix tuning adapter (injecting prefix KVs) active simultaneously. This is supported by the PEFT library's multi-adapter mechanism. The use case is niche but real: use LoRA for domain adaptation and a soft prompt for task steering on top.

Limitations

Self-Check
Q1: What is the key architectural difference between Prompt Tuning and Prefix Tuning?
Show Answer
Prompt Tuning adds learnable vectors only at the input embedding layer (before the first transformer block). Prefix Tuning adds learnable key-value pairs to every attention layer in the model. This means Prefix Tuning is far more expressive: it can steer representations at every depth of the network, not just condition the initial input. The trade-off is more parameters and more complex training.
Q2: Why does Prompt Tuning use a text-based initialization rather than random initialization, and when does it matter most?
Show Answer
Text-based initialization places the soft prompt vectors in a region of embedding space that already encodes task-relevant meaning, giving the optimizer a head start. Random initialization requires many more gradient steps to reach a useful solution and often converges to worse optima. This matters most when training data is limited (fewer examples mean fewer gradient updates, so starting closer to the target is more important) and when the model is smaller (larger models have more redundant capacity and can compensate for poor initialization).
Q3: P-Tuning v1 demonstrated that GPT-style models could match BERT on NLU tasks. What was the key mechanism that enabled this?
Show Answer
The key mechanism was the LSTM-based prompt encoder, which generates stable, contextualized embeddings for virtual tokens placed at non-prefix positions in the input. Without the LSTM encoder, directly optimizing embeddings in the middle of an input sequence leads to unstable training because nearby token embeddings disrupt gradient flow. The LSTM provides smooth, well-conditioned gradients and prevents the collapse of learned representations. The finding showed the performance gap between GPT and BERT on NLU was a prompting problem, not an architectural limitation.
Q4: You are building a multi-tenant API that needs to serve a single LLM for 500 different enterprise customers, each with a custom task. Would you use soft prompts or LoRA adapters, and why?
Show Answer
Soft prompts (specifically Prompt Tuning) are likely the better choice here. Each soft prompt is under 1 MB, so storing 500 of them costs less than 500 MB total. Swapping between prompts at inference time is a cheap tensor concatenation operation. LoRA adapters, while still small (5-50 MB each), are larger and require modifying weight matrices at inference time. However, if the base model is below 10B parameters and task performance is critical, LoRA will likely outperform Prompt Tuning; in that case, a serving framework like LoRAX (which manages hot-swappable LoRA adapters) would be the better choice.
Key Takeaways
Research Frontier

Recent work explores transferable soft prompts: can a soft prompt trained for task A be transferred to task B by composing it with a small residual adapter? This would allow a library of reusable prompt building blocks rather than training from scratch for each task.

Separately, work on prompt distillation attempts to compress a long soft prompt into a shorter one with equivalent task performance, reducing the memory and latency overhead during inference. These directions suggest soft prompts may evolve into composable, transferable primitives rather than monolithic per-task artifacts.

Fun Fact

The original Prompt Tuning paper from Google included a striking experiment: they trained a single T5-XXL model with thousands of task-specific soft prompts and showed it could serve them all with no degradation, using a total storage overhead smaller than a single fine-tuned model copy. The result was a compelling early vision of "one model, many personalities" that remains highly relevant for large-scale serving infrastructure today.

What Comes Next

In the next chapter, Chapter 16: Knowledge Distillation & Model Merging, we explore knowledge distillation and model merging: techniques for creating smaller, specialized models from larger ones, and for combining multiple fine-tuned models without retraining.

Bibliography
Soft Prompt Methods

Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021.

Introduces Prompt Tuning: prepending a small number of learnable tokens to the input embedding layer. The core finding is that performance scales with model size, with the method matching full fine-tuning at 11B parameters. Essential reading for anyone considering soft prompts for large-model serving, as it establishes the scaling law that governs when Prompt Tuning is viable.

Foundational

Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021.

Proposes adding learnable key-value pairs to every transformer layer for generation tasks, with a reparameterization MLP for stable training. Demonstrates strong results on table-to-text and summarization with less than 0.1% of model parameters modified. The reparameterization trick for prefix optimization is widely adopted in subsequent work.

Generation Tasks

Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021). GPT Understands, Too. AI Open, 2023.

Introduces P-Tuning v1, which uses an LSTM-based prompt encoder to generate stable embeddings for virtual tokens inserted at flexible positions in the input. Shows that GPT-style autoregressive models can match or exceed BERT on NLU benchmarks when prompted correctly, challenging the assumption that encoder-only models are necessary for understanding tasks.

NLU Benchmark

Liu, X., Ji, K., Fu, Y., Tam, W. L., Du, Z., Yang, Z., & Tang, J. (2022). P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. ACL 2022.

Extends P-Tuning to deep prefix injection across all layers with a classification head, showing strong NLU performance at small model scales (300M parameters) where Prompt Tuning fails. Demonstrates that soft prompt methods can be competitive with full fine-tuning universally, not just at large scales. Key reference for practitioners working on NLU tasks with limited model size.

Deep Prompting

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., & Bossan, B. (2022). PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. Hugging Face.

The HuggingFace PEFT library implements all methods in this section (PromptTuningConfig, PrefixTuningConfig, PromptEncoderConfig) under a unified API. Practitioners should use this library as the primary interface for soft prompt training and deployment. The library handles reparameterization, adapter saving, and multi-adapter composition transparently.

Library