Soft Prompts: Prompt Tuning, Prefix Tuning, and P-Tuning

Section 17.4

"A well-placed soft prompt does not change what the model knows. It changes what the model is trying to do."

LoRALoRA, Softly Persuasive AI Agent
Big Picture

Soft prompt methods occupy a fascinating middle ground between prompt engineering and fine-tuning. Instead of choosing discrete text tokens that a human could read and interpret, soft prompt methods learn continuous, real-valued vectors in embedding space. These learned vectors are prepended to the input (or inserted into hidden layers) and steer the model toward a desired behavior without modifying the base model's weights. The family includes Prompt Tuning (input layer only, extremely lightweight), Prefix Tuning (all attention layers, more expressive), P-Tuning v1 (encoder-generated embeddings, bridging NLU and NLG), and P-Tuning v2 (deep prefix, strong NLU performance at all scales). Together they form a spectrum: as you add parameters, you gain task performance but reduce the elegance of a near-zero overhead approach. Cross-reference: Section 17.1 (LoRA), Section 17.2 (Advanced PEFT), Chapter 12 (Prompt Engineering).

Prerequisites

This section assumes familiarity with the LoRA and QLoRA techniques from Section 17.1 and the advanced PEFT methods in Section 17.2. You should understand how transformer attention works (including layer normalization) and be familiar with discrete prompt engineering from Chapter 12. Background in fine-tuning fundamentals from Chapter 16 is also helpful.

17.4.1 The Soft Prompt Family: A Taxonomy

A soft prompt is a set of learnable continuous vectors that are concatenated with the model's input representations and updated through gradient descent. Unlike the discrete tokens used in Chapter 12's prompt engineering, soft prompts exist only as floating-point tensors: they have no natural language interpretation, cannot be read by a human, and are discovered entirely by the optimizer.

Concretely, suppose your hard prompt is the string "Classify the sentiment of:" which the tokenizer encodes as 5 tokens that look up 5 vectors in the embedding table (say, 4096-dim each). Each of those 5 vectors is the embedding of a real word and lives at a fixed point in embedding space. A soft prompt replaces those with 5 (or 20, or 100) free-floating 4096-dim vectors that are not constrained to match any word's embedding. The optimizer is free to push them anywhere in R^4096; the model still concatenates them with the user's input and runs the transformer as usual. The "prompt" looks like gibberish if you tried to decode it back to tokens, but it steers the model more effectively than any wording a human could write.

Key Insight

Soft prompts reveal something profound about LLMs: the optimal instruction for a task is not necessarily expressible in human language. When you let the optimizer search freely in embedding space, it finds vectors that steer the model more effectively than any discrete prompt a human could write. These vectors often correspond to no real words at all. This suggests that the "language" models respond to best is not English or any natural language; it is the geometry of their own embedding space.

Warning
Common Mistake: Confusing Soft Prompts with Prompt Engineering

Despite the similar names, soft prompts and prompt engineering (covered in Chapter 12) are fundamentally different techniques. Prompt engineering crafts human-readable text instructions. Soft prompts are learned continuous vectors with no natural language interpretation; they cannot be read, shared as text, or transferred between different models. Soft prompts require gradient-based training, while prompt engineering requires only API access. Additionally, soft prompts are tied to a specific model checkpoint: if the base model is updated, the soft prompts must be retrained.

The key architectural question that distinguishes each method is where the learned vectors are inserted:

The four main soft prompt methods differ in where they insert their learnable parameters. Prompt Tuning touches only the input embedding layer; Prefix Tuning and P-Tuning v2 inject learned key-value pairs into every attention layer; P-Tuning v1 uses a trainable encoder to place embeddings at chosen input positions.
Figure 17.4.1: The four main soft prompt methods differ in where they insert their learnable parameters. Prompt Tuning touches only the input embedding layer; Prefix Tuning and P-Tuning v2 inject learned key-value pairs into every attention layer; P-Tuning v1 uses a trainable encoder to place embeddings at chosen input positions.

A useful mental model: think of a soft prompt as a context-setting preamble that only the model's internals can read. You are teaching the model to adopt a "mode" for a given task without rewriting any of its core knowledge.

Key Insight: Hard Prompts vs. Soft Prompts

Hard prompts (discrete text tokens from Chapter 12) must live in a model's existing vocabulary and are human-readable. Soft prompts are arbitrary floating-point vectors: they can represent concepts that no single word captures, and they are found by gradient descent rather than human authorship. The trade-off is interpretability vs. expressiveness.

17.4.2 Prompt Tuning (Lester et al., 2021)

Prompt Tuning, introduced by Lester, Al-Rfou, and Constant at Google (2021), is the simplest member of the soft-prompt family. It prepends a small set of learnable vectors (the "soft prompt") to the token embeddings at the very first layer of the transformer. All other weights remain frozen. During forward and backward passes, only these prefix vectors accumulate gradients.

The parameter count is striking: for a prompt of 100 tokens on a model with a 4096-dimensional embedding, you are training 100 x 4096 = 409,600 parameters. For a 7B-parameter model, that is roughly 0.006% of model parameters. Storage and swapping costs are negligible.

The main finding of the original paper was that Prompt Tuning performance scales with model size. On small models (100M parameters), it significantly underperforms full fine-tuning. As models approach 10B parameters, Prompt Tuning closes the gap almost entirely. For practitioners using frontier-scale models, it can match fine-tuning performance at a fraction of the cost.

Formally, given the original input embeddings $E(x) \in \mathbb{R}^{L \times d}$ for a sequence of length $L$, Prompt Tuning prepends $k$ learnable virtual embeddings $P_\theta \in \mathbb{R}^{k \times d}$ and feeds the concatenation to a frozen language model $\text{LM}_\Phi$:

$$y = \text{LM}_\Phi\!\big([\,P_\theta\, ;\, E(x)\,]\big), \qquad \nabla\mathcal{L} \text{ updates only } P_\theta$$

The trainable parameter count is exactly $k \cdot d$ (e.g., $20 \times 2048 \approx 41\text{k}$ for a 2B model), independent of model depth. All gradients flow back through the frozen network into the $k$ prepended vectors only.

The "why" behind the scaling effect. A soft prompt steers the model through its existing representations rather than changing what those representations are. A small model has not yet learned a rich enough latent space for clever steering to substitute for actually updating the weights: there is simply no good direction in 768-dimensional GPT-2 embedding space that re-purposes the model for sentiment classification. A 10B-parameter model, by contrast, has already absorbed sentiment, sarcasm, formality, and topic into its latent geometry during pretraining; the soft prompt only has to locate the right region and pull the model into it. Empirically this means Prompt Tuning's parameter efficiency is unlocked by scale: the same 100 vectors are useless at 100M parameters and competitive at 10B because what changes is the underlying space they navigate, not the prompt itself.

The following code (Code Fragment 17.4.1a) shows how to set up Prompt Tuning with the Hugging Face PEFT library.

# Adds learnable tokens only at the input embedding layer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PromptTuningConfig, PromptTuningInit, get_peft_model, TaskType
MODEL_NAME = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype="auto")
# Configure Prompt Tuning
# num_virtual_tokens: length of the soft prompt (10-100 is typical)
# prompt_tuning_init: TEXT initializes from a real string; RANDOM is also valid
peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    prompt_tuning_init=PromptTuningInit.TEXT,
    prompt_tuning_init_text="Classify the sentiment of the following text:",
    num_virtual_tokens=20,
    tokenizer_name_or_path=MODEL_NAME,
)
# Wrap model: only the prompt embeddings are trainable
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# Typical output: trainable params: 81,920 || all params: 1,235,814,400
# || trainable%: 0.0066
# Standard training loop (SFTTrainer or custom loop works unchanged)
# The PEFT wrapper transparently prepends the soft prompt during forward passes
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
dataset = load_dataset("imdb", split="train[:2000]")
training_args = TrainingArguments(
    output_dir="./prompt-tuning-output",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=3e-2, # Higher LR than LoRA; soft prompts need it
    warmup_steps=100,
    logging_steps=50,
    save_strategy="epoch",
    fp16=True,
)
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,
)
trainer.train()
# Saving: only the soft prompt vectors are stored (~320 KB for 20 tokens)
model.save_pretrained("./my-prompt-tuning-adapter")
Code Fragment 17.4.1b: Prompt Tuning with Hugging Face PEFT. The PromptTuningConfig prepends 20 learnable virtual tokens (initialized from a text string) to every input. Only these ~320 KB of soft prompt embeddings are trained, leaving the full model frozen.
Tip: Initialization Matters

Initializing the soft prompt from a meaningful text string (as shown above) consistently outperforms random initialization, especially when training data is limited. The intuition: you are starting the optimizer in a region of embedding space that is already near a useful representation. Random initialization requires many more gradient steps to escape noise.

Worked Example: Adapter Footprint and Storage for 50 Tasks

Suppose a team serves 50 task-specific adapters on top of Llama-3.1-70B (hidden size $d = 8192$) and chooses $k = 20$ virtual tokens per task. Each soft prompt is a single dense matrix of shape $20 \times 8192 = 163{,}840$ parameters, or about $0.66$ MB in FP32 (and half that in BF16). Storing all 50 adapters costs $50 \times 0.66 \approx 33$ MB, which fits inside a single CPU page and can be hot-swapped in microseconds. Compare this to LoRA at rank 16 on the same backbone (roughly $25$ MB per adapter, so $1.25$ GB for 50 tasks) or a full fine-tune ($140$ GB per task, so $7$ TB for 50). The trainable share is $163{,}840 / 70 \times 10^9 \approx 0.00023\%$ of the base model, which is exactly why Prompt Tuning's scaling claim only kicks in once the frozen base is rich enough to be merely "steered" rather than "rewritten."

When to use Prompt Tuning: very large models (10B+), scenarios where you need dozens of task-specific adapters sharing one base model, and simple classification or generation tasks where the additional expressiveness of deeper methods is not required.

17.4.3 Prefix Tuning (Li & Liang, 2021)

Prefix Tuning, from Li and Liang at Stanford (2021), takes a more aggressive approach: it prepends learnable key-value pairs to every transformer layer's attention mechanism, not just the input. Each layer gets its own set of prefix tokens that directly modulate what the attention mechanism attends to at that depth.

The practical consequence: Prefix Tuning is far more expressive than Prompt Tuning. Because prefixes influence attention at every layer, the model can steer mid-level representations and not just the initial input conditioning. This makes a material difference on generation-heavy tasks such as table-to-text and summarization, where the output structure is complex and sequential.

The "why" behind layer-wise prefixes. A transformer's depth is also a hierarchy of abstraction: early layers handle syntax and token identity, middle layers compose phrases, and late layers shape discourse-level structure. A Prompt-Tuning vector applied only at layer 0 must rely on the frozen network to propagate its influence through all this re-abstraction, and that signal degrades the deeper it travels. Prefix Tuning re-injects task-specific context at every layer, so it can nudge the summarizer at the layer where summarization-style reasoning actually happens. This is why the gain over Prompt Tuning is largest on structured generation: the right intervention is at the layer where the structure is being assembled, not at the input.

A key implementation detail is reparameterization. Directly optimizing prefix parameters can be unstable. During training, the prefix vectors are generated by a small MLP applied to a lower-dimensional latent space. This MLP introduces smoother gradients and prevents collapse or divergence. At inference time, the MLP is discarded; only the final prefix key-value tensors are stored and used.

Formally, at each transformer layer $\ell$, Prefix Tuning injects learnable key and value tensors $P^{(\ell)}_K, P^{(\ell)}_V \in \mathbb{R}^{k \times d}$ into the attention computation:

$$\text{Attn}^{(\ell)}(Q, K, V) = \operatorname{softmax}\!\Big(\frac{Q\, [P^{(\ell)}_K;\, K]^{\top}}{\sqrt{d}}\Big)\, [P^{(\ell)}_V;\, V]$$

where $Q$, $K$, $V$ are the standard projections from the frozen weights and the prefix entries are prepended along the sequence dimension. Trainable parameter count scales as $2 \cdot L \cdot k \cdot d$ where $L$ is depth, which explains why Prefix Tuning ends up 10 to 100 times larger than Prompt Tuning even though both use the same prefix length $k$.

Parameter count sits roughly between 0.1% and 1% of model parameters, depending on prefix length and model depth. This is roughly 10-100x more than Prompt Tuning, but still far below LoRA.

# Injects learned KV pairs at every attention layer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
MODEL_NAME = "google/flan-t5-large" # Encoder-decoder well-suited to prefix tuning
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
# PrefixTuningConfig with reparameterization enabled by default
peft_config = PrefixTuningConfig(
    task_type=TaskType.SEQ_2_SEQ_LM,
    num_virtual_tokens=30, # Prefix length per layer
    encoder_hidden_size=512, # Dimension of the reparameterization MLP
    # prefix_projection=True by default: uses MLP during training,
    # discards it at inference and uses the final projected vectors
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 1,835,008 || all params: 783,150,080
# || trainable%: 0.2343 (about 0.23%)
# At inference, the MLP is transparent: model.forward() uses cached prefix KVs
# This means zero extra latency from the reparameterization MLP after training
print("Prefix Tuning config:", peft_config)
# PrefixTuningConfig(num_virtual_tokens=30, encoder_hidden_size=512, ...)
Code Fragment 17.4.2: Prefix Tuning with Hugging Face PEFT. Unlike Prompt Tuning, Prefix Tuning injects learned key-value pairs into every attention layer via a reparameterization MLP (encoder_hidden_size=512). After training, the MLP is discarded and only the cached prefix KVs are used at inference.
Note: Encoder-Decoder vs. Decoder-Only

Prefix Tuning was originally evaluated on encoder-decoder models (T5, BART) for tasks like summarization and table-to-text translation, where it showed its greatest strengths. On decoder-only models (GPT-style), it also works but the gains over Prompt Tuning are less dramatic for simple tasks. If you are working on structured generation with flan-T5 or similar, Prefix Tuning is the natural starting point.

Worked Example
Why Prefix Tuning Trains 24x More Parameters Than Prompt Tuning

Consider flan-T5-large with $L = 24$ encoder layers, $d = 1024$ hidden dimension, and a prefix length of $k = 30$. Prompt Tuning trains a single $k \times d = 30 \times 1024 = 30{,}720$ parameter table. Prefix Tuning injects $k$ key-value vectors at every layer of the encoder and the decoder, giving a parameter count of $2 \cdot L_{\text{enc+dec}} \cdot k \cdot d = 2 \cdot 48 \cdot 30 \cdot 1024 \approx 2.95$ M, roughly 96 times larger. After the reparameterization MLP is added (an $r_{\text{enc}} = 512$ bottleneck projected up to $2 L_{\text{enc+dec}} d$), the training-time parameter count climbs to about $1.8$ M (the MLP is smaller than the raw KV table for small $k$). At inference the MLP is discarded and only the final cached prefix KVs persist, so deployment memory matches the raw $2 L k d$ figure. This is the structural reason Prefix Tuning routinely beats Prompt Tuning on summarization while still costing less than 0.3% of full fine-tuning's parameter budget.

When to use Prefix Tuning: generation tasks requiring structural control (summarization, data-to-text, dialogue response generation), encoder-decoder architectures, and situations where Prompt Tuning underperforms due to small model size.

17.4.4 P-Tuning v1 (Liu et al., 2021)

P-Tuning (v1), from Liu et al. at Tsinghua University and MIT (2021), targets a different problem: can autoregressive models like GPT compete with BERT on natural language understanding (NLU) tasks, given the right prompting strategy?

The paper's core insight was that the GPT-NLU performance gap was not an architectural limitation but a prompting problem. Hard prompts for NLU tasks like "The capital of France is [MASK]" require the learnable tokens to appear at specific non-prefix positions in the input. But naively placing random embeddings in the middle of a sequence leads to poor optimization: nearby token embeddings interfere with gradient flow, and the learned vectors collapse to degenerate solutions.

P-Tuning v1 solves this by routing the learnable prompt tokens through a small prompt encoder: an LSTM (or MLP) that takes positional indices as input and produces contextualized embeddings. The LSTM's sequential structure provides smooth gradient signal and prevents the collapse problem.

The resulting approach allows learnable tokens to appear anywhere in the input template, not just as a prefix. A P-Tuning template for a knowledge-probing task might look like: [P1][P2] The capital of [P3][P4] is, where P1-P4 are learned vectors generated by the LSTM encoder.

# Note: HuggingFace PEFT implements P-Tuning v1 as PromptEncoderConfig
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PromptEncoderConfig, PromptEncoderReparameterizationType, get_peft_model, TaskType
MODEL_NAME = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# PromptEncoderConfig maps to P-Tuning v1:
# encoder_reparameterization_type selects LSTM or MLP
peft_config = PromptEncoderConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,
    encoder_reparameterization_type=PromptEncoderReparameterizationType.LSTM,
    encoder_hidden_size=128, # LSTM hidden dimension
    encoder_num_layers=2, # Number of LSTM layers
    encoder_dropout=0.0,
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 307,340 || all params: 354,841,948
# Includes both the virtual token embeddings AND the LSTM encoder parameters
# The LSTM encoder is used during both forward and backward passes.
# At inference, the LSTM is still present (unlike Prefix Tuning's MLP,
# which is discarded). However, you can cache the encoder outputs once
# and reuse them for all inference calls on a given task.
# Example: manual encode to show the two-stage process
import torch
prompt_encoder = model.prompt_encoder["default"]
# Step 1: LSTM generates contextualized prompt embeddings from learned input embeddings
# Step 2: these embeddings are prepended to the token embeddings in model.forward()
Code Fragment 17.4.3: P-Tuning v1 with Hugging Face PEFT. The LSTM-based prompt encoder generates contextualized virtual token embeddings (rather than independent vectors), producing soft prompts where each token is influenced by its neighbors in the sequence.

The results from Liu et al. were striking at the time: GPT-style models, previously thought to be weak at NLU tasks without task-specific heads, matched or exceeded BERT on SuperGLUE benchmarks when guided by P-Tuning prompts. The conclusion: the bottleneck was prompt design, not model capability.

17.4.5 P-Tuning v2 (Liu et al., 2022)

P-Tuning v2, also from Liu et al. (2022), represents a convergence of ideas: it adopts the deep, layer-wise prefix injection strategy from Prefix Tuning but applies it to NLU tasks with a classification head, and demonstrates strong performance even at small model scales (as small as 300M parameters).

The key architectural features of P-Tuning v2 are:

# Uses deep prefix tuning (all layers) with a classification head
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PrefixTuningConfig, get_peft_model, TaskType
import torch
MODEL_NAME = "bert-base-uncased" # Works on encoder models too
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# P-Tuning v2 for classification: use SEQ_CLS task type
# This adds a linear classification head on top of the frozen model
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=2 # Binary classification
    )
# P-Tuning v2 is PrefixTuningConfig with prefix_projection=False
# (direct optimization, no reparameterization MLP)
peft_config = PrefixTuningConfig(
    task_type=TaskType.SEQ_CLS,
    num_virtual_tokens=16,
    prefix_projection=False, # Direct prefix optimization (P-Tuning v2 style)
    )
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 197,634 || all params: 109,682,178
# || trainable%: 0.1801
# Training with a standard classification objective
from transformers import TrainingArguments, Trainer
from datasets import load_dataset
import numpy as np
dataset = load_dataset("glue", "sst2")
def tokenize(batch):
    return tokenizer(batch["sentence"], truncation=True, max_length=128, padding="max_length")
tokenized = dataset.map(tokenize, batched=True)
tokenized = tokenized.rename_column("label", "labels")
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}
training_args = TrainingArguments(
    output_dir="./p-tuning-v2-sst2",
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    learning_rate=5e-3, # Higher than full fine-tuning; prefix params need it
    warmup_ratio=0.1,
    evaluation_strategy="epoch",
    save_strategy="best",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    )
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    compute_metrics=compute_metrics,
    )
trainer.train()
Code Fragment 17.4.4: P-Tuning v2 for sequence classification with PEFT. Deep prefix tokens are injected at every layer (num_layers defaults to the full model depth), and a classification head is trained alongside the soft prompts for NLU tasks like sentiment analysis.
Key Insight
P-Tuning v2 vs. Prefix Tuning: Nearly Twin Architectures

P-Tuning v2 and Prefix Tuning share the same deep prefix injection structure. The meaningful differences are task focus (NLU with a classification head vs. NLG with language modeling loss), reparameterization (Prefix Tuning uses an MLP; P-Tuning v2 optimizes directly), and scale sensitivity (P-Tuning v2 claims consistent gains down to 300M parameters while Prompt Tuning degrades badly at small scale). In practice, when using the Hugging Face PEFT library, the distinction between the two comes down to the config settings shown above.

17.4.6 Comprehensive Comparison

The table below summarizes the key characteristics of each soft prompt method and places them alongside LoRA and QLoRA from Section 17.1 for context.

Table 17.4.1c: The table below summarizes the key characteristics of each soft prompt method and places them alongside LoRA and QLoRA from Section 17.1.
Method Parameters Where Applied Strengths Weaknesses Best For
Prompt Tuning
Lester et al. 2021
< 0.01% Input layer only Minimal storage; trivial to swap per task; near-zero inference overhead Requires very large models; underperforms on small models and complex tasks Large-scale models (10B+); simple classification and generation; multi-task serving
Prefix Tuning
Li & Liang 2021
0.1 - 1% All attention layers (KV) More expressive; strong on generation; reparameterization improves stability Higher parameter count than Prompt Tuning; MLP overhead during training Summarization; table-to-text; encoder-decoder models
P-Tuning v1
Liu et al. 2021
~0.01 - 0.1% Input layer (flexible positions) Non-prefix insertion; enables GPT-style models on NLU; LSTM encoder adds stability LSTM encoder present at inference; complex template design required Knowledge probing; NLU with GPT-style models; SuperGLUE-style tasks
P-Tuning v2
Liu et al. 2022
0.1 - 1% All layers (deep prefix) Strong NLU across model scales; no verbalizer needed; classification head Slightly more complex setup; less commonly used than LoRA in practice NER, POS, NLU tasks; small-to-medium models; sequence labeling
LoRA
Hu et al. 2021, Sec 17.1
0.1 - 2% Weight matrices (WQ, WV, etc.) Excellent performance across scales and tasks; well-studied; wide tooling support More parameters than Prompt Tuning; requires rank selection Most fine-tuning tasks; the dominant PEFT method in practice
QLoRA
Dettmers et al. 2023, Sec 17.1
0.1 - 2% + 4-bit base Weight matrices (quantized base) LoRA quality at 4-bit memory; enables 70B fine-tuning on single GPU Quantization overhead; limited hardware support for NF4 Large models on consumer hardware; resource-constrained fine-tuning

For practitioners making a practical choice, the following decision guide covers the most common scenarios:

Decision flowchart for choosing PEFT method: Prompt Tuning for 10B+ models, Prefix Tuning for generation tasks, P-Tuning v2 for NLU, and LoRA/QLoRA as the default
Figure 17.4.2a: Decision guide for choosing among soft prompt methods and LoRA. Most practitioners should default to LoRA unless they have a specific reason to use a soft prompt approach.

17.4.7 Practical Considerations and Limitations

Soft prompt methods have unique practical characteristics that differ from weight-based methods like LoRA. Understanding these helps you avoid common pitfalls.

Training Dynamics

Soft prompts are notoriously sensitive to learning rate. They typically need much higher learning rates (1e-2 to 1e-1) than LoRA (1e-4 to 1e-3), because the prompt parameters must "move far" in embedding space relative to where they are initialized. Using a learning rate that works for LoRA will leave soft prompts nearly unchanged. Conversely, too-high a learning rate causes oscillation and poor convergence. A linear warmup of at least 100-200 steps is strongly recommended.

Tip
Prompt dampening as a softer alternative to full masking

When training soft prompts on instruction data, the conventional recipe sets the loss to zero over the prompt tokens (including the soft-prompt positions and the user instruction) and computes cross-entropy only on the response. A softer alternative, sometimes called prompt dampening, multiplies the prompt-token loss by a small factor $\lambda \in (0, 1)$ (typical values $0.1$ to $0.3$) instead of zeroing it. The soft prompt then receives a faint but non-zero learning signal from the prompt positions, which can stabilize optimization when the soft prompt has to also encode formatting or style cues that show up in the prompt itself. The same trick applies to full SFT (see Section 16.3) and to mixed prompt-tuning + LoRA stacks. Use full masking ($\lambda = 0$) as the default and try dampening only if the prompt distribution drifts measurably during training.

Multi-Task Serving

One of the most compelling use cases for soft prompts is multi-task serving with a single frozen base model. Each task gets its own soft prompt (a few hundred kilobytes at most), and switching tasks at inference time is a matter of swapping the prepended vectors. This pattern is far cheaper to operate than maintaining separate LoRA adapters when the number of tasks is very large (hundreds or thousands), since the per-task storage is negligible and the base model is never copied.

# One base model, many task-specific soft prompts loaded on demand
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
BASE_MODEL = "meta-llama/Llama-3.2-3B"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, torch_dtype=torch.float16, device_map="auto"
    )
# Load multiple soft prompt adapters (each is ~200-500 KB)
task_adapters = {
    "sentiment": "./soft-prompts/sentiment-adapter",
    "summarization": "./soft-prompts/summary-adapter",
    "ner": "./soft-prompts/ner-adapter",
    }
def run_task(task_name: str, text: str) -> str:
    """Load the appropriate soft prompt and run inference."""
    adapter_path = task_adapters[task_name]
    # PeftModel.from_pretrained loads the prompt vectors and
    # wraps the base model to prepend them during forward passes
    model = PeftModel.from_pretrained(base_model, adapter_path)
    model.eval()
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=64)
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
        # Switch tasks with no base model reload
        result_a = run_task("sentiment", "The product exceeded all my expectations.")
        result_b = run_task("summarization", "Long article text goes here...")
Code Fragment 17.4.5: Multi-task soft prompt serving pattern. A single frozen base model loads different soft prompt adapters on demand, switching between sentiment analysis and summarization tasks by swapping only the small prompt vectors. No model reload is needed between tasks.

Composability with LoRA

Soft prompts can be combined with LoRA adapters. A model can have both a LoRA adapter (modifying weight matrices) and a prefix tuning adapter (injecting prefix KVs) active simultaneously. This is supported by the PEFT library's multi-adapter mechanism. The use case is niche but real: use LoRA for domain adaptation and a soft prompt for task steering on top.

Limitations

Fun Fact

The original Prompt Tuning paper from Google included a striking experiment: they trained a single T5-XXL model with thousands of task-specific soft prompts and showed it could serve them all with no degradation, using a total storage overhead smaller than a single fine-tuned model copy. The result was a compelling early vision of "one model, many personalities" that remains highly relevant for large-scale serving infrastructure today.

Research Frontier

Recent work explores transferable soft prompts: can a soft prompt trained for task A be transferred to task B by composing it with a small residual adapter? This would allow a library of reusable prompt building blocks rather than training from scratch for each task.

Separately, work on prompt distillation attempts to compress a long soft prompt into a shorter one with equivalent task performance, reducing the memory and latency overhead during inference. These directions suggest soft prompts may evolve into composable, transferable primitives rather than monolithic per-task artifacts.

Key Takeaways
Self-Check
Q1: What is the key architectural difference between Prompt Tuning and Prefix Tuning?
Show Answer
Prompt Tuning adds learnable vectors only at the input embedding layer (before the first transformer block). Prefix Tuning adds learnable key-value pairs to every attention layer in the model. This means Prefix Tuning is far more expressive: it can steer representations at every depth of the network, not just condition the initial input. The trade-off is more parameters and more complex training.
Q2: Why does Prompt Tuning use a text-based initialization rather than random initialization, and when does it matter most?
Show Answer
Text-based initialization places the soft prompt vectors in a region of embedding space that already encodes task-relevant meaning, giving the optimizer a head start. Random initialization requires many more gradient steps to reach a useful solution and often converges to worse optima. This matters most when training data is limited (fewer examples mean fewer gradient updates, so starting closer to the target is more important) and when the model is smaller (larger models have more redundant capacity and can compensate for poor initialization).
Q3: P-Tuning v1 demonstrated that GPT-style models could match BERT on NLU tasks. What was the key mechanism that enabled this?
Show Answer
The key mechanism was the LSTM-based prompt encoder, which generates stable, contextualized embeddings for virtual tokens placed at non-prefix positions in the input. Without the LSTM encoder, directly optimizing embeddings in the middle of an input sequence leads to unstable training because nearby token embeddings disrupt gradient flow. The LSTM provides smooth, well-conditioned gradients and prevents the collapse of learned representations. The finding showed the performance gap between GPT and BERT on NLU was a prompting problem, not an architectural limitation.
Q4: You are building a multi-tenant API that needs to serve a single LLM for 500 different enterprise customers, each with a custom task. Would you use soft prompts or LoRA adapters, and why?
Show Answer
Soft prompts (specifically Prompt Tuning) are likely the better choice here. Each soft prompt is under 1 MB, so storing 500 of them costs less than 500 MB total. Swapping between prompts at inference time is a cheap tensor concatenation operation. LoRA adapters, while still small (5-50 MB each), are larger and require modifying weight matrices at inference time. However, if the base model is below 10B parameters and task performance is critical, LoRA will likely outperform Prompt Tuning; in that case, a serving framework like LoRAX (which manages hot-swappable LoRA adapters) would be the better choice.

Exercises

Exercise 15.4.1: Soft Prompts vs LoRA Conceptual

Both prompt tuning and LoRA freeze the base model and learn a small set of new parameters. (a) What fundamentally distinguishes soft prompts from LoRA? (b) Why did LoRA become the default for production fine-tuning while prompt tuning largely faded? (c) Name one task category where prompt tuning still has an edge.

Answer Sketch

(a) Soft prompts add learnable embeddings at the input (or at every layer for prefix tuning), leaving model weights untouched and letting attention pull the prompt's information into the residual stream. LoRA modifies the weight matrices themselves via low-rank updates. (b) LoRA generalizes better to harder tasks, has higher capacity per learnable parameter, integrates cleanly with quantized base models (QLoRA), and the resulting adapter is composable with the base in a single matrix multiplication. Prompt tuning required care to converge, didn't scale well past simple classification, and gave smaller per-parameter quality gains. (c) Prompt tuning still wins for very-many-task multi-tenancy (you serve thousands of small task adapters cheaply: each is just 1-100 vectors), and for tasks where you must absolutely guarantee no weight modification (e.g., licensing constraints).

Exercise 15.4.2: Parameter Count Calculation Calculation

For a 7B model with hidden size d=4096 and 32 layers: (a) how many trainable parameters does prompt tuning with 20 virtual tokens add? (b) prefix tuning with 20 prefix vectors at every layer? (c) LoRA with rank 8 on the q and v projections of every layer? Compare as fractions of the 7B base.

Answer Sketch

(a) Prompt tuning: 20 x 4096 = 81,920 params (~0.001% of 7B). (b) Prefix tuning: 20 x 4096 x 2 (key+value) x 32 layers = 5,242,880 params (~0.07%). (c) LoRA r=8 on q,v: per layer = 2 matrices x (4096 x 8 + 8 x 4096) = 131,072; total = 131,072 x 32 = 4,194,304 params (~0.06%). Prefix tuning and LoRA are in the same parameter ballpark, but LoRA distributes its capacity multiplicatively into the weight matrices rather than additively into the activations, which is why it tends to outperform at the same parameter budget.

Exercise 15.4.3: Implement a Soft-Prompt Wrapper Code Tweak

Sketch a 10-line PyTorch wrapper that adds N learnable tokens at the start of any input embedding for a Hugging Face Llama model. The wrapper should: (a) define an nn.Parameter of shape (N, hidden_size); (b) prepend it to the input embedding inside forward; (c) freeze every other parameter.

Answer Sketch
from torch import nn
import torch
class PromptedLlama(nn.Module):
    def __init__(self, base, n_virtual=20):
        super().__init__()
        self.base = base
        # Freeze every base-model weight so only the soft prompt trains.
        for param in base.parameters(): param.requires_grad = False
        self.soft = nn.Parameter(torch.randn(n_virtual, base.config.hidden_size) * 0.02)
    def forward(self, input_ids, attention_mask):
        emb = self.base.get_input_embeddings()(input_ids)
        soft = self.soft.unsqueeze(0).expand(emb.size(0), -1, -1)
        x = torch.cat([soft, emb], dim=1)
        am = torch.cat([torch.ones(emb.size(0), self.soft.size(0), device=emb.device), attention_mask], dim=1)
        return self.base(inputs_embeds=x, attention_mask=am)
Code Fragment 17.4.6: The PromptedLlama wrapper holds the base model frozen and prepends a learnable (n_virtual, hidden_size) parameter to the input embeddings, with a matching attention_mask extension so the soft tokens participate in attention. The 0.02 init scale keeps the soft tokens inside the model's natural embedding magnitude.

The non-obvious bit is initializing soft prompts from random tokens or from real embedding values; the latter often converges in 5-10x fewer steps because the optimization starts inside the model's natural embedding manifold.

Exercise 15.4.4: Soft-Prompt Failure Modes Failure Mode

You train soft prompts for a sentiment task, get 92% accuracy on the dev set, but the model produces gibberish on out-of-distribution inputs (longer sequences, code-mixed text). List three failure modes specific to soft prompts and a mitigation for each.

Answer Sketch

(1) Length sensitivity: soft prompts learn to interact with attention positions seen during training; with much longer inputs the prompt's effective influence dilutes. Mitigation: train with mixed-length batches and include up to 2x the longest expected production length. (2) Domain rigidity: soft prompts encode task identity narrowly; OOD inputs miss the right activation pattern. Mitigation: train with noisy or augmented prompts, or use prefix tuning (which inserts at every layer, increasing robustness). (3) Vanishing in long contexts: soft prompts at the input get attended-away after thousands of tokens. Mitigation: use prefix tuning or P-Tuning v2 so the learned vectors persist throughout the network. The general lesson: soft prompts are the lightest PEFT method but also the brittlest, especially for distributional generalization.

What Comes Next

In the next chapter, Section 17.5: Knowledge Distillation, we explore knowledge distillation and model merging: techniques for creating smaller, specialized models from larger ones, and for combining multiple fine-tuned models without retraining.

Further Reading

Soft Prompt Methods

Lester, B., Al-Rfou, R., & Constant, N. (2021). The Power of Scale for Parameter-Efficient Prompt Tuning. EMNLP 2021. Introduces Prompt Tuning: prepending a small number of learnable tokens to the input embedding layer. The core finding is that performance scales with model size, with the method matching full fine-tuning at 11B parameters. Useful for anyone considering soft prompts for large-model serving, as it establishes the scaling law that governs when Prompt Tuning is viable.
Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. ACL 2021. Proposes adding learnable key-value pairs to every transformer layer for generation tasks, with a reparameterization MLP for stable training. Demonstrates strong results on table-to-text and summarization with less than 0.1% of model parameters modified. The reparameterization trick for prefix optimization is widely adopted in subsequent work.
Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., & Tang, J. (2021). GPT Understands, Too. AI Open, 2023. Introduces P-Tuning v1, which uses an LSTM-based prompt encoder to generate stable embeddings for virtual tokens inserted at flexible positions in the input. Shows that GPT-style autoregressive models can match or exceed BERT on NLU benchmarks when prompted correctly, challenging the assumption that encoder-only models are necessary for understanding tasks.
Liu, X., Ji, K., Fu, Y., Tam, W. L., Du, Z., Yang, Z., & Tang, J. (2022). P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks. ACL 2022. Extends P-Tuning to deep prefix injection across all layers with a classification head, showing strong NLU performance at small model scales (300M parameters) where Prompt Tuning fails. Demonstrates that soft prompt methods can be competitive with full fine-tuning universally, not just at large scales. Key reference for practitioners working on NLU tasks with limited model size.
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul, S., & Bossan, B. (2022). PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. Hugging Face. The Hugging Face PEFT library implements all methods in this section (PromptTuningConfig, PrefixTuningConfig, PromptEncoderConfig) under a unified API. Practitioners should use this library as the primary interface for soft prompt training and deployment. The library handles reparameterization, adapter saving, and multi-adapter composition transparently.