"A model that memorizes is a model that leaks. The question is not if, but when, and how much."
A Cautious Guard, Leak-Conscious AI Agent
Large language models memorize portions of their training data, and adversaries can extract that data through carefully crafted queries. Carlini et al. (2021) demonstrated that GPT-2 could be prompted to emit verbatim training sequences, including names, phone numbers, email addresses, and code snippets. This is not a bug in a specific model; it is an inherent property of how neural language models learn. Larger models memorize more, and fine-tuned models memorize their fine-tuning data at even higher rates than the base model memorizes its pretraining corpus. This section covers the attack landscape (extraction attacks, membership inference, attribute inference), the theoretical foundation of differential privacy, and the practical application of DP-SGD for privacy-preserving fine-tuning using the Opacus library. It also addresses PII detection and mitigation strategies that complement formal privacy guarantees with practical defense-in-depth.
Prerequisites
This section expands on the privacy foundations introduced in Section 47.1: LLM Licensing, IP & Privacy and the security threat landscape from Section 47.1. Understanding fine-tuning fundamentals (Section 16.1) is essential for the DP-SGD sections. The GDPR requirements discussed in Section 47.1 provide the regulatory motivation for the technical defenses presented here.
50.1.1 Training Data Extraction Attacks
Training data extraction attacks exploit the fact that language models assign higher probability to sequences they have seen during training. An attacker generates a large number of candidate sequences (either by prompting the model or by sampling from it) and then identifies which outputs are likely verbatim memorized content. The key insight from Carlini et al. (2021) is the distinction between eidetic memorization (the model can reproduce a sequence exactly when given the right prefix) and extractable memorization (the sequence can be recovered through systematic probing without knowing the prefix).
In the landmark Carlini et al. extraction study, the researchers recovered a person's full name, email address, phone number, and physical address by prompting GPT-2 with just the first few words of a memorized training sequence. The model essentially acted as a search engine for its own training data, except nobody had built a privacy policy for that feature.
The attack surface increases with model size. Larger models have greater capacity to memorize rare sequences, and they exhibit lower perplexity on memorized content, making extraction easier. Fine-tuned models are particularly vulnerable because the fine-tuning dataset is typically much smaller than the pretraining corpus, so each example receives far more gradient updates and is memorized more thoroughly.
# Measuring memorization: perplexity-based extraction detection
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
def measure_memorization(
model_name: str,
candidate_texts: list[str],
device: str = "cuda",
) -> list[dict]:
"""Score candidate texts by likelihood under the model.
High-likelihood (low-perplexity) texts are more likely memorized."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
model.eval()
results = []
for text in candidate_texts:
inputs = tokenizer(text, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
loss = outputs.loss.item()
perplexity = np.exp(loss)
results.append({
"text_prefix": text[:80] + "...",
"perplexity": round(perplexity, 2),
"loss": round(loss, 4),
"num_tokens": inputs["input_ids"].shape[1],
})
# Sort by perplexity (lowest = most likely memorized)
results.sort(key=lambda x: x["perplexity"])
return results
# Example: check if specific sequences are memorized
candidates = [
"The quick brown fox jumps over the lazy dog.",
# In a real attack, these would be suspected training data
"def fibonacci(n):\n if n <= 1:\n return n",
]
# results = measure_memorization("gpt2", candidates)
# End-to-end privacy pipeline: scrub, train with DP, filter outputs
from dataclasses import dataclass
from typing import Optional
@dataclass
class PrivacyConfig:
# Preprocessing
scrub_pii: bool = True
deduplicate: bool = True
min_token_count: int = 3 # Remove near-unique sequences
# Training
use_dp_sgd: bool = True
target_epsilon: float = 8.0
target_delta: float = 1e-5
max_grad_norm: float = 1.0
# Output filtering
scan_outputs_for_pii: bool = True
block_verbatim_training_matches: bool = True
verbatim_threshold: int = 50 # Char length for match detection
def privacy_aware_pipeline(config: PrivacyConfig, raw_data: list[str]):
"""Orchestrate a privacy-preserving fine-tuning pipeline."""
# Stage 1: Data preprocessing
data = raw_data
if config.scrub_pii:
data = [scrub_pii(text) for text in data]
print(f"PII scrubbing: processed {len(data)} samples")
if config.deduplicate:
data = list(set(data)) # Simplified; use MinHash in production
print(f"After dedup: {len(data)} samples")
# Stage 2: DP fine-tuning (pseudocode)
if config.use_dp_sgd:
print(f"Training with DP-SGD: epsilon={config.target_epsilon}")
# ... Opacus training loop from Section 3 ...
else:
print("Training WITHOUT differential privacy (not recommended)")
# Stage 3: Output filter setup
if config.scan_outputs_for_pii:
print("Output PII scanner: ENABLED")
if config.block_verbatim_training_matches:
print(f"Verbatim match blocking: threshold={config.verbatim_threshold}")
return data # Return processed dataset
config = PrivacyConfig(target_epsilon=8.0)
# processed = privacy_aware_pipeline(config, raw_training_data)
PrivacyConfig dataclass codifies three layers of privacy defenses (preprocessing, DP-SGD training, output filtering) as a single configurable object. The privacy_aware_pipeline function runs them in order and prints which stages were active, making it explicit when a deployment opts out of any layer.Memorization is not uniformly distributed. Models preferentially memorize content that is (1) repeated multiple times in the training data, (2) highly structured (phone numbers, email addresses, URLs), (3) distinctive or unusual in context, and (4) present in smaller fine-tuning datasets. Carlini et al. (2023) found that data duplicated 10x in the training set is extractable at 10x higher rates than unique sequences. This has a direct practical implication: deduplicating your training data is both a data quality measure and a privacy measure. The data processing techniques from Section 6.1 (deduplication, filtering) serve double duty as privacy defenses.
50.1.2 Membership Inference Attacks
Membership inference attacks (MIAs) answer a binary question: was a specific data point used in training this model? Unlike extraction attacks (which recover the data itself), MIAs reveal whether an individual's data was included in the training set. This is a privacy violation in itself: knowing that someone's medical records were used to train a model reveals information about their health status, even without seeing the records.
The core technique exploits the observation that models behave differently on data they were trained on (members) versus data they have never seen (non-members). Members typically have lower loss, higher confidence, and different gradient patterns. The attacker trains a binary classifier (the "attack model") to distinguish members from non-members based on these signals.
# Membership Inference Attack: loss-threshold method
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import roc_auc_score, precision_recall_curve
def compute_per_sample_loss(
model, tokenizer, texts: list[str], device: str = "cuda",
) -> np.ndarray:
"""Compute per-sample cross-entropy loss."""
model.eval()
losses = []
for text in texts:
inputs = tokenizer(
text, return_tensors="pt", truncation=True, max_length=512,
).to(device)
with torch.no_grad():
outputs = model(**inputs, labels=inputs["input_ids"])
losses.append(outputs.loss.item())
return np.array(losses)
def membership_inference_attack(
model_name: str,
member_texts: list[str], # Known training data samples
nonmember_texts: list[str], # Known non-training data samples
device: str = "cuda",
) -> dict:
"""Run a loss-threshold membership inference attack."""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
member_losses = compute_per_sample_loss(
model, tokenizer, member_texts, device
)
nonmember_losses = compute_per_sample_loss(
model, tokenizer, nonmember_texts, device
)
# Lower loss = more likely a member
# Negate losses so higher score = more likely member
scores = np.concatenate([-member_losses, -nonmember_losses])
labels = np.concatenate([
np.ones(len(member_losses)),
np.zeros(len(nonmember_losses)),
])
auc = roc_auc_score(labels, scores)
precision, recall, _ = precision_recall_curve(labels, scores)
return {
"auc_roc": round(auc, 4),
"member_mean_loss": round(float(member_losses.mean()), 4),
"nonmember_mean_loss": round(float(nonmember_losses.mean()), 4),
"loss_gap": round(
float(nonmember_losses.mean() - member_losses.mean()), 4
),
}
loss_gap close to zero means the model has not overfit; a large gap (or AUC > 0.8) indicates the attacker can reliably identify training examples.A random-guessing attacker achieves AUC-ROC of 0.5. In practice, membership inference against large pretrained models typically achieves AUC between 0.55 and 0.65, a modest but statistically significant advantage. Against fine-tuned models, the attack is far more effective, often achieving AUC above 0.80, because the fine-tuning data is overfit to a much greater degree.
Algorithm: SHADOW-MODEL-MIA
Input: Target model f_target,
auxiliary unlabeled corpus D_aux from the target's distribution,
number of shadow models K,
shadow training size n_train
Output: Attack model A: score -> {member, non-member}
// Phase 1: Build labeled attack data using shadow models
attack_data = empty list
For k = 1 to K:
Sample D_k^in of size n_train from D_aux (in-set)
Sample D_k^out of size n_train from D_aux \ D_k^in (held out)
Train shadow model f_k with same architecture as f_target on D_k^in
For each x in D_k^in:
append ( features(f_k, x), label = MEMBER ) to attack_data
For each x in D_k^out:
append ( features(f_k, x), label = NON-MEMBER ) to attack_data
// Phase 2: Train the attack classifier
Train A on attack_data
// features(f, x) commonly = ( -loss(f, x), top-1 prob, prediction entropy, ... )
// Phase 3: Attack the target
For each candidate x:
Return A( features(f_target, x) )The simplest feature vector is the per-sample loss $\ell(f, x) = -\log p_f(x)$; thresholding $\ell$ at the shadow-set median already achieves $0.60\text{–}0.75$ AUC on a fine-tuned GPT-2. Better attacks (LiRA, Carlini et al., 2022) compute the per-example likelihood ratio $\Lambda(x) = p(\ell(f_{\mathrm{target}}, x) \mid x \in D_{\mathrm{train}})\,/\,p(\ell(f_{\mathrm{target}}, x) \mid x \notin D_{\mathrm{train}})$ using shadow-model loss distributions, which yields the strongest known attacks at the low-FPR regime relevant to GDPR (Shokri et al., 2017).
DP-SGD modifies the standard mini-batch gradient update in exactly two places. Let $g_i(\theta) = \nabla_\theta \ell(\theta; x_i)$ be the per-example gradient on example $x_i$. Standard SGD averages: $\theta \leftarrow \theta - \eta \cdot \tfrac{1}{|B|}\sum_{i \in B} g_i$. DP-SGD instead clips per-example, sums, adds Gaussian noise, then averages:
where $\operatorname{clip}_C(g) = g \cdot \min\!\bigl(1, C/\|g\|_2\bigr)$ bounds the L2 norm of every per-example contribution to $C$. The cumulative $(\varepsilon, \delta)$ budget across $T$ steps is tracked by the Renyi-DP moments accountant (Abadi et al., 2016):
so doubling $T$ increases $\varepsilon$ by $\sqrt{2}$; doubling the noise multiplier $\sigma$ halves it. This is why DP-SGD pairs naturally with parameter-efficient fine-tuning (LoRA): fewer trainable parameters means fewer steps to convergence and a smaller cumulative $\varepsilon$ for the same utility.
50.1.3 Differential Privacy in Fine-Tuning
Differential privacy (DP) provides a mathematical guarantee that the output of a computation does not depend "too much" on any single input record. Formally, a mechanism $\mathcal{M}$ satisfies $(\varepsilon, \delta)$-differential privacy if for any two datasets $D$ and $D'$ that differ in a single record, and for any set of outputs $S$:
The privacy budget $\varepsilon$ (epsilon) controls the strength of the guarantee. Smaller epsilon means stronger privacy but greater utility loss. In practice, $\varepsilon$ values between 1 and 10 are common for fine-tuning, with $\delta$ set to the reciprocal of the training set size. An $\varepsilon$ of 1 is considered strong privacy; $\varepsilon$ of 8 is moderate; values above 10 provide limited formal guarantees but may still offer practical protection against known attacks.
Suppose we train with $\varepsilon = 2$ and $\delta = 10^{-5}$. For any single record, the guarantee says:
Since $e^{2} \approx 7.39$, any output set $S$ is at most 7.39 times more likely when your record is included versus excluded, plus a negligible $10^{-5}$ slack. At $\varepsilon = 1$ the multiplier drops to $e^{1} \approx 2.72$, and at $\varepsilon = 8$ it rises to $e^{8} \approx 2{,}981$, which is why the community considers $\varepsilon \leq 1$ strong, $\varepsilon \approx 8$ moderate, and $\varepsilon > 10$ weak.
50.1.3.1 DP-SGD with Opacus
DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training loop in two ways: (1) it clips the per-sample gradient to bound the influence of any single example, and (2) it adds calibrated Gaussian noise to the clipped gradients before the optimizer step. The Opacus library from Meta provides a drop-in integration with PyTorch that handles both operations transparently.
# DP-SGD fine-tuning with Opacus
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name, num_labels=2,
)
# Opacus requires models to pass validation
# (e.g., BatchNorm must be replaced with GroupNorm)
model = ModuleValidator.fix(model)
errors = ModuleValidator.validate(model, strict=False)
assert len(errors) == 0, f"Model validation failed: {errors}"
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Privacy parameters
MAX_GRAD_NORM = 1.0 # Per-sample gradient clipping bound
EPSILON = 8.0 # Target privacy budget
DELTA = 1e-5 # Should be < 1/N (training set size)
EPOCHS = 3
BATCH_SIZE = 32
# Wrap model, optimizer, and dataloader with PrivacyEngine
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
module=model,
optimizer=optimizer,
data_loader=train_loader, # Your DataLoader
epochs=EPOCHS,
target_epsilon=EPSILON,
target_delta=DELTA,
max_grad_norm=MAX_GRAD_NORM,
)
print(f"Using sigma={optimizer.noise_multiplier:.4f} "
f"for ({EPSILON}, {DELTA})-DP over {EPOCHS} epochs")
# Training loop (standard PyTorch, Opacus handles DP transparently)
for epoch in range(EPOCHS):
model.train()
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
# Check actual privacy spent so far
eps = privacy_engine.get_epsilon(delta=DELTA)
print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}, "
f"epsilon_spent={eps:.2f}")
make_private wraps a standard PyTorch training loop, applying per-sample gradient clipping at max_grad_norm and adding calibrated Gaussian noise so that the cumulative (epsilon, delta) budget is tracked by the moments accountant. The fine-tuning loop itself is unchanged; the privacy guarantee comes from the privatized DataLoader.Who: A machine learning engineer and a privacy counsel at a legal technology company processing privileged attorney-client communications
Situation: The company was fine-tuning DistilBERT for sentiment classification on client feedback to improve their legal AI product. Without differential privacy, the model achieved approximately 92% accuracy on the SST-2 benchmark. However, privacy counsel required formal privacy guarantees because the training data contained sensitive legal communications.
Problem: DP-SGD at epsilon=1 (strong privacy) dropped accuracy to approximately 82%, a 10-point loss that made the model unusable for production. The team needed to find a privacy budget that satisfied legal requirements without destroying model utility.
Dilemma: Ship a strong-privacy model nobody could use, or relax the budget and have to defend that choice if regulators or clients later challenged the privacy guarantee.
Decision: After testing multiple epsilon values, they selected epsilon=8 (moderate privacy), which dropped accuracy to approximately 88%. Privacy counsel accepted this level because it still provided a formal mathematical bound on individual record influence, and the 4-point accuracy loss was within acceptable margins for their use case.
How: Engineering ran a sweep of DP-SGD fine-tunes at epsilon in {1, 2, 4, 8, 16} using Opacus, plotting accuracy against epsilon and presenting the curve to privacy counsel alongside membership inference results at each point so the choice was driven by data, not intuition.
Result: The model shipped with DP-SGD at epsilon=8. A subsequent membership inference audit confirmed that no individual training record could be extracted. The 4% accuracy gap was offset by users' increased willingness to share sensitive data, knowing it was formally protected.
Lesson: Selecting the right epsilon requires explicit negotiation between engineering and legal teams; moderate privacy budgets (epsilon 4-8) often provide the best practical balance for sensitive but non-medical data.
50.1.3.2 Privacy Budget Management
The privacy budget is a finite resource that is consumed with every access to the training data. Under the composition theorem, running $k$ training runs each with privacy budget $\varepsilon$ consumes a total budget of approximately $\varepsilon \sqrt{k}$ (under advanced composition) or $k \varepsilon$ (under basic composition). This means that hyperparameter tuning, which requires many training runs, rapidly depletes the privacy budget. Strategies for managing the budget include:
- Composition accounting. As a concrete example, suppose each training run uses $\varepsilon = 2$. After $k = 9$ runs, basic composition gives a total budget of $k\varepsilon = 18$, while advanced composition gives approximately $\varepsilon\sqrt{k} = 2\sqrt{9} = 6$. The advanced bound is much tighter, which is why privacy accountants (such as the Renyi Divergence accountant in Opacus) use advanced composition by default.
- Public data pretuning: Perform hyperparameter search on a public dataset with similar characteristics, then run a single DP fine-tuning pass on the private data.
- Privacy-free validation: Hold out a validation set that is not subject to DP constraints (if your privacy policy permits this).
- Transfer learning: Start from a strong pretrained model to minimize the number of DP fine-tuning steps needed.
- LoRA with DP: Combine parameter-efficient fine-tuning with DP-SGD. Training fewer parameters requires fewer gradient updates, consuming less privacy budget for the same number of epochs.
50.1.4 Contextual Integrity and PII Leakage
Helen Nissenbaum's theory of contextual integrity provides a useful framework for reasoning about privacy in LLM systems. The theory holds that privacy is violated not by any disclosure of information, but by information flows that violate context-specific norms. A person's medical diagnosis shared with their doctor is appropriate; the same information appearing in a chatbot's response to a stranger violates contextual integrity, even if the information is technically "public" in some database.
For LLM engineers, this framework suggests that privacy protection is not only about preventing extraction of training data. It also encompasses ensuring that the model does not recombine publicly available information in ways that violate contextual norms. A model that infers someone's medical condition from their public social media posts and reveals this in a response violates contextual integrity, even though no private data was used in training.
50.1.4.1 PII Detection and Mitigation
This snippet detects personally identifiable information in text and redacts it before sending data to an LLM.
# PII detection and scrubbing pipeline
import re
from dataclasses import dataclass
@dataclass
class PIIMatch:
category: str
text: str
start: int
end: int
confidence: float
PII_PATTERNS = {
"email": (
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", 0.95
),
"phone_us": (
r"(?:\+1[-.])?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}", 0.85
),
"ssn": (
r"\b\d{3}-\d{2}-\d{4}\b", 0.95
),
"credit_card": (
r"\b(?:\d{4}[-\s]?){3}\d{4}\b", 0.90
),
"ip_address": (
r"\b(?:\d{1,3}\.){3}\d{1,3}\b", 0.70
),
}
def detect_pii(text: str) -> list[PIIMatch]:
"""Detect PII using pattern matching."""
matches = []
for category, (pattern, confidence) in PII_PATTERNS.items():
for match in re.finditer(pattern, text):
matches.append(PIIMatch(
category=category,
text=match.group(),
start=match.start(),
end=match.end(),
confidence=confidence,
))
return matches
def scrub_pii(text: str, replacement: str = "[REDACTED]") -> str:
"""Replace detected PII with redaction markers."""
matches = detect_pii(text)
# Process matches in reverse order to preserve positions
for match in sorted(matches, key=lambda m: m.start, reverse=True):
tag = f"[{match.category.upper()}_REDACTED]"
text = text[:match.start] + tag + text[match.end:]
return text
# Example
sample = "Contact John at john.doe@example.com or 555-123-4567"
print(scrub_pii(sample))
# "Contact John at [EMAIL_REDACTED] or [PHONE_US_REDACTED]"
Pattern-based PII detection catches structured identifiers (emails, phone numbers, credit cards) but misses unstructured PII (names in context, addresses described in prose, medical conditions). For comprehensive PII detection, combine regex patterns with NER-based approaches using models like Microsoft Presidio, which applies named entity recognition to identify person names, locations, organizations, and other entity types that may constitute PII.
The same result in 6 lines with Presidio, which adds NER-based detection on top of regex patterns:
Show code
# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Contact John at john.doe@example.com or 555-123-4567"
results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON"], language="en")
scrubbed = anonymizer.anonymize(text=text, analyzer_results=results)
print(scrubbed.text) # "Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>"
entities to analyze keeps false-positive rates manageable; passing None runs every recognizer at the cost of more spurious flags.50.1.5 Defense in Depth: Combining Privacy Techniques
No single privacy technique provides complete protection. A robust privacy strategy layers multiple defenses, each addressing a different threat vector:
| Layer | Technique | Protects Against | Limitation |
|---|---|---|---|
| Data preprocessing | PII scrubbing, deduplication | Direct PII exposure, memorization of duplicates | Incomplete PII detection; removes but does not forget |
| Training | DP-SGD (Opacus) | Membership inference, extraction with formal guarantees | Utility degradation; requires privacy budget management |
| Post-training | Machine unlearning (Section 49.7) | Specific data removal requests (GDPR) | Verification of complete removal is difficult |
| Inference | Output filtering, PII scanning | PII in model responses | Cannot catch all implicit information leakage |
| Architecture | Retrieval separation (RAG) | Reduces memorization by externalizing knowledge | Retrieved documents may themselves contain PII |
Differential privacy and PII scrubbing address different threat models. PII scrubbing removes identifiable information from the data before training, but cannot protect against inference attacks that reconstruct PII from statistical patterns. Differential privacy provides formal guarantees against such attacks, but at the cost of model utility. The optimal strategy uses both: scrub PII to remove the obvious risks (and improve data quality), then apply DP to bound the residual privacy leakage from statistical patterns. This defense-in-depth approach aligns with both GDPR's data minimization principle and the practical reality that no single technique is sufficient.
Objective
This lab walks through a complete workflow: fine-tune a small model on a dataset containing synthetic PII, measure memorization rates before and after DP-SGD, and compare utility.
Setup
You need a CUDA GPU with at least 8 GB of VRAM, the transformers and datasets libraries
from Hugging Face, and Opacus for the DP-SGD pass. The lab uses a small causal-LM checkpoint (GPT-2 small is the worked
example); pin a tokenizer, set a global seed, and write outputs to a clean directory so the canary-extraction step is reproducible.
Steps
The four-step workflow is shown in the code below: build the dataset with embedded canaries, fine-tune without DP and measure canary memorization, fine-tune with DP-SGD and measure canary memorization, and finally compare utility (perplexity on held-out data) across the two runs. The full DP-SGD implementation is in Exercise 29.12.2.
# Lab: Memorization measurement and DP mitigation
import torch
import numpy as np
from transformers import (
AutoModelForCausalLM, AutoTokenizer,
TrainingArguments, Trainer,
)
from datasets import Dataset
# Step 1: Create a dataset with "canary" sequences
# These are unique strings we embed to test for memorization
def create_canary_dataset(n_samples: int = 1000, n_canaries: int = 50):
"""Create training data with embedded canary sequences."""
import random
import string
base_texts = [
f"The capital of France is Paris. Document {i}."
for i in range(n_samples - n_canaries)
]
# Canaries: unique sequences that should not be memorizable
canaries = []
for i in range(n_canaries):
# Simulated PII: fake SSN, email, phone
fake_ssn = f"{random.randint(100,999)}-{random.randint(10,99)}-{random.randint(1000,9999)}"
fake_email = (
''.join(random.choices(string.ascii_lowercase, k=8))
+ "@example.com"
)
canary = (
f"Patient record {i}: SSN {fake_ssn}, "
f"contact {fake_email}, diagnosis: common cold."
)
canaries.append(canary)
all_texts = base_texts + canaries
random.shuffle(all_texts)
return all_texts, canaries
texts, canaries = create_canary_dataset()
print(f"Dataset: {len(texts)} samples, {len(canaries)} canaries")
print(f"Sample canary: {canaries[0]}")
# Step 2: Fine-tune without DP, measure canary memorization
# Step 3: Fine-tune with DP, measure canary memorization
# Step 4: Compare utility (perplexity on held-out data)
# See exercise 32.12.2 for the complete implementation
Expected Output
Without DP, the model should reliably regurgitate a substantial fraction of the canaries when prompted with the canary prefix (extraction success rate typically 30-80 percent on a 1000-sample fine-tune). With DP-SGD at epsilon = 8, the extraction rate should fall to near zero, while validation perplexity rises by roughly 10-20 percent. Plot the canary-extraction rate alongside the perplexity curve to visualize the privacy-utility trade-off the rest of this section formalizes.
- Training data extraction attacks can recover verbatim memorized text from LLMs, including personally identifiable information and copyrighted material.
- Membership inference attacks determine whether a specific example was in the training set, posing privacy risks even when the training data is not directly extractable.
- Differential privacy (DP) in fine-tuning adds calibrated noise to gradients, providing mathematical guarantees that individual training examples cannot be identified.
- The privacy-utility tradeoff is the central challenge: stronger privacy guarantees (lower epsilon) reduce model performance, requiring careful calibration for each use case.
- Defense in depth combines DP fine-tuning, PII scrubbing, output filtering, and access controls to create layered privacy protection.
Applying differential privacy to fine-tuning does not retroactively protect the pretraining data. Every current foundation model was pretrained without DP guarantees, so the base model still memorizes and leaks information from its pretraining corpus. DP fine-tuning protects only the fine-tuning dataset. Full protection requires defense in depth: DP fine-tuning plus output filtering plus PII detection at inference time. Skip any layer and the leak surface stays open.
Exercises
Fine-tune GPT-2 on a 1,000-example subset of a public dataset (e.g., Wikitext). Hold out 1,000 non-training examples from the same distribution. Implement the loss-threshold membership inference attack from Section 2 and report the AUC-ROC. Then repeat the experiment with DP-SGD at epsilon=8. How much does DP reduce the attack's effectiveness?
Answer Sketch
Without DP, the membership inference AUC is typically 0.70 to 0.85 on a small fine-tuning set, because the model overfits significantly. With DP-SGD at epsilon=8, the AUC drops to 0.55 to 0.65, close to random guessing. The noise added during DP training prevents the model from fitting closely enough to individual examples for the loss gap to be exploitable. The trade-off is a modest increase in validation perplexity (roughly 10 to 20%).
Using the canary dataset from Section 6, complete the full lab: (1) Fine-tune GPT-2 without DP for 5 epochs. (2) Attempt to extract canaries by prompting with their prefixes and measuring completion perplexity. (3) Count how many canaries are extractable (perplexity below a threshold). (4) Repeat with DP-SGD at epsilon=4 and epsilon=8. Plot the number of extractable canaries vs. epsilon.
Answer Sketch
Without DP, 60 to 80% of canaries are extractable after 5 epochs of fine-tuning on a small dataset. At epsilon=8, this drops to 10 to 20%. At epsilon=4 (stronger privacy), fewer than 5% of canaries are extractable. The validation perplexity increases by ~15% at epsilon=8 and ~30% at epsilon=4 compared to the non-DP baseline. This demonstrates the privacy-utility trade-off quantitatively.
An LLM trained on the open web can sometimes regurgitate verbatim text it saw during pretraining, including PII like phone numbers and credit card formats. (a) What property of the training objective causes verbatim memorization? (b) Why are duplicates and rare strings the highest-risk subsets for extraction attacks? (c) Why does scaling up parameters generally make memorization worse, not better, at fixed dataset size?
Answer Sketch
(a) Cross-entropy training drives the model to maximize the likelihood of the exact next token; for any string in the training set, perfect prediction is the minimum-loss solution, so the optimizer pulls the model toward memorization wherever capacity allows. (b) Duplicates are reinforced multiple times so their conditional probability rises above the noise floor; rare unique strings have no competing alternatives in the model's distribution, so once memorized they're easy to elicit with the right prefix. (c) More parameters mean more capacity to memorize at the same training loss; the model can fit both the generalization signal and individual examples. This is why frontier-model labs combine deduplication with privacy-aware filtering rather than just relying on scale.
You run a Carlini-style extraction attack: prompt the model with random short prefixes and check whether continuations match training data. Predict the qualitative shape of the results: (a) for a 7B model trained on dedup'd 2T tokens; (b) for the same model after fine-tuning on a private medical corpus; (c) what changes if the attacker has access to the gradient (white-box) versus only the API (black-box)?
Answer Sketch
(a) Most prefixes return generic, novel-looking text. Verbatim hits are rare (often <1% of prefixes) and concentrated in heavily duplicated content (license headers, common code snippets, song lyrics). (b) Risk skyrockets: the medical corpus is small relative to model capacity and was seen many epochs, so memorization rates can reach 5-20% for whole patient records. This is why DP fine-tuning matters more than DP pretraining for many threat models. (c) White-box attackers can run membership-inference with high precision (test if a specific record was in training) and can do gradient-based extraction with much higher hit rate. Black-box attackers are limited to behavioral probes; their bound is loose but real. The threat model determines which mitigations to invest in.
Sketch the change to a standard PyTorch fine-tuning loop required to use Opacus for differential privacy. Note (a) the three-line wrapper, (b) the two hyperparameters that control the privacy/utility tradeoff, and (c) the typical accuracy cost.
Answer Sketch
(a) Wrapper: from opacus import PrivacyEngine; pe = PrivacyEngine(); model, optimizer, train_loader = pe.make_private(module=model, optimizer=optimizer, data_loader=train_loader, noise_multiplier=1.0, max_grad_norm=1.0). (b) The two key knobs: noise_multiplier (more noise = stronger privacy, lower utility) and max_grad_norm (the per-sample gradient clip, controlling sensitivity). Together they determine the (epsilon, delta) privacy budget after a given number of steps. (c) Typical cost: 5-15 percentage-point accuracy drop on small fine-tuning datasets. The hit gets smaller as dataset size grows; for very large fine-tunes the gap can shrink to ~1-3 points. PEFT methods like LoRA combine well with DP because they reduce the effective dimensionality being clipped, recovering some utility.
Your team layers four privacy defenses: (i) PII redaction in pretraining, (ii) DP fine-tuning, (iii) input filtering in production, (iv) output scanning for SSN-like patterns. Identify three failure modes that survive all four layers, and what additional measure would close each.
Answer Sketch
(1) Indirect leakage via inference: model produces a fact that, combined with other public information, identifies a specific person. Pattern scanning misses this entirely. Mitigation: contextual integrity audits, where you red-team for inferential disclosures with realistic attacker queries. (2) Memorization in adapter weights: fine-tuning data leaks through PEFT adapters even if the base passed PII redaction. Mitigation: extraction tests on the full adapter+base stack as part of release gates. (3) Conversational-history leak: a long conversation embeds facts about the user that subsequent calls echo back to other parties. Mitigation: per-tenant memory isolation and explicit consent gates for memory features. The general principle: technical defenses bound, but never eliminate, privacy risk; pair them with policy and audit.
Machine unlearning for LLMs seeks to remove specific training data points from a model after training, without retraining from scratch.
Current approaches (gradient ascent on forget sets, SISA sharding) work for smaller models but remain computationally expensive at LLM scale.
In parallel, confidential computing (using hardware enclaves like Intel TDX and AMD SEV-SNP) is being explored to run LLM inference on encrypted data, ensuring that even the cloud provider cannot observe user prompts or model outputs.
What Comes Next
Differential privacy and PII defenses protect data held centrally. The next section, Federated Learning for Privacy-Preserving Training, covers the complementary problem: training across organizations whose data cannot leave their premises at all.
For the differential-privacy and federated-learning libraries (Opacus, Flower) that operationalize these defenses, see Section 56.2: Responsible-AI Tools. For the RAG-specific privacy considerations (corpus PII, retrieval logs, prompt-injection through documents), see Section 32.1: RAG Foundations. For the regulatory side of privacy (GDPR, CCPA, EU AI Act data-protection clauses), see Section 53.1: Global Regulatory Landscape.