Section 32.12: Privacy Attacks & Differential Privacy for LLMs

"A model that memorizes is a model that leaks. The question is not if, but when, and how much."
A Cautious Guard, Leak-Conscious AI Agent

Big Picture

Large language models memorize portions of their training data, and adversaries can extract that data through carefully crafted queries. Carlini et al. (2021) demonstrated that GPT-2 could be prompted to emit verbatim training sequences, including names, phone numbers, email addresses, and code snippets. This is not a bug in a specific model; it is an inherent property of how neural language models learn. Larger models memorize more, and fine-tuned models memorize their fine-tuning data at even higher rates than the base model memorizes its pretraining corpus. This section covers the attack landscape (extraction attacks, membership inference, attribute inference), the theoretical foundation of differential privacy, and the practical application of DP-SGD for privacy-preserving fine-tuning using the Opacus library. It also addresses PII detection and mitigation strategies that complement formal privacy guarantees with practical defense-in-depth.

Prerequisites

This section expands on the privacy foundations introduced in Section 32.6: LLM Licensing, IP & Privacy and the security threat landscape from Section 32.1. Understanding fine-tuning fundamentals (Section 14.1) is essential for the DP-SGD sections. The GDPR requirements discussed in Section 32.4 provide the regulatory motivation for the technical defenses presented here.

1. Training Data Extraction Attacks

Training data extraction attacks exploit the fact that language models assign higher probability to sequences they have seen during training. An attacker generates a large number of candidate sequences (either by prompting the model or by sampling from it) and then identifies which outputs are likely verbatim memorized content. The key insight from Carlini et al. (2021) is the distinction between eidetic memorization (the model can reproduce a sequence exactly when given the right prefix) and extractable memorization (the sequence can be recovered through systematic probing without knowing the prefix).

Fun Fact

In the landmark Carlini et al. extraction study, the researchers recovered a person's full name, email address, phone number, and physical address by prompting GPT-2 with just the first few words of a memorized training sequence. The model essentially acted as a search engine for its own training data, except nobody had built a privacy policy for that feature.

The attack surface increases with model size. Larger models have greater capacity to memorize rare sequences, and they exhibit lower perplexity on memorized content, making extraction easier. Fine-tuned models are particularly vulnerable because the fine-tuning dataset is typically much smaller than the pretraining corpus, so each example receives far more gradient updates and is memorized more thoroughly.

# Measuring memorization: perplexity-based extraction detection
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np

def measure_memorization(
 model_name: str,
 candidate_texts: list[str],
 device: str = "cuda",
) -> list[dict]:
 """Score candidate texts by likelihood under the model.
 High-likelihood (low-perplexity) texts are more likely memorized."""
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
 model.eval()

 results = []
 for text in candidate_texts:
 inputs = tokenizer(text, return_tensors="pt").to(device)
 with torch.no_grad():
 outputs = model(**inputs, labels=inputs["input_ids"])
 loss = outputs.loss.item()
 perplexity = np.exp(loss)

 results.append({
 "text_prefix": text[:80] + "...",
 "perplexity": round(perplexity, 2),
 "loss": round(loss, 4),
 "num_tokens": inputs["input_ids"].shape[1],
 })

 # Sort by perplexity (lowest = most likely memorized)
 results.sort(key=lambda x: x["perplexity"])
 return results

# Example: check if specific sequences are memorized
candidates = [
 "The quick brown fox jumps over the lazy dog.",
 # In a real attack, these would be suspected training data
 "def fibonacci(n):\n if n <= 1:\n return n",
]
# results = measure_memorization("gpt2", candidates)

Using sigma=0.8712 for (8.0, 1e-05)-DP over 3 epochs Epoch 1: loss=0.4823, epsilon_spent=3.14 Epoch 2: loss=0.3156, epsilon_spent=5.47 Epoch 3: loss=0.2491, epsilon_spent=7.89

Code Fragment 32.12.1: Measuring memorization: perplexity-based extraction detection

# End-to-end privacy pipeline: scrub, train with DP, filter outputs
from dataclasses import dataclass
from typing import Optional

@dataclass
class PrivacyConfig:
 # Preprocessing
 scrub_pii: bool = True
 deduplicate: bool = True
 min_token_count: int = 3 # Remove near-unique sequences

 # Training
 use_dp_sgd: bool = True
 target_epsilon: float = 8.0
 target_delta: float = 1e-5
 max_grad_norm: float = 1.0

 # Output filtering
 scan_outputs_for_pii: bool = True
 block_verbatim_training_matches: bool = True
 verbatim_threshold: int = 50 # Char length for match detection

def privacy_aware_pipeline(config: PrivacyConfig, raw_data: list[str]):
 """Orchestrate a privacy-preserving fine-tuning pipeline."""
 # Stage 1: Data preprocessing
 data = raw_data
 if config.scrub_pii:
 data = [scrub_pii(text) for text in data]
 print(f"PII scrubbing: processed {len(data)} samples")

 if config.deduplicate:
 data = list(set(data)) # Simplified; use MinHash in production
 print(f"After dedup: {len(data)} samples")

 # Stage 2: DP fine-tuning (pseudocode)
 if config.use_dp_sgd:
 print(f"Training with DP-SGD: epsilon={config.target_epsilon}")
 # ... Opacus training loop from Section 3 ...
 else:
 print("Training WITHOUT differential privacy (not recommended)")

 # Stage 3: Output filter setup
 if config.scan_outputs_for_pii:
 print("Output PII scanner: ENABLED")
 if config.block_verbatim_training_matches:
 print(f"Verbatim match blocking: threshold={config.verbatim_threshold}")

 return data # Return processed dataset

config = PrivacyConfig(target_epsilon=8.0)
# processed = privacy_aware_pipeline(config, raw_training_data)

Code Fragment 32.12.2: Measuring memorization: perplexity-based extraction detection

Key Insight

Memorization is not uniformly distributed. Models preferentially memorize content that is (1) repeated multiple times in the training data, (2) highly structured (phone numbers, email addresses, URLs), (3) distinctive or unusual in context, and (4) present in smaller fine-tuning datasets. Carlini et al. (2023) found that data duplicated 10x in the training set is extractable at 10x higher rates than unique sequences. This has a direct practical implication: deduplicating your training data is both a data quality measure and a privacy measure. The data processing techniques from Section 6.1 (deduplication, filtering) serve double duty as privacy defenses.

2. Membership Inference Attacks

Membership inference attacks (MIAs) answer a binary question: was a specific data point used in training this model? Unlike extraction attacks (which recover the data itself), MIAs reveal whether an individual's data was included in the training set. This is a privacy violation in itself: knowing that someone's medical records were used to train a model reveals information about their health status, even without seeing the records.

The core technique exploits the observation that models behave differently on data they were trained on (members) versus data they have never seen (non-members). Members typically have lower loss, higher confidence, and different gradient patterns. The attacker trains a binary classifier (the "attack model") to distinguish members from non-members based on these signals.

# Membership Inference Attack: loss-threshold method
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import roc_auc_score, precision_recall_curve

def compute_per_sample_loss(
 model, tokenizer, texts: list[str], device: str = "cuda",
) -> np.ndarray:
 """Compute per-sample cross-entropy loss."""
 model.eval()
 losses = []
 for text in texts:
 inputs = tokenizer(
 text, return_tensors="pt", truncation=True, max_length=512,
 ).to(device)
 with torch.no_grad():
 outputs = model(**inputs, labels=inputs["input_ids"])
 losses.append(outputs.loss.item())
 return np.array(losses)

def membership_inference_attack(
 model_name: str,
 member_texts: list[str], # Known training data samples
 nonmember_texts: list[str], # Known non-training data samples
 device: str = "cuda",
) -> dict:
 """Run a loss-threshold membership inference attack."""
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

 member_losses = compute_per_sample_loss(
 model, tokenizer, member_texts, device
 )
 nonmember_losses = compute_per_sample_loss(
 model, tokenizer, nonmember_texts, device
 )

 # Lower loss = more likely a member
 # Negate losses so higher score = more likely member
 scores = np.concatenate([-member_losses, -nonmember_losses])
 labels = np.concatenate([
 np.ones(len(member_losses)),
 np.zeros(len(nonmember_losses)),
 ])

 auc = roc_auc_score(labels, scores)
 precision, recall, _ = precision_recall_curve(labels, scores)

 return {
 "auc_roc": round(auc, 4),
 "member_mean_loss": round(float(member_losses.mean()), 4),
 "nonmember_mean_loss": round(float(nonmember_losses.mean()), 4),
 "loss_gap": round(
 float(nonmember_losses.mean() - member_losses.mean()), 4
 ),
 }

Code Fragment 32.12.3: Membership Inference Attack: loss-threshold method

A random-guessing attacker achieves AUC-ROC of 0.5. In practice, membership inference against large pretrained models typically achieves AUC between 0.55 and 0.65, a modest but statistically significant advantage. Against fine-tuned models, the attack is far more effective, often achieving AUC above 0.80, because the fine-tuning data is overfit to a much greater degree.

3. Differential Privacy in Fine-Tuning

Differential privacy (DP) provides a mathematical guarantee that the output of a computation does not depend "too much" on any single input record. Formally, a mechanism $\mathcal{M}$ satisfies $(\varepsilon, \delta)$-differential privacy if for any two datasets $D$ and $D'$ that differ in a single record, and for any set of outputs $S$:

$$\Pr[\mathcal{M}(D) \in S] \leq e^{\varepsilon} \cdot \Pr[\mathcal{M}(D') \in S] + \delta$$

The privacy budget $\varepsilon$ (epsilon) controls the strength of the guarantee. Smaller epsilon means stronger privacy but greater utility loss. In practice, $\varepsilon$ values between 1 and 10 are common for fine-tuning, with $\delta$ set to the reciprocal of the training set size. An $\varepsilon$ of 1 is considered strong privacy; $\varepsilon$ of 8 is moderate; values above 10 provide limited formal guarantees but may still offer practical protection against known attacks.

Worked Example: Reading the DP Guarantee

Suppose we train with $\varepsilon = 2$ and $\delta = 10^{-5}$. For any single record, the guarantee says:

$$\Pr[\mathcal{M}(D) \in S] \leq e^{2} \cdot \Pr[\mathcal{M}(D') \in S] + 10^{-5}$$

Since $e^{2} \approx 7.39$, any output set $S$ is at most 7.39 times more likely when your record is included versus excluded, plus a negligible $10^{-5}$ slack. At $\varepsilon = 1$ the multiplier drops to $e^{1} \approx 2.72$, and at $\varepsilon = 8$ it rises to $e^{8} \approx 2{,}981$, which is why the community considers $\varepsilon \leq 1$ strong, $\varepsilon \approx 8$ moderate, and $\varepsilon > 10$ weak.

3.1 DP-SGD with Opacus

DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training loop in two ways: (1) it clips the per-sample gradient to bound the influence of any single example, and (2) it adds calibrated Gaussian noise to the clipped gradients before the optimizer step. The Opacus library from Meta provides a drop-in integration with PyTorch that handles both operations transparently.

# DP-SGD fine-tuning with Opacus
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
 model_name, num_labels=2,
)

# Opacus requires models to pass validation
# (e.g., BatchNorm must be replaced with GroupNorm)
model = ModuleValidator.fix(model)
errors = ModuleValidator.validate(model, strict=False)
assert len(errors) == 0, f"Model validation failed: {errors}"

optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Privacy parameters
MAX_GRAD_NORM = 1.0 # Per-sample gradient clipping bound
EPSILON = 8.0 # Target privacy budget
DELTA = 1e-5 # Should be < 1/N (training set size)
EPOCHS = 3
BATCH_SIZE = 32

# Wrap model, optimizer, and dataloader with PrivacyEngine
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
 module=model,
 optimizer=optimizer,
 data_loader=train_loader, # Your DataLoader
 epochs=EPOCHS,
 target_epsilon=EPSILON,
 target_delta=DELTA,
 max_grad_norm=MAX_GRAD_NORM,
)

print(f"Using sigma={optimizer.noise_multiplier:.4f} "
 f"for ({EPSILON}, {DELTA})-DP over {EPOCHS} epochs")

# Training loop (standard PyTorch, Opacus handles DP transparently)
for epoch in range(EPOCHS):
 model.train()
 total_loss = 0
 for batch in train_loader:
 optimizer.zero_grad()
 outputs = model(**batch)
 loss = outputs.loss
 loss.backward()
 optimizer.step()
 total_loss += loss.item()

 # Check actual privacy spent so far
 eps = privacy_engine.get_epsilon(delta=DELTA)
 print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}, "
 f"epsilon_spent={eps:.2f}")

Code Fragment 32.12.4: DP-SGD fine-tuning with Opacus

Real-World Scenario: The Privacy-Utility Trade-off in Practice

Who: A machine learning engineer and a privacy counsel at a legal technology company processing privileged attorney-client communications

Situation: The company was fine-tuning DistilBERT for sentiment classification on client feedback to improve their legal AI product. Without differential privacy, the model achieved approximately 92% accuracy on the SST-2 benchmark. However, privacy counsel required formal privacy guarantees because the training data contained sensitive legal communications.

Problem: DP-SGD at epsilon=1 (strong privacy) dropped accuracy to approximately 82%, a 10-point loss that made the model unusable for production. The team needed to find a privacy budget that satisfied legal requirements without destroying model utility.

Decision: After testing multiple epsilon values, they selected epsilon=8 (moderate privacy), which dropped accuracy to approximately 88%. Privacy counsel accepted this level because it still provided a formal mathematical bound on individual record influence, and the 4-point accuracy loss was within acceptable margins for their use case.

Result: The model shipped with DP-SGD at epsilon=8. A subsequent membership inference audit confirmed that no individual training record could be extracted. The 4% accuracy gap was offset by users' increased willingness to share sensitive data, knowing it was formally protected.

Lesson: Selecting the right epsilon requires explicit negotiation between engineering and legal teams; moderate privacy budgets (epsilon 4-8) often provide the best practical balance for sensitive but non-medical data.

3.2 Privacy Budget Management

The privacy budget is a finite resource that is consumed with every access to the training data. Under the composition theorem, running $k$ training runs each with privacy budget $\varepsilon$ consumes a total budget of approximately $\varepsilon \sqrt{k}$ (under advanced composition) or $k \varepsilon$ (under basic composition). This means that hyperparameter tuning, which requires many training runs, rapidly depletes the privacy budget. Strategies for managing the budget include:

Composition accounting. As a concrete example, suppose each training run uses $\varepsilon = 2$. After $k = 9$ runs, basic composition gives a total budget of $k\varepsilon = 18$, while advanced composition gives approximately $\varepsilon\sqrt{k} = 2\sqrt{9} = 6$. The advanced bound is much tighter, which is why privacy accountants (such as the Renyi Divergence accountant in Opacus) use advanced composition by default.
Public data pretuning: Perform hyperparameter search on a public dataset with similar characteristics, then run a single DP fine-tuning pass on the private data.
Privacy-free validation: Hold out a validation set that is not subject to DP constraints (if your privacy policy permits this).
Transfer learning: Start from a strong pre-trained model to minimize the number of DP fine-tuning steps needed.
LoRA with DP: Combine parameter-efficient fine-tuning with DP-SGD. Training fewer parameters requires fewer gradient updates, consuming less privacy budget for the same number of epochs.

4. Contextual Integrity and PII Leakage

Helen Nissenbaum's theory of contextual integrity provides a useful framework for reasoning about privacy in LLM systems. The theory holds that privacy is violated not by any disclosure of information, but by information flows that violate context-specific norms. A person's medical diagnosis shared with their doctor is appropriate; the same information appearing in a chatbot's response to a stranger violates contextual integrity, even if the information is technically "public" in some database.

For LLM engineers, this framework suggests that privacy protection is not only about preventing extraction of training data. It also encompasses ensuring that the model does not recombine publicly available information in ways that violate contextual norms. A model that infers someone's medical condition from their public social media posts and reveals this in a response violates contextual integrity, even though no private data was used in training.

4.1 PII Detection and Mitigation

This snippet detects personally identifiable information in text and redacts it before sending data to an LLM.

# PII detection and scrubbing pipeline
import re
from dataclasses import dataclass

@dataclass
class PIIMatch:
 category: str
 text: str
 start: int
 end: int
 confidence: float

PII_PATTERNS = {
 "email": (
 r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", 0.95
 ),
 "phone_us": (
 r"(?:\+1[-.])?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}", 0.85
 ),
 "ssn": (
 r"\b\d{3}-\d{2}-\d{4}\b", 0.95
 ),
 "credit_card": (
 r"\b(?:\d{4}[-\s]?){3}\d{4}\b", 0.90
 ),
 "ip_address": (
 r"\b(?:\d{1,3}\.){3}\d{1,3}\b", 0.70
 ),
}

def detect_pii(text: str) -> list[PIIMatch]:
 """Detect PII using pattern matching."""
 matches = []
 for category, (pattern, confidence) in PII_PATTERNS.items():
 for match in re.finditer(pattern, text):
 matches.append(PIIMatch(
 category=category,
 text=match.group(),
 start=match.start(),
 end=match.end(),
 confidence=confidence,
 ))
 return matches

def scrub_pii(text: str, replacement: str = "[REDACTED]") -> str:
 """Replace detected PII with redaction markers."""
 matches = detect_pii(text)
 # Process matches in reverse order to preserve positions
 for match in sorted(matches, key=lambda m: m.start, reverse=True):
 tag = f"[{match.category.upper()}_REDACTED]"
 text = text[:match.start] + tag + text[match.end:]
 return text

# Example
sample = "Contact John at john.doe@example.com or 555-123-4567"
print(scrub_pii(sample))
# "Contact John at [EMAIL_REDACTED] or [PHONE_US_REDACTED]"

Contact John at [EMAIL_REDACTED] or [PHONE_US_REDACTED]

Code Fragment 32.12.5: PII detection and scrubbing pipeline

Pattern-based PII detection catches structured identifiers (emails, phone numbers, credit cards) but misses unstructured PII (names in context, addresses described in prose, medical conditions). For comprehensive PII detection, combine regex patterns with NER-based approaches using models like Microsoft Presidio, which applies named entity recognition to identify person names, locations, organizations, and other entity types that may constitute PII.

Library Shortcut: Presidio for PII Detection

The same result in 6 lines with Presidio, which adds NER-based detection on top of regex patterns:


# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Contact John at john.doe@example.com or 555-123-4567"
results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON"], language="en")
scrubbed = anonymizer.anonymize(text=text, analyzer_results=results)
print(scrubbed.text) # "Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>"

Code Fragment 32.12.6: pip install presidio-analyzer presidio-anonymizer

5. Defense in Depth: Combining Privacy Techniques

No single privacy technique provides complete protection. A robust privacy strategy layers multiple defenses, each addressing a different threat vector:

Privacy Defense Stack for LLM Applications

Layer	Technique	Protects Against	Limitation
Data preprocessing	PII scrubbing, deduplication	Direct PII exposure, memorization of duplicates	Incomplete PII detection; removes but does not forget
Training	DP-SGD (Opacus)	Membership inference, extraction with formal guarantees	Utility degradation; requires privacy budget management
Post-training	Machine unlearning (Section 32.7)	Specific data removal requests (GDPR)	Verification of complete removal is difficult
Inference	Output filtering, PII scanning	PII in model responses	Cannot catch all implicit information leakage
Architecture	Retrieval separation (RAG)	Reduces memorization by externalizing knowledge	Retrieved documents may themselves contain PII

Key Insight

Differential privacy and PII scrubbing address different threat models. PII scrubbing removes identifiable information from the data before training, but cannot protect against inference attacks that reconstruct PII from statistical patterns. Differential privacy provides formal guarantees against such attacks, but at the cost of model utility. The optimal strategy uses both: scrub PII to remove the obvious risks (and improve data quality), then apply DP to bound the residual privacy leakage from statistical patterns. This defense-in-depth approach aligns with both GDPR's data minimization principle and the practical reality that no single technique is sufficient.

Lab: Measuring and Mitigating Memorization

This lab walks through a complete workflow: fine-tune a small model on a dataset containing synthetic PII, measure memorization rates before and after DP-SGD, and compare utility.

# Lab: Memorization measurement and DP mitigation
import torch
import numpy as np
from transformers import (
 AutoModelForCausalLM, AutoTokenizer,
 TrainingArguments, Trainer,
)
from datasets import Dataset

# Step 1: Create a dataset with "canary" sequences
# These are unique strings we embed to test for memorization
def create_canary_dataset(n_samples: int = 1000, n_canaries: int = 50):
 """Create training data with embedded canary sequences."""
 import random
 import string

 base_texts = [
 f"The capital of France is Paris. Document {i}."
 for i in range(n_samples - n_canaries)
 ]

 # Canaries: unique sequences that should not be memorizable
 canaries = []
 for i in range(n_canaries):
 # Simulated PII: fake SSN, email, phone
 fake_ssn = f"{random.randint(100,999)}-{random.randint(10,99)}-{random.randint(1000,9999)}"
 fake_email = (
 ''.join(random.choices(string.ascii_lowercase, k=8))
 + "@example.com"
 )
 canary = (
 f"Patient record {i}: SSN {fake_ssn}, "
 f"contact {fake_email}, diagnosis: common cold."
 )
 canaries.append(canary)

 all_texts = base_texts + canaries
 random.shuffle(all_texts)
 return all_texts, canaries

texts, canaries = create_canary_dataset()
print(f"Dataset: {len(texts)} samples, {len(canaries)} canaries")
print(f"Sample canary: {canaries[0]}")

# Step 2: Fine-tune without DP, measure canary memorization
# Step 3: Fine-tune with DP, measure canary memorization
# Step 4: Compare utility (perplexity on held-out data)
# See exercise 32.12.2 for the complete implementation

Dataset: 1000 samples, 50 canaries Sample canary: Patient record 0: SSN 472-38-9104, contact xkqmwvbn@example.com, diagnosis: common cold.

Code Fragment 32.12.7: Lab: Memorization measurement and DP mitigation

Common Misconception

Applying differential privacy to fine-tuning does not retroactively protect the pre-training data. If a base model was pre-trained without DP guarantees (as is the case for all current foundation models), it may still memorize and leak information from its pre-training corpus. DP fine-tuning only protects the fine-tuning dataset. For full protection, you need a defense-in-depth strategy combining DP fine-tuning, output filtering, and PII detection at inference time.

Key Insight

Key Takeaways

Training data extraction attacks can recover verbatim memorized text from LLMs, including personally identifiable information and copyrighted material.
Membership inference attacks determine whether a specific example was in the training set, posing privacy risks even when the training data is not directly extractable.
Differential privacy (DP) in fine-tuning adds calibrated noise to gradients, providing mathematical guarantees that individual training examples cannot be identified.
The privacy-utility tradeoff is the central challenge: stronger privacy guarantees (lower epsilon) reduce model performance, requiring careful calibration for each use case.
Defense in depth combines DP fine-tuning, PII scrubbing, output filtering, and access controls to create layered privacy protection.

Exercises

Exercise 32.12.1: Membership Inference Evaluation

Fine-tune GPT-2 on a 1,000-example subset of a public dataset (e.g., Wikitext). Hold out 1,000 non-training examples from the same distribution. Implement the loss-threshold membership inference attack from Section 2 and report the AUC-ROC. Then repeat the experiment with DP-SGD at epsilon=8. How much does DP reduce the attack's effectiveness?

Answer Sketch

Without DP, the membership inference AUC is typically 0.70 to 0.85 on a small fine-tuning set, because the model overfits significantly. With DP-SGD at epsilon=8, the AUC drops to 0.55 to 0.65, close to random guessing. The noise added during DP training prevents the model from fitting closely enough to individual examples for the loss gap to be exploitable. The trade-off is a modest increase in validation perplexity (roughly 10 to 20%).

Exercise 32.12.2: Canary Extraction Lab

Using the canary dataset from Section 6, complete the full lab: (1) Fine-tune GPT-2 without DP for 5 epochs. (2) Attempt to extract canaries by prompting with their prefixes and measuring completion perplexity. (3) Count how many canaries are extractable (perplexity below a threshold). (4) Repeat with DP-SGD at epsilon=4 and epsilon=8. Plot the number of extractable canaries vs. epsilon.

Answer Sketch

Without DP, 60 to 80% of canaries are extractable after 5 epochs of fine-tuning on a small dataset. At epsilon=8, this drops to 10 to 20%. At epsilon=4 (stronger privacy), fewer than 5% of canaries are extractable. The validation perplexity increases by ~15% at epsilon=8 and ~30% at epsilon=4 compared to the non-DP baseline. This demonstrates the privacy-utility trade-off quantitatively.

Research Frontier

Machine unlearning for LLMs seeks to remove specific training data points from a model after training, without retraining from scratch.

Current approaches (gradient ascent on forget sets, SISA sharding) work for smaller models but remain computationally expensive at LLM scale.

In parallel, confidential computing (using hardware enclaves like Intel TDX and AMD SEV-SNP) is being explored to run LLM inference on encrypted data, ensuring that even the cloud provider cannot observe user prompts or model outputs.

What Comes Next

This concludes the safety, ethics, and regulation chapter. In the next chapter, Chapter 33: Strategy, Product & ROI, we shift from regulatory compliance to business strategy, examining how to build an economic case for LLM investments and measure return on AI initiatives.

References & Further Reading

Key References

Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium.

Demonstrates that GPT-2 memorizes and can reproduce verbatim training data, including personal information. Established that memorization is a fundamental privacy risk in large language models, not just a theoretical concern.

📄 Paper

Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR 2023.

Systematically measures how memorization scales with model size, finding that larger models memorize more training data. Provides the empirical foundation for this section's discussion of memorization risks.

📄 Paper

Abadi, M. et al. (2016). "Deep Learning with Differential Privacy." ACM CCS 2016.

The foundational paper on training deep learning models with formal differential privacy guarantees. Introduces the DP-SGD algorithm that remains the primary approach for privacy-preserving model training.

📄 Paper

Yousefpour, A. et al. (2021). "Opacus: User-Friendly Differential Privacy Library in PyTorch." arXiv:2109.12298. GitHub: pytorch/opacus.

Documentation and paper for Opacus, PyTorch's official differential privacy library. The most practical entry point for implementing DP-SGD in existing PyTorch training pipelines.

📄 Paper

Nissenbaum, H. (2004). "Privacy as Contextual Integrity." Washington Law Review, 79(1).

Introduces contextual integrity as a framework for reasoning about privacy, arguing that privacy norms depend on information flows within specific social contexts. Provides the theoretical foundation for evaluating when AI data usage violates user expectations.

📄 Paper

Shokri, R. et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P 2017.

Introduces membership inference attacks, showing that an adversary can determine whether a specific data point was in the training set. A key threat model for understanding privacy risks in machine learning deployments.

📄 Paper

Li, X. et al. (2022). "Large Language Models Can Be Strong Differentially Private Learners." ICLR 2022.

Shows that large language models can achieve strong task performance while maintaining differential privacy guarantees, challenging the assumption that privacy and utility are fundamentally at odds in the LLM setting.

📄 Paper

Microsoft Presidio. "Data Protection and De-identification SDK." GitHub: microsoft/presidio.

Open-source SDK for detecting and anonymizing personally identifiable information (PII) in text. A practical tool for preprocessing training data and sanitizing LLM inputs and outputs in privacy-sensitive applications.

🛠 Tool