Privacy Attacks and Differential Privacy

Section 50.1

"A model that memorizes is a model that leaks. The question is not if, but when, and how much."

GuardA Cautious Guard, Leak-Conscious AI Agent
Big Picture

Large language models memorize portions of their training data, and adversaries can extract that data through carefully crafted queries. Carlini et al. (2021) demonstrated that GPT-2 could be prompted to emit verbatim training sequences, including names, phone numbers, email addresses, and code snippets. This is not a bug in a specific model; it is an inherent property of how neural language models learn. Larger models memorize more, and fine-tuned models memorize their fine-tuning data at even higher rates than the base model memorizes its pretraining corpus. This section covers the attack landscape (extraction attacks, membership inference, attribute inference), the theoretical foundation of differential privacy, and the practical application of DP-SGD for privacy-preserving fine-tuning using the Opacus library. It also addresses PII detection and mitigation strategies that complement formal privacy guarantees with practical defense-in-depth.

Three privacy attacks and three layered defenses
Figure 50.1.1: Three attack classes against LLM privacy and three layered defenses. Extraction (Carlini et al. 2021) recovers verbatim memorized strings. Membership inference (Shokri et al. 2017) decides whether a specific record was in training. Attribute inference treats the model as a fuzzy database of sensitive attributes. The defenses stack: DP-SGD via Opacus provides formal (ε, δ) guarantees during fine-tuning; data minimization with Presidio shrinks what can be memorized; and output-side PII filters with canary probes catch leaks at runtime.

Prerequisites

This section expands on the privacy foundations introduced in Section 47.1: LLM Licensing, IP & Privacy and the security threat landscape from Section 47.1. Understanding fine-tuning fundamentals (Section 16.1) is essential for the DP-SGD sections. The GDPR requirements discussed in Section 47.1 provide the regulatory motivation for the technical defenses presented here.

50.1.1 Training Data Extraction Attacks

Training data extraction attacks exploit the fact that language models assign higher probability to sequences they have seen during training. An attacker generates a large number of candidate sequences (either by prompting the model or by sampling from it) and then identifies which outputs are likely verbatim memorized content. The key insight from Carlini et al. (2021) is the distinction between eidetic memorization (the model can reproduce a sequence exactly when given the right prefix) and extractable memorization (the sequence can be recovered through systematic probing without knowing the prefix).

Fun Fact

In the landmark Carlini et al. extraction study, the researchers recovered a person's full name, email address, phone number, and physical address by prompting GPT-2 with just the first few words of a memorized training sequence. The model essentially acted as a search engine for its own training data, except nobody had built a privacy policy for that feature.

The attack surface increases with model size. Larger models have greater capacity to memorize rare sequences, and they exhibit lower perplexity on memorized content, making extraction easier. Fine-tuned models are particularly vulnerable because the fine-tuning dataset is typically much smaller than the pretraining corpus, so each example receives far more gradient updates and is memorized more thoroughly.

# Measuring memorization: perplexity-based extraction detection
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import numpy as np
def measure_memorization(
    model_name: str,
    candidate_texts: list[str],
    device: str = "cuda",
    ) -> list[dict]:
    """Score candidate texts by likelihood under the model.
    High-likelihood (low-perplexity) texts are more likely memorized."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    model.eval()
    results = []
    for text in candidate_texts:
        inputs = tokenizer(text, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss.item()
            perplexity = np.exp(loss)
            results.append({
                "text_prefix": text[:80] + "...",
                "perplexity": round(perplexity, 2),
                "loss": round(loss, 4),
                "num_tokens": inputs["input_ids"].shape[1],
                })
            # Sort by perplexity (lowest = most likely memorized)
            results.sort(key=lambda x: x["perplexity"])
            return results

# Example: check if specific sequences are memorized
candidates = [
    "The quick brown fox jumps over the lazy dog.",
    # In a real attack, these would be suspected training data
    "def fibonacci(n):\n if n <= 1:\n return n",
    ]
# results = measure_memorization("gpt2", candidates)
Output: Using sigma=0.8712 for (8.0, 1e-05)-DP over 3 epochs Epoch 1: loss=0.4823, epsilon_spent=3.14 Epoch 2: loss=0.3156, epsilon_spent=5.47 Epoch 3: loss=0.2491, epsilon_spent=7.89
Code Fragment 50.1.1a: Detects memorized canaries by comparing per-token loss on a candidate string against a model's typical loss distribution. A candidate whose loss is in the bottom percentile is likely a verbatim training example; this is the technique Carlini et al. used to extract verbatim training data from GPT-2 in their 2021 USENIX paper.
# End-to-end privacy pipeline: scrub, train with DP, filter outputs
from dataclasses import dataclass
from typing import Optional
@dataclass
class PrivacyConfig:
    # Preprocessing
    scrub_pii: bool = True
    deduplicate: bool = True
    min_token_count: int = 3 # Remove near-unique sequences
    # Training
    use_dp_sgd: bool = True
    target_epsilon: float = 8.0
    target_delta: float = 1e-5
    max_grad_norm: float = 1.0
    # Output filtering
    scan_outputs_for_pii: bool = True
    block_verbatim_training_matches: bool = True
    verbatim_threshold: int = 50 # Char length for match detection
    def privacy_aware_pipeline(config: PrivacyConfig, raw_data: list[str]):
        """Orchestrate a privacy-preserving fine-tuning pipeline."""
        # Stage 1: Data preprocessing
        data = raw_data
        if config.scrub_pii:
            data = [scrub_pii(text) for text in data]
            print(f"PII scrubbing: processed {len(data)} samples")
            if config.deduplicate:
                data = list(set(data)) # Simplified; use MinHash in production
                print(f"After dedup: {len(data)} samples")
                # Stage 2: DP fine-tuning (pseudocode)
                if config.use_dp_sgd:
                    print(f"Training with DP-SGD: epsilon={config.target_epsilon}")
                    # ... Opacus training loop from Section 3 ...
                else:
                    print("Training WITHOUT differential privacy (not recommended)")
                    # Stage 3: Output filter setup
                    if config.scan_outputs_for_pii:
                        print("Output PII scanner: ENABLED")
                        if config.block_verbatim_training_matches:
                            print(f"Verbatim match blocking: threshold={config.verbatim_threshold}")
                            return data # Return processed dataset
                            config = PrivacyConfig(target_epsilon=8.0)
                            # processed = privacy_aware_pipeline(config, raw_training_data)
Code Fragment 50.1.2: The PrivacyConfig dataclass codifies three layers of privacy defenses (preprocessing, DP-SGD training, output filtering) as a single configurable object. The privacy_aware_pipeline function runs them in order and prints which stages were active, making it explicit when a deployment opts out of any layer.
Key Insight

Memorization is not uniformly distributed. Models preferentially memorize content that is (1) repeated multiple times in the training data, (2) highly structured (phone numbers, email addresses, URLs), (3) distinctive or unusual in context, and (4) present in smaller fine-tuning datasets. Carlini et al. (2023) found that data duplicated 10x in the training set is extractable at 10x higher rates than unique sequences. This has a direct practical implication: deduplicating your training data is both a data quality measure and a privacy measure. The data processing techniques from Section 6.1 (deduplication, filtering) serve double duty as privacy defenses.

50.1.2 Membership Inference Attacks

Membership inference attacks (MIAs) answer a binary question: was a specific data point used in training this model? Unlike extraction attacks (which recover the data itself), MIAs reveal whether an individual's data was included in the training set. This is a privacy violation in itself: knowing that someone's medical records were used to train a model reveals information about their health status, even without seeing the records.

The core technique exploits the observation that models behave differently on data they were trained on (members) versus data they have never seen (non-members). Members typically have lower loss, higher confidence, and different gradient patterns. The attacker trains a binary classifier (the "attack model") to distinguish members from non-members based on these signals.

# Membership Inference Attack: loss-threshold method
import torch
import numpy as np
from transformers import AutoModelForCausalLM, AutoTokenizer
from sklearn.metrics import roc_auc_score, precision_recall_curve
def compute_per_sample_loss(
    model, tokenizer, texts: list[str], device: str = "cuda",
    ) -> np.ndarray:
    """Compute per-sample cross-entropy loss."""
    model.eval()
    losses = []
    for text in texts:
        inputs = tokenizer(
            text, return_tensors="pt", truncation=True, max_length=512,
            ).to(device)
        with torch.no_grad():
            outputs = model(**inputs, labels=inputs["input_ids"])
            losses.append(outputs.loss.item())
            return np.array(losses)
            def membership_inference_attack(
                model_name: str,
                member_texts: list[str], # Known training data samples
                nonmember_texts: list[str], # Known non-training data samples
                device: str = "cuda",
                ) -> dict:
                """Run a loss-threshold membership inference attack."""
                tokenizer = AutoTokenizer.from_pretrained(model_name)
                model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
                member_losses = compute_per_sample_loss(
                    model, tokenizer, member_texts, device
                    )
                nonmember_losses = compute_per_sample_loss(
                    model, tokenizer, nonmember_texts, device
                    )
                # Lower loss = more likely a member
                # Negate losses so higher score = more likely member
                scores = np.concatenate([-member_losses, -nonmember_losses])
                labels = np.concatenate([
                    np.ones(len(member_losses)),
                    np.zeros(len(nonmember_losses)),
                    ])
                auc = roc_auc_score(labels, scores)
                precision, recall, _ = precision_recall_curve(labels, scores)
                return {
                    "auc_roc": round(auc, 4),
                    "member_mean_loss": round(float(member_losses.mean()), 4),
                    "nonmember_mean_loss": round(float(nonmember_losses.mean()), 4),
                    "loss_gap": round(
                    float(nonmember_losses.mean() - member_losses.mean()), 4
                    ),
                    }
Code Fragment 50.1.3a: A loss-threshold membership inference attack: compute per-sample cross-entropy for known members vs non-members, then use AUC-ROC to quantify how separable the two loss distributions are. A loss_gap close to zero means the model has not overfit; a large gap (or AUC > 0.8) indicates the attacker can reliably identify training examples.

A random-guessing attacker achieves AUC-ROC of 0.5. In practice, membership inference against large pretrained models typically achieves AUC between 0.55 and 0.65, a modest but statistically significant advantage. Against fine-tuned models, the attack is far more effective, often achieving AUC above 0.80, because the fine-tuning data is overfit to a much greater degree.

Algorithm 50.1.1: Shadow-Model Membership Inference (Shokri et al., 2017)
Algorithm: SHADOW-MODEL-MIA
Input:  Target model f_target,
        auxiliary unlabeled corpus D_aux from the target's distribution,
        number of shadow models K,
        shadow training size n_train
Output: Attack model A: score -> {member, non-member}

  // Phase 1: Build labeled attack data using shadow models
  attack_data = empty list
  For k = 1 to K:
    Sample D_k^in  of size n_train from D_aux         (in-set)
    Sample D_k^out of size n_train from D_aux \ D_k^in (held out)
    Train shadow model f_k with same architecture as f_target on D_k^in
    For each x in D_k^in:
      append ( features(f_k, x), label = MEMBER )      to attack_data
    For each x in D_k^out:
      append ( features(f_k, x), label = NON-MEMBER )  to attack_data

  // Phase 2: Train the attack classifier
  Train A on attack_data
  // features(f, x) commonly = ( -loss(f, x), top-1 prob, prediction entropy, ... )

  // Phase 3: Attack the target
  For each candidate x:
    Return A( features(f_target, x) )
Code Fragment 50.1.3b: Shokri's shadow-model attack trains K shadow models on auxiliary data drawn from the target's distribution, then trains an attack classifier on the (loss, member?) pairs they produce. The attack is stronger than threshold attacks because the shadow models calibrate the loss-distribution under both hypotheses, which is what LiRA exploits to dominate the low-FPR regime relevant to GDPR.

The simplest feature vector is the per-sample loss $\ell(f, x) = -\log p_f(x)$; thresholding $\ell$ at the shadow-set median already achieves $0.60\text{–}0.75$ AUC on a fine-tuned GPT-2. Better attacks (LiRA, Carlini et al., 2022) compute the per-example likelihood ratio $\Lambda(x) = p(\ell(f_{\mathrm{target}}, x) \mid x \in D_{\mathrm{train}})\,/\,p(\ell(f_{\mathrm{target}}, x) \mid x \notin D_{\mathrm{train}})$ using shadow-model loss distributions, which yields the strongest known attacks at the low-FPR regime relevant to GDPR (Shokri et al., 2017).

Key Insight: DP-SGD as Gradient Clipping + Gaussian Noise

DP-SGD modifies the standard mini-batch gradient update in exactly two places. Let $g_i(\theta) = \nabla_\theta \ell(\theta; x_i)$ be the per-example gradient on example $x_i$. Standard SGD averages: $\theta \leftarrow \theta - \eta \cdot \tfrac{1}{|B|}\sum_{i \in B} g_i$. DP-SGD instead clips per-example, sums, adds Gaussian noise, then averages:

$$\theta \;\leftarrow\; \theta \;-\; \frac{\eta}{|B|}\Bigl(\underbrace{\sum_{i \in B} \operatorname{clip}_C(g_i)}_{\text{bounded per-example sensitivity}} \;+\; \underbrace{\mathcal{N}\!\bigl(0,\sigma^2 C^2 I\bigr)}_{\text{Gaussian mechanism}}\Bigr),$$

where $\operatorname{clip}_C(g) = g \cdot \min\!\bigl(1, C/\|g\|_2\bigr)$ bounds the L2 norm of every per-example contribution to $C$. The cumulative $(\varepsilon, \delta)$ budget across $T$ steps is tracked by the Renyi-DP moments accountant (Abadi et al., 2016):

$$\varepsilon(T) \;\le\; \frac{q\,\sqrt{2\,T\,\log(1/\delta)}}{\sigma},\quad q = |B|/N,$$

so doubling $T$ increases $\varepsilon$ by $\sqrt{2}$; doubling the noise multiplier $\sigma$ halves it. This is why DP-SGD pairs naturally with parameter-efficient fine-tuning (LoRA): fewer trainable parameters means fewer steps to convergence and a smaller cumulative $\varepsilon$ for the same utility.

50.1.3 Differential Privacy in Fine-Tuning

Differential privacy (DP) provides a mathematical guarantee that the output of a computation does not depend "too much" on any single input record. Formally, a mechanism $\mathcal{M}$ satisfies $(\varepsilon, \delta)$-differential privacy if for any two datasets $D$ and $D'$ that differ in a single record, and for any set of outputs $S$:

A crowd photo where one individual is hidden in a soft fog of statistical noise dots; the overall scene stays readable.
Figure 50.1.2a: Differential privacy adds calibrated statistical noise so no single individual's data can be distinguished, while aggregate patterns remain accurate: the crowd stays readable, the individual stays private.
$$\Pr[\mathcal{M}(D) \in S] \leq e^{\varepsilon} \cdot \Pr[\mathcal{M}(D') \in S] + \delta$$

The privacy budget $\varepsilon$ (epsilon) controls the strength of the guarantee. Smaller epsilon means stronger privacy but greater utility loss. In practice, $\varepsilon$ values between 1 and 10 are common for fine-tuning, with $\delta$ set to the reciprocal of the training set size. An $\varepsilon$ of 1 is considered strong privacy; $\varepsilon$ of 8 is moderate; values above 10 provide limited formal guarantees but may still offer practical protection against known attacks.

Worked Example: Reading the DP Guarantee

Suppose we train with $\varepsilon = 2$ and $\delta = 10^{-5}$. For any single record, the guarantee says:

$$\Pr[\mathcal{M}(D) \in S] \leq e^{2} \cdot \Pr[\mathcal{M}(D') \in S] + 10^{-5}$$

Since $e^{2} \approx 7.39$, any output set $S$ is at most 7.39 times more likely when your record is included versus excluded, plus a negligible $10^{-5}$ slack. At $\varepsilon = 1$ the multiplier drops to $e^{1} \approx 2.72$, and at $\varepsilon = 8$ it rises to $e^{8} \approx 2{,}981$, which is why the community considers $\varepsilon \leq 1$ strong, $\varepsilon \approx 8$ moderate, and $\varepsilon > 10$ weak.

50.1.3.1 DP-SGD with Opacus

DP-SGD (Differentially Private Stochastic Gradient Descent) modifies the training loop in two ways: (1) it clips the per-sample gradient to bound the influence of any single example, and (2) it adds calibrated Gaussian noise to the clipped gradients before the optimizer step. The Opacus library from Meta provides a drop-in integration with PyTorch that handles both operations transparently.

# DP-SGD fine-tuning with Opacus
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=2,
    )
# Opacus requires models to pass validation
# (e.g., BatchNorm must be replaced with GroupNorm)
model = ModuleValidator.fix(model)
errors = ModuleValidator.validate(model, strict=False)
assert len(errors) == 0, f"Model validation failed: {errors}"
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# Privacy parameters
MAX_GRAD_NORM = 1.0 # Per-sample gradient clipping bound
EPSILON = 8.0 # Target privacy budget
DELTA = 1e-5 # Should be < 1/N (training set size)
EPOCHS = 3
BATCH_SIZE = 32
# Wrap model, optimizer, and dataloader with PrivacyEngine
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader, # Your DataLoader
    epochs=EPOCHS,
    target_epsilon=EPSILON,
    target_delta=DELTA,
    max_grad_norm=MAX_GRAD_NORM,
    )
print(f"Using sigma={optimizer.noise_multiplier:.4f} "
    f"for ({EPSILON}, {DELTA})-DP over {EPOCHS} epochs")
# Training loop (standard PyTorch, Opacus handles DP transparently)
for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        # Check actual privacy spent so far
        eps = privacy_engine.get_epsilon(delta=DELTA)
        print(f"Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}, "
            f"epsilon_spent={eps:.2f}")
Code Fragment 50.1.4: Opacus's make_private wraps a standard PyTorch training loop, applying per-sample gradient clipping at max_grad_norm and adding calibrated Gaussian noise so that the cumulative (epsilon, delta) budget is tracked by the moments accountant. The fine-tuning loop itself is unchanged; the privacy guarantee comes from the privatized DataLoader.
Real-World Scenario
The Privacy-Utility Trade-off in Practice

Who: A machine learning engineer and a privacy counsel at a legal technology company processing privileged attorney-client communications

Situation: The company was fine-tuning DistilBERT for sentiment classification on client feedback to improve their legal AI product. Without differential privacy, the model achieved approximately 92% accuracy on the SST-2 benchmark. However, privacy counsel required formal privacy guarantees because the training data contained sensitive legal communications.

Problem: DP-SGD at epsilon=1 (strong privacy) dropped accuracy to approximately 82%, a 10-point loss that made the model unusable for production. The team needed to find a privacy budget that satisfied legal requirements without destroying model utility.

Dilemma: Ship a strong-privacy model nobody could use, or relax the budget and have to defend that choice if regulators or clients later challenged the privacy guarantee.

Decision: After testing multiple epsilon values, they selected epsilon=8 (moderate privacy), which dropped accuracy to approximately 88%. Privacy counsel accepted this level because it still provided a formal mathematical bound on individual record influence, and the 4-point accuracy loss was within acceptable margins for their use case.

How: Engineering ran a sweep of DP-SGD fine-tunes at epsilon in {1, 2, 4, 8, 16} using Opacus, plotting accuracy against epsilon and presenting the curve to privacy counsel alongside membership inference results at each point so the choice was driven by data, not intuition.

Result: The model shipped with DP-SGD at epsilon=8. A subsequent membership inference audit confirmed that no individual training record could be extracted. The 4% accuracy gap was offset by users' increased willingness to share sensitive data, knowing it was formally protected.

Lesson: Selecting the right epsilon requires explicit negotiation between engineering and legal teams; moderate privacy budgets (epsilon 4-8) often provide the best practical balance for sensitive but non-medical data.

50.1.3.2 Privacy Budget Management

The privacy budget is a finite resource that is consumed with every access to the training data. Under the composition theorem, running $k$ training runs each with privacy budget $\varepsilon$ consumes a total budget of approximately $\varepsilon \sqrt{k}$ (under advanced composition) or $k \varepsilon$ (under basic composition). This means that hyperparameter tuning, which requires many training runs, rapidly depletes the privacy budget. Strategies for managing the budget include:

50.1.4 Contextual Integrity and PII Leakage

Helen Nissenbaum's theory of contextual integrity provides a useful framework for reasoning about privacy in LLM systems. The theory holds that privacy is violated not by any disclosure of information, but by information flows that violate context-specific norms. A person's medical diagnosis shared with their doctor is appropriate; the same information appearing in a chatbot's response to a stranger violates contextual integrity, even if the information is technically "public" in some database.

For LLM engineers, this framework suggests that privacy protection is not only about preventing extraction of training data. It also encompasses ensuring that the model does not recombine publicly available information in ways that violate contextual norms. A model that infers someone's medical condition from their public social media posts and reveals this in a response violates contextual integrity, even though no private data was used in training.

50.1.4.1 PII Detection and Mitigation

This snippet detects personally identifiable information in text and redacts it before sending data to an LLM.

# PII detection and scrubbing pipeline
import re
from dataclasses import dataclass
@dataclass
class PIIMatch:
    category: str
    text: str
    start: int
    end: int
    confidence: float
PII_PATTERNS = {
    "email": (
    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", 0.95
    ),
    "phone_us": (
    r"(?:\+1[-.])?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}", 0.85
    ),
    "ssn": (
    r"\b\d{3}-\d{2}-\d{4}\b", 0.95
    ),
    "credit_card": (
    r"\b(?:\d{4}[-\s]?){3}\d{4}\b", 0.90
    ),
    "ip_address": (
    r"\b(?:\d{1,3}\.){3}\d{1,3}\b", 0.70
    ),
    }
def detect_pii(text: str) -> list[PIIMatch]:
    """Detect PII using pattern matching."""
    matches = []
    for category, (pattern, confidence) in PII_PATTERNS.items():
        for match in re.finditer(pattern, text):
            matches.append(PIIMatch(
                category=category,
                text=match.group(),
                start=match.start(),
                end=match.end(),
                confidence=confidence,
                ))
            return matches
            def scrub_pii(text: str, replacement: str = "[REDACTED]") -> str:
                """Replace detected PII with redaction markers."""
                matches = detect_pii(text)
                # Process matches in reverse order to preserve positions
                for match in sorted(matches, key=lambda m: m.start, reverse=True):
                    tag = f"[{match.category.upper()}_REDACTED]"
                    text = text[:match.start] + tag + text[match.end:]
                    return text
                    # Example
                    sample = "Contact John at john.doe@example.com or 555-123-4567"
                    print(scrub_pii(sample))
                    # "Contact John at [EMAIL_REDACTED] or [PHONE_US_REDACTED]"
Output: Contact John at [EMAIL_REDACTED] or [PHONE_US_REDACTED]
Code Fragment 50.1.5a: A PII scrubbing pipeline that walks regex patterns for emails, phones, SSNs, and credit cards, then replaces matches with typed placeholders before training data ever reaches the optimizer. Removing identifiers before training is cheaper than DP-SGD and complementary: PII scrubbing handles the structured leaks, DP handles the unstructured statistical ones.

Pattern-based PII detection catches structured identifiers (emails, phone numbers, credit cards) but misses unstructured PII (names in context, addresses described in prose, medical conditions). For comprehensive PII detection, combine regex patterns with NER-based approaches using models like Microsoft Presidio, which applies named entity recognition to identify person names, locations, organizations, and other entity types that may constitute PII.

Library Shortcut: Presidio for PII Detection

The same result in 6 lines with Presidio, which adds NER-based detection on top of regex patterns:

Show code
# pip install presidio-analyzer presidio-anonymizer
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "Contact John at john.doe@example.com or 555-123-4567"
results = analyzer.analyze(text=text, entities=["EMAIL_ADDRESS", "PHONE_NUMBER", "PERSON"], language="en")
scrubbed = anonymizer.anonymize(text=text, analyzer_results=results)
print(scrubbed.text) # "Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>"
Code Fragment 50.1.5b: Presidio extends regex PII detection with NER-based recognizers, catching unstructured PII like person names ("John") and locations that pure regex misses. Passing a whitelist of entities to analyze keeps false-positive rates manageable; passing None runs every recognizer at the cost of more spurious flags.

50.1.5 Defense in Depth: Combining Privacy Techniques

No single privacy technique provides complete protection. A robust privacy strategy layers multiple defenses, each addressing a different threat vector:

Table 50.1.1b: Privacy Defense Stack for LLM Applications (as of 2026).
LayerTechniqueProtects AgainstLimitation
Data preprocessing PII scrubbing, deduplication Direct PII exposure, memorization of duplicates Incomplete PII detection; removes but does not forget
Training DP-SGD (Opacus) Membership inference, extraction with formal guarantees Utility degradation; requires privacy budget management
Post-training Machine unlearning (Section 49.7) Specific data removal requests (GDPR) Verification of complete removal is difficult
Inference Output filtering, PII scanning PII in model responses Cannot catch all implicit information leakage
Architecture Retrieval separation (RAG) Reduces memorization by externalizing knowledge Retrieved documents may themselves contain PII
Key Insight

Differential privacy and PII scrubbing address different threat models. PII scrubbing removes identifiable information from the data before training, but cannot protect against inference attacks that reconstruct PII from statistical patterns. Differential privacy provides formal guarantees against such attacks, but at the cost of model utility. The optimal strategy uses both: scrub PII to remove the obvious risks (and improve data quality), then apply DP to bound the residual privacy leakage from statistical patterns. This defense-in-depth approach aligns with both GDPR's data minimization principle and the practical reality that no single technique is sufficient.

Lab: Measuring and Mitigating Memorization

Objective

This lab walks through a complete workflow: fine-tune a small model on a dataset containing synthetic PII, measure memorization rates before and after DP-SGD, and compare utility.

Setup

You need a CUDA GPU with at least 8 GB of VRAM, the transformers and datasets libraries from Hugging Face, and Opacus for the DP-SGD pass. The lab uses a small causal-LM checkpoint (GPT-2 small is the worked example); pin a tokenizer, set a global seed, and write outputs to a clean directory so the canary-extraction step is reproducible.

Steps

The four-step workflow is shown in the code below: build the dataset with embedded canaries, fine-tune without DP and measure canary memorization, fine-tune with DP-SGD and measure canary memorization, and finally compare utility (perplexity on held-out data) across the two runs. The full DP-SGD implementation is in Exercise 29.12.2.

# Lab: Memorization measurement and DP mitigation
import torch
import numpy as np
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    TrainingArguments, Trainer,
    )
from datasets import Dataset
# Step 1: Create a dataset with "canary" sequences
# These are unique strings we embed to test for memorization
def create_canary_dataset(n_samples: int = 1000, n_canaries: int = 50):
    """Create training data with embedded canary sequences."""
    import random
    import string
    base_texts = [
        f"The capital of France is Paris. Document {i}."
        for i in range(n_samples - n_canaries)
        ]
    # Canaries: unique sequences that should not be memorizable
    canaries = []
    for i in range(n_canaries):
        # Simulated PII: fake SSN, email, phone
        fake_ssn = f"{random.randint(100,999)}-{random.randint(10,99)}-{random.randint(1000,9999)}"
        fake_email = (
            ''.join(random.choices(string.ascii_lowercase, k=8))
            + "@example.com"
            )
        canary = (
            f"Patient record {i}: SSN {fake_ssn}, "
            f"contact {fake_email}, diagnosis: common cold."
            )
        canaries.append(canary)
        all_texts = base_texts + canaries
        random.shuffle(all_texts)
        return all_texts, canaries
        texts, canaries = create_canary_dataset()
        print(f"Dataset: {len(texts)} samples, {len(canaries)} canaries")
        print(f"Sample canary: {canaries[0]}")
        # Step 2: Fine-tune without DP, measure canary memorization
        # Step 3: Fine-tune with DP, measure canary memorization
        # Step 4: Compare utility (perplexity on held-out data)
        # See exercise 32.12.2 for the complete implementation
Output: Dataset: 1000 samples, 50 canaries Sample canary: Patient record 0: SSN 472-38-9104, contact xkqmwvbn@example.com, diagnosis: common cold.
Code Fragment 50.1.6: The lab scaffolding generates a training set with 950 benign documents and 50 unique "canary" sequences containing fake SSNs and emails. Because canaries are unique, any non-zero canary-recall rate after training is direct evidence of memorization, providing a clean A/B comparison between standard SGD and DP-SGD runs.

Expected Output

Without DP, the model should reliably regurgitate a substantial fraction of the canaries when prompted with the canary prefix (extraction success rate typically 30-80 percent on a 1000-sample fine-tune). With DP-SGD at epsilon = 8, the extraction rate should fall to near zero, while validation perplexity rises by roughly 10-20 percent. Plot the canary-extraction rate alongside the perplexity curve to visualize the privacy-utility trade-off the rest of this section formalizes.

Key Takeaways
Warning: Common Misconception

Applying differential privacy to fine-tuning does not retroactively protect the pretraining data. Every current foundation model was pretrained without DP guarantees, so the base model still memorizes and leaks information from its pretraining corpus. DP fine-tuning protects only the fine-tuning dataset. Full protection requires defense in depth: DP fine-tuning plus output filtering plus PII detection at inference time. Skip any layer and the leak surface stays open.

Exercises

Exercise 50.1.1: Membership Inference Evaluation

Fine-tune GPT-2 on a 1,000-example subset of a public dataset (e.g., Wikitext). Hold out 1,000 non-training examples from the same distribution. Implement the loss-threshold membership inference attack from Section 2 and report the AUC-ROC. Then repeat the experiment with DP-SGD at epsilon=8. How much does DP reduce the attack's effectiveness?

Answer Sketch

Without DP, the membership inference AUC is typically 0.70 to 0.85 on a small fine-tuning set, because the model overfits significantly. With DP-SGD at epsilon=8, the AUC drops to 0.55 to 0.65, close to random guessing. The noise added during DP training prevents the model from fitting closely enough to individual examples for the loss gap to be exploitable. The trade-off is a modest increase in validation perplexity (roughly 10 to 20%).

Exercise 50.1.2: Canary Extraction Lab

Using the canary dataset from Section 6, complete the full lab: (1) Fine-tune GPT-2 without DP for 5 epochs. (2) Attempt to extract canaries by prompting with their prefixes and measuring completion perplexity. (3) Count how many canaries are extractable (perplexity below a threshold). (4) Repeat with DP-SGD at epsilon=4 and epsilon=8. Plot the number of extractable canaries vs. epsilon.

Answer Sketch

Without DP, 60 to 80% of canaries are extractable after 5 epochs of fine-tuning on a small dataset. At epsilon=8, this drops to 10 to 20%. At epsilon=4 (stronger privacy), fewer than 5% of canaries are extractable. The validation perplexity increases by ~15% at epsilon=8 and ~30% at epsilon=4 compared to the non-DP baseline. This demonstrates the privacy-utility trade-off quantitatively.

Exercise 50.1.3: Why LLMs Memorize Conceptual

An LLM trained on the open web can sometimes regurgitate verbatim text it saw during pretraining, including PII like phone numbers and credit card formats. (a) What property of the training objective causes verbatim memorization? (b) Why are duplicates and rare strings the highest-risk subsets for extraction attacks? (c) Why does scaling up parameters generally make memorization worse, not better, at fixed dataset size?

Answer Sketch

(a) Cross-entropy training drives the model to maximize the likelihood of the exact next token; for any string in the training set, perfect prediction is the minimum-loss solution, so the optimizer pulls the model toward memorization wherever capacity allows. (b) Duplicates are reinforced multiple times so their conditional probability rises above the noise floor; rare unique strings have no competing alternatives in the model's distribution, so once memorized they're easy to elicit with the right prefix. (c) More parameters mean more capacity to memorize at the same training loss; the model can fit both the generalization signal and individual examples. This is why frontier-model labs combine deduplication with privacy-aware filtering rather than just relying on scale.

Exercise 50.1.4: Predict an Extraction Attack Outcome Predictive

You run a Carlini-style extraction attack: prompt the model with random short prefixes and check whether continuations match training data. Predict the qualitative shape of the results: (a) for a 7B model trained on dedup'd 2T tokens; (b) for the same model after fine-tuning on a private medical corpus; (c) what changes if the attacker has access to the gradient (white-box) versus only the API (black-box)?

Answer Sketch

(a) Most prefixes return generic, novel-looking text. Verbatim hits are rare (often <1% of prefixes) and concentrated in heavily duplicated content (license headers, common code snippets, song lyrics). (b) Risk skyrockets: the medical corpus is small relative to model capacity and was seen many epochs, so memorization rates can reach 5-20% for whole patient records. This is why DP fine-tuning matters more than DP pretraining for many threat models. (c) White-box attackers can run membership-inference with high precision (test if a specific record was in training) and can do gradient-based extraction with much higher hit rate. Black-box attackers are limited to behavioral probes; their bound is loose but real. The threat model determines which mitigations to invest in.

Exercise 50.1.5: Add DP-SGD to Fine-Tuning Code Tweak

Sketch the change to a standard PyTorch fine-tuning loop required to use Opacus for differential privacy. Note (a) the three-line wrapper, (b) the two hyperparameters that control the privacy/utility tradeoff, and (c) the typical accuracy cost.

Answer Sketch

(a) Wrapper: from opacus import PrivacyEngine; pe = PrivacyEngine(); model, optimizer, train_loader = pe.make_private(module=model, optimizer=optimizer, data_loader=train_loader, noise_multiplier=1.0, max_grad_norm=1.0). (b) The two key knobs: noise_multiplier (more noise = stronger privacy, lower utility) and max_grad_norm (the per-sample gradient clip, controlling sensitivity). Together they determine the (epsilon, delta) privacy budget after a given number of steps. (c) Typical cost: 5-15 percentage-point accuracy drop on small fine-tuning datasets. The hit gets smaller as dataset size grows; for very large fine-tunes the gap can shrink to ~1-3 points. PEFT methods like LoRA combine well with DP because they reduce the effective dimensionality being clipped, recovering some utility.

Exercise 50.1.6: Defense-in-Depth Failure Modes Failure Mode

Your team layers four privacy defenses: (i) PII redaction in pretraining, (ii) DP fine-tuning, (iii) input filtering in production, (iv) output scanning for SSN-like patterns. Identify three failure modes that survive all four layers, and what additional measure would close each.

Answer Sketch

(1) Indirect leakage via inference: model produces a fact that, combined with other public information, identifies a specific person. Pattern scanning misses this entirely. Mitigation: contextual integrity audits, where you red-team for inferential disclosures with realistic attacker queries. (2) Memorization in adapter weights: fine-tuning data leaks through PEFT adapters even if the base passed PII redaction. Mitigation: extraction tests on the full adapter+base stack as part of release gates. (3) Conversational-history leak: a long conversation embeds facts about the user that subsequent calls echo back to other parties. Mitigation: per-tenant memory isolation and explicit consent gates for memory features. The general principle: technical defenses bound, but never eliminate, privacy risk; pair them with policy and audit.

Research Frontier

Machine unlearning for LLMs seeks to remove specific training data points from a model after training, without retraining from scratch.

Current approaches (gradient ascent on forget sets, SISA sharding) work for smaller models but remain computationally expensive at LLM scale.

In parallel, confidential computing (using hardware enclaves like Intel TDX and AMD SEV-SNP) is being explored to run LLM inference on encrypted data, ensuring that even the cloud provider cannot observe user prompts or model outputs.

What Comes Next

Differential privacy and PII defenses protect data held centrally. The next section, Federated Learning for Privacy-Preserving Training, covers the complementary problem: training across organizations whose data cannot leave their premises at all.

See Also

For the differential-privacy and federated-learning libraries (Opacus, Flower) that operationalize these defenses, see Section 56.2: Responsible-AI Tools. For the RAG-specific privacy considerations (corpus PII, retrieval logs, prompt-injection through documents), see Section 32.1: RAG Foundations. For the regulatory side of privacy (GDPR, CCPA, EU AI Act data-protection clauses), see Section 53.1: Global Regulatory Landscape.

Further Reading

Key References

Carlini, N. et al. (2021). "Extracting Training Data from Large Language Models." USENIX Security Symposium. Demonstrates that GPT-2 memorizes and can reproduce verbatim training data, including personal information. Established that memorization is a fundamental privacy risk in large language models, not just a theoretical concern.
Carlini, N. et al. (2023). "Quantifying Memorization Across Neural Language Models." ICLR 2023. Systematically measures how memorization scales with model size, finding that larger models memorize more training data. Provides the empirical foundation for this section's discussion of memorization risks.
Abadi, M. et al. (2016). "Deep Learning with Differential Privacy." ACM CCS 2016. The foundational paper on training deep learning models with formal differential privacy guarantees. Introduces the DP-SGD algorithm that remains the primary approach for privacy-preserving model training.
Yousefpour, A. et al. (2021). "Opacus: User-Friendly Differential Privacy Library in PyTorch." arXiv:2109.12298. GitHub: pytorch/opacus. Documentation and paper for Opacus, PyTorch's official differential privacy library. The most practical entry point for implementing DP-SGD in existing PyTorch training pipelines.
Nissenbaum, H. (2004). "Privacy as Contextual Integrity." Washington Law Review, 79(1). Introduces contextual integrity as a framework for reasoning about privacy, arguing that privacy norms depend on information flows within specific social contexts. Provides the theoretical foundation for evaluating when AI data usage violates user expectations.
Shokri, R. et al. (2017). "Membership Inference Attacks Against Machine Learning Models." IEEE S&P 2017. Introduces membership inference attacks, showing that an adversary can determine whether a specific data point was in the training set. A key threat model for understanding privacy risks in machine learning deployments.
Li, X. et al. (2022). "Large Language Models Can Be Strong Differentially Private Learners." ICLR 2022. Shows that large language models can achieve strong task performance while maintaining differential privacy guarantees, challenging the assumption that privacy and utility are fundamentally at odds in the LLM setting.
Microsoft Presidio. "Data Protection and De-identification SDK." GitHub: microsoft/presidio. Open-source SDK for detecting and anonymizing personally identifiable information (PII) in text. A practical tool for preprocessing training data and sanitizing LLM inputs and outputs in privacy-sensitive applications.