Machine Unlearning

Section 50.2

Forgetting on purpose turns out to be much harder than learning by accident.

GuardA Diligent Guard, Stubbornly Remembering AI Agent
Big Picture

Machine unlearning is the ability to remove specific knowledge from a trained model without retraining from scratch. This capability is driven by three needs: GDPR right-to-erasure compliance (removing personal data), copyright compliance (removing copyrighted content), and safety RLHF (removing dangerous knowledge). While retraining from scratch on a filtered dataset is the gold standard, it is prohibitively expensive for large models. Approximate unlearning methods trade off forgetting guarantees for computational efficiency. The fine-tuning techniques from Section 16.3 and the interpretability tools from Section 10.3 (model editing) are closely related to the unlearning approaches discussed here.

Prerequisites

Before starting, make sure you are familiar with the LLM API and basic deployment vocabulary from Section 11.1. Production engineering, application architecture, and deployment patterns are revisited in detail later in the book.

A robot brain with a zipper being opened, with a scientist carefully removing specific glowing memory orbs using tweezers while other memories remain safely in place, representing the concept of selective knowledge removal from neural networks.
Figure 50.2.1: Forgetting on purpose turns out to be much harder than learning by accident. Approximate unlearning methods trade perfect erasure guarantees for practical computational costs.

50.2.1 Motivations for Unlearning

Table 50.2.1a: Motivations for Unlearning Basic Comparison (as of 2026).
MotivationWhat to RemoveVerification Challenge
GDPR right to erasureIndividual's personal dataProve the model cannot reproduce the specific data
Copyright complianceCopyrighted text, code, imagesVerify no verbatim or near-verbatim reproduction
Safety alignmentDangerous knowledge (bioweapons, hacking)Ensure knowledge is not recoverable via fine-tuning
Model updatesOutdated or incorrect informationConfirm old facts are replaced, not just suppressed

Three families of unlearning methods exist, each with different cost and guarantee trade-offs. Figure 50.2.1b compares exact unlearning, approximate methods, and weight editing approaches.

Three families of machine unlearning compared: Exact (retrain from scratch on filtered data, complete guarantee, prohibitive cost), Approximate (gradient ascent on the forget set, partial guarantee, moderate cost), Weight Editing (task vectors, LOKA, representation surgery, targeted guarantee, low cost).
Figure 50.2.2: Unlearning methods trade off between forgetting guarantees and computational cost.
Key Insight

Mental Model: The Ink on Paper. Machine unlearning is like trying to remove ink from paper after it has dried. Exact unlearning (retraining) is reprinting the entire document without the offending paragraph: perfect results, but enormously expensive for a long book. Gradient ascent is using chemical solvent to fade the ink: cheaper, but traces may remain and nearby text can smudge. Weight editing is carefully cutting out a sentence and gluing the page back together: surgical, but the seam might show. The fundamental challenge is that neural network weights, like dried ink, blend learned information in ways that resist clean separation.

Fun Fact

The field of AI safety has grown from a handful of researchers in 2015 to thousands of full-time practitioners in 2025. Anthropic, OpenAI, Google DeepMind, and Meta all have dedicated safety teams with budgets in the tens of millions. This growth reflects a hard-won consensus: safety is not a constraint on capability; it is a prerequisite for deployment.

Machine unlearning intersects with several other topics in this book. The GDPR "right to be forgotten" (Section 49.4) creates legal demand for unlearning. The interpretability techniques from Section 10.2 could, in principle, help identify which weights encode the knowledge to be forgotten. And the evaluation frameworks from Chapter 42 are essential for verifying that unlearning actually worked, because the model might still reveal the "forgotten" information through indirect queries.

Key Insight: SISA Partitioning for Cheap Exact Unlearning

SISA (Sharded, Isolated, Sliced, Aggregated; Bourtoule et al., 2021) is the only known method that gives exact unlearning at sub-linear retraining cost. Partition the dataset $D$ of $N$ records into $S$ disjoint shards $D = \bigsqcup_{j=1}^{S} D_j$; further split each shard into $R$ slices $D_j = \bigsqcup_{r=1}^{R} D_{j,r}$. Train one constituent model $f_j$ per shard, checkpointing after each slice. At inference, ensemble the $S$ shard models (majority vote or average logits). To unlearn a record $x$ located in shard $j^\star$, slice $r^\star$: only retrain $f_{j^\star}$ starting from its checkpoint at slice $r^\star - 1$. Expected retraining cost is

$$\mathbb{E}[\text{cost}] \;=\; \frac{N}{S}\cdot\frac{R - r^\star + 1}{R} \;\le\; \frac{N}{S},$$

versus $N$ for naive retraining; a ($S=20, R=10$) setup is on average $\sim$200x cheaper. The cost is utility: ensembling $S$ shard-trained models loses 2–5 points of accuracy compared to a single model trained on the full dataset, which is why SISA is most attractive at the high-erasure-rate end of GDPR workloads.

Algorithm 50.2.1: Gradient-Ascent Unlearning with Retain-Set Anchoring
Algorithm: GRADIENT-ASCENT-UNLEARN
Input:  Trained model theta_0,
        forget set D_forget, retain set D_retain,
        learning rate eta, balance alpha in [0,1],
        epochs T, KL anchor weight beta
Output: Unlearned weights theta_T

  theta_ref = theta_0                          // snapshot for KL anchor
  theta = theta_0
  For epoch = 1 to T:
    For (x_f, x_r) zipped from D_forget x D_retain:
      L_forget = - cross_entropy(f_theta(x_f), y_f)   // ASCENT: negate
      L_retain =   cross_entropy(f_theta(x_r), y_r)   // descent
      // KL anchor keeps f_theta close to f_ref on retain set
      L_kl     = KL( f_theta(x_r) || f_ref(x_r) )
      L = alpha * L_forget + (1 - alpha) * L_retain + beta * L_kl
      theta = theta - eta * nabla_theta L
  Return theta

The unanchored gradient-ascent objective ($\beta = 0$) is famously unstable: aggressive negation flips the model into a "say-nothing" attractor where it forgets the forget set and 30–50% of unrelated capabilities. The KL anchor on a held-out retain set keeps $f_\theta$ close to its pre-unlearning behavior outside the forget region, and is the practical fix that ships in TOFU-style benchmarks (Maini et al., 2024).

50.2.2 Gradient Ascent Unlearning

The most direct unlearning approach reverses the training process: instead of minimizing loss on the data to forget, we maximize it (gradient ascent), while simultaneously maintaining performance on retained data through standard gradient descent. Code Fragment 50.2.2a below implements this dual-objective optimization loop.

# implement gradient_ascent_unlearn
import torch
from torch.utils.data import DataLoader
def gradient_ascent_unlearn(model, forget_loader: DataLoader,
    retain_loader: DataLoader,
    epochs: int = 3, lr: float = 1e-5,
    alpha: float = 0.5):
    """Unlearn via gradient ascent on forget set + descent on retain set."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    for epoch in range(epochs):
        total_loss = 0
        forget_iter = iter(forget_loader)
        retain_iter = iter(retain_loader)
        for step in range(min(len(forget_loader), len(retain_loader))):
            # Gradient ASCENT on forget data (maximize loss = forget)
            forget_batch = next(forget_iter)
            forget_out = model(**forget_batch, labels=forget_batch["input_ids"])
            forget_loss = -forget_out.loss # negate for ascent
            # Gradient DESCENT on retain data (minimize loss = keep)
            retain_batch = next(retain_iter)
            retain_out = model(**retain_batch, labels=retain_batch["input_ids"])
            retain_loss = retain_out.loss
            # Combined loss: forget + retain balance
            loss = alpha * forget_loss + (1 - alpha) * retain_loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
            print(f"Epoch {epoch + 1}: avg loss = {total_loss / (step + 1):.4f}")
            return model
Output: Epoch 1: avg loss = -0.3142 Epoch 2: avg loss = -0.5871 Epoch 3: avg loss = -0.7203
Code Fragment 50.2.1c: Implementation of gradient_ascent_unlearn
Tip

After any unlearning procedure, always test both the "forget set" (the data you wanted removed) and a "retain set" (capabilities you want to keep). Gradient ascent unlearning frequently degrades general model quality alongside the targeted knowledge. If your retain-set accuracy drops more than 2 to 3 percentage points, the unlearning was too aggressive. Reduce the learning rate or the number of unlearning epochs and try again.

50.2.3 Task Vector Unlearning

An alternative to gradient ascent is task vector negation. A "task vector" is the difference between a fine-tuned model's weights and the base model's weights. Negating this vector and applying it back to the base model removes the learned skill. Code Fragment 50.2.2b below implements this approach.

# implement compute_task_vector, negate_task_vector
import torch
from collections import OrderedDict
def compute_task_vector(base_weights: dict, finetuned_weights: dict) -> dict:
    """Compute the task vector (difference between fine-tuned and base)."""
    task_vector = OrderedDict()
    for key in base_weights:
        task_vector[key] = finetuned_weights[key] - base_weights[key]
        return task_vector
def negate_task_vector(base_weights: dict, task_vector: dict,
    scale: float = 1.0) -> dict:
    """Remove a capability by negating the task vector."""
    result = OrderedDict()
    for key in base_weights:
        result[key] = base_weights[key] - scale * task_vector[key]
        return result
    # Conceptual example:
        # 1. Fine-tune base model on "toxic content generation"
        # 2. Compute task_vector = finetuned_weights - base_weights
        # 3. Subtract task_vector from base: unlearned = base - scale * task_vector
        # Result: model with reduced ability to generate toxic content
        print("Task vector unlearning: subtract the 'skill vector' to remove capability")
Output: Task vector unlearning: subtract the 'skill vector' to remove capability
Code Fragment 50.2.2c: Defines compute_task_vector and negate_task_vector

Task vector arithmetic provides an intuitive geometric picture of how unlearning works. Figure 50.2.3 visualizes the weight-space operations for both gradient ascent and task vector approaches.

Weight-space view of unlearning: gradient ascent maximizes loss on the forget set, moving the model in the opposite direction from the forget gradient; a retain-loss term pulls the weights back to preserve general ability.
Figure 50.2.3a: Two geometric views of unlearning in weight space. Gradient ascent (left) moves away from the forget set's loss minimum. Task vector negation (right) subtracts the learned skill direction from the base model.

50.2.4 Evaluating Unlearning Quality

Measuring whether unlearning actually worked requires checking three dimensions: whether the model has truly forgotten the target data, whether it retains performance on everything else, and whether membership inference attacks can still detect traces of the forgotten data. Code Fragment 50.2.3b below implements an evaluation framework for all three.

# Define UnlearningEvaluation; implement forget_quality, privacy_leakage
from dataclasses import dataclass
@dataclass
class UnlearningEvaluation:
    """Evaluate the quality of machine unlearning."""
    forget_accuracy: float # lower is better (model forgot)
    retain_accuracy: float # higher is better (model remembers)
    membership_inference_auc: float # closer to 0.5 is better
    @property
    def forget_quality(self) -> str:
        if self.forget_accuracy < 0.1 and self.retain_accuracy > 0.9:
            return "excellent"
        elif self.forget_accuracy < 0.3 and self.retain_accuracy > 0.8:
            return "good"
            return "insufficient"
        @property
        def privacy_leakage(self) -> str:
            deviation = abs(self.membership_inference_auc - 0.5)
            if deviation < 0.05:
                return "minimal"
            elif deviation < 0.15:
                return "moderate"
                return "significant"
                eval_result = UnlearningEvaluation(
                    forget_accuracy=0.08, retain_accuracy=0.92,
                    membership_inference_auc=0.53
                    )
                print(f"Forget quality: {eval_result.forget_quality}")
                print(f"Privacy leakage: {eval_result.privacy_leakage}")
Output: Forget quality: excellent Privacy leakage: minimal
Code Fragment 50.2.3c: Defines forget_quality and privacy_leakage

Evaluating unlearning quality requires measuring three distinct axes simultaneously. Figure 50.2.4 shows how forget quality, retain quality, and privacy resistance together determine whether unlearning has succeeded.

Three evaluation axes for unlearning: Forget Quality (forget-set accuracy, target near random), Retain Quality (retain-set accuracy, target unchanged), and Privacy (membership inference AUC, target 0.5).
Figure 50.2.4a: Good unlearning must score well on all three axes: forgetting the target data, retaining general capability, and resisting membership inference attacks.
Warning

Approximate unlearning does not match retraining from scratch. Recent work (Lynch et al., 2024) recovered "unlearned" knowledge in all eight unlearning techniques tested using targeted fine-tuning or adversarial prompts. For high-stakes regulatory compliance, treat approximate unlearning as one layer among many: combine it with access restrictions, output filtering, and a scheduled full retrain. Do not rely on it alone.

Note

LOKA (Localized Knowledge Ablation) identifies the specific neurons or attention heads that encode the target knowledge and zeroes out or modifies only those parameters. This surgical approach minimizes collateral damage to other capabilities but requires interpretability tools to locate the relevant parameters.

Key Insight

A model that refuses to answer is not a model that has forgotten. Output suppression leaves the knowledge intact in the weights; the information leaks through indirect queries, jailbreaks, or after a tiny amount of follow-up fine-tuning. True unlearning must pass membership inference attacks, not just behavioral tests. Evaluation is half the work.

Real-World Scenario
GDPR Erasure Request for an LLM Trained on Customer Data

Who: A data protection officer and an ML team at a European insurance company

Situation: A customer exercised their GDPR Article 17 right to erasure, requesting that all their personal data be deleted from the company's systems, including any AI models trained on their data.

Problem: The customer's claims history had been part of the fine-tuning dataset for a claims processing LLM. Retraining from scratch on the filtered dataset would cost approximately 50,000 euros in compute and take three weeks.

Dilemma: Approximate unlearning (gradient ascent) was faster and cheaper but did not provide the same guarantees as full retraining. The DPO needed to demonstrate compliance to regulators if challenged.

Decision: They applied gradient ascent unlearning on the specific customer's data, verified with membership inference testing, and documented the process for regulatory review. They also committed to including this customer's data in the exclusion list for the next scheduled full retrain.

How: The forget set contained 47 records from the customer. Gradient ascent ran for 3 epochs with alpha=0.5 to balance forgetting against retaining general capability. Membership inference AUC on the forget set dropped from 0.82 to 0.51 (near random chance).

Result: The erasure request was fulfilled within the 30-day GDPR deadline. Retain set accuracy dropped by only 0.3%. The full retrain three months later confirmed complete removal.

Lesson: Approximate unlearning can satisfy erasure requests within regulatory timelines, but should be combined with scheduled full retrains and thorough membership inference verification for defensible compliance.

Research Frontier

Open Questions:

Recent Developments (2024-2025):

Explore Further: Fine-tune a small language model on a dataset containing specific facts, then apply a published unlearning technique (such as gradient ascent on the target data). Test whether the information is truly removed or merely suppressed by probing with varied prompts.

Key Takeaways
Self-Check

1. What are the three main motivations for machine unlearning in LLMs?

Show Answer
GDPR right to erasure (removing individual personal data), copyright compliance (removing copyrighted content from model knowledge), and safety alignment (removing dangerous capabilities like bioweapon synthesis instructions). Each motivation has different verification requirements and acceptable tradeoffs.

2. How does gradient ascent achieve unlearning?

Show Answer
Gradient ascent maximizes the loss on the forget set (the data to be removed) while minimizing the loss on the retain set (data that should be preserved). This pushes the model away from being able to correctly predict or reproduce the forget data while maintaining performance on everything else. The balance between forget and retain is controlled by the alpha hyperparameter.

3. What is a task vector and how can it be used for unlearning?

Show Answer
A task vector is the weight difference between a fine-tuned model and the base model: task_vector = finetuned_weights - base_weights. It encodes the "skill" learned during fine-tuning. For unlearning, you first fine-tune the base model specifically on the knowledge to remove, compute the task vector, then subtract it from the original model weights. This removes the encoded capability.

4. Why is membership inference AUC an important metric for unlearning evaluation?

Show Answer
Membership inference attacks try to determine whether a specific example was in the training set. After successful unlearning, an attacker should not be able to distinguish forgotten examples from never-seen examples, yielding an AUC near 0.5 (random chance). An AUC significantly above 0.5 indicates that the model still retains detectable traces of the supposedly forgotten data, meaning the unlearning was incomplete.

5. Why is output suppression (refusing to answer) not the same as true unlearning?

Show Answer
Output suppression trains the model to refuse questions about the target topic, but the knowledge remains encoded in the model's weights. This surface-level behavior can be bypassed through jailbreaking, indirect questioning, or fine-tuning the refusal behavior away. True unlearning removes the knowledge from the weights themselves, so it cannot be recovered by any prompting strategy or subsequent training.

Exercises

Exercise 29.7.1: Unlearning Motivations Conceptual

Describe the four main motivations for machine unlearning in LLMs (GDPR compliance, copyright removal, safety alignment, knowledge updates). For each, explain why retraining from scratch is impractical.

Answer Sketch

GDPR: removing one person's data requires filtering the entire training corpus and retraining, costing millions of dollars. Copyright: same issue for removing copyrighted works. Safety: removing dangerous knowledge (e.g., weapons synthesis) requires identifying and removing all related training examples. Knowledge updates: replacing outdated facts would require periodic full retraining. In all cases, retraining a 70B+ model costs $1M+ in compute and takes weeks, making it impractical for individual requests.

Exercise 29.7.2: Gradient Ascent Unlearning Conceptual

Explain how gradient ascent can be used for approximate unlearning. What is the intuition? What is the main risk of this approach?

Answer Sketch

Intuition: training maximizes the likelihood of the data (gradient descent on loss). Unlearning reverses this by increasing the loss on the data to forget (gradient ascent). The model becomes worse at predicting the specific sequences, effectively "forgetting" them. Main risk: catastrophic forgetting of nearby knowledge. Gradient ascent on a specific text may also degrade the model's ability on related topics, because knowledge is distributed across shared parameters. Careful tuning of the learning rate and the number of steps is essential to avoid collateral damage.

Exercise 29.7.3: Unlearning Verification Analysis

After applying an unlearning method, how do you verify that the target knowledge has actually been removed? Describe three verification approaches and their limitations.

Answer Sketch

(1) Direct probing: ask the model questions about the target knowledge. Limitation: the model may have learned to refuse without actually forgetting. (2) Membership inference: test whether the model can distinguish between training data and non-training data for the target text. Limitation: unreliable for small amounts of text. (3) Extraction attacks: attempt to extract the target text through prompting strategies. Limitation: new extraction techniques may be discovered later. None of these methods provide mathematical guarantees of forgetting, which is why approximate unlearning remains an active research area.

Exercise 29.7.4: Unlearning vs. Suppression Conceptual

Distinguish between true unlearning (the model no longer contains the knowledge) and output suppression (the model still contains the knowledge but refuses to output it). Why does this distinction matter legally and technically?

Answer Sketch

Suppression: the model can be "re-awakened" through jailbreaking or fine-tuning to output the suppressed knowledge. The information still exists in the weights. True unlearning: the information is genuinely absent from the model's parameters. Legally: GDPR's right to erasure arguably requires true deletion, not just suppression. Technically: suppressed knowledge can be recovered by adversaries, creating a false sense of compliance. Most current "unlearning" methods are closer to suppression, making them legally uncertain for compliance purposes.

Exercise 29.7.5: Unlearning Pipeline Design Coding

Design an end-to-end pipeline for handling a GDPR erasure request for an LLM system. Include: request intake, impact assessment, unlearning method selection, execution, verification, and documentation. What are the SLA considerations?

Answer Sketch

Pipeline: (1) Request intake: log the request with timestamp and data subject identifier. (2) Impact assessment: search training data for the subject's data, estimate the scope of removal needed. (3) Method selection: for small amounts, use gradient ascent; for large amounts, consider retraining on filtered data. (4) Execution: apply the chosen method with safeguards against collateral damage. (5) Verification: run probing and extraction tests. (6) Documentation: record all steps for the compliance audit trail. SLA: GDPR requires response within 30 days. Given the computational cost, organizations should maintain a batch processing schedule and communicate timelines transparently.

What Comes Next

In the next section, Section 47.3: Red Teaming Frameworks & LLM Security Testing, we explore practical frameworks for red teaming and security testing of LLM systems.

Further Reading

Core References

Bourtoule, L. et al. (2021). Machine Unlearning. IEEE S&P 2021. The seminal paper formalizing machine unlearning as the problem of efficiently removing the influence of specific training data points. Introduces the SISA (Sharded, Isolated, Sliced, Aggregated) training framework. Foundational reference for understanding the theoretical basis of unlearning.
Jang, J. et al. (2023). Knowledge Unlearning for Mitigating Language Models' Harms. ACL 2023. Proposes gradient ascent on target data as a practical unlearning method for LLMs, demonstrating removal of toxic and private information. Evaluates the tradeoff between forgetting quality and model utility preservation. Key reference for implementing approximate unlearning in practice.
Ilharco, G. et al. (2023). Editing Models with Task Arithmetic. ICLR 2023. Introduces task vectors, which are weight-space directions that encode task-specific behavior and can be negated to remove capabilities. Demonstrates that model behaviors can be composed and subtracted in weight space. Important for understanding the task vector approach to unlearning.
Eldan, R. & Russinovich, M. (2023). Who's Harry Potter? Approximate Unlearning in LLMs. Microsoft Research. Demonstrates approximate unlearning of the Harry Potter book series from an LLM using reinforced fine-tuning on alternative completions. Provides a practical case study of content-specific knowledge removal. Accessible introduction to approximate unlearning techniques.
Maini, P. et al. (2024). TOFU: A Task of Fictitious Unlearning for LLMs. Introduces a benchmark for evaluating LLM unlearning using fictitious author profiles, enabling controlled measurement of forgetting. Provides standardized evaluation metrics for comparing unlearning methods. Essential benchmark for teams developing or evaluating unlearning techniques.
Lynch, A. et al. (2024). Eight Methods to Evaluate Robust Unlearning in LLMs. Comprehensive evaluation of unlearning robustness, testing eight attack methods to determine whether supposedly unlearned knowledge can be recovered. Reveals that many unlearning techniques fail under adversarial probing. Critical reading for teams implementing or validating machine unlearning in production systems.