Section 34.7: Mechanistic Interpretability at Scale

"Understanding a neural network one neuron at a time is like understanding a city one brick at a time. You need to find the roads."
Frontier, Road Mapping AI Agent

Big Picture

Mechanistic interpretability aims to reverse-engineer the algorithms learned by neural networks. Rather than treating models as black boxes and studying their behavior from the outside, mechanistic interpretability opens the model and identifies the specific computational circuits responsible for specific behaviors. In 2024 and 2025, this field moved from toy models to frontier-scale systems, producing the first comprehensive feature maps of production language models. This section covers the core techniques, the scaling challenges, and the practical applications that are beginning to emerge.

Prerequisites

This section assumes familiarity with transformer architecture (attention heads, MLP layers, residual streams), basic linear algebra (basis vectors, projections), and the concept of neural network training. The AI safety material from Section 32.1 provides context for why interpretability matters.

1. The Superposition Hypothesis

Imagine trying to understand a city by interviewing one resident at a time. Each person plays multiple roles (parent, employee, neighbor, volunteer), and each role is shared across many people. You cannot understand the city's "transportation system" by finding the one person who is "the transportation neuron." Mechanistic interpretability faces exactly this challenge. The central challenge is superposition: neural networks represent far more features than they have dimensions. A model with a 4096-dimensional residual stream does not encode 4096 features; it may encode tens of thousands or millions of features, each represented as a direction in the high-dimensional space, with most features overlapping (sharing dimensions with other features).

This is possible because of the geometry of high-dimensional spaces. In 4096 dimensions, you can find an exponential number of approximately orthogonal directions. If the model only needs to represent a feature with moderate precision, it can pack many more features into the space than the number of dimensions, at the cost of interference between features. This is analogous to compressed sensing in signal processing.

Elhage et al. (2022) at Anthropic formalized the superposition hypothesis and demonstrated it in toy models. They showed that a small neural network trained on a sparse input distribution will learn to superpose many features into fewer dimensions, with the degree of superposition depending on the sparsity of feature activations. Features that are rarely active can be more heavily superposed because they are unlikely to interfere with each other simultaneously.

Polysemanticity

A direct consequence of superposition is polysemanticity: individual neurons respond to multiple, seemingly unrelated concepts. A single neuron might activate for both "academic citations" and "legal precedents" and "Bible verses" because all three features share a similar direction in activation space, and the neuron's activation axis happens to align with that shared direction.

Polysemanticity makes it nearly impossible to interpret neural networks one neuron at a time. The "grandmother cell" model of neural representation (one neuron per concept) is the exception, not the rule. Most neurons participate in encoding many features, and most features are distributed across many neurons. Understanding what a model computes requires identifying features as directions in activation space, not as individual neurons.

2. Sparse Autoencoders for Feature Discovery

Sparse Autoencoders (SAEs) have become the primary tool for extracting interpretable features from neural network activations. The core idea is simple: train an autoencoder with a much larger hidden dimension than the input, with a sparsity penalty that encourages most hidden units to be zero for any given input. The non-zero hidden units then correspond to the active features for that input.

Formally, given an activation vector $\mathbf{x} \in \mathbb{R}^d$ from a transformer layer, the SAE computes:

$$\mathbf{f} = \text{ReLU}(W_{\text{enc}} \mathbf{x} + \mathbf{b}_{\text{enc}}) \in \mathbb{R}^{m}$$ $$\hat{\mathbf{x}} = W_{\text{dec}} \mathbf{f} + \mathbf{b}_{\text{dec}} \in \mathbb{R}^{d}$$

where $m \gg d$ (the hidden dimension is much larger than the input dimension, often 8x to 64x) and the training loss combines reconstruction accuracy with a sparsity penalty:

$$\mathcal{L} = \| \mathbf{x} - \hat{\mathbf{x}} \|_2^2 + \lambda \| \mathbf{f} \|_1$$

The L1 penalty on $\mathbf{f}$ encourages sparse activations: for any given input, only a small fraction of the $m$ features are active. Each active feature corresponds to a specific direction in the original activation space (the corresponding column of $W_{\text{dec}}$), and the magnitude of the feature activation indicates how strongly that feature is present in the input.

The following code demonstrates training a basic SAE on transformer activations.

# Train a sparse autoencoder (SAE) on transformer hidden states to discover
# interpretable features. The encoder projects to a high-dimensional space
# with ReLU sparsity; the decoder reconstructs the original activation.
import torch
import torch.nn as nn

class SparseAutoencoder(nn.Module):
 """Sparse autoencoder for extracting interpretable features
 from transformer hidden states.

 Args:
 d_model: dimension of the transformer activations
 n_features: number of SAE features (typically 8x to 64x d_model)
 sparsity_coeff: weight of the L1 sparsity penalty
 """
 def __init__(self, d_model: int, n_features: int, sparsity_coeff: float = 1e-3):
 super().__init__()
 self.encoder = nn.Linear(d_model, n_features)
 self.decoder = nn.Linear(n_features, d_model, bias=True)
 self.sparsity_coeff = sparsity_coeff

 # Initialize decoder weights to unit norm (each column is a feature direction)
 with torch.no_grad():
 self.decoder.weight.data = nn.functional.normalize(
 self.decoder.weight.data, dim=0
 )

 def forward(self, x: torch.Tensor) -> dict:
 # Encode: project to high-dimensional sparse space
 features = torch.relu(self.encoder(x))

 # Decode: reconstruct the original activation
 reconstructed = self.decoder(features)

 # Losses
 reconstruction_loss = (x - reconstructed).pow(2).mean()
 sparsity_loss = features.abs().mean()
 total_loss = reconstruction_loss + self.sparsity_coeff * sparsity_loss

 return {
 "features": features,
 "reconstructed": reconstructed,
 "loss": total_loss,
 "reconstruction_loss": reconstruction_loss.item(),
 "sparsity_loss": sparsity_loss.item(),
 "n_active": (features > 0).float().sum(dim=-1).mean().item(),
 }

# Example: train an SAE on synthetic "activations"
d_model = 768
n_features = 768 * 16 # 16x expansion
sae = SparseAutoencoder(d_model, n_features, sparsity_coeff=5e-4)
optimizer = torch.optim.Adam(sae.parameters(), lr=3e-4)

# Simulate training on batches of transformer activations
for step in range(1000):
 # In practice, these come from running text through a transformer
 fake_activations = torch.randn(64, d_model)

 result = sae(fake_activations)
 result["loss"].backward()
 optimizer.step()
 optimizer.zero_grad()

 if (step + 1) % 200 == 0:
 print(
 f"Step {step+1}: "
 f"recon={result['reconstruction_loss']:.4f}, "
 f"sparsity={result['sparsity_loss']:.4f}, "
 f"active_features={result['n_active']:.1f}/{n_features}"
 )

Step 200: recon=0.8143, sparsity=0.0312, active_features=847.3/12288 Step 400: recon=0.5271, sparsity=0.0189, active_features=312.5/12288 Step 600: recon=0.3847, sparsity=0.0124, active_features=128.4/12288 Step 800: recon=0.2916, sparsity=0.0087, active_features=62.1/12288 Step 1000: recon=0.2341, sparsity=0.0068, active_features=41.7/12288

Code Fragment 34.7.1: Train a sparse autoencoder (SAE) on transformer hidden states to discover

Real-World Scenario: SAE Dimensions in Practice

Who: A mechanistic interpretability researcher at an AI safety organization training sparse autoencoders on an internal 7B-parameter language model.

Situation: The researcher needed to choose the SAE expansion factor and estimate the compute and memory requirements before requesting GPU allocation from the cluster team.

Problem: With the model's hidden dimension $d = 768$ and a 16x expansion factor, the SAE would have $m = 12{,}288$ features. The encoder $W_{\text{enc}}$ has shape $768 \times 12{,}288$ and the decoder $W_{\text{dec}}$ has shape $12{,}288 \times 768$, totaling $2 \times 768 \times 12{,}288 + 12{,}288 + 768 = 18{,}897{,}408$ parameters (roughly 19M). That was manageable. But scaling to the organization's frontier model with $d = 16{,}384$ and 32x expansion would yield $m = 524{,}288$ features and over 17 billion SAE parameters, requiring significant cluster resources.

Decision: The researcher started with the 7B model at 16x expansion to validate the training pipeline and feature quality. The sparsity penalty was tuned to produce an average of 40 active features per input, making each activation roughly $\frac{40}{12{,}288} \approx 0.3\%$ dense. Only after confirming that the discovered features were interpretable would the team request the larger allocation for the frontier model.

Result: The 7B SAE trained in 4 hours on a single A100, discovered over 200 clearly interpretable features (language-specific, topic-specific, and behavioral), and validated the training pipeline. The team then secured approval for the frontier-scale run based on the concrete quality evidence from the smaller experiment.

Lesson: SAE parameter counts scale quadratically with model hidden dimension and linearly with expansion factor. Always validate on a smaller model first to confirm feature quality before committing the substantial compute required for frontier-scale sparse autoencoders.

Library Shortcut: TransformerLens for Feature Extraction

Rather than building an SAE from scratch, TransformerLens lets you extract and inspect activations from any GPT-2 style model in a few lines:

# pip install transformer_lens
import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt2-small")
logits, cache = model.run_with_cache("The Eiffel Tower is in")
# Inspect residual stream at layer 8, final token position
resid = cache["resid_post", 8][0, -1] # shape: (768,)
print(f"Residual norm: {resid.norm():.2f}")

Code Fragment 34.7.2: pip install transformer_lens

For training SAEs on cached activations, see SAELens, which provides pre-trained SAEs for GPT-2 and Gemma models with feature dashboards.

Scaling SAEs to Frontier Models

In 2024, both Anthropic and OpenAI published results from training SAEs on their frontier models. Anthropic's work on Claude 3 Sonnet (Templeton et al., 2024) extracted millions of interpretable features, including features for specific concepts (Golden Gate Bridge, computer code, deceptive reasoning), behaviors (refusing harmful requests, being sycophantic), and languages (features that activate specifically for French, Mandarin, or Python code).

The scaling challenge is substantial. Training a SAE on a frontier model requires running billions of tokens through the model, storing the activations, and training the SAE on this dataset. The SAE itself can have billions of parameters (for a 16x expansion of a model with a 16384-dimensional residual stream, the SAE has over 4 billion parameters). The compute cost is a significant fraction of the cost of training the original model.

Despite the cost, the features discovered by large-scale SAEs are remarkably interpretable. Anthropic's team found that they could steer Claude's behavior by amplifying or suppressing specific features. Amplifying the "Golden Gate Bridge" feature caused the model to mention the bridge in every response. Suppressing "deceptive reasoning" features reduced the model's tendency to produce misleading outputs. This level of granular control was not previously possible.

3. Circuit Analysis

While SAEs identify the features (the "what"), circuit analysis identifies the computational pathways (the "how"). A circuit is a subgraph of the model's computational graph that implements a specific behavior: for example, the circuit responsible for indirect object identification in sentences like "When Mary and John went to the store, John gave a drink to ___."

The Circuit Discovery Pipeline

Identifying circuits in transformer models typically follows a multi-step process:

Task specification. Define a narrow behavior to investigate (e.g., "the model correctly predicts the indirect object in ditransitive sentences").
Activation patching. Run the model on a clean input and a corrupted input, then systematically replace activations in the corrupted run with activations from the clean run. If replacing the output of a specific attention head restores correct behavior, that head is part of the circuit.
Path patching. Refine the analysis by patching individual connections between components (e.g., the output of attention head 5 in layer 3 as it flows into MLP layer 4). This identifies the specific information pathways, not just the components.
Interpretation. For each component in the circuit, analyze what it computes using techniques like attention pattern visualization, probing classifiers, and SAE feature analysis.

Wang et al. (2023) applied this methodology to identify the "Indirect Object Identification (IOI) circuit" in GPT-2 Small, finding a circuit of 26 attention heads across multiple layers that collaborate to solve this task. The circuit includes "duplicate token heads" (detecting repeated names), "S-inhibition heads" (suppressing the subject), and "name mover heads" (copying the correct name to the output).

Library Shortcut: nnsight for Activation Patching

For activation patching on larger models (Llama, Mistral, Gemma), nnsight provides remote tracing without downloading full model weights:

# pip install nnsight
from nnsight import LanguageModel

model = LanguageModel("openai-community/gpt2", device_map="auto")
with model.trace("The Eiffel Tower is in") as tracer:
 # Access hidden states at layer 6
 hidden = model.transformer.h[6].output[0].save()
print(f"Layer 6 output shape: {hidden.value.shape}")
# torch.Size([1, 6, 768])

Code Fragment 34.7.3: pip install nnsight

Limitations of Circuit Analysis

Circuit analysis has produced impressive results on narrow tasks, but scaling it to the full breadth of a frontier model's capabilities faces fundamental challenges. First, most model behaviors involve diffuse computation distributed across many components, not clean, modular circuits. Second, the number of possible circuits is combinatorially large, making exhaustive search intractable. Third, circuits may overlap and interact, so understanding one circuit in isolation may not predict its behavior when other circuits are simultaneously active.

Key Insight

Mechanistic interpretability is moving from "can we understand toy models?" to "can we understand production models well enough to make safety-relevant claims?" The field has demonstrated that individual features and small circuits can be identified and understood even in frontier models. The open question is whether this understanding can be made comprehensive enough to support safety arguments. For example, can we identify all the circuits involved in deceptive behavior and verify that they are inactive? This "comprehensive safety case via interpretability" is the long-term goal, and it remains far from achieved.

4. Practical Applications

While the theoretical ambitions of mechanistic interpretability are grand, several practical applications are already emerging.

Model Debugging

When a model produces incorrect or unexpected outputs, SAE features can help diagnose why. By examining which features are active during the problematic generation, developers can identify whether the model is attending to the wrong part of the input, activating an irrelevant feature, or suppressing a relevant one. This is analogous to using a debugger in software engineering: rather than guessing what went wrong, you can inspect the internal state.

Safety Auditing

Interpretability tools enable a new form of model safety auditing. Rather than relying solely on behavioral testing (which can miss failure modes not covered by the test set), auditors can examine the model's internal features for concerning patterns. Does the model have features related to deception? Manipulation? Weapon design? The presence of such features does not necessarily indicate a safety risk (the model may have learned them from training data without using them in problematic ways), but their absence provides evidence that specific risks are lower.

Targeted Model Editing

Once a problematic feature or circuit has been identified, it can potentially be modified without retraining the entire model. Techniques like activation steering (adding or subtracting feature directions from the residual stream during inference) and targeted fine-tuning (training only the components of a specific circuit) allow for surgical modifications. This is more efficient and predictable than full fine-tuning, which may have unintended side effects on other capabilities.

Understanding Training Dynamics

SAE features can be tracked across training checkpoints to understand how the model's internal representations evolve during training. This reveals phenomena like feature splitting (a single feature early in training splits into more specific features later), phase transitions (certain features appear suddenly at specific training steps), and feature death (features that were active early in training become permanently inactive). These observations inform training methodology and may help predict and prevent training instabilities.

5. The Scaling Challenge

The fundamental challenge facing mechanistic interpretability is scaling. Current techniques work well on models with hundreds of millions of parameters. Applying them to models with hundreds of billions of parameters requires innovations in both methodology and compute.

Key scaling challenges include:

Feature count. Larger models have more features. Anthropic's Claude 3 Sonnet SAEs found millions of features; frontier models likely have tens of millions. Manually inspecting each feature is impossible, requiring automated interpretability methods (using LLMs to describe features, for instance).
Circuit complexity. As models grow, circuits become more complex and involve more components. The 26-head IOI circuit in GPT-2 Small may correspond to a 200-head circuit in a frontier model, with additional complexity from increased depth and width.
Validation. How do you verify that a proposed circuit is complete and correct? In small models, ablation studies (removing the circuit and checking that the behavior disappears) are feasible. In large models, the computational cost of comprehensive ablation is prohibitive.
Automated interpretability. Using language models to automatically label and describe SAE features (Bills et al., 2023) is promising but introduces a meta-level interpretability challenge: how do you verify that the LLM's description of a feature is accurate?

6. Connections to Other Chapters

Mechanistic interpretability intersects with several other topics covered in this book:

AI safety (Section 32.1). Interpretability is one of the primary tools for building safety cases for AI systems. If you can understand what a model computes and why, you can make stronger claims about its safety.
Fine-tuning (Chapter 19). Understanding what features change during fine-tuning can explain why fine-tuning sometimes causes capability regressions or unexpected behavior changes.
Evaluation (Section 29.1). Interpretability provides an internal complement to external evaluation. Behavioral tests tell you what the model does; interpretability tells you how and why.
Reasoning (Section 34.5). Circuit analysis of chain-of-thought reasoning may reveal whether CoT chains are causally involved in the model's computation or are post-hoc rationalizations.

Tip

You do not need to be a mechanistic interpretability researcher to benefit from the field's outputs. Publicly released SAE feature dashboards (such as Anthropic's Neuronpedia and OpenAI's feature visualization tools) allow practitioners to explore what their models have learned. If your model produces unexpected behavior on a specific input, examining the active SAE features for that input can provide a starting point for diagnosis, even without deep expertise in the underlying methodology.

Exercise 34.7.1: SAE Feature Interpretation

A sparse autoencoder trained on GPT-2 layer 8 activations has 32,768 features (8x the hidden dimension of 4,096). When you feed the sentence "The Eiffel Tower is in Paris" through the model, feature #12,047 activates strongly on the token "Paris" with a value of 4.2, while most other features are near zero.

What does it mean for this feature to activate on "Paris"? What hypothesis would you form about what feature #12,047 represents?
Describe two experiments you would run to test your hypothesis.
Why does the SAE use 8x the hidden dimension? What would happen if you used only 1x (same size as the original hidden state)?

Show Answer

1. A high activation means the direction in activation space corresponding to feature #12,047 is strongly present in the representation of "Paris." An initial hypothesis might be that this feature encodes "European capital cities" or "famous landmarks' locations." 2. First, test other European capitals: feed sentences like "Big Ben is in London" and "The Colosseum is in Rome" and check whether #12,047 activates on the city tokens. Second, test non-capital cities or non-geographic uses of "Paris" (e.g., "Paris Hilton") to see whether the feature is location-specific or name-specific. 3. At 1x, the autoencoder cannot decompose superposed features because the model stores more concepts than it has dimensions (the superposition hypothesis). The 8x expansion gives the SAE enough capacity to disentangle individual features. With 1x, features would remain entangled, and the autoencoder would learn the identity function rather than meaningful decompositions.

Exercise 34.7.2: From Interpretability to Debugging

Your production model occasionally generates incorrect country names when answering geography questions. You suspect a specific attention head is responsible. Using the activation patching technique described in this section, outline a step-by-step plan to identify and confirm which head causes the error. What would you do after identifying the faulty circuit?

Show Answer

Step 1: Collect a set of prompts where the model answers correctly and a matched set where it answers incorrectly. Step 2: For each attention head, patch its output from a correct-answer run into the incorrect-answer run and measure whether the output changes to the correct answer. Step 3: Heads where patching fixes the error are causally involved in the failure. Step 4: Examine the attention patterns of the identified heads to understand what they attend to (e.g., are they attending to the wrong entity?). After identification, you could fine-tune with targeted examples that strengthen the correct circuit, add a retrieval layer that provides ground-truth geography data, or use representation engineering to steer the head's output toward the correct direction.

Key Takeaways

Superposition means features share neurons. Models represent far more concepts than they have neurons by using overlapping, sparse activation patterns.
Sparse autoencoders extract interpretable features. Training a separate autoencoder on model activations can decompose superposed representations into human-understandable concepts.
Circuit analysis traces computation through the network. Identifying which components contribute to specific behaviors enables targeted debugging and safety auditing.

What Comes Next

In the next section, Section 34.8: The Nature of Agency, we examine the question of when a model becomes an agent, exploring definitional frameworks, degrees of autonomy, and the safety implications of increasingly agentic AI systems.

References & Further Reading

Superposition & Feature Discovery

Elhage, N., Hume, T., Olsson, C., et al. (2022). "Toy Models of Superposition." Transformer Circuits Thread, Anthropic.

Provides the theoretical framework for understanding superposition, where neural networks represent more features than they have dimensions. This foundational work explains why individual neurons are often uninterpretable.

📄 Paper

Bricken, T., Templeton, A., Batson, J., et al. (2023). "Towards Monosemanticity: Decomposing Language Models with Dictionary Learning." Transformer Circuits Thread, Anthropic.

Demonstrates that sparse autoencoders can decompose neural network activations into interpretable, monosemantic features. The paper that launched the current wave of dictionary learning approaches to interpretability.

📄 Paper

Templeton, A., Conerly, T., Marcus, J., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Transformer Circuits Thread, Anthropic.

Scales sparse autoencoders to a production frontier model, finding millions of interpretable features including abstract concepts and multilingual representations. Demonstrates that mechanistic interpretability works at realistic scale.

📄 Paper

Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L. (2023). "Sparse Autoencoders Find Highly Interpretable Features in Language Models." ICLR 2024.

Independently validates the sparse autoencoder approach and provides systematic evaluation of feature quality. Offers complementary evidence to the Anthropic line of work on dictionary learning.

📄 Paper

Circuit Analysis

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small." ICLR 2023.

The most detailed circuit-level analysis of a real language model behavior, tracing how GPT-2 resolves indirect object references. Sets the gold standard for what a complete mechanistic explanation looks like.

📄 Paper

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). "Towards Automated Circuit Discovery for Mechanistic Interpretability." NeurIPS 2023.

Introduces ACDC, an algorithm for automatically identifying computational circuits in neural networks. Essential reading for understanding how circuit analysis can scale beyond manual investigation.

📄 Paper

Automated Interpretability

Bills, S., Cammarata, N., Mossing, D., et al. (2023). "Language Models Can Explain Neurons in Language Models." OpenAI Blog.

Pioneers the use of LLMs to automatically generate and score natural language explanations of individual neurons. Demonstrates the potential (and current limitations) of using AI to interpret AI.

📄 Paper