Interpretability Tooling, Evaluation, and LLM-Assisted Explanation

Section 10.5

"A method without a tool is a lecture; a tool without a method is a toy. Interpretability needs both."

ProbeProbe, Tool-Smith AI Agent
Big Picture

Section 10.4 introduced the attribution methods (attention rollout, gradient-weighted attention, LRP, perturbation, integrated gradients). This continuation answers three operational questions: which open-source tools run those methods at scale, how to evaluate the explanations they produce (faithfulness vs. plausibility), and how LLMs themselves are becoming explanation assistants that narrate the outputs of other models. Together these three layers close the loop from "I have a method" to "I have an explanation I can ship and audit".

Prerequisites

This section continues from Section 10.4: Explaining Transformers, which introduced the core attribution methods that the tools below operationalize. Familiarity with attention analysis and probing from Section 10.1 is also assumed.

Interpretability tool stack mapped to workflow stage
Figure 10.4b.1: The 2026 interpretability tool stack mapped to workflow stage. Neuronpedia (Anthropic, with Gemma Scope and GPT-2 small features) is the no-code entry point for hypothesis generation. TransformerLens (Neel Nanda et al.) provides hook access to Q, K, V, attention patterns, and residual stream on GPT-2, Pythia, Llama, Gemma, and Mistral; the canonical tool for circuit analysis and activation patching. SAELens (Joseph Bloom) trains and loads sparse autoencoders, exports to Neuronpedia, and answers "what features fire here?". nnsight and nnterp wrap any Hugging Face PyTorch model to enable probing and logit-lens experiments without porting to TransformerLens. Most production workflows combine all four.

10.5.1 Interpretability Tools Ecosystem (2025)

The interpretability research community has built a rich ecosystem of open-source tools over the past three years. Choosing the right tool depends on your goal: are you doing mechanistic circuit analysis, exploring SAE features, or building a production explanation pipeline? This section surveys the major tools and helps you match tools to use cases.

Fun Fact: Brain Scans for Neural Nets

A neurologist who suspects a stroke does not lecture the patient about brain anatomy; she orders a scan and reads what lights up. Interpretability tooling for LLMs is the same workflow with different equipment: activation patching is the scan, sparse autoencoders are the contrast dye, and an LLM-as-explainer is a junior radiologist drafting the report. The patient cannot tell you what they were thinking; the scan can.

Why does the tooling matter? Interpretability research is only as reproducible and accessible as its tooling. A brilliant mechanistic finding that requires custom infrastructure to replicate has limited impact. The tools listed below have lowered the barrier to entry, enabling researchers and practitioners to run experiments that previously required months of infrastructure work.

Table 10.5.2: Tool Comparison (as of 2026).
Tool Primary Use Key Features Best For
TransformerLens Mechanistic interpretability Full hook access at every sub-computation (Q, K, V, attention patterns, residual stream); built-in caching; direct logit attribution Detailed circuit analysis on supported models (GPT-2, Pythia, Llama, Gemma, Mistral)
SAELens SAE training and analysis Train SAEs on any TransformerLens model; load pretrained Gemma Scope SAEs; feature dashboard generation; integration with Neuronpedia Training custom SAEs, loading Gemma Scope, feature-level analysis
Neuronpedia Feature browsing and search Web-based feature explorer; auto-generated descriptions; activation histograms; community annotations; cross-model comparisons Non-code exploration of SAE features; sharing and discussing findings
nnsight Model intervention Wraps any PyTorch primitives model; proxy-based lazy evaluation; remote execution support; familiar PyTorch API Quick experiments on any architecture, including models not supported by TransformerLens
nnterp Neural network interpretation Probing, logit lens, representation analysis; lightweight API; works with Hugging Face models directly Probing experiments and logit lens analysis without TransformerLens overhead
Key Insight

Tool selection depends on your interpretability workflow stage. For hypothesis generation (browsing features, visualizing attention), start with Neuronpedia and standard Hugging Face tools. For hypothesis testing (activation patching, circuit tracing), use TransformerLens or nnsight. For SAE training and feature analysis, use SAELens. For lightweight probing and logit lens experiments, nnterp provides a lower-overhead alternative. Many researchers combine multiple tools: SAELens for training SAEs, TransformerLens for circuit analysis, and Neuronpedia for browsing results.

# Same task in three frameworks: extract activations from layer 5 of GPT-2.
# Each library trades off ease of use for control over the model internals.

# (A) Plain HuggingFace transformers + a forward hook
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("gpt2")
model_hf = AutoModelForCausalLM.from_pretrained("gpt2")
acts_hf = {}
def hook(module, inputs, output):
    acts_hf["layer5"] = output[0]
model_hf.transformer.h[5].register_forward_hook(hook)
model_hf(**tok("The Eiffel Tower is in", return_tensors="pt"))
print("HF activations:", acts_hf["layer5"].shape)

# (B) nnsight: pause execution mid-forward and pull values declaratively
from nnsight import LanguageModel
model_ns = LanguageModel("gpt2")
with model_ns.trace("The Eiffel Tower is in"):
    layer5_acts = model_ns.transformer.h[5].output[0].save()
print("nnsight activations:", layer5_acts.shape)

# (C) transformer_lens: built-in HookedTransformer with named hook points
from transformer_lens import HookedTransformer
model_tl = HookedTransformer.from_pretrained("gpt2")
_, cache = model_tl.run_with_cache("The Eiffel Tower is in")
print("transformer_lens activations:", cache["blocks.5.hook_resid_post"].shape)
# All three return the same tensor; the differences are in API ergonomics.
Code Fragment 10.5.1: The same interpretability task (accessing layer 5 activations) in three different tools. TransformerLens provides the deepest access, nnsight wraps any model, and nnterp offers the simplest API for common tasks.

10.5.1.1 Production XAI Libraries: Captum, LIME, and BertViz

The tools above (TransformerLens, SAELens, nnsight) serve the mechanistic interpretability community, where the goal is understanding internal model computations. A complementary set of tools addresses the production explainability problem: generating human-readable explanations of individual predictions for end users, auditors, or regulatory compliance. These libraries treat the model as a function (sometimes a black box) and explain its input-output behavior rather than its internal circuits.

Captum: Meta's Attribution Toolkit

Captum is Meta's comprehensive model interpretability library for PyTorch. It implements over a dozen attribution methods under a unified API, making it straightforward to compare different explanation approaches on the same prediction. For transformer models, the most commonly used methods are Layer Integrated Gradients (attributing to the embedding layer), Layer Gradient x Activation, and Layer Conductance (which measures the importance of individual neurons in a specific layer).

Captum's strength is its breadth: it covers gradient-based methods (Integrated Gradients, DeepLift, GradientSHAP), perturbation-based methods (Feature Ablation, Shapley Value Sampling, LIME via the Lime wrapper), and layer-level methods (Layer Conductance, Internal Influence). This means you can compare multiple explanation strategies on the same model without switching libraries.

Note: Captum's algorithm zoo

The full Captum attribution catalog spans roughly twenty methods, organized into three families: primary attribution (Integrated Gradients, Saliency, DeepLift, DeepLiftShap, GradientShap, InputXGradient, GuidedBackprop, GuidedGradCam, Deconvolution, Feature Ablation, Feature Permutation, Occlusion, Shapley Value Sampling, Lime, KernelShap), layer attribution (LayerConductance, LayerIntegratedGradients, LayerGradientXActivation, LayerGradCam, LayerDeepLift, LayerActivation, InternalInfluence), and neuron attribution (NeuronConductance, NeuronGradient, NeuronIntegratedGradients). Most methods share the same .attribute(inputs, ...) API, so swapping methods is usually a one-line change. The official Captum site keeps a visual algorithm zoo chart that maps each method to its theoretical family and recommended use case; consult it before picking a method for a novel modality.

Captum on image classification (vision)

Captum is not text-only. Applied to an image classifier (a ResNet-50 or ViT, for example), Layer Integrated Gradients produces a per-pixel saliency map for any chosen target class. The recipe is exactly the same as the text case: wrap the model's forward pass, choose a baseline (typically a zero or blurred image), pick a target class index, and call .attribute(). Captum then returns a tensor of the same shape as the input image, and captum.attr.visualization.visualize_image_attr overlays the saliency map on the original image. The two-line summary is:

from captum.attr import IntegratedGradients, visualization as viz
ig = IntegratedGradients(resnet50)
attr = ig.attribute(image_tensor, target=predicted_class, n_steps=50)
viz.visualize_image_attr(attr.squeeze().cpu().permute(1,2,0).numpy(),
                          original_image, method="heat_map", sign="positive")
Code Fragment 10.5.2a: Captum Integrated Gradients on a vision classifier. Same API as the text case; only the baseline (a zero image instead of [PAD] tokens) and the visualization helper change. Useful for sanity-checking whether the model relies on object features or shortcuts like watermarks and backgrounds.
# Comprehensive Captum attribution for a transformer classifier
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from captum.attr import (
    LayerIntegratedGradients,
    LayerGradientXActivation,
    LayerConductance,
    visualization as viz,
    )
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.eval()
text = "The movie was surprisingly entertaining despite a weak script"
inputs = tokenizer(text, return_tensors="pt")
input_ids = inputs["input_ids"]
baseline_ids = torch.zeros_like(input_ids) # PAD token baseline
# Wrap the forward function for Captum
def forward_func(input_ids):
    outputs = model(input_ids)
    return outputs.logits[:, 1] # positive sentiment logit
# Method 1: Layer Integrated Gradients (most common for transformers)
lig = LayerIntegratedGradients(forward_func, model.distilbert.embeddings)
attrs_ig, delta = lig.attribute(
    input_ids, baselines=baseline_ids,
    n_steps=50, return_convergence_delta=True,
    )
# Convergence delta should be small (< 0.05); large values indicate
# that n_steps is too low for accurate integration.
# Method 2: Gradient x Activation (faster, less theoretically grounded)
lga = LayerGradientXActivation(forward_func, model.distilbert.embeddings)
attrs_gxa = lga.attribute(input_ids)
# Method 3: Layer Conductance (neuron-level importance in a specific layer)
lc = LayerConductance(forward_func, model.distilbert.transformer.layer[3])
attrs_cond = lc.attribute(input_ids, baselines=baseline_ids, n_steps=20)
# Summarize per-token attributions
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
attrs_sum = attrs_ig.sum(dim=-1).squeeze(0).detach().numpy()
print("Integrated Gradients attribution per token:")
for tok, score in zip(tokens, attrs_sum):
    bar = "#" * int(min(abs(score) * 10, 40))
    sign = "+" if score > 0 else "-"
    print(f" {tok:20s} {sign}{bar:40s} {score:+.4f}")
Output: Integrated Gradients attribution per token: [CLS] - -0.0012 the +### +0.0341 movie +##### +0.0523 was +## +0.0187 surprisingly +########### +0.1142 entertaining +################ +0.1689 despite -#### -0.0412 a -# -0.0098 weak -######## -0.0834 script -### -0.0301 [SEP] - -0.0021
Code Fragment 10.5.2b: Using Captum's three most popular attribution methods on a sentiment classifier. Layer Integrated Gradients provides the most theoretically grounded attributions; Gradient x Activation is faster for exploratory work; Layer Conductance reveals which neurons in a specific layer contribute most.

LIME for Language Models

LIME (Local Interpretable Model-agnostic Explanations) explains individual predictions by fitting a simple interpretable model (typically a sparse linear model) to the behavior of the complex model in the neighborhood of a specific input. For text, LIME works by randomly removing words from the input, observing how the model's prediction changes, and fitting a linear model that approximates the local decision boundary.

LIME's key advantage is that it is entirely model-agnostic: it treats the model as a black box and requires only the ability to call the model's prediction function. This makes it applicable to API-based LLMs where you have no access to gradients or internal activations. The tradeoff is that LIME's perturbation strategy (removing tokens) can create out-of-distribution inputs that a language model has never seen during training, potentially producing misleading attributions.

import torch
# LIME explanation for a text classifier (model-agnostic)
from lime.lime_text import LimeTextExplainer
import numpy as np
# Works with any model that returns class probabilities
def predict_proba(texts):
    """Prediction function that LIME will call repeatedly."""
    results = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt",
            truncation=True, max_length=512)
        with torch.no_grad():
            logits = model(**inputs).logits[0]
            probs = torch.softmax(logits, dim=-1).numpy()
            results.append(probs)
            return np.array(results)
        explainer = LimeTextExplainer(class_names=["negative", "positive"])
        text = "The movie was surprisingly entertaining despite a weak script"
        explanation = explainer.explain_instance(
            text,
            predict_proba,
            num_features=10, # top 10 most important words
            num_samples=1000, # number of perturbations to generate
            )
        # Display word-level importance
        print("LIME feature importance (positive sentiment):")
        for word, weight in explanation.as_list():
            direction = "+" if weight > 0 else "-"
            bar = "#" * int(abs(weight) * 50)
            print(f" {word:20s} {direction}{bar} ({weight:+.4f})")
            # LIME also provides HTML visualization:
            # explanation.save_to_file("lime_explanation.html")
            # For API-based LLMs (no gradient access), LIME is often the only option.
            # Replace predict_proba with an API call wrapper:
            #
            # def predict_proba_api(texts):
            # results = []
            # for text in texts:
            # response = client.chat.completions.create(
            # model="gpt-4o-mini",
            # messages=[{"role": "user", "content": f"Classify: {text}"}],
            # logprobs=True,
            # )
            # # Extract probabilities from logprobs
            # results.append(parse_logprobs(response))
            # return np.array(results)
Output: LIME feature importance (positive sentiment): entertaining +############## (+0.2847) surprisingly +########## (+0.1923) movie +#### (+0.0712) despite -### (-0.0534) weak -######### (-0.1756) script -#### (-0.0891) was +## (+0.0312) The + (+0.0087) a - (-0.0041)
Code Fragment 10.5.3: Using LIME for model-agnostic text explanation. LIME perturbs the input by removing words, observes prediction changes, and fits a local linear model. This approach works with any model, including API-based LLMs where gradient access is unavailable.

BertViz: Interactive Attention Visualization

BertViz provides interactive, browser-based visualizations of attention patterns across all layers and heads of a transformer model. It offers three visualization modes: the head view (attention from a single head as lines connecting tokens), the model view (all heads across all layers in a compact overview), and the neuron view (how individual neurons in Q, K, V contribute to attention). BertViz works in Jupyter notebooks and supports BERT, GPT-2, RoBERTa, XLNet, and other Hugging Face models.

While Section 10.1 covered attention visualization from scratch, BertViz is the production tool for this task. It is particularly useful for qualitative exploration: scanning attention patterns across layers to identify which heads attend to syntactic structure, which heads focus on positional patterns, and which heads appear to implement specific linguistic functions like coreference resolution or subject-verb agreement.

# BertViz: interactive attention visualization in Jupyter
# pip install bertviz
from bertviz import model_view, head_view, neuron_view
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("bert-base-uncased",
    output_attentions=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "The bank raised interest rates after the financial crisis"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
attention = outputs.attentions # tuple of (batch, heads, seq, seq) per layer
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Model view: compact overview of all layers and heads
model_view(attention, tokens)
# Head view: detailed view for a specific layer
head_view(attention, tokens, layer=6, heads=[0, 3, 7])
# Neuron view: see Q/K/V contributions (requires special model loading)
# neuron_view(model, tokenizer, text, layer=6, head=3)
# Practical usage pattern: identify interesting heads, then investigate
# with probing classifiers (Section 10.1) or activation patching
# (Section 10.2) to confirm functional roles.
Code Fragment 10.5.4: BertViz provides three levels of attention visualization. The model view gives a bird's-eye overview, the head view examines specific attention patterns, and the neuron view reveals how individual Q/K/V neurons contribute. Use BertViz for hypothesis generation, then confirm with more rigorous methods.
Tip: Choosing the Right XAI Tool for Your Situation

The choice between XAI libraries depends on three factors: model access (do you have gradient access, or only API access?), audience (researchers, engineers, or non-technical stakeholders?), and goal (debugging a specific failure, or systematic audit for compliance?). If you have full model access and need theoretically grounded attributions, use Captum with Integrated Gradients. If you only have API access, use LIME with a prediction wrapper. If you are exploring attention patterns during model development, use BertViz for interactive visualization. For mechanistic circuit analysis during research, use TransformerLens. For regulatory audits requiring feature-level explanations, combine Captum (for attributions) with SHAP (for Shapley-value guarantees from Section 10.3).

Table 10.5.3a: XAI Practitioner's Decision Matrix (as of 2026).
Scenario Model Access Recommended Tool(s) Output
Debug misclassification in production Full (local model) Captum (Integrated Gradients) Per-token attribution scores
Explain API-based LLM predictions API only LIME Word-level importance, local linear model
Explore attention during development Full BertViz Interactive attention heatmaps
Regulatory compliance audit Full SHAP + Captum Shapley values with theoretical guarantees
Research: understand model circuits Full TransformerLens + SAELens Activation patches, feature dashboards
Quick probing and logit lens Full nnterp Per-layer predictions, probing accuracy
Non-technical stakeholder report Any LIME or Captum + custom visualization Highlighted text, plain-language summaries

10.5.2 Evaluation of Explanation Quality

How do we know if an explanation is "good"? Several metrics have been proposed to evaluate explanation quality, each capturing different desirable properties.

Table 10.5.4a: Evaluation of Explanation Quality (as of 2026).
MetricWhat It MeasuresHow to Compute
Faithfulness (Sufficiency)Can the top-k tokens reproduce the prediction?Keep only top-k attributed tokens, measure prediction change
Faithfulness (Comprehensiveness)Do the top-k tokens account for the prediction?Remove top-k tokens, measure prediction drop
PlausibilityDo explanations match human intuition?Compare attributions to human annotation of important words
ConsistencyDo similar inputs get similar explanations?Measure attribution similarity for paraphrased inputs
SparsityHow concentrated is the attribution?Entropy or Gini coefficient of attribution distribution

Code Fragment 10.5.5 demonstrates this approach.

Code Fragment 10.5.5a evaluates whether the attributed tokens actually drive the prediction, using both sufficiency (keeping only top-k) and comprehensiveness (removing top-k) tests.

import numpy as np
import torch
# Faithfulness evaluation for attribution methods
def evaluate_faithfulness(
    model,
    tokenizer,
    text,
    attributions,
    k_values=[1, 3, 5],
    ):
    """
    Evaluate faithfulness of attributions using sufficiency and comprehensiveness.
    """
    inputs = tokenizer(text, return_tensors="pt")
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    with torch.no_grad():
        baseline_logits = model(**inputs).logits[0, -1]
        predicted_id = baseline_logits.argmax()
        baseline_prob = torch.softmax(baseline_logits, dim=-1)[predicted_id].item()
        sorted_indices = np.argsort(attributions)[::-1]
        results = {}
        for k in k_values:
            top_k = sorted_indices[:k]
            # Sufficiency: keep only top-k tokens
            sufficient_ids = inputs["input_ids"].clone()
            mask = torch.ones(len(tokens), dtype=torch.bool)
            mask[list(top_k)] = False
            sufficient_ids[0, mask] = tokenizer.pad_token_id or 0
            with torch.no_grad():
                suf_logits = model(sufficient_ids).logits[0, -1]
                suf_prob = torch.softmax(suf_logits, dim=-1)[predicted_id].item()
                sufficiency = suf_prob / baseline_prob # closer to 1 = better
                # Comprehensiveness: remove top-k tokens
                comp_ids = inputs["input_ids"].clone()
                for idx in top_k:
                    comp_ids[0, idx] = tokenizer.pad_token_id or 0
                    with torch.no_grad():
                        comp_logits = model(comp_ids).logits[0, -1]
                        comp_prob = torch.softmax(comp_logits, dim=-1)[predicted_id].item()
                        comprehensiveness = 1 - (comp_prob / baseline_prob) # closer to 1 = better
                        results[f"k={k}"] = {
                            "sufficiency": sufficiency,
                            "comprehensiveness": comprehensiveness,
                            }
                        return results
Code Fragment 10.5.5b: Faithfulness evaluation: keeping only the top-k attributed tokens (sufficiency) or removing them (comprehensiveness) and measuring how the prediction probability changes. A faithful attribution scores high on both axes.

10.5.3 LLMs as Interpretability Assistants

A surprising twist in the interpretability story is that LLMs themselves have become powerful tools for explaining other models. Instead of relying solely on numerical attribution scores or heatmaps, practitioners are using language models to generate natural language explanations of model behavior, produce counterfactual analyses, and even automate the labeling of internal model features discovered through mechanistic interpretability.

10.5.3.1 Natural Language Explanations of Predictions

Given a model's input, output, and attribution scores, an LLM can synthesize a human-readable explanation: "The model predicted negative sentiment primarily because of the phrase 'deeply disappointed,' which received the highest attribution score. The word 'excellent' in the second sentence pulled toward positive sentiment but was outweighed by the negative signals." This transforms opaque numerical outputs into narratives that stakeholders can understand and critique. The key advantage is accessibility: a product manager does not need to interpret a SHAP waterfall chart if an LLM can narrate the same information in plain language.

10.5.3.2 LLM-Generated Counterfactual Explanations

Counterfactual explanations answer the question "what would need to change for the prediction to be different?" An LLM can generate these by prompting it with the original input and prediction, then asking it to produce minimal modifications that would flip the outcome. For example: "The loan application was denied because the debt-to-income ratio of 45% exceeds the threshold. The prediction would change to approved if the monthly debt payments decreased from $2,700 to below $2,000, or if annual income increased from $72,000 to above $85,000." These explanations are actionable in ways that feature importance scores are not, and they satisfy regulatory requirements in domains like finance and healthcare where model decisions must be explainable.

10.5.3.3 Automated Model Card Generation

Model cards (Mitchell et al., 2019) document a model's intended use, performance characteristics, limitations, and ethical considerations. Writing them manually is tedious and often skipped. LLMs can automate this by analyzing a model's evaluation results, training data statistics, and configuration, then generating a structured model card that covers performance breakdowns by demographic group, known failure modes, and recommended use cases. While the generated card requires human review, it reduces the documentation burden from hours to minutes and ensures that no standard section is accidentally omitted.

10.5.3.4 Auto-Labeling SAE Features with LLMs

The sparse autoencoders (SAEs) discussed in Section 10.3 decompose model activations into thousands of interpretable features, but each feature needs a human-readable label to be useful. OpenAI's "Language models can explain neurons in language models" (Bills et al., 2023) pioneered the approach of using GPT-4 to automatically describe what each neuron computes by showing it the neuron's top-activating examples and asking for a natural language summary. Neuronpedia scales this approach to the features discovered by SAEs, using LLMs to auto-label features from Gemma Scope (covered earlier in Section 10.3) and other SAE analyses. The process works as follows: collect the top 20 text examples that maximally activate a given SAE feature, present them to an LLM with the prompt "What concept or pattern do these examples share?", and store the generated label alongside the feature. Human spot-checks verify label quality, and the community can propose corrections.

Key Insight

Using LLMs to explain other models creates a productive feedback loop: interpretability techniques surface internal features, LLMs label those features in natural language, and researchers use the labels to form hypotheses about model behavior that drive further investigation. The risk is circular reasoning; if the explaining LLM shares biases or blind spots with the model being explained, the generated labels may be plausible but misleading. Always validate LLM-generated explanations against ground truth or human judgment, especially for safety-critical applications.

10.5.3.5 Practical Workflow

A typical LLM-assisted interpretability workflow combines the techniques above. First, run standard attribution methods (Integrated Gradients, attention rollout) on a set of important predictions. Second, feed the attributions into an LLM to generate natural language explanations and counterfactuals. Third, use SAE feature analysis with LLM auto-labeling to identify higher-level circuits involved in the prediction. Fourth, compile the results into an auto-generated model card. This end-to-end pipeline makes interpretability accessible to teams that lack dedicated interpretability researchers, democratizing a practice that was previously confined to specialized labs.

Research Frontier

The logit lens family of techniques (including the tuned lens and future lens) is revealing how transformer layers progressively refine predictions, providing a window into the computation happening across depth. Research on universal neurons and induction heads has identified recurring computational motifs that appear across different transformer architectures and training runs, suggesting fundamental building blocks of language model computation. An open frontier is using interpretability findings to design better architectures, closing the loop from understanding to engineering by building models whose internal computations are more transparent by construction.

Key Takeaways
Self-Check
1. What is the difference between faithfulness and plausibility in explanation evaluation?
Show Answer
Faithfulness measures whether the explanation accurately reflects the model's actual computation (does removing the highlighted tokens actually change the prediction?). Plausibility measures whether the explanation matches human intuition about what should be important. These can diverge: a model might make its prediction based on unexpected features (like punctuation patterns) that are faithful but implausible, while humans might expect certain keywords to be important even if the model does not rely on them.
2. When should you reach for LIME instead of Captum's Integrated Gradients?
Show Answer
LIME is the right tool when you only have API access to a model (no gradients, no activations). It is model-agnostic and only requires a prediction function. Captum is preferred when you have full PyTorch model access, because gradient-based methods like Integrated Gradients are more theoretically grounded and produce smoother attributions, especially for long text inputs.
3. Why is using an LLM to auto-label SAE features a "productive but risky" feedback loop?
Show Answer
It is productive because LLMs can scan thousands of activating examples and propose concise natural-language descriptions far faster than humans, and the labels make subsequent circuit-level reasoning possible. It is risky because the labeling LLM may share biases or blind spots with the model being explained, so a plausible label is not the same as a correct label. Spot-checking against human judgment and ground truth is essential.

Exercises

Exercise 10.4b.1: Feature visualization for LLMs Conceptual

Feature visualization is well-developed for vision models (generating images that maximally activate a neuron). Why is the equivalent for LLMs more challenging, and what alternative approaches exist?

Answer Sketch

For vision models, you can optimize an input image via gradient ascent to maximize a neuron's activation, producing a human-interpretable image. For LLMs, the input is discrete tokens, so gradient ascent does not directly apply (you cannot have a 'fractional token'). Alternatives: (1) Dataset examples: find real text that maximally activates the feature (the approach used in the previous exercise). (2) Logit attribution: see which output tokens the feature promotes or suppresses. (3) Automated interpretability: use another LLM to describe what a feature responds to. (4) Optimization in embedding space followed by nearest-token projection (approximate but sometimes useful). Dataset examples are the most common approach because they show real-world contexts where the feature fires.

Exercise 10.4b.2: Representation engineering Conceptual

Representation engineering studies how high-level concepts (truthfulness, safety, emotion) are encoded as directions in a model's activation space. Explain the basic approach: how do you find the 'truthfulness direction'?

Answer Sketch

Create pairs of prompts designed to elicit truthful vs. untruthful model behavior (e.g., true statements vs. common misconceptions). Run both sets through the model and record activations at each layer. The 'truthfulness direction' is the vector that best separates truthful from untruthful activations (often found via PCA on the difference vectors). This direction can then be used to: (1) classify whether the model is being truthful on new inputs. (2) Steer the model toward truthfulness by adding the direction vector to activations during inference. The approach assumes that concepts are encoded as linear directions, which is approximately true for many high-level properties.

Exercise 10.4b.3: Interpretability and AGI safety Conceptual

Some researchers argue that interpretability is necessary for safe AGI because behavioral testing alone cannot guarantee safety. Others argue that interpretability methods will always lag behind model complexity. Evaluate both positions and suggest a middle ground.

Answer Sketch

Pro-interpretability: behavioral testing only covers tested scenarios; a model could behave safely in known situations but dangerously in novel ones. Interpretability could identify dangerous internal representations (deception, power-seeking) before they manifest behaviorally. This is analogous to X-raying a machine rather than just watching it operate. Anti-interpretability: models are too complex for humans to fully understand (billions of parameters), interpretability methods produce simplified stories that may miss critical details, and the methods themselves require assumptions that may not hold for novel architectures. Middle ground: use interpretability as one layer in a defense-in-depth strategy. Combine interpretability (identify specific risk circuits), behavioral testing (verify on comprehensive scenarios), and formal guarantees where possible (provable bounds on certain behaviors). No single approach suffices, but together they provide stronger safety assurance than any alone.

What Comes Next

In the next section, Section 10.6: Platforms, the focus shifts from "how do I explain a model?" to "where do I run it?", surveying the platforms that host today's open-weight interpretability targets at the scale modern research requires.

Further Reading

Production XAI Tooling

Kokhlikyan, N., Miglani, V., Martin, M., et al. (2020). Captum: A unified and generic model interpretability library for PyTorch. arXiv:2009.07896. The reference paper for Captum, Meta's PyTorch attribution toolkit. Practitioners shipping gradient-based explanations should read this for an overview of the unified API and the breadth of supported methods (Integrated Gradients, DeepLift, GradientSHAP, LayerConductance).
Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. ACL System Demonstrations. Introduces BertViz, the de facto interactive attention visualization tool for transformer practitioners. Use as a reference when wiring BertViz into Jupyter exploration of new model families.

Evaluation and LLM-Assisted Interpretability

Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., & Saunders, W. (2023). Language models can explain neurons in language models. OpenAI. Uses GPT-4 to automatically generate natural language descriptions of what individual neurons compute, then scores those descriptions against activation patterns. This pioneering work on automated interpretability is relevant for teams exploring scalable approaches to understanding large models.
Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., & Du, M. (2024). Explainability for Large Language Models: A Survey. ACM TIST, 15(2). A comprehensive survey covering the full landscape of LLM explainability, from local attribution methods to global analysis techniques and evaluation metrics. Researchers entering the field should read this for its thorough taxonomy and identification of open challenges.