Practical Interpretability for Applications

Section 10.3

"Interpretability that cannot leave the lab is just curiosity. Interpretability that ships to production is engineering."

ProbeProbe, Production Shipping AI Agent
Big Picture

Interpretability is not just a research curiosity; it is a practical toolkit for building better models. Feature attribution methods explain individual predictions, and representation engineering steers model behavior without retraining. These tools help practitioners debug hallucination, remove unwanted biases, and verify that models behave as intended before deployment. The mechanistic understanding from Section 10.2 provides the theoretical grounding for why targeted interventions on specific weight matrices and activation directions can change specific behaviors.

Prerequisites

This section builds on interpretability basics from Section 10.1: Attention Analysis and Probing and mechanistic interpretability covered in Section 10.2: Mechanistic Interpretability.

Fun Fact: The Hidden State Was Doing Math

When Anthropic published interpretability work showing that mid-layer activations in a small transformer were performing what looked like binary addition in linear subspaces, the reaction split cleanly along axes of expectation. Researchers who believed neural nets were inscrutable black boxes were stunned. Researchers who believed they were doing math were stunned that someone had finally proved it. The lesson: 'incomprehensible' often means 'we have not looked carefully yet'.

10.3.1 Feature Attribution Methods

Feature attribution methods assign an importance score to each input token, answering the question: "How much did each token contribute to the model's prediction?" Unlike attention visualization (which shows where the model looks), attribution methods track the actual causal influence of each input on the output through the entire network.

10.3.1.1 Integrated Gradients

Integrated Gradients (Sundararajan et al., 2017) computes attribution by integrating the gradient of the output with respect to the input along a straight-line path from a baseline (typically the zero embedding) to the actual input. The method satisfies two desirable axioms: sensitivity (if changing an input changes the output, that input gets non-zero attribution) and implementation invariance (the attribution depends only on the function, not its implementation details).

Key Insight: Mental Model: The Highlighted Document

Think of feature attribution as a highlighter pen applied to the input text. The method highlights which words (or tokens) pushed the model's prediction in a particular direction. Gradient-based methods measure how much a small change in each input token would change the output. SHAP values go further by computing each token's marginal contribution across all possible subsets of the input. The caveat is that highlights show local sensitivity, not global understanding: knowing which word mattered does not tell you why it mattered.

# Integrated Gradients for Token Attribution
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def integrated_gradients(model, tokenizer, text, target_token_idx=-1, n_steps=50):
    model.eval()
    inputs = tokenizer(text, return_tensors="pt")
    input_ids = inputs["input_ids"]
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
    embedding_layer = model.get_input_embeddings()
    input_embeds = embedding_layer(input_ids).detach()
    baseline = torch.zeros_like(input_embeds)
    alphas = torch.linspace(0, 1, n_steps).unsqueeze(1).unsqueeze(2)
    interpolated = baseline + alphas * (input_embeds - baseline)
    interpolated.requires_grad_(True)
    outputs = model(inputs_embeds=interpolated)
    target_logit = outputs.logits[:, target_token_idx, :].max(dim=-1).values
    target_logit.sum().backward()
    avg_grads = interpolated.grad.mean(dim=0)
    ig = (input_embeds.squeeze(0) - baseline.squeeze(0)) * avg_grads
    attributions = ig.sum(dim=-1).detach().numpy()
    return attributions, tokens
Output: Token attributions for next-token prediction: The ++++++ (+0.3142) Ġcapital +++++++++++++++ (+0.7891) Ġof +++++ (+0.2534) ĠFrance ++++++++++++++++++++ (+1.0247) Ġis +++ (+0.1563)
Code Fragment 10.3.1: Integrated Gradients computes input attribution by integrating gradients along a straight-line path from a zero baseline to the actual input. The function returns one attribution score per input token, telling you which words the model relied on for its prediction.
Tip: Production Alternative

The implementation above builds Integrated Gradients from scratch for pedagogical clarity. In production, use Captum (install: pip install captum), Meta's model interpretability library for PyTorch, which provides optimized implementations of multiple attribution methods (LayerIntegratedGradients, DeepLift, Lime, etc.) with batched compute and a uniform API across PyTorch model families.

10.3.1.2 SHAP for Language Models

Key Insight
Why naive single-token masking fails ("not bad")

The simplest token-importance recipe is to mask one token at a time and watch how the predicted class probability changes. It is wrong, and the canonical counter-example is the two-word sentence "not bad", which a sentiment classifier confidently labels positive. Mask "not" and the classifier reads only "bad" and predicts negative; mask "bad" and the classifier reads only "not" and predicts negative. Both tokens look like positive markers in isolation, which is absurd. The real story is the interaction: "not" + "bad" together carries the positive signal, neither alone does. Single-token masking is blind to interactions because it only ever subtracts one feature from a fixed background. Any honest attribution method has to consider what each token contributes averaged across every possible subset of the other tokens. That is exactly what SHAP does.

SHAP (SHapley Additive exPlanations) adapts Shapley values from cooperative game theory to feature attribution. Each token's SHAP value represents the average marginal contribution of that token across all possible subsets of input tokens.

Formally, for an input with feature set $F$ and a value function $v(S)$ giving the model output when only the subset $S \subseteq F$ of features is present (absent features are replaced by their average or a baseline), the SHAP value of feature $i$ is:

$$\phi _{i} = \sum _{S \subseteq F \setminus \{i\}} \frac{|S|! \, (|F| - |S| - 1)!}{|F|!} \, \bigl(v(S \cup \{i\}) - v(S)\bigr)$$

This is the average marginal contribution of feature $i$ over all $2^{|F|-1}$ possible subsets of the other features, weighted so that the formula satisfies the four Shapley axioms (efficiency, symmetry, dummy, additivity). Exact computation is exponential, so SHAP libraries sample subsets and use specialized estimators (KernelSHAP, TreeSHAP, Partition SHAP for text).

SHAP libraries render per-token attributions on text by shading each token by its signed SHAP value, typically red for "pushes toward the predicted class" and blue for "pushes away". On a sentiment model classifying "The movie was absolutely wonderful and I loved every moment" as positive, words like "wonderful", "loved", and "absolutely" light up red, function words contribute near-zero, and any genuinely negative-leaning token would show blue. The base value above the sentence is the average logit across the dataset; $f(\text{inputs})$ is the current sample's logit; and the signed token shadings sum to their difference (the additive guarantee).

Note

For production use, Integrated Gradients is typically preferred over SHAP for language models because it scales linearly with input length (O(n_steps * forward_pass)), while exact SHAP values require exponentially many evaluations (2^n for n tokens). Approximate SHAP methods reduce this cost but introduce estimation noise.

Library Shortcut: transformer-interpret

The transformer-interpret package (pip install transformers-interpret) is a minimal wrapper around Captum's Integrated Gradients for HuggingFace models. For sequence classification, two lines give you per-token attributions against the predicted class:

from transformers_interpret import SequenceClassificationExplainer
explainer = SequenceClassificationExplainer(model, tokenizer)
attributions = explainer("I really enjoyed the movie.")
explainer.visualize("out.html")   # color-coded HTML report
Code Fragment 10.3.2: transformer-interpret as a one-import alternative to writing the Captum boilerplate by hand. Useful when prototyping; for custom attribution targets, embedding interventions, or layer-level analysis, drop down to Captum directly.

10.3.2 Representation Engineering

Representation engineering (RepE) steers model behavior by modifying internal representations at inference time. Instead of retraining the model, you identify a "control vector" in activation space that corresponds to a specific behavior (such as honesty, verbosity, or formality) and add or subtract this vector during generation to increase or decrease that behavior.

Figure 10.3.2: Representation engineering extracts control vectors from contr...
Figure 10.3.1: Representation engineering extracts control vectors from contrastive prompts, then applies them at inference time to steer behavior without retraining.

The procedure is: (1) construct 50 to 200 contrastive prompt pairs that differ in the presence of the target trait; (2) compute the mean activation difference at each layer to obtain a control direction; (3) at inference time, register a forward hook that adds $\alpha \cdot v$ to the hidden state at the target layer. Positive $\alpha$ amplifies the trait, negative $\alpha$ suppresses it.

# Representation Engineering: Control Vectors
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def extract_control_vector(model, tokenizer, pos_prompts, neg_prompts, layer_idx):
    def mean_act(prompts):
        activations = []
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt")
            with torch.no_grad():
                outputs = model(**inputs, output_hidden_states=True)
                activations.append(outputs.hidden_states[layer_idx][0, -1, :])
        return torch.stack(activations).mean(dim=0)
    v = mean_act(pos_prompts) - mean_act(neg_prompts)
    return v / v.norm()

def generate_with_steering(model, tokenizer, prompt, v, layer_idx, alpha=1.5):
    def hook(module, input, output):
        h = output[0]
        h[:, -1, :] += alpha * v
        return (h,) + output[1:]
    handle = model.transformer.h[layer_idx].register_forward_hook(hook)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    handle.remove()
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
Code Fragment 10.3.3: Representation engineering: extracting a control vector from contrastive prompts and adding it via a forward hook at a target layer to steer generation without retraining.

10.3.2.1 Representation Engineering for Runtime Control

The Representation Engineering framework (Zou et al., 2023) systematizes this approach. Control vectors can be precomputed offline, stored as lightweight artifacts (a single vector per behavior per layer), and applied at inference time with negligible latency overhead. This makes representation engineering a compelling alternative to prompt engineering for behavioral steering, because the control is applied at the representation level rather than through input text that consumes context window tokens. Crucially, multiple control vectors can be composed by simple addition: applying both an "honesty" and a "conciseness" vector simultaneously produces responses that are both more truthful and more brief.

In production, representation engineering enables several runtime control patterns that are difficult to achieve through prompting alone. A content moderation system can apply a "safety" control vector with high alpha for user-facing responses while using a lower alpha for internal reasoning steps. An A/B testing framework can vary the "creativity" coefficient across user cohorts to measure its impact on engagement, with each variant differing only in a scalar multiplier rather than requiring distinct prompt templates. The repeng library provides a high-level API for extracting and applying control vectors to Hugging Face models, and custom forward hooks in vLLM and SGLang apply them with typically less than 2% latency increase.

Warning: Limitations of Representation Engineering

Control vectors are not a replacement for alignment training. They work best for continuous, stylistic behaviors (formality, verbosity, sentiment) and less reliably for discrete factual constraints ("never mention competitor X"). Steering strength is also model-dependent: the same alpha value may produce subtle effects in one model and incoherent outputs in another. Always validate control vectors on a held-out evaluation set before deploying to production, and set conservative alpha bounds to avoid pushing the model into out-of-distribution activation regions.

Key Takeaways
Self-Check
1. How does Integrated Gradients differ from simply looking at gradients?
Show Answer
Raw gradients measure local sensitivity (how a tiny perturbation affects the output) but can be noisy and fail to account for saturation effects. Integrated Gradients averages gradients along a path from a baseline (zero embedding) to the actual input, capturing the total contribution of each input feature. It satisfies the completeness axiom: attributions sum to the difference between the output at the input and the baseline, ensuring nothing is lost.
2. What is a control vector in representation engineering?
Show Answer
A control vector is a direction in activation space that corresponds to a specific behavioral property (honesty, formality, safety). It is extracted by running the model on contrastive prompt pairs (e.g., honest vs. dishonest prompts) and computing the mean activation difference. At inference time, adding this vector (scaled by a coefficient alpha) to the model's hidden states increases the associated behavior without retraining.

What's Next?

This section continues in Section 10.3a: Model Editing, Concept Erasure & Debugging, which covers ROME and MEMIT for surgically editing factual associations stored in weights, LEACE for provably erasing concepts from representations, the chain-of-thought faithfulness debate, and the practical interpretability-driven debugging workflow.

Further Reading

Attribution

Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. ICML 2017. Introduces Integrated Gradients, an attribution method with strong axiomatic guarantees including sensitivity and implementation invariance. This is the standard baseline for feature attribution in deep networks.

Representation Engineering

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. Proposes reading and controlling model behavior by identifying linear directions in representation space corresponding to high-level concepts like honesty or harmfulness.