Section 10.7: Interpretability Tools & Transformers Deep Dive

"TGI, vLLM, and TensorRT-LLM all post the same throughput on the same benchmark. The one you pick will be decided by which one your on-call engineer can fix at 3 AM."
Pip, Library-Triage AI Agent

Big Picture

Part II's library layer is the shelf of off-the-shelf code that turns Chapters 7 to 11's ideas into runnable notebooks. Three families dominate: the model-loading stack (transformers, accelerate, bitsandbytes) that gets weights onto your GPU; the tokenizer trio (tiktoken, SentencePiece, HF tokenizers) that converts text into model inputs; and the mechanistic-interpretability tier (TransformerLens, nnsight, SAELens) that exposes attention heads, residual streams, and sparse autoencoders for the research probes in Chapter 10. The pages that follow are reference catalogs, not deep theory; the goal is to know which library to reach for when a notebook in Chapter 7 calls for it.

Part II's library layer splits into three families: the model-loading stack (transformers, accelerate, bitsandbytes), the tokenizer trio (tiktoken, SentencePiece, HF tokenizers), and the mechanistic-interpretability tier (TransformerLens, nnsight, SAELens). Almost every figure and notebook in Chapters 7 to 11 is built from this exact set.

Three library families: model-loading, tokenizer trio, mech-interp tier — **Figure 10.6a.1:** The three library families that build every Part II notebook. The model-loading stack (transformers v5 with AutoModel and AutoTokenizer, accelerate for multi-GPU, bitsandbytes for 4-bit NF4 to fit 70B on 24 GB, FlashAttention 2 and 3 for long context, vLLM for serving, outlines for constrained output) handles loading and inference. The tokenizer trio (tiktoken for OpenAI billing parity, SentencePiece for T5 and Llama, HF tokenizers for everything else) optimizes for different tradeoffs. The mech-interp tier (TransformerLens, nnsight with NDIF, SAELens, Anthropic's circuit-tracer, garak and EasyJailbreak) is the substrate behind nearly every Chapter 10 paper.

10.7.1 The model-loading stack

transformers v5 (2025-Q1): the canonical model-loading library. AutoModel, AutoTokenizer, AutoConfig. Every open-weight model in this part is loadable in one line. v5 is a major break from v4 (refactored model registry, redesigned config API, removed many legacy methods); code written against v4.x often needs small updates to load with v5. The 2025 deprecation notes list the breaking changes.
accelerate: multi-GPU / multi-node loader. The right abstraction for "this model does not fit on one GPU".
bitsandbytes: 8-bit and 4-bit (NF4) quantization. The reason 70B models fit on a single 24 GB consumer GPU.
FlashAttention (versions 2 and 3): kernels for long-context inference. Required for million-token contexts on commodity hardware. Speculative decoding (a small draft model proposes tokens, the large model verifies in a single forward pass) and prefix caching (sharing the KV cache across requests with the same system prompt) are now production-default inference techniques; vLLM v1 and SGLang both ship them out of the box. Attention sinks (Xiao et al., 2024) let you keep early tokens pinned in the cache for streaming inference without quality collapse.
vLLM: a high-throughput serving engine with PagedAttention. The right runtime when latency matters; covered in Chapter 10.
outlines (.txt, 2023+): constrained generation via finite-state-machine guidance, the de facto library for structured output and JSON-mode in 2025-26.

10.7.2 The tokenizer trio

tiktoken: OpenAI's Rust-backed BPE tokenizer. The canonical tokenizer for the GPT family from GPT-2 onwards; the way to count tokens for OpenAI billing.
SentencePiece: Google's subword tokenizer (BPE and Unigram). Default for T5, mT5, ALBERT, XLNet, Llama family, and most Asian-language models.
Hugging Face tokenizers: the Rust-backed superset library. Implements BPE, WordPiece, Unigram. The way to train a tokenizer from scratch on your own corpus.

10.7.3 The mech-interp tier

TransformerLens (Neel Nanda): the mech-interp library. Hooks every layer, head, and residual stream of a model for inspection. The substrate of nearly every paper in Chapter 10.
nnsight 0.4+: PyTorch-native model introspection. Lets you write Python that runs inside a model's forward pass on a remote GPU. The newer alternative to TransformerLens for larger models. The companion National Deep Inference Fabric (NDIF) is the public infrastructure that runs remote-execution mech-interp jobs on Llama-405B-scale models, free for academic researchers as of 2025.
SAELens: sparse-autoencoder training and inspection. The library behind Anthropic's "Scaling Monosemanticity" (2024) and OpenAI's "Scaling and evaluating sparse autoencoders" (Gao et al., 2024, arXiv:2406.04093) work; both ground SAE-based interpretability for newcomers.
circuit-tracer (Anthropic, 2025): cross-layer-transcoder tooling for circuit-level interpretability, building on "Sparse Crosscoders for Cross-Layer Features in Superposition" (Lindsey et al., 2024).
garak and EasyJailbreak: tokenizer and model probing libraries for the safety-related chapters; useful when you want to programmatically probe a model's failure surface.

Key Insight: Why three tokenizer libraries instead of one

They optimise for different things. tiktoken is fastest at encoding existing OpenAI vocabularies but cannot train a new tokenizer. SentencePiece can train multilingual tokenizers from scratch but its training loop is slower than HF's Rust implementation. HF tokenizers is the most flexible but its on-disk format is not interchangeable with the others. In practice: use tiktoken when you must match OpenAI's tokenization exactly; use HF tokenizers for everything else; reach for SentencePiece only when you are replicating a paper that explicitly used it.

Fun Fact: The Linux Distro Moment of LLM Tools

In the early 2000s a dozen Linux distros all bundled the same kernel with different package managers, and people argued about it as if it were a religion. The Transformers, vLLM, and TGI ecosystems are at the same moment: same models underneath, different packaging upstairs, and the choice of framework quietly decides what experiments are easy versus impossible. Picking a stack is half engineering and half taste.

Real-World Scenario: The same string, three tokenizers

Code Fragment 10.7.1 below tokenizes the same English sentence with the GPT-4o, Llama 3.3, and Qwen3 tokenizers; the same string gets carved into different pieces by every model family. The numbers diverge sharply for non-English text and source code; Chapter 6 walks through the consequences.

import tiktoken, sentencepiece as spm
from transformers import AutoTokenizer

text = "Mechanistic interpretability of transformer models."

# For OpenAI: use encoding_for_model with the latest model you care about
gpt = tiktoken.encoding_for_model("gpt-4o")  # swap for current model when billing matters
llama = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
qwen = AutoTokenizer.from_pretrained("Qwen/Qwen3-7B")

print(len(gpt.encode(text)))    # GPT family: ~7 tokens
print(len(llama.encode(text)))  # Llama 3.3: ~8 tokens
print(len(qwen.encode(text)))   # Qwen3: ~8 tokens

Code Fragment 10.7.1: For OpenAI: use encoding_for_model with the latest model you care about.

10.7.4 Versions to pin

Install with uv (10-100x faster than pip and the default modern installer): uv pip install transformers tokenizers accelerate bitsandbytes flash-attn. For Part II the safe pin set is transformers>=5.0 (the 2025-Q1 major release; >=4.45 still works for legacy notebooks but expect deprecation warnings), tokenizers>=0.20, accelerate>=1.0, bitsandbytes>=0.43, flash-attn>=2.6 (only required for FA3 training; inference uses PyTorch's built-in scaled_dot_product_attention). The mech-interp libraries move faster; for those, pin to the commit hash of whichever paper you are reproducing. The interpretability literature is full of breaking API changes between minor versions.

Hugging Face Transformers Deep Dive

Key Insight: Hugging Face's three layers of abstraction

The transformers library is the central hub of the Hugging Face ecosystem. It provides a unified API for loading, configuring, and running inference with thousands of pretrained models spanning text, vision, audio, and multimodal tasks. This deep-dive walks through the three layers of abstraction the library offers: high-level pipelines for quick prototyping, AutoClasses for flexible model loading, and direct model/tokenizer access for full control.

Prerequisites

This section assumes familiarity with the Hugging Face transformers ecosystem from Section 12.1 and LLM inference fundamentals from Section 9.1. The platform shelf in Section 10.6 provides the context for why each library exists.

1. The Pipeline API: Inference in One Line

The fastest way to use a pretrained model is through the pipeline() function. Pipelines bundle a model, a tokenizer, and task-specific pre/post-processing into a single callable object. You specify a task name, and the library selects a suitable default model from the Hub.

The following example creates pipelines for three common NLP tasks: sentiment analysis, named entity recognition, and text generation.

from transformers import pipeline

# Sentiment analysis (default: distilbert-base-uncased-finetuned-sst-2-english)
classifier = pipeline("text-classification")
result = classifier("HuggingFace makes NLP accessible to everyone.")
print(result)
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Yann LeCun works at Meta in New York.")
for ent in entities:
    print(f"  {ent['word']:<15} {ent['entity_group']:<10} {ent['score']:.3f}")

# Text generation with a specific model
generator = pipeline("text-generation", model="gpt2", max_new_tokens=40)
output = generator("The future of AI is", do_sample=True, temperature=0.7)
print(output[0]["generated_text"])

Output: [{'label': 'POSITIVE', 'score': 0.9998}] Yann LeCun PER 0.998 Meta ORG 0.994 New York LOC 0.997 The future of AI is likely to be shaped by a combination of advances in hardware, algorithms, and data availability that together...

Code Fragment 10.7.2: Three pipeline() calls covering sentiment analysis, NER with entity aggregation, and text generation with a specified model. Each pipeline handles tokenization, inference, and output formatting internally, making single-line inference possible for 30+ task types.

Pipelines support over 30 task types including question-answering, summarization, translation, zero-shot-classification, image-classification, and automatic-speech-recognition. Each task maps to a specific pipeline class that handles the input/output formatting appropriate for that task.

Note: Pipeline Device Placement

By default, pipelines run on CPU. Pass device=0 to place the model on the first GPU, or use device="cuda" for automatic GPU selection. For Apple Silicon, use device="mps". Starting with Transformers v4.36, you can also pass device_map="auto" to let the library distribute a large model across multiple GPUs automatically.

2. AutoClasses: Flexible Model and Tokenizer Loading

When you need more control than pipelines offer, AutoClasses provide the next level of abstraction. The two most important are AutoTokenizer and AutoModel (plus its task-specific variants). These classes inspect a model's configuration on the Hub and instantiate the correct architecture automatically.

The code below loads a tokenizer and a sequence classification model, then runs a forward pass to obtain logits.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load tokenizer and model from the Hub
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "This library is incredibly well designed."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Convert logits to probabilities
probs = torch.softmax(outputs.logits, dim=-1)
labels = model.config.id2label
for idx, prob in enumerate(probs[0]):
    print(f"  {labels[idx]}: {prob:.4f}")

Output: NEGATIVE: 0.0003 POSITIVE: 0.9997

Code Fragment 10.7.3: Manual inference with AutoTokenizer and AutoModelForSequenceClassification. Unlike pipelines, this approach gives direct access to raw logits, enabling custom post-processing such as calibration or thresholding. The id2label mapping from the config converts indices to human-readable class names.

The AutoModelFor* family includes task-specific heads. The most commonly used variants are listed in Figure C.1.1.

AutoClass	Task	Output
`AutoModelForCausalLM`	Text generation (decoder-only)	Next-token logits
`AutoModelForSeq2SeqLM`	Translation, summarization (encoder-decoder)	Sequence logits
`AutoModelForSequenceClassification`	Sentiment, NLI, topic classification	Class logits
`AutoModelForTokenClassification`	NER, POS tagging	Per-token logits
`AutoModelForQuestionAnswering`	Extractive QA	Start/end logits
`AutoModelForMaskedLM`	Fill-mask (encoder-only)	Vocabulary logits

Figure 10.7.1: Common AutoModel task-specific classes and their outputs.

3. Model Architectures: Encoder, Decoder, and Encoder-Decoder

See Chapter 3 (Transformer Architecture) for why each family exists. When picking an AutoClass: encoder-only (BERT, RoBERTa, DeBERTa) maps to AutoModel or AutoModelForSequenceClassification; decoder-only (GPT-2, LLaMA, Mistral) maps to AutoModelForCausalLM; encoder-decoder (T5, BART, mBART) maps to AutoModelForSeq2SeqLM.

The following example demonstrates loading one model from each family.

from transformers import (
    AutoModel,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
    AutoTokenizer,
)

# Encoder-only: BERT
enc_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
enc_model = AutoModel.from_pretrained("bert-base-uncased")
enc_out = enc_model(**enc_tokenizer("Hello world", return_tensors="pt"))
print(f"Encoder hidden states shape: {enc_out.last_hidden_state.shape}")

# Decoder-only: GPT-2
dec_tokenizer = AutoTokenizer.from_pretrained("gpt2")
dec_model = AutoModelForCausalLM.from_pretrained("gpt2")
dec_out = dec_model.generate(
    **dec_tokenizer("Once upon a time", return_tensors="pt"),
    max_new_tokens=20,
    do_sample=True,
)
print(f"Generated: {dec_tokenizer.decode(dec_out[0], skip_special_tokens=True)}")

# Encoder-decoder: T5
s2s_tokenizer = AutoTokenizer.from_pretrained("t5-small")
s2s_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
s2s_out = s2s_model.generate(
    **s2s_tokenizer("translate English to French: Hello, how are you?", return_tensors="pt"),
    max_new_tokens=30,
)
print(f"Translation: {s2s_tokenizer.decode(s2s_out[0], skip_special_tokens=True)}")

Output: Encoder hidden states shape: torch.Size([1, 4, 768]) Generated: Once upon a time, there was a young girl who lived in a small village... Translation: Bonjour, comment allez-vous?

Code Fragment 10.7.4: Loading one model from each transformer family. BERT (encoder-only) produces hidden states for each token, GPT-2 (decoder-only) generates text autoregressively, and T5 (encoder-decoder) translates between languages. The AutoModel variant must match the architecture type.

4. Model Configuration and Customization

Every model has an associated configuration object (AutoConfig) that stores architectural hyperparameters such as the number of layers, hidden size, number of attention heads, and vocabulary size. You can inspect or modify configuration before instantiating a model.

This example loads a configuration, modifies it, and creates a randomly initialized model with the new settings.

from transformers import AutoConfig, AutoModelForCausalLM

# Load existing config
config = AutoConfig.from_pretrained("gpt2")
print(f"Original: {config.n_layer} layers, {config.n_head} heads, "
      f"hidden size {config.n_embd}")

# Create a smaller variant for experimentation
config.n_layer = 4
config.n_head = 4
config.n_embd = 256

# Instantiate a randomly initialized model with modified config
small_model = AutoModelForCausalLM.from_config(config)
num_params = sum(p.numel() for p in small_model.parameters())
print(f"Custom model: {num_params / 1e6:.1f}M parameters")

Output: Original: 12 layers, 12 heads, hidden size 768 Custom model: 11.2M parameters

Code Fragment 10.7.5: Creating a custom model variant by modifying AutoConfig. Reducing GPT-2 from 12 layers to 4 and the hidden size from 768 to 256 produces an 11.2M parameter model suitable for rapid experimentation. Note that from_config() creates randomly initialized weights.

Warning: Random Initialization vs. Pretrained Weights

from_config() creates a model with random weights. This is useful for architecture experiments or training from scratch, but not for inference. Always use from_pretrained() when you need a model with learned weights.

5. Efficient Loading and Precision Control

Modern LLMs can be extremely large. The Transformers library provides several mechanisms for loading models efficiently, including reduced-precision formats, quantization, and memory-mapped loading. These techniques are essential for working with billion-parameter models on consumer hardware.

The following example shows how to load a large model with reduced precision and automatic device mapping.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

# Load in float16 with automatic device mapping across GPUs
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,   # Half precision (saves ~50% memory)
    device_map="auto",           # Distribute across available devices
    low_cpu_mem_usage=True,      # Avoid peak memory during loading
)

# For even smaller memory footprint, use 4-bit quantization
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,                  # Requires bitsandbytes
    bnb_4bit_compute_dtype=torch.float16,
    device_map="auto",
)

print(f"FP16 memory: ~14 GB")
print(f"4-bit memory: ~4 GB")

Output: FP16 memory: ~14 GB 4-bit memory: ~4 GB

Code Fragment 10.7.6: Loading Mistral-7B in float16 with device_map="auto" for multi-GPU distribution, then in 4-bit NF4 quantization for consumer GPUs. The low_cpu_mem_usage=True flag avoids a temporary peak where both the full model and the shard coexist in CPU memory.

Precision	Bits per Parameter	7B Model Size	Use Case
float32	32	~28 GB	Training (full precision)
float16 / bfloat16	16	~14 GB	Inference, mixed-precision training
int8	8	~7 GB	Inference with minimal quality loss
int4 (NF4)	4	~4 GB	Inference on consumer GPUs; QLoRA fine-tuning

Figure 10.7.2: Precision formats and approximate memory requirements for a 7-billion-parameter model.

6. Inference Patterns and Generation Strategies

For causal language models, the generate() method provides a rich set of decoding strategies. Understanding these strategies is critical for controlling the quality, diversity, and determinism of generated text. For a thorough discussion of sampling methods, see Chapter 4: Decoding Strategies & Text Generation.

The example below demonstrates several generation strategies on the same prompt.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer("The key to good AI is", return_tensors="pt").input_ids

# Greedy decoding: deterministic, often repetitive
greedy = model.generate(input_ids, max_new_tokens=30, do_sample=False)
print("Greedy:", tokenizer.decode(greedy[0], skip_special_tokens=True))

# Top-k sampling: sample from the top 50 tokens at each step
topk = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_k=50)
print("Top-k:", tokenizer.decode(topk[0], skip_special_tokens=True))

# Nucleus (top-p) sampling: sample from smallest set whose cumulative
# probability exceeds p
topp = model.generate(input_ids, max_new_tokens=30, do_sample=True, top_p=0.92)
print("Top-p:", tokenizer.decode(topp[0], skip_special_tokens=True))

# Beam search: explore multiple hypotheses in parallel
beam = model.generate(
    input_ids,
    max_new_tokens=30,
    num_beams=5,
    no_repeat_ngram_size=2,
    early_stopping=True,
)
print("Beam:", tokenizer.decode(beam[0], skip_special_tokens=True))

Output: Greedy: The key to good AI is to make sure that the AI is able to learn from its mistakes. The AI is able to... Top-k: The key to good AI is building systems that understand the nuances of human intent and adapt to new situations... Top-p: The key to good AI is not just raw intelligence, but the ability to collaborate with humans in meaningful ways... Beam: The key to good AI is the ability to learn from experience and adapt to new situations in real time.

Code Fragment 10.7.7: Four generation strategies on the same prompt. Greedy decoding is deterministic but repetitive. Top-k and top-p (nucleus) sampling introduce diversity by restricting the candidate set at each step. Beam search explores multiple hypotheses in parallel, with no_repeat_ngram_size=2 preventing verbatim repetition.

Tip: Choosing a Generation Strategy

For factual and deterministic outputs (code generation, structured extraction), use greedy decoding or beam search. For creative and diverse outputs (story writing, brainstorming), use nucleus sampling with top_p between 0.9 and 0.95 and temperature between 0.7 and 1.0. For conversational agents, nucleus sampling with moderate temperature (0.6 to 0.8) typically gives the best balance of coherence and variety.

What's Next?

In the next section, Section 10.8: Serving Runtimes, we shift from interpretability tooling to production serving: vLLM (PagedAttention, continuous batching), Hugging Face Text Generation Inference (TGI), and SGLang.

Further Reading

Inference Libraries

Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention" (vLLM). SOSP 2023. arXiv:2309.06180. The vLLM paper.

Wolf, T., Debut, L., Sanh, V., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP 2020. arXiv:1910.03771. The original Hugging Face Transformers paper.

Hugging Face (2024). "Transformers Documentation." huggingface.co/docs/transformers. Authoritative reference for the de-facto LLM library.

Mechanistic Interpretability Libraries

Nanda, N., & Bloom, J. (2022). "TransformerLens." github.com/TransformerLensOrg/TransformerLens. The standard mechanistic-interpretability library.

Conmy, A., Bloom, J., Lieberum, T., et al. (2024). "Sparse Autoencoder Library (SAELens)." github.com/jbloomAus/SAELens. Reference SAE library for feature extraction.