Models

Section 5.5

Part I uses two pretrained reference models, BERT-base and GPT-2, and one untrained scaffold (the small transformer you build by hand in Chapter 3). The two reference models are no longer state-of-the-art, but that is exactly why they are useful: they are well-documented, small enough to fit on a 6 GB consumer GPU, and stable across thousands of tutorials. Every "obscure" behaviour you discover in them has been written about by someone, somewhere, and the explanation almost always still applies to the frontier models in Part II.

The aim of this section is not to enumerate every model on the Hugging Face Hub, that is Chapter 12's job. The aim is to lock in the few checkpoints that anchor Part I exercises and to set the vocabulary you will need when frontier model cards inevitably write things like "encoder-style architecture derived from BERT" or "decoder-only transformer in the GPT-2 lineage".

BERT-base and GPT-2 small are nearly identical in scale (12 layers, 12 heads, 768 hidden dim) but use opposite attention masks.
Figure 5.5.1: BERT-base and GPT-2 small are nearly identical in scale (12 layers, 12 heads, 768 hidden dim) but use opposite attention masks. BERT's bidirectional attention serves classification and span-prediction; GPT-2's causal mask is what enables autoregressive generation. Knowing this pair fluently is the prerequisite for reading any modern model card.

5.5.1 BERT-base: the encoder reference

BERT-base (uncased) has 110M parameters, 12 layers, 12 heads, 768-dim hidden state, and a 30k WordPiece vocabulary. It was pretrained on BookCorpus + English Wikipedia with masked language modelling and next-sentence prediction. The 2018 paper launched the "pretrain once, fine-tune many" paradigm that every transformer in this book depends on.

For Part I, BERT-base is the canonical "encoder transformer you can fit in memory". One line loads it:

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
Code Fragment 5.5.1a: For Part I, BERT-base is the canonical "encoder transformer you can fit in memory".

5.5.2 GPT-2: the decoder reference

GPT-2 (124M) is the small variant of the original GPT-2 family, with 12 decoder layers, 12 heads, 768-dim hidden state, and a 50k BPE vocabulary. It was pretrained on the 40 GB WebText corpus (OpenAI's curated Reddit-linked web pages) with the standard causal language modelling objective.

GPT-2 is the right teaching tool for decoder-style behavior: KV caching, autoregressive sampling, top-k and nucleus decoding, attention-mask handling. It is small enough to fine-tune fully on a single 6 GB GPU and large enough that decoding strategy actually matters. The larger checkpoints (355M, 774M, 1.5B) exist on the Hub and behave similarly with longer training runs. For modern decoder reference work, you would reach for SmolLM2-135M, Qwen3-0.6B, Llama-3.2 1B, or Gemma 3 270M (the late-2024 / 2025 small-model wave); for Part I, GPT-2 stays canonical because the literature anchors here.

For mechanistic-interpretability work specifically, the smallest useful checkpoints are TinyStories-1M / 33M (sub-coherent but reveal the entire training trajectory) and Pythia-14M / 31M (smallest Pythia checkpoints, used in 2024-25 circuits papers).

5.5.3 The other names you will see in passing

5.5.4 Comparing the Part I reference models

Table 5.5.1b: 6.4.1 Reference checkpoints for Part I.
Model Params Type Max seq len Vocab
BERT-base 110M Encoder 512 30k WordPiece
GPT-2 124M Decoder 1024 50k BPE
DistilBERT 66M Encoder (distilled) 512 30k WordPiece
RoBERTa-base 125M Encoder 512 50k BPE
T5-base 220M Encoder-decoder 512 / 512 32k SentencePiece
ModernBERT-base 149M Encoder (2024) 8192 50k BPE
Key Insight: Why these old models still matter

The frontier models of mid-2026 (the Claude 4 family, GPT-5 family, Gemini 2.5 Pro and successors) are trillion-parameter MoE systems behind APIs. You can neither run them locally nor inspect their weights. The Part I models trade away frontier quality for two superpowers: they fit on your laptop, and their weights are downloadable. Every interpretability, fine-tuning, and probing technique in this book was developed and validated against BERT-base or GPT-2 before being scaled. Skipping these models is like learning music theory without ever picking up a piano. The interpretability examples in Chapter 10 all start from these checkpoints.

Real-World Scenario: Loading both reference models

Drop the block in Code Fragment 5.5.2 below into any notebook to verify your environment can pull from Hugging Face. If both lines print, the rest of Part I will work. If the Hub is rate-limited, set HF_HUB_OFFLINE=1 after the first successful load to force the cached copy.

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

bert_tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
bert = AutoModel.from_pretrained("google-bert/bert-base-uncased")

gpt2_tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
gpt2 = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

print(f"BERT params: {sum(p.numel() for p in bert.parameters()):,}")
print(f"GPT-2 params: {sum(p.numel() for p in gpt2.parameters()):,}")
Code Fragment 5.5.2: Drop this block into any notebook to verify your environment can pull from Hugging Face:.

What's Next?

In the next section, Section 5.6: External Reading & Communities, we build on the material covered here.

Further Reading

Foundational Models

Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Reference open-weight LLM family.
Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv:2310.06825. Reference for the Mistral architecture; the open-weight small-LM baseline.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Reference for the 2024-25 open-weight MoE architecture.

Model Hubs

Hugging Face (2024). "HF Hub Documentation." huggingface.co/docs/hub. The canonical reference for the open-weight model registry.