Part I uses two pretrained reference models, BERT-base and GPT-2, and one untrained scaffold (the small transformer you build by hand in Chapter 3). The two reference models are no longer state-of-the-art, but that is exactly why they are useful: they are well-documented, small enough to fit on a 6 GB consumer GPU, and stable across thousands of tutorials. Every "obscure" behaviour you discover in them has been written about by someone, somewhere, and the explanation almost always still applies to the frontier models in Part II.
The aim of this section is not to enumerate every model on the Hugging Face Hub, that is Chapter 12's job. The aim is to lock in the few checkpoints that anchor Part I exercises and to set the vocabulary you will need when frontier model cards inevitably write things like "encoder-style architecture derived from BERT" or "decoder-only transformer in the GPT-2 lineage".
5.5.1 BERT-base: the encoder reference
BERT-base (uncased) has 110M parameters, 12 layers, 12 heads, 768-dim hidden state, and a 30k WordPiece vocabulary. It was pretrained on BookCorpus + English Wikipedia with masked language modelling and next-sentence prediction. The 2018 paper launched the "pretrain once, fine-tune many" paradigm that every transformer in this book depends on.
For Part I, BERT-base is the canonical "encoder transformer you can fit in memory". One line loads it:
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModel.from_pretrained("google-bert/bert-base-uncased")
5.5.2 GPT-2: the decoder reference
GPT-2 (124M) is the small variant of the original GPT-2 family, with 12 decoder layers, 12 heads, 768-dim hidden state, and a 50k BPE vocabulary. It was pretrained on the 40 GB WebText corpus (OpenAI's curated Reddit-linked web pages) with the standard causal language modelling objective.
GPT-2 is the right teaching tool for decoder-style behavior: KV caching, autoregressive sampling, top-k and nucleus decoding, attention-mask handling. It is small enough to fine-tune fully on a single 6 GB GPU and large enough that decoding strategy actually matters. The larger checkpoints (355M, 774M, 1.5B) exist on the Hub and behave similarly with longer training runs. For modern decoder reference work, you would reach for SmolLM2-135M, Qwen3-0.6B, Llama-3.2 1B, or Gemma 3 270M (the late-2024 / 2025 small-model wave); for Part I, GPT-2 stays canonical because the literature anchors here.
For mechanistic-interpretability work specifically, the smallest useful checkpoints are TinyStories-1M / 33M (sub-coherent but reveal the entire training trajectory) and Pythia-14M / 31M (smallest Pythia checkpoints, used in 2024-25 circuits papers).
5.5.3 The other names you will see in passing
- DistilBERT (66M): a distilled BERT-base, 40% smaller and 60% faster with 97% of GLUE quality. Use it when you want BERT-base behaviour with a smaller footprint.
- RoBERTa-base: BERT-base re-trained with more data, longer schedule, and no next-sentence prediction. Modestly better than BERT-base on every GLUE task.
- T5-base: encoder-decoder, frames every task as text-to-text. The reference for "what an encoder-decoder transformer looks like at small scale".
- Pythia (14M to 12B): a fully-reproducible decoder series with intermediate checkpoints, used in Chapter 11 for mech-interp work.
- ModernBERT (Warner et al., 2024, arXiv:2412.13663): the 2024 BERT replacement, with 8192-token context, GLU activations, and RoPE positional embeddings. The right "modern encoder reference" for 2026 work where BERT-base shows its age.
5.5.4 Comparing the Part I reference models
| Model | Params | Type | Max seq len | Vocab |
|---|---|---|---|---|
| BERT-base | 110M | Encoder | 512 | 30k WordPiece |
| GPT-2 | 124M | Decoder | 1024 | 50k BPE |
| DistilBERT | 66M | Encoder (distilled) | 512 | 30k WordPiece |
| RoBERTa-base | 125M | Encoder | 512 | 50k BPE |
| T5-base | 220M | Encoder-decoder | 512 / 512 | 32k SentencePiece |
| ModernBERT-base | 149M | Encoder (2024) | 8192 | 50k BPE |
The frontier models of mid-2026 (the Claude 4 family, GPT-5 family, Gemini 2.5 Pro and successors) are trillion-parameter MoE systems behind APIs. You can neither run them locally nor inspect their weights. The Part I models trade away frontier quality for two superpowers: they fit on your laptop, and their weights are downloadable. Every interpretability, fine-tuning, and probing technique in this book was developed and validated against BERT-base or GPT-2 before being scaled. Skipping these models is like learning music theory without ever picking up a piano. The interpretability examples in Chapter 10 all start from these checkpoints.
Drop the block in Code Fragment 5.5.2 below into any notebook to verify your environment can pull from Hugging Face. If both lines print, the rest of Part I will work. If the Hub is rate-limited, set HF_HUB_OFFLINE=1 after the first successful load to force the cached copy.
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
bert_tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
bert = AutoModel.from_pretrained("google-bert/bert-base-uncased")
gpt2_tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
gpt2 = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
print(f"BERT params: {sum(p.numel() for p in bert.parameters()):,}")
print(f"GPT-2 params: {sum(p.numel() for p in gpt2.parameters()):,}")
What's Next?
In the next section, Section 5.6: External Reading & Communities, we build on the material covered here.