Section 5.5: Models

Part I uses two pretrained reference models, BERT-base and GPT-2, and one untrained scaffold (the small transformer you build by hand in Chapter 3). The two reference models are no longer state-of-the-art, but that is exactly why they are useful: they are well-documented, small enough to fit on a 6 GB consumer GPU, and stable across thousands of tutorials. Every "obscure" behaviour you discover in them has been written about by someone, somewhere, and the explanation almost always still applies to the frontier models in Part II.

The aim of this section is not to enumerate every model on the Hugging Face Hub, that is Chapter 12's job. The aim is to lock in the few checkpoints that anchor Part I exercises and to set the vocabulary you will need when frontier model cards inevitably write things like "encoder-style architecture derived from BERT" or "decoder-only transformer in the GPT-2 lineage".

**Figure 5.5.1**: BERT-base and GPT-2 small are nearly identical in scale (12 layers, 12 heads, 768 hidden dim) but use opposite attention masks. BERT's bidirectional attention serves classification and span-prediction; GPT-2's causal mask is what enables autoregressive generation. Knowing this pair fluently is the prerequisite for reading any modern model card.

5.5.1 BERT-base: the encoder reference

BERT-base (uncased) has 110M parameters, 12 layers, 12 heads, 768-dim hidden state, and a 30k WordPiece vocabulary. It was pretrained on BookCorpus + English Wikipedia with masked language modelling and next-sentence prediction. The 2018 paper launched the "pretrain once, fine-tune many" paradigm that every transformer in this book depends on.

For Part I, BERT-base is the canonical "encoder transformer you can fit in memory". One line loads it:

from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
model = AutoModel.from_pretrained("google-bert/bert-base-uncased")

Code Fragment 5.5.1a: For Part I, BERT-base is the canonical "encoder transformer you can fit in memory".

5.5.2 GPT-2: the decoder reference

GPT-2 (124M) is the small variant of the original GPT-2 family, with 12 decoder layers, 12 heads, 768-dim hidden state, and a 50k BPE vocabulary. It was pretrained on the 40 GB WebText corpus (OpenAI's curated Reddit-linked web pages) with the standard causal language modelling objective.

GPT-2 is the right teaching tool for decoder-style behavior: KV caching, autoregressive sampling, top-k and nucleus decoding, attention-mask handling. It is small enough to fine-tune fully on a single 6 GB GPU and large enough that decoding strategy actually matters. The larger checkpoints (355M, 774M, 1.5B) exist on the Hub and behave similarly with longer training runs. For modern decoder reference work, you would reach for SmolLM2-135M, Qwen3-0.6B, Llama-3.2 1B, or Gemma 3 270M (the late-2024 / 2025 small-model wave); for Part I, GPT-2 stays canonical because the literature anchors here.

For mechanistic-interpretability work specifically, the smallest useful checkpoints are TinyStories-1M / 33M (sub-coherent but reveal the entire training trajectory) and Pythia-14M / 31M (smallest Pythia checkpoints, used in 2024-25 circuits papers).

5.5.3 The other names you will see in passing

DistilBERT (66M): a distilled BERT-base, 40% smaller and 60% faster with 97% of GLUE quality. Use it when you want BERT-base behaviour with a smaller footprint.
RoBERTa-base: BERT-base re-trained with more data, longer schedule, and no next-sentence prediction. Modestly better than BERT-base on every GLUE task.
T5-base: encoder-decoder, frames every task as text-to-text. The reference for "what an encoder-decoder transformer looks like at small scale".
Pythia (14M to 12B): a fully-reproducible decoder series with intermediate checkpoints, used in Chapter 11 for mech-interp work.
ModernBERT (Warner et al., 2024, arXiv:2412.13663): the 2024 BERT replacement, with 8192-token context, GLU activations, and RoPE positional embeddings. The right "modern encoder reference" for 2026 work where BERT-base shows its age.

5.5.4 Comparing the Part I reference models

Table 5.5.1b: 6.4.1 Reference checkpoints for Part I.

Model	Params	Type	Max seq len	Vocab
BERT-base	110M	Encoder	512	30k WordPiece
GPT-2	124M	Decoder	1024	50k BPE
DistilBERT	66M	Encoder (distilled)	512	30k WordPiece
RoBERTa-base	125M	Encoder	512	50k BPE
T5-base	220M	Encoder-decoder	512 / 512	32k SentencePiece
ModernBERT-base	149M	Encoder (2024)	8192	50k BPE

Key Insight: Why these old models still matter

The frontier models of mid-2026 (the Claude 4 family, GPT-5 family, Gemini 2.5 Pro and successors) are trillion-parameter MoE systems behind APIs. You can neither run them locally nor inspect their weights. The Part I models trade away frontier quality for two superpowers: they fit on your laptop, and their weights are downloadable. Every interpretability, fine-tuning, and probing technique in this book was developed and validated against BERT-base or GPT-2 before being scaled. Skipping these models is like learning music theory without ever picking up a piano. The interpretability examples in Chapter 10 all start from these checkpoints.

Real-World Scenario: Loading both reference models

Drop the block in Code Fragment 5.5.2 below into any notebook to verify your environment can pull from Hugging Face. If both lines print, the rest of Part I will work. If the Hub is rate-limited, set HF_HUB_OFFLINE=1 after the first successful load to force the cached copy.

from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM

bert_tok = AutoTokenizer.from_pretrained("google-bert/bert-base-uncased")
bert = AutoModel.from_pretrained("google-bert/bert-base-uncased")

gpt2_tok = AutoTokenizer.from_pretrained("openai-community/gpt2")
gpt2 = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

print(f"BERT params: {sum(p.numel() for p in bert.parameters()):,}")
print(f"GPT-2 params: {sum(p.numel() for p in gpt2.parameters()):,}")

Code Fragment 5.5.2: Drop this block into any notebook to verify your environment can pull from Hugging Face:.

What's Next?

In the next section, Section 5.6: External Reading & Communities, we build on the material covered here.

Further Reading

Foundational Models

Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Reference open-weight LLM family.

Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv:2310.06825. Reference for the Mistral architecture; the open-weight small-LM baseline.

DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Reference for the 2024-25 open-weight MoE architecture.

Model Hubs

Hugging Face (2024). "HF Hub Documentation." huggingface.co/docs/hub. The canonical reference for the open-weight model registry.