Section 3.4: Transformer: Training Loop, Shapes & Debugging

I built a Transformer from scratch and it predicted "the the the the." Honestly, some meetings feel the same way.
Norm, Repetitively Decoded AI Agent

Big Picture: Why Build a Transformer from Scratch?

Reading about attention heads and layer normalization is one thing; implementing them is another. This hands-on lab translates the architecture from Section 3.1 into working PyTorch code, building a character-level language model step by step. By the end, you will have internalized how data flows through embeddings, multi-head attention, and feed-forward layers. This concrete understanding pays dividends when you fine-tune models in Chapter 16 or debug inference issues in production.

Prerequisites

This section continues from Section 3.3. You should be comfortable with the Transformer architecture overview from Section 3.1 and with the scaled dot-product attention from Section 2.3.

This continuation of Section 3.3 picks up once the Transformer model is implemented and the data is ready to feed it. It wires up the full training loop, traces the tensor shapes that flow through every layer (the most useful thing to internalize when you start modifying these models), runs the lab end to end on a small character-level dataset, and catalogues the bugs that practically every from-scratch Transformer build hits along the way.

Note: Hands-On Implementation Lab

This section is a coding lab. By the end you will have a working character-level language model built on a decoder-only Transformer. Every line of code is explained. We encourage you to type the code yourself rather than copy-pasting; the act of typing builds muscle memory for these patterns.

See Figure 3.4.1 in Section 3.3 for the Transformer assembly workbench illustration.

3.4.1 The Training Loop

Note

Connecting the Pieces: Next-Token Prediction Is Classification

Next-token prediction is classification. At each position in the sequence, the model performs a V-way classification over the entire vocabulary, where V is the vocabulary size. The softmax loss from Section 0.1 applies directly here: we compare the model's predicted probability distribution over all possible next tokens against the one-hot target (the actual next token in the training data). This is why the code below uses F.cross_entropy to compute the loss, treating every position as an independent classification problem.

# --- Complete training loop with warmup, gradient clipping, and periodic eval ---
import math
import torch
from torch.utils.data import DataLoader

def train(
    model,
    train_loader: DataLoader,
    val_loader: DataLoader,
    *,
    epochs: int = 5,
    base_lr: float = 3e-4,
    warmup_steps: int = 100,
    grad_clip: float = 1.0,
    eval_every: int = 200,
    device: str = "cuda",
):
    """A from-scratch training loop that demonstrates every piece you need:
    AdamW + warmup-then-cosine LR schedule, gradient clipping, and periodic
    validation. No Lightning, no Trainer, every line is yours to read.
    """
    model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=base_lr, betas=(0.9, 0.95))
    total_steps = epochs * len(train_loader)

    def lr_at(step: int) -> float:
        # Linear warmup, then cosine decay to 10% of base_lr
        if step < warmup_steps:
            return base_lr * (step + 1) / warmup_steps
        progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        return base_lr * (0.1 + 0.9 * 0.5 * (1 + math.cos(math.pi * progress)))

    step = 0
    history = []
    for epoch in range(epochs):
        model.train()
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)

            # Forward + cross-entropy over the vocabulary
            logits = model(x)                                 # (B, T, V)
            loss = torch.nn.functional.cross_entropy(
                logits.view(-1, logits.size(-1)), y.view(-1)
            )

            # Backward + grad clipping + step
            optimizer.zero_grad(set_to_none=True)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
            for g in optimizer.param_groups:
                g["lr"] = lr_at(step)
            optimizer.step()

            # Periodic validation
            if step % eval_every == 0:
                val_loss = evaluate(model, val_loader, device)
                history.append((step, loss.item(), val_loss))
                print(f"step {step:>5} | train {loss.item():.3f} | val {val_loss:.3f} | "
                      f"lr {lr_at(step):.2e}")
            step += 1

    # Final checkpoint
    torch.save(model.state_dict(), "mini_transformer.pt")
    return history

@torch.no_grad()
def evaluate(model, loader: DataLoader, device: str) -> float:
    """Compute mean cross-entropy on a held-out loader; returns avg loss."""
    model.eval()
    losses = []
    for x, y in loader:
        x, y = x.to(device), y.to(device)
        logits = model(x)
        losses.append(torch.nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)), y.view(-1)
        ).item())
    model.train()
    return sum(losses) / len(losses)

# --- Run training ---
# train_loader, val_loader = build_loaders(...)   # from Section 3.2.3
# model = MiniTransformer(...)                    # from Section 3.2.2
# history = train(model, train_loader, val_loader, epochs=5)

Code Fragment 3.4.10: Complete training loop with learning rate warmup, gradient clipping, and periodic evaluation. The DataLoader handles batching and shuffling of character sequences.

Warning: Practical Tips

Gradient clipping (clip_grad_norm_ with max_norm=1.0) prevents training instability from occasional large gradient spikes. This is standard in all Transformer training pipelines.

AdamW (Adam with decoupled weight decay) is the optimizer of choice. The betas (0.9, 0.95) and weight_decay (0.1) follow common LLM training conventions. The learning rate 3e-4 works well for small models; larger models typically use lower rates with warmup schedules.

3.4.2 Understanding the Shapes

Tracking tensor shapes is one of the most valuable debugging skills when working with Transformers. Here is a shape trace through the forward pass:

**Table 3.4.3:** *Tracking tensor shapes is one of the most valuable debugging skills when working with Transformers.*
Variable	Shape	Description
`idx`	(B, T)	Input token indices
`token_emb(idx)`	(B, T, d_model)	Token embeddings
`pos_emb(positions)`	(T, d_model)	Positional embeddings (broadcast over B)
`x` after embedding	(B, T, d_model)	Sum of token + position embeddings
`qkv`	(B, T, 3*d_model)	Fused QKV projection output
`q, k, v` after reshape	(B, n_heads, T, d_k)	Per-head queries, keys, values
`scores`	(B, n_heads, T, T)	Attention scores (before masking)
`attn_weights`	(B, n_heads, T, T)	Attention probabilities (after Section 3.1)
`out` from attention	(B, T, d_model)	Concatenated head outputs after out_proj
`ffn` output	(B, T, d_model)	Feed-forward output
`logits`	(B, T, vocab_size)	Raw prediction scores for each position

Key Insight: The T x T Attention Matrix

The attention scores have shape (B, n_heads, T, T). This is where the quadratic cost of attention lives. For T=128, this is 128 × 128 = 16,384 entries per head per example. For T=4096 (a moderate context window), that grows to 16.7 million. Section 3.5 covers techniques to reduce this cost.

Lab 3.4.6: Running the Transformer Lab

3.2.6.1 Getting Data

Download a small text file for training. Shakespeare's collected works (~1.1 MB) is the classic choice.

# Download the tiny Shakespeare dataset
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "input.txt")

Code Fragment 3.4.11: Download the tiny Shakespeare dataset.

3.2.6.2 Training

With the dataset prepared, we can now train the model and observe the loss decreasing over several thousand steps.

# Train with default settings
model, dataset = train(max_steps=5000)
# Expected output after 5000 steps (loss around 1.4-1.5):
# step 0 | loss 4.1742 | time 0.0s
# step 500 | loss 1.9831 | time 12.3s
# step 1000 | loss 1.6524 | time 24.7s
# ...
# step 5000 | loss 1.4208 | time 123.5s

Code Fragment 3.4.12: With the dataset prepared, we can now train the model and observe the loss decreasing over several thousand steps.

3.2.6.3 Evaluating the Output

After training, the model will generate text that resembles the style of the training data. At ~5000 steps with our small model, you should see recognizable words, approximate sentence structure, and character-level patterns that match the training corpus. The text will not be coherent, but it should clearly be "trying" to produce English in the style of the training data.

3.2.6.4 Experiments to Try

Increase n_layers from 4 to 6 or 8. Does the loss improve? How much slower is training?
Increase d_model from 128 to 256. Compare the parameter count and training speed.
Remove positional embeddings entirely. What happens to the generated text?
Switch to SwiGLU (replace FeedForward with SwiGLUFeedForward). Does the loss curve change?
Remove the causal mask. The model can now "cheat" by looking at future tokens. What happens to the training loss? What happens to generation quality?
Try temperature values of 0.5, 1.0, and 1.5 during generation. Observe the diversity/quality tradeoff.

Library Shortcut

Pretrained GPT-2 with Hugging Face Transformers

You just built a character-level Transformer from scratch and trained it for thousands of steps. In practice, you can load a pretrained GPT-2 model (trained on 40 GB of internet text) and generate high-quality text immediately:

Show code

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer("To be, or not to be", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Code Fragment 3.4.1: Minimal working example using Pretrained GPT-2.

pip install transformers torch. The from-scratch lab teaches you what is inside this black box. Every layer, projection, and mask you implemented is present in the pretrained model; the library simply wraps it in a convenient API with optimized CUDA kernels and a pretrained checkpoint.

3.4.3 Common Bugs and Debugging

Fun Fact

The most common Transformer implementation bug is getting the attention mask wrong. It is also the hardest to notice, because a model with a subtly broken mask will still train, still produce output, and still reduce its loss. It will just quietly produce nonsense that looks plausible. Debugging Transformers is an exercise in distrusting success.

When implementing Transformers from scratch, certain bugs appear repeatedly. Here are the most common ones and how to detect them:

Table 3.4.1a: Symptom Comparison (as of 2026).

Symptom	Likely Cause	Fix
Loss stays flat at ~ln(vocab_size)	Gradients are not flowing; possible shape mismatch or detached computation	Check that no `.detach()` calls break the computation graph. Verify loss computation.
Loss drops fast then NaN	Learning rate too high or no gradient clipping	Add gradient clipping (max_norm=1.0). Reduce learning rate. Check for missing layer norm.
Generated text is repetitive gibberish	Missing or incorrect causal mask	Verify the mask is lower-triangular and correctly applied before softmax.
Generated text is random characters	Insufficient training or broken positional encoding	Train longer. Verify pos_emb is added, not concatenated.
All generated tokens are the same	Temperature too low or top_k=1	Increase temperature. Use top_k > 1 or remove top_k filtering.

Note: Debugging Tip: Overfit a Tiny Dataset First

Before training on the full dataset, verify your model can overfit a single batch. Take one batch of data and train for 100 steps. The loss should drop to near zero. If it does not, there is a bug in your model or training loop. This simple sanity check saves hours of debugging.

Tip: Label Smoothing as a Cheap Regularizer

The standard cross-entropy loss applied to a one-hot target distribution pushes the model toward placing all probability mass on the single correct token. For a vocabulary of 50K tokens, the optimizer happily drives the predicted logit for the correct token toward infinity and every other logit toward minus infinity. The model becomes overconfident: it stops hedging, its predicted probabilities no longer reflect real uncertainty, and small distribution shifts at evaluation time hurt disproportionately.

Label smoothing (Szegedy et al., 2016; adopted as standard in the original Transformer with $\epsilon = 0.1$) replaces the one-hot target with a soft target that assigns probability $1 - \epsilon$ to the correct class and spreads $\epsilon$ uniformly across the other $V - 1$ classes. The cross-entropy minimum is no longer at $\pm\infty$ logits; it sits at a finite spread that matches the smoothed target. Equivalently, the gradient explicitly penalizes overconfident predictions on the wrong class. The practical effect is mildly worse training loss, slightly better validation loss, and substantially better calibration (the model's reported probability tracks accuracy more honestly). PyTorch exposes this as nn.CrossEntropyLoss(label_smoothing=0.1). Label smoothing remains standard practice in machine translation, classification with very large vocabularies, and any setting where the downstream consumer cares about probability calibration, not just argmax accuracy.

Tip: Use FlashAttention When Available

If your GPU supports it (Ampere or newer), use FlashAttention (torch.nn.functional.scaled_dot_product_attention in PyTorch 2.0+). It fuses the attention computation, reducing memory from O(n²) to O(n) and often doubling throughput with identical outputs.

Key Insight: Implementation as Understanding

Richard Feynman famously said, "What I cannot create, I do not understand." Building a Transformer from scratch is more than a pedagogical exercise; it reveals how seemingly abstract mathematical ideas (softmax attention, layer normalization, residual connections) interact at a concrete level. Subtle implementation choices, such as whether to apply layer norm before or after attention (pre-norm vs. post-norm), whether to tie input and output embeddings, or how to scale initialization with depth, can determine whether a model trains stably or diverges. These are not described in the original "Attention Is All You Need" paper, and they were discovered through years of engineering practice. This gap between the mathematical specification of an algorithm and its practical implementation is a recurring theme in machine learning, and it is why reproducibility remains one of the field's persistent challenges (see Section 42.2 on evaluation methodology).

Real-World Scenario

Hyperparameter Sensitivity in Small Transformers

Who: A graduate student training a character-level Transformer for a class project on poetry generation.

Situation: Following a tutorial similar to this section, the student trained a 4-layer, 4-head Transformer on a corpus of English poetry (2 MB). The model trained without errors but generated text that was grammatically incoherent, mixing fragments from different styles.

Problem: The student assumed the model was too small (1.6M parameters) and tried doubling every dimension: 8 layers, 8 heads, d_model=256. The larger model overfit severely, memorizing training poems verbatim while producing gibberish on new prompts.

Dilemma: Should they get more training data, add regularization, or reconsider the architecture entirely?

Decision: A mentor suggested keeping the original 4-layer architecture but adjusting three things: increasing dropout from 0.0 to 0.1, switching from a fixed learning rate to cosine decay, and increasing the context length (block_size) from 64 to 256 so the model could see full stanzas.

How: They used the same training code from this section with those three changes, training for 10,000 steps instead of 5,000.

Result: The small model with proper regularization and context length outperformed the large model. Generated text maintained consistent style within passages and showed recognizable poetic structure. Validation loss improved from 2.1 to 1.6.

Lesson: For small-scale Transformer experiments, tuning dropout, learning rate schedule, and context length matters more than adding layers or heads. Scale up the architecture only after exhausting these simpler knobs.

Research Frontier

Reference implementations continue to improve accessibility. Andrej Karpathy's nanoGPT remains a popular educational resource. Meta's torchtune provides production-quality implementations for fine-tuning. LitGPT (Lightning AI) offers clean implementations of 20+ architectures. For hardware-optimized training, DeepSpeed and FSDP (Fully Sharded Data Parallel) in PyTorch handle multi-GPU distribution automatically. Understanding from-scratch implementations remains valuable for debugging, customization, and understanding what frameworks abstract away.

Key Takeaways

A decoder-only Transformer can be implemented in ~300 lines of clear, modular PyTorch.
The architecture has four main components: embeddings, causal self-attention, feed-forward networks, and Section 3.1, all connected by residual connections.
Fused QKV projections and weight tying are standard efficiency tricks with no loss in model quality.
Careful initialization (especially scaling residual projections) is critical for stable training.
Gradient clipping, AdamW with appropriate hyperparameters, and the Pre-LN ordering are standard practice.
Tracking tensor shapes through the forward pass is the single most effective debugging technique.

Self-Check

1. Why do we combine Q, K, V into a single linear projection rather than using three separate layers?

Show Answer

A single large matrix multiply (d_model to 3*d_model) is more GPU-efficient than three smaller ones (d_model to d_model each). The GPU better utilizes its parallelism with larger matrices. The result is mathematically identical.

2. What does weight tying do and why is it beneficial?

Show Answer

Weight tying shares the token embedding matrix with the output projection matrix. Both map between d_model-dimensional space and vocabulary space. Sharing them reduces parameter count by vocab_size * d_model parameters and provides a useful inductive bias: tokens with similar embeddings will have similar output logits.

3. Why is the final LayerNorm necessary in a Pre-LN Transformer?

Show Answer

In Pre-LN ordering, each sub-layer normalizes its input but not its output. The residual connection adds the (unnormalized) sub-layer output back to the stream. After the last block, the residual stream has not been normalized. The final LayerNorm ensures the representations have stable statistics before the output projection to vocabulary logits.

4. What would happen if you removed the causal mask during training?

Show Answer

Without the causal mask, each position can attend to future tokens. The training loss would drop quickly because the model can "cheat" by looking ahead. However, at generation time future tokens do not exist yet, so the model would produce poor output. The causal mask ensures training conditions match inference conditions.

5. The attention scores tensor has shape (B, n_heads, T, T). What does each element represent?

Show Answer

Element [b, h, i, j] is the (unnormalized) attention score from position i (the query) to position j (the key) in head h of example b. After softmax over the last dimension, it becomes the weight that position j's value contributes to the output at position i. The causal mask sets entries where j > i to negative infinity.

Exercises

Exercise 4.2.1: Mini-Transformer Parameter Audit Calculation

Compute the parameter count for the mini-transformer with the hyperparameters in Section 3.2.1 (vocab=65, block=128, n_layers=4, n_heads=4, d_model=128, d_ff=512, no bias, weight tying). Break it into (a) embeddings + positional, (b) per-block, (c) total. Compare to the chapter's claim of "~1.6M parameters."

Answer Sketch

(a) Token embedding: 65 × 128 = 8,320; positional embedding: 128 × 128 = 16,384; total = 24,704 (output projection shares the 8,320 with token embedding via weight tying). (b) Per block: attention QKV combined: 128 × (3 × 128) = 49,152; output projection: 128 × 128 = 16,384; FFN W1: 128 × 512 = 65,536; FFN W2: 512 × 128 = 65,536; LayerNorm scale (no bias): 2 × 128 = 256; per-block total = 196,864. (c) 4 blocks = 787,456; final LayerNorm = 128. Grand total = 24,704 + 787,456 + 128 = 812,288 (~0.81M). The chapter's "~1.6M" likely includes biases (which the config has set to False) or differs in some other config detail; the exercise's value is in showing that ~80% of parameters live in the FFN as predicted by the 8d²/12d² rule of thumb.

Exercise 4.2.2: Diagnosing "Loss Stays at ln(vocab_size)" Analysis

The debug table notes that a loss flat at $\ln(\text{vocab\_size})$ indicates broken gradient flow. (a) For vocab_size=65, what is this baseline loss numerically? (b) What does this baseline correspond to in terms of the model's predictions? (c) Apart from .detach(), list two other implementation bugs that would produce this exact symptom.

Answer Sketch

(a) $\ln(65) \approx 4.174$. (b) This is the cross-entropy of a uniform distribution over 65 classes, meaning the model is predicting equal probability for every character. The model has not learned anything that distinguishes the true next character from random. (c) Other gradient-killing bugs: (1) Using torch.no_grad() around the model call (or training inside an eval() mode that disables dropout but is otherwise fine, but the eval mode is not the cause; the no_grad is). (2) The optimizer was created before model.to(device), so it holds references to CPU parameters that get replaced; gradients update CPU tensors that are never used. (3) Forgetting to call loss.backward() before optimizer.step(). (4) Initializing all weights to zero (kills the symmetry breaking needed for learning). Each of these produces the same flat-loss symptom and has the same diagnostic: log model.parameters()[0].grad.norm() on every step; if it is zero, gradients are not reaching the parameters.

Exercise 4.2.3: Code Tweak: Tie Embeddings Coding

Suppose the mini-transformer's MiniTransformer class currently has separate self.tok_emb = nn.Embedding(vocab_size, d_model) and self.lm_head = nn.Linear(d_model, vocab_size, bias=False). Sketch the two-line change to implement weight tying. Then explain what subtle pitfall to avoid when saving/loading checkpoints.

Answer Sketch

The code change is one line: self.lm_head.weight = self.tok_emb.weight (assign the embedding weight tensor to the head's weight parameter). This makes the two modules share the same underlying tensor; updates to one update both. Pitfall: when saving with torch.save(model.state_dict()), PyTorch saves both tok_emb.weight and lm_head.weight as separate keys even though they share storage. On load, the second assignment breaks tying because PyTorch creates two independent tensors. The fix is to re-tie after loading: model.load_state_dict(...); model.lm_head.weight = model.tok_emb.weight. Forgetting this means the loaded model will have ~50K extra parameters silently and will diverge from training-time behavior. This bug is silent and only shows up when comparing logits to training reference.

Exercise 4.2.4: Sanity-Check Overfitting a Single Batch Predictive

The "Debugging Tip" recommends overfitting a single batch as a sanity check. (a) For the mini-transformer with vocab=65, d_model=128, what loss would you expect after 100 steps of overfitting? (b) Why would a model that fails to overfit a single batch also fail to train on the full dataset? (c) Sketch the smallest possible PyTorch loop (5-7 lines) that performs this overfitting test.

Answer Sketch

(a) Loss should drop close to zero (typically < 0.05) within 100 steps because the model has more than enough capacity to memorize a single batch of 64 sequences x 128 tokens = 8192 character predictions, which is far less than the 0.8M parameters available. If loss plateaus far above 0 (e.g., at 1-2), there is a bug. (b) A model that cannot fit one batch certainly cannot fit thousands of batches drawn from a more diverse distribution. The single-batch test isolates "is the model + optimizer wired correctly?" from "is the model big enough for the data?" If overfitting fails, the bug is architectural or numeric, not data-related. (c) x, y = next(iter(loader)); for _ in range(100): logits = model(x); loss = F.cross_entropy(logits.view(-1, V), y.view(-1)); opt.zero_grad(); loss.backward(); opt.step(); print(loss.item()). The key is calling the same x, y on every step rather than fetching new batches.

Exercise 4.2.5: Why Combine QKV Projection? Predictive

The chapter explains that fusing Q, K, V into one large matrix multiplication is more GPU-efficient. (a) For d_model=4096, what is the size of the fused projection matrix versus three separate matrices? (b) Why does GPU performance favor one large matmul over three small ones, given the FLOP count is identical? (c) When would you NOT fuse them?

Answer Sketch

(a) Fused: a single 4096 x 12288 matrix (~50.3M parameters); three separate: each 4096 x 4096 (~16.8M each, total 50.3M). Same parameter count, same total FLOPs. (b) GPU performance is dominated by memory bandwidth and kernel launch overhead, not raw FLOPs. One large matmul reuses input data (the input x is loaded from GDDR memory once) and amortizes the kernel launch cost (~5-10 microseconds) over more compute. Three separate matmuls load x three times and pay the launch cost three times. On modern GPUs, this can yield a 1.3-1.8x speedup. (c) You would NOT fuse them when (1) Q, K, V are computed from different inputs, as in cross-attention where Q comes from the decoder and K, V from the encoder; (2) you want different dropout patterns on Q vs K vs V; or (3) you are using grouped-query attention (Section 3.2 of the larger book), where K and V have fewer heads than Q and the matrix shapes no longer match.

Exercise 4.2.6: Mask Bug Failure Mode Analysis

The "fun fact" warns that mask bugs are silent: training continues, loss decreases, but generation produces nonsense. (a) Construct a specific example: if the causal mask is accidentally set to upper-triangular instead of lower-triangular, what does each output position attend to? (b) Why does the model still train (loss still decreases)? (c) What test would catch this bug deterministically?

Answer Sketch

(a) An upper-triangular mask blocks the past and exposes the future: position 0 attends to positions 1..T-1, position 1 attends to 2..T-1, ..., position T-1 attends to nothing. The model is doing "anti-causal" attention, looking only at future tokens. (b) Loss still decreases because the mask is consistent between training and the loss computation: the model is being asked to predict each token using future context, and it learns to do this very well. The training task is now "predict the previous token from the future" which is easy. The bug only manifests at generation time, when future tokens do not exist and the model attends to nothing meaningful. (c) Deterministic test: at generation time, compute attention over a sequence and verify position 0 has weight 1.0 on position 0 (only itself), position 1 distributes weight over positions 0 and 1, and so on. Any non-zero weight on a future position from a query position fails the test. A second test: train for 100 steps, then run inference; if the loss-during-training is much lower than the loss-during-inference on the same data, the mask is wrong.

What's Next?

In the next section, Section 3.5: Transformer Variants & Efficiency, we continue building on the topics covered here.

Further Reading

Core Architecture Papers

Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. The original Transformer paper. Every component implemented in this section traces back to this work. Read at least Sections 3.1 through 3.3 to see how the authors describe the architecture you just built from scratch.

Radford, A. et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. The first paper to demonstrate that a decoder-only Transformer (the architecture built in this section) could be pretrained and fine-tuned for many downstream tasks. This is the GPT-1 paper that launched the decoder-only paradigm.

Press, O. & Wolf, L. (2017). "Using the Output Embedding to Improve Language Models." EACL 2017. Establishes the weight tying technique used in this implementation, showing it improves perplexity while reducing parameter count. A short, elegant paper worth reading in full.

Implementation Guides & Tutorials

Karpathy, A. (2023). "nanoGPT." A minimal, well-documented GPT implementation in PyTorch. The design patterns in this section follow a similar philosophy of clarity over abstraction. Clone it and compare it to the code you wrote here.

Rush, A. (2018). "The Annotated Transformer." A line-by-line walkthrough of the original encoder-decoder Transformer in PyTorch. This complements this section's decoder-only focus by showing the full encoder-decoder variant, including cross-attention.

Training Practices

Xiong, R. et al. (2020). "On Layer Normalization in the Transformer Architecture." ICML 2020. Analyzes Pre-LN vs. Post-LN placement and explains why Pre-LN (used in this implementation) leads to more stable training. Essential reading if you want to understand why the layer norm position matters so much.

Loshchilov, I. & Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. Introduces AdamW, the optimizer used in the training loop of this section. Explains why decoupled weight decay outperforms L2 regularization with Adam, a subtle but important distinction for training stability.

Clark, K., Khandelwal, U., Levy, O., & Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP, ACL 2019. Documents how individual BERT attention heads specialize in syntactic roles, local context, and coreference, providing empirical grounding for why multi-head attention outperforms single-head attention. The head visualization methodology is highly accessible and directly relevant to understanding why the multi-head design in this section's implementation works as well as it does.

Zhang, B. & Sennrich, R. (2019). "Root Mean Square Layer Normalization." NeurIPS 2019. Introduces RMSNorm, the simplified normalization variant used in LLaMA, Mistral, and most modern LLMs. Directly relevant to the LayerNorm discussion in this section, and to the Pre-LN vs. Post-LN analysis in Section 3.3.