I built a Transformer from scratch and it predicted "the the the the." Honestly, some meetings feel the same way.
Norm, Repetitively Decoded AI Agent
Reading about attention heads and layer normalization is one thing; implementing them is another. This hands-on lab translates the architecture from Section 3.1 into working PyTorch code, building a character-level language model step by step. By the end, you will have internalized how data flows through embeddings, multi-head attention, and feed-forward layers. This concrete understanding pays dividends when you fine-tune models in Chapter 16 or debug inference issues in production.
Prerequisites
This section continues from Section 3.3. You should be comfortable with the Transformer architecture overview from Section 3.1 and with the scaled dot-product attention from Section 2.3.
This continuation of Section 3.3 picks up once the Transformer model is implemented and the data is ready to feed it. It wires up the full training loop, traces the tensor shapes that flow through every layer (the most useful thing to internalize when you start modifying these models), runs the lab end to end on a small character-level dataset, and catalogues the bugs that practically every from-scratch Transformer build hits along the way.
This section is a coding lab. By the end you will have a working character-level language model built on a decoder-only Transformer. Every line of code is explained. We encourage you to type the code yourself rather than copy-pasting; the act of typing builds muscle memory for these patterns.
See Figure 3.4.1 in Section 3.3 for the Transformer assembly workbench illustration.
3.4.1 The Training Loop
Next-token prediction is classification. At each position in the sequence, the model
performs a V-way classification over the entire vocabulary, where V is the vocabulary size.
The softmax loss from Section 0.1 applies directly here: we compare the model's
predicted probability distribution over all possible next tokens against the one-hot target
(the actual next token in the training data). This is why the code below uses
F.cross_entropy to compute the loss, treating every position as an independent
classification problem.
# --- Complete training loop with warmup, gradient clipping, and periodic eval ---
import math
import torch
from torch.utils.data import DataLoader
def train(
model,
train_loader: DataLoader,
val_loader: DataLoader,
*,
epochs: int = 5,
base_lr: float = 3e-4,
warmup_steps: int = 100,
grad_clip: float = 1.0,
eval_every: int = 200,
device: str = "cuda",
):
"""A from-scratch training loop that demonstrates every piece you need:
AdamW + warmup-then-cosine LR schedule, gradient clipping, and periodic
validation. No Lightning, no Trainer, every line is yours to read.
"""
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=base_lr, betas=(0.9, 0.95))
total_steps = epochs * len(train_loader)
def lr_at(step: int) -> float:
# Linear warmup, then cosine decay to 10% of base_lr
if step < warmup_steps:
return base_lr * (step + 1) / warmup_steps
progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
return base_lr * (0.1 + 0.9 * 0.5 * (1 + math.cos(math.pi * progress)))
step = 0
history = []
for epoch in range(epochs):
model.train()
for x, y in train_loader:
x, y = x.to(device), y.to(device)
# Forward + cross-entropy over the vocabulary
logits = model(x) # (B, T, V)
loss = torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)), y.view(-1)
)
# Backward + grad clipping + step
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
for g in optimizer.param_groups:
g["lr"] = lr_at(step)
optimizer.step()
# Periodic validation
if step % eval_every == 0:
val_loss = evaluate(model, val_loader, device)
history.append((step, loss.item(), val_loss))
print(f"step {step:>5} | train {loss.item():.3f} | val {val_loss:.3f} | "
f"lr {lr_at(step):.2e}")
step += 1
# Final checkpoint
torch.save(model.state_dict(), "mini_transformer.pt")
return history
@torch.no_grad()
def evaluate(model, loader: DataLoader, device: str) -> float:
"""Compute mean cross-entropy on a held-out loader; returns avg loss."""
model.eval()
losses = []
for x, y in loader:
x, y = x.to(device), y.to(device)
logits = model(x)
losses.append(torch.nn.functional.cross_entropy(
logits.view(-1, logits.size(-1)), y.view(-1)
).item())
model.train()
return sum(losses) / len(losses)
# --- Run training ---
# train_loader, val_loader = build_loaders(...) # from Section 3.2.3
# model = MiniTransformer(...) # from Section 3.2.2
# history = train(model, train_loader, val_loader, epochs=5)
Gradient clipping (clip_grad_norm_ with max_norm=1.0) prevents
training instability from occasional large gradient spikes. This is standard in all Transformer
training pipelines.
AdamW (Adam with decoupled weight decay) is the optimizer of choice. The betas (0.9, 0.95) and weight_decay (0.1) follow common LLM training conventions. The learning rate 3e-4 works well for small models; larger models typically use lower rates with warmup schedules.
3.4.2 Understanding the Shapes
Tracking tensor shapes is one of the most valuable debugging skills when working with Transformers. Here is a shape trace through the forward pass:
| Variable | Shape | Description |
|---|---|---|
idx | (B, T) | Input token indices |
token_emb(idx) | (B, T, d_model) | Token embeddings |
pos_emb(positions) | (T, d_model) | Positional embeddings (broadcast over B) |
x after embedding | (B, T, d_model) | Sum of token + position embeddings |
qkv | (B, T, 3*d_model) | Fused QKV projection output |
q, k, v after reshape | (B, n_heads, T, d_k) | Per-head queries, keys, values |
scores | (B, n_heads, T, T) | Attention scores (before masking) |
attn_weights | (B, n_heads, T, T) | Attention probabilities (after Section 3.1) |
out from attention | (B, T, d_model) | Concatenated head outputs after out_proj |
ffn output | (B, T, d_model) | Feed-forward output |
logits | (B, T, vocab_size) | Raw prediction scores for each position |
The attention scores have shape (B, n_heads, T, T). This is where the quadratic
cost of attention lives. For T=128, this is 128 × 128 = 16,384 entries per head per
example. For T=4096 (a moderate context window), that grows to 16.7 million. Section 3.5 covers
techniques to reduce this cost.
3.2.6.1 Getting Data
Download a small text file for training. Shakespeare's collected works (~1.1 MB) is the classic choice.
# Download the tiny Shakespeare dataset
import urllib.request
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
urllib.request.urlretrieve(url, "input.txt")
3.2.6.2 Training
With the dataset prepared, we can now train the model and observe the loss decreasing over several thousand steps.
# Train with default settings
model, dataset = train(max_steps=5000)
# Expected output after 5000 steps (loss around 1.4-1.5):
# step 0 | loss 4.1742 | time 0.0s
# step 500 | loss 1.9831 | time 12.3s
# step 1000 | loss 1.6524 | time 24.7s
# ...
# step 5000 | loss 1.4208 | time 123.5s
3.2.6.3 Evaluating the Output
After training, the model will generate text that resembles the style of the training data. At ~5000 steps with our small model, you should see recognizable words, approximate sentence structure, and character-level patterns that match the training corpus. The text will not be coherent, but it should clearly be "trying" to produce English in the style of the training data.
3.2.6.4 Experiments to Try
- Increase n_layers from 4 to 6 or 8. Does the loss improve? How much slower is training?
- Increase d_model from 128 to 256. Compare the parameter count and training speed.
- Remove positional embeddings entirely. What happens to the generated text?
- Switch to SwiGLU (replace
FeedForwardwithSwiGLUFeedForward). Does the loss curve change? - Remove the causal mask. The model can now "cheat" by looking at future tokens. What happens to the training loss? What happens to generation quality?
- Try temperature values of 0.5, 1.0, and 1.5 during generation. Observe the diversity/quality tradeoff.
You just built a character-level Transformer from scratch and trained it for thousands of steps. In practice, you can load a pretrained GPT-2 model (trained on 40 GB of internet text) and generate high-quality text immediately:
Show code
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")
input_ids = tokenizer("To be, or not to be", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=50, temperature=0.8, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Pretrained GPT-2.pip install transformers torch. The from-scratch lab teaches you what is inside this black box. Every layer, projection, and mask you implemented is present in the pretrained model; the library simply wraps it in a convenient API with optimized CUDA kernels and a pretrained checkpoint.
3.4.3 Common Bugs and Debugging
The most common Transformer implementation bug is getting the attention mask wrong. It is also the hardest to notice, because a model with a subtly broken mask will still train, still produce output, and still reduce its loss. It will just quietly produce nonsense that looks plausible. Debugging Transformers is an exercise in distrusting success.
When implementing Transformers from scratch, certain bugs appear repeatedly. Here are the most common ones and how to detect them:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Loss stays flat at ~ln(vocab_size) | Gradients are not flowing; possible shape mismatch or detached computation | Check that no .detach() calls break the computation graph. Verify loss computation. |
| Loss drops fast then NaN | Learning rate too high or no gradient clipping | Add gradient clipping (max_norm=1.0). Reduce learning rate. Check for missing layer norm. |
| Generated text is repetitive gibberish | Missing or incorrect causal mask | Verify the mask is lower-triangular and correctly applied before softmax. |
| Generated text is random characters | Insufficient training or broken positional encoding | Train longer. Verify pos_emb is added, not concatenated. |
| All generated tokens are the same | Temperature too low or top_k=1 | Increase temperature. Use top_k > 1 or remove top_k filtering. |
Before training on the full dataset, verify your model can overfit a single batch. Take one batch of data and train for 100 steps. The loss should drop to near zero. If it does not, there is a bug in your model or training loop. This simple sanity check saves hours of debugging.
The standard cross-entropy loss applied to a one-hot target distribution pushes the model toward placing all probability mass on the single correct token. For a vocabulary of 50K tokens, the optimizer happily drives the predicted logit for the correct token toward infinity and every other logit toward minus infinity. The model becomes overconfident: it stops hedging, its predicted probabilities no longer reflect real uncertainty, and small distribution shifts at evaluation time hurt disproportionately.
Label smoothing (Szegedy et al., 2016; adopted as standard in the original Transformer with $\epsilon = 0.1$) replaces the one-hot target with a soft target that assigns probability $1 - \epsilon$ to the correct class and spreads $\epsilon$ uniformly across the other $V - 1$ classes. The cross-entropy minimum is no longer at $\pm\infty$ logits; it sits at a finite spread that matches the smoothed target. Equivalently, the gradient explicitly penalizes overconfident predictions on the wrong class. The practical effect is mildly worse training loss, slightly better validation loss, and substantially better calibration (the model's reported probability tracks accuracy more honestly). PyTorch exposes this as nn.CrossEntropyLoss(label_smoothing=0.1). Label smoothing remains standard practice in machine translation, classification with very large vocabularies, and any setting where the downstream consumer cares about probability calibration, not just argmax accuracy.
If your GPU supports it (Ampere or newer), use FlashAttention (torch.nn.functional.scaled_dot_product_attention in PyTorch 2.0+). It fuses the attention computation, reducing memory from O(n2) to O(n) and often doubling throughput with identical outputs.
Richard Feynman famously said, "What I cannot create, I do not understand." Building a Transformer from scratch is more than a pedagogical exercise; it reveals how seemingly abstract mathematical ideas (softmax attention, layer normalization, residual connections) interact at a concrete level. Subtle implementation choices, such as whether to apply layer norm before or after attention (pre-norm vs. post-norm), whether to tie input and output embeddings, or how to scale initialization with depth, can determine whether a model trains stably or diverges. These are not described in the original "Attention Is All You Need" paper, and they were discovered through years of engineering practice. This gap between the mathematical specification of an algorithm and its practical implementation is a recurring theme in machine learning, and it is why reproducibility remains one of the field's persistent challenges (see Section 42.2 on evaluation methodology).
Who: A graduate student training a character-level Transformer for a class project on poetry generation.
Situation: Following a tutorial similar to this section, the student trained a 4-layer, 4-head Transformer on a corpus of English poetry (2 MB). The model trained without errors but generated text that was grammatically incoherent, mixing fragments from different styles.
Problem: The student assumed the model was too small (1.6M parameters) and tried doubling every dimension: 8 layers, 8 heads, d_model=256. The larger model overfit severely, memorizing training poems verbatim while producing gibberish on new prompts.
Dilemma: Should they get more training data, add regularization, or reconsider the architecture entirely?
Decision: A mentor suggested keeping the original 4-layer architecture but adjusting three things: increasing dropout from 0.0 to 0.1, switching from a fixed learning rate to cosine decay, and increasing the context length (block_size) from 64 to 256 so the model could see full stanzas.
How: They used the same training code from this section with those three changes, training for 10,000 steps instead of 5,000.
Result: The small model with proper regularization and context length outperformed the large model. Generated text maintained consistent style within passages and showed recognizable poetic structure. Validation loss improved from 2.1 to 1.6.
Lesson: For small-scale Transformer experiments, tuning dropout, learning rate schedule, and context length matters more than adding layers or heads. Scale up the architecture only after exhausting these simpler knobs.
Reference implementations continue to improve accessibility. Andrej Karpathy's nanoGPT remains a popular educational resource. Meta's torchtune provides production-quality implementations for fine-tuning. LitGPT (Lightning AI) offers clean implementations of 20+ architectures. For hardware-optimized training, DeepSpeed and FSDP (Fully Sharded Data Parallel) in PyTorch handle multi-GPU distribution automatically. Understanding from-scratch implementations remains valuable for debugging, customization, and understanding what frameworks abstract away.
- A decoder-only Transformer can be implemented in ~300 lines of clear, modular PyTorch.
- The architecture has four main components: embeddings, causal self-attention, feed-forward networks, and Section 3.1, all connected by residual connections.
- Fused QKV projections and weight tying are standard efficiency tricks with no loss in model quality.
- Careful initialization (especially scaling residual projections) is critical for stable training.
- Gradient clipping, AdamW with appropriate hyperparameters, and the Pre-LN ordering are standard practice.
- Tracking tensor shapes through the forward pass is the single most effective debugging technique.
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Exercises
Compute the parameter count for the mini-transformer with the hyperparameters in Section 3.2.1 (vocab=65, block=128, n_layers=4, n_heads=4, d_model=128, d_ff=512, no bias, weight tying). Break it into (a) embeddings + positional, (b) per-block, (c) total. Compare to the chapter's claim of "~1.6M parameters."
Answer Sketch
(a) Token embedding: 65 × 128 = 8,320; positional embedding: 128 × 128 = 16,384; total = 24,704 (output projection shares the 8,320 with token embedding via weight tying). (b) Per block: attention QKV combined: 128 × (3 × 128) = 49,152; output projection: 128 × 128 = 16,384; FFN W1: 128 × 512 = 65,536; FFN W2: 512 × 128 = 65,536; LayerNorm scale (no bias): 2 × 128 = 256; per-block total = 196,864. (c) 4 blocks = 787,456; final LayerNorm = 128. Grand total = 24,704 + 787,456 + 128 = 812,288 (~0.81M). The chapter's "~1.6M" likely includes biases (which the config has set to False) or differs in some other config detail; the exercise's value is in showing that ~80% of parameters live in the FFN as predicted by the 8d²/12d² rule of thumb.
The debug table notes that a loss flat at $\ln(\text{vocab\_size})$ indicates broken gradient flow. (a) For vocab_size=65, what is this baseline loss numerically? (b) What does this baseline correspond to in terms of the model's predictions? (c) Apart from .detach(), list two other implementation bugs that would produce this exact symptom.
Answer Sketch
(a) $\ln(65) \approx 4.174$. (b) This is the cross-entropy of a uniform distribution over 65 classes, meaning the model is predicting equal probability for every character. The model has not learned anything that distinguishes the true next character from random. (c) Other gradient-killing bugs: (1) Using torch.no_grad() around the model call (or training inside an eval() mode that disables dropout but is otherwise fine, but the eval mode is not the cause; the no_grad is). (2) The optimizer was created before model.to(device), so it holds references to CPU parameters that get replaced; gradients update CPU tensors that are never used. (3) Forgetting to call loss.backward() before optimizer.step(). (4) Initializing all weights to zero (kills the symmetry breaking needed for learning). Each of these produces the same flat-loss symptom and has the same diagnostic: log model.parameters()[0].grad.norm() on every step; if it is zero, gradients are not reaching the parameters.
Suppose the mini-transformer's MiniTransformer class currently has separate self.tok_emb = nn.Embedding(vocab_size, d_model) and self.lm_head = nn.Linear(d_model, vocab_size, bias=False). Sketch the two-line change to implement weight tying. Then explain what subtle pitfall to avoid when saving/loading checkpoints.
Answer Sketch
The code change is one line: self.lm_head.weight = self.tok_emb.weight (assign the embedding weight tensor to the head's weight parameter). This makes the two modules share the same underlying tensor; updates to one update both. Pitfall: when saving with torch.save(model.state_dict()), PyTorch saves both tok_emb.weight and lm_head.weight as separate keys even though they share storage. On load, the second assignment breaks tying because PyTorch creates two independent tensors. The fix is to re-tie after loading: model.load_state_dict(...); model.lm_head.weight = model.tok_emb.weight. Forgetting this means the loaded model will have ~50K extra parameters silently and will diverge from training-time behavior. This bug is silent and only shows up when comparing logits to training reference.
The "Debugging Tip" recommends overfitting a single batch as a sanity check. (a) For the mini-transformer with vocab=65, d_model=128, what loss would you expect after 100 steps of overfitting? (b) Why would a model that fails to overfit a single batch also fail to train on the full dataset? (c) Sketch the smallest possible PyTorch loop (5-7 lines) that performs this overfitting test.
Answer Sketch
(a) Loss should drop close to zero (typically < 0.05) within 100 steps because the model has more than enough capacity to memorize a single batch of 64 sequences x 128 tokens = 8192 character predictions, which is far less than the 0.8M parameters available. If loss plateaus far above 0 (e.g., at 1-2), there is a bug. (b) A model that cannot fit one batch certainly cannot fit thousands of batches drawn from a more diverse distribution. The single-batch test isolates "is the model + optimizer wired correctly?" from "is the model big enough for the data?" If overfitting fails, the bug is architectural or numeric, not data-related. (c) x, y = next(iter(loader)); for _ in range(100): logits = model(x); loss = F.cross_entropy(logits.view(-1, V), y.view(-1)); opt.zero_grad(); loss.backward(); opt.step(); print(loss.item()). The key is calling the same x, y on every step rather than fetching new batches.
The chapter explains that fusing Q, K, V into one large matrix multiplication is more GPU-efficient. (a) For d_model=4096, what is the size of the fused projection matrix versus three separate matrices? (b) Why does GPU performance favor one large matmul over three small ones, given the FLOP count is identical? (c) When would you NOT fuse them?
Answer Sketch
(a) Fused: a single 4096 x 12288 matrix (~50.3M parameters); three separate: each 4096 x 4096 (~16.8M each, total 50.3M). Same parameter count, same total FLOPs. (b) GPU performance is dominated by memory bandwidth and kernel launch overhead, not raw FLOPs. One large matmul reuses input data (the input x is loaded from GDDR memory once) and amortizes the kernel launch cost (~5-10 microseconds) over more compute. Three separate matmuls load x three times and pay the launch cost three times. On modern GPUs, this can yield a 1.3-1.8x speedup. (c) You would NOT fuse them when (1) Q, K, V are computed from different inputs, as in cross-attention where Q comes from the decoder and K, V from the encoder; (2) you want different dropout patterns on Q vs K vs V; or (3) you are using grouped-query attention (Section 3.2 of the larger book), where K and V have fewer heads than Q and the matrix shapes no longer match.
The "fun fact" warns that mask bugs are silent: training continues, loss decreases, but generation produces nonsense. (a) Construct a specific example: if the causal mask is accidentally set to upper-triangular instead of lower-triangular, what does each output position attend to? (b) Why does the model still train (loss still decreases)? (c) What test would catch this bug deterministically?
Answer Sketch
(a) An upper-triangular mask blocks the past and exposes the future: position 0 attends to positions 1..T-1, position 1 attends to 2..T-1, ..., position T-1 attends to nothing. The model is doing "anti-causal" attention, looking only at future tokens. (b) Loss still decreases because the mask is consistent between training and the loss computation: the model is being asked to predict each token using future context, and it learns to do this very well. The training task is now "predict the previous token from the future" which is easy. The bug only manifests at generation time, when future tokens do not exist and the model attends to nothing meaningful. (c) Deterministic test: at generation time, compute attention over a sequence and verify position 0 has weight 1.0 on position 0 (only itself), position 1 distributes weight over positions 0 and 1, and so on. Any non-zero weight on a future position from a query position fails the test. A second test: train for 100 steps, then run inference; if the loss-during-training is much lower than the loss-during-inference on the same data, the mask is wrong.
What's Next?
In the next section, Section 3.5: Transformer Variants & Efficiency, we continue building on the topics covered here.