Part Overview
Part II takes you inside the black box. You will learn how LLMs are pretrained on massive corpora, what scaling laws predict about model performance, and how modern architectures (GPT, LLaMA, Mistral, Gemma) differ in their design choices. The part concludes with inference optimization: quantization, KV-cache management, batching strategies, and efficient serving frameworks that make LLMs practical in production.
Chapters: 5 (Chapters 6 through 9, plus Chapter 18: Interpretability). Builds directly on the Transformer foundations from Part I and prepares you for hands-on LLM work in Part III.
Before you can effectively use or customize an LLM, you need to understand how it was built. Part II reveals the training recipes, architectural trade-offs, and serving strategies that determine what a model can do and how much it costs to run.
How LLMs learn from raw text: pretraining objectives, dataset construction, scaling laws (Chinchilla, Kaplan), data mixing strategies, deduplication, and the economics of large-scale training.
Survey of major model families (GPT, LLaMA, Mistral, Gemma, Claude, Gemini), their architectural innovations, and how to read model cards. Covers open vs. closed models and the rapidly evolving landscape.
How reasoning models like o1, o3, and DeepSeek-R1 improve outputs by allocating more compute at inference time. Covers the test-time compute paradigm, training with reinforcement learning, and compute-optimal inference strategies.
Making LLMs fast and affordable: quantization (GPTQ, AWQ, GGUF), KV-cache optimization, continuous batching, speculative decoding, and serving frameworks (vLLM, TGI, SGLang).
Understanding what LLMs have learned and why they produce specific outputs. Covers attention analysis, probing classifiers, mechanistic interpretability (circuits, superposition), and practical tools for explaining model behavior.
What Comes Next
Continue to Part III: Working with LLMs.