
"The best way to predict the future is to invent it, but the second best way is to train a very large neural network on all of human text."
Eval, Prophetically Trained AI Agent
Chapter 7 told you how LLMs are trained. This chapter tells you which LLMs to actually use. The frontier (GPT, Claude, Gemini), the open-weight winners (Llama, Mistral, DeepSeek, Qwen, Gemma), and the architectural innovations that distinguish them (MoE vs. dense, GQA, sliding-window attention). This is the chapter to come back to whenever the question is "which model should I use for this task?"
Chapter Overview
The large language model ecosystem has grown at a breathtaking pace. Closed-source frontier models from OpenAI, Anthropic, and Google push the boundaries of capability, while open-weight releases from Meta, DeepSeek, Mistral, Alibaba, and Microsoft have democratized access to powerful models that anyone can download, fine-tune, and deploy. Meanwhile, a new class of reasoning models has emerged, shifting compute from training time to inference time through extended chains of thought, process reward models, and tree search over candidate solutions.
This chapter surveys the current landscape across four complementary perspectives. We begin with the closed-source frontier (Section 7.1), examining the capabilities, pricing, and architectural hints available for GPT-4o, Claude, Gemini, and their competitors. Section 7.3 dives deep into open-source and open-weight models, with particular attention to architectural innovations like DeepSeek V3's Multi-head Latent Attention, FP8 training, and auxiliary-loss-free Mixture of Experts. Section 7.4 explores the paradigm shift toward reasoning models and test-time compute scaling. Finally, Section 7.4 addresses the multilingual and cross-cultural dimensions that determine whether these models serve a global audience or remain English-centric tools.
The LLM landscape spans a spectrum from closed-source frontier APIs (maximum capability, least control) to open-weight models (full transparency, deployment flexibility). Understanding this spectrum, along with the emerging paradigm of reasoning models that shift compute to inference time, is essential for making informed architectural decisions throughout the rest of this book.
- Compare frontier closed-source models on capability dimensions including reasoning, multimodality, context length, and pricing
- Explain the architectural innovations in DeepSeek V3 (MLA, FP8, auxiliary-loss-free MoE) and their impact on efficiency
- Articulate the difference between train-time and test-time compute scaling, and identify when each is preferable
- Implement best-of-N sampling with a reward model and explain process vs. outcome reward models
- Evaluate multilingual LLM capabilities and understand the challenges of cross-lingual transfer
- Navigate the Hugging Face ecosystem to discover, download, and run open-weight models locally
- Describe Monte Carlo Tree Search applied to language generation and the AlphaProof approach
Prerequisites
- Chapter 3: Transformer architecture (attention mechanism, multi-head attention, feed-forward layers)
- Chapter 4: Decoding strategies (greedy, beam search, sampling methods)
- Chapter 6: Pretraining, scaling laws, and data curation fundamentals
- Basic familiarity with Python and the Hugging Face Transformers library
Sections
- 7.1 Frontier Models: OpenAI & Anthropic The frontier model landscape, OpenAI's GPT-4o and the o-series, and Anthropic's Claude family. Advanced
- 7.2 Frontier: Gemini, Architecture & Benchmarks Google's Gemini, second-tier providers, multimodal architectural unification, attention variants, and a multi-axis comparison of the closed-source frontier. Advanced
- 7.2a Rate Limits, Convergence & Benchmarking Rate limits and practical constraints, architectural inference from the outside, the convergence trend, and benchmarking methodology with contamination. Advanced
- 7.3 Open-Source & Open-Weight Models The open-weight revolution. Intermediate
- 7.4 Multilingual & Cross-Cultural LLMs Language technology is not linguistically neutral. Advanced
- 7.4a Multilingual Evaluation, Adaptation & Model Families Benchmarks, the three-stage adaptation pipeline, and production-grade multilingual model families. Advanced
What's Next?
Next: Chapter 8: Reasoning Models & Test-Time Compute. The model zoo you just toured assumed one shot per prompt. But 2024 introduced a new axis: spend more compute at inference and the same weights solve harder problems. Chapter 8 covers how o1, o3, DeepSeek-R1, and QwQ are trained (RLVR, GRPO, PRM) and when paying more per request is cheaper than buying a bigger model.