Chapter 7: Modern LLM Landscape & Model Internals

Chapter opener illustration: Modern LLM Landscape & Model Internals.

"The best way to predict the future is to invent it, but the second best way is to train a very large neural network on all of human text."
Eval, Prophetically Trained AI Agent

Looking Back

Chapter 7 told you how LLMs are trained. This chapter tells you which LLMs to actually use. The frontier (GPT, Claude, Gemini), the open-weight winners (Llama, Mistral, DeepSeek, Qwen, Gemma), and the architectural innovations that distinguish them (MoE vs. dense, GQA, sliding-window attention). This is the chapter to come back to whenever the question is "which model should I use for this task?"

Chapter Overview

The large language model ecosystem has grown at a breathtaking pace. Closed-source frontier models from OpenAI, Anthropic, and Google push the boundaries of capability, while open-weight releases from Meta, DeepSeek, Mistral, Alibaba, and Microsoft have democratized access to powerful models that anyone can download, fine-tune, and deploy. Meanwhile, a new class of reasoning models has emerged, shifting compute from training time to inference time through extended chains of thought, process reward models, and tree search over candidate solutions.

This chapter surveys the current landscape across four complementary perspectives. We begin with the closed-source frontier (Section 7.1), examining the capabilities, pricing, and architectural hints available for GPT-4o, Claude, Gemini, and their competitors. Section 7.3 dives deep into open-source and open-weight models, with particular attention to architectural innovations like DeepSeek V3's Multi-head Latent Attention, FP8 training, and auxiliary-loss-free Mixture of Experts. Section 7.4 explores the paradigm shift toward reasoning models and test-time compute scaling. Finally, Section 7.4 addresses the multilingual and cross-cultural dimensions that determine whether these models serve a global audience or remain English-centric tools.

Big Picture

The LLM landscape spans a spectrum from closed-source frontier APIs (maximum capability, least control) to open-weight models (full transparency, deployment flexibility). Understanding this spectrum, along with the emerging paradigm of reasoning models that shift compute to inference time, is essential for making informed architectural decisions throughout the rest of this book.

Note: Learning Objectives

Compare frontier closed-source models on capability dimensions including reasoning, multimodality, context length, and pricing
Explain the architectural innovations in DeepSeek V3 (MLA, FP8, auxiliary-loss-free MoE) and their impact on efficiency
Articulate the difference between train-time and test-time compute scaling, and identify when each is preferable
Implement best-of-N sampling with a reward model and explain process vs. outcome reward models
Evaluate multilingual LLM capabilities and understand the challenges of cross-lingual transfer
Navigate the Hugging Face ecosystem to discover, download, and run open-weight models locally
Describe Monte Carlo Tree Search applied to language generation and the AlphaProof approach

Prerequisites

Chapter 3: Transformer architecture (attention mechanism, multi-head attention, feed-forward layers)
Chapter 4: Decoding strategies (greedy, beam search, sampling methods)
Chapter 6: Pretraining, scaling laws, and data curation fundamentals
Basic familiarity with Python and the Hugging Face Transformers library

Sections

What's Next?

Next: Chapter 8: Reasoning Models & Test-Time Compute. The model zoo you just toured assumed one shot per prompt. But 2024 introduced a new axis: spend more compute at inference and the same weights solve harder problems. Chapter 8 covers how o1, o3, DeepSeek-R1, and QwQ are trained (RLVR, GRPO, PRM) and when paying more per request is cheaper than buying a bigger model.