Part II: Understanding LLMs

Chapter 07: Modern LLM Landscape & Model Internals

"The best way to predict the future is to invent it, but the second best way is to train a very large neural network on all of human text."

Eval Eval, Prophetically Trained AI Agent
Modern LLM Landscape and Model Internals chapter illustration
Figure 7.0.1: The LLM ecosystem as a bustling marketplace: closed-source frontier models push capability boundaries while open-weight releases democratize access, and a new class of reasoning models reshapes what is possible.

Chapter Overview

The large language model ecosystem has grown at a breathtaking pace. Closed-source frontier models from OpenAI, Anthropic, and Google push the boundaries of capability, while open-weight releases from Meta, DeepSeek, Mistral, Alibaba, and Microsoft have democratized access to powerful models that anyone can download, fine-tune, and deploy. Meanwhile, a new class of reasoning models has emerged, shifting compute from training time to inference time through extended chains of thought, process reward models, and tree search over candidate solutions.

This chapter surveys the current landscape across four complementary perspectives. We begin with the closed-source frontier (Section 7.1), examining the capabilities, pricing, and architectural hints available for GPT-4o, Claude, Gemini, and their competitors. Section 7.2 dives deep into open-source and open-weight models, with particular attention to architectural innovations like DeepSeek V3's Multi-head Latent Attention, FP8 training, and auxiliary-loss-free Mixture of Experts. Section 7.3 explores the paradigm shift toward reasoning models and test-time compute scaling. Finally, Section 7.4 addresses the multilingual and cross-cultural dimensions that determine whether these models serve a global audience or remain English-centric tools.

Big Picture

The LLM landscape spans a spectrum from closed-source frontier APIs (maximum capability, least control) to open-weight models (full transparency, deployment flexibility). Understanding this spectrum, along with the emerging paradigm of reasoning models that shift compute to inference time, is essential for making informed architectural decisions throughout the rest of this book.

Prerequisites

Learning Objectives

Sections

What's Next?

In the next chapter, Chapter 08: Reasoning Models and Test-Time Compute, we explore how reasoning models like o1, o3, and DeepSeek-R1 improve outputs by allocating more compute at inference time.

Bibliography & Further Reading

Foundational Papers

Touvron, H., et al. "Llama 2: Open Foundation and Fine-Tuned Chat Models." (2023). arXiv:2307.09288
The paper that launched Meta's open-weight LLM family, detailing pretraining, RLHF alignment, and safety evaluations.
DeepSeek-AI. "DeepSeek-V3 Technical Report." (2024). arXiv:2412.19437
Describes Multi-head Latent Attention, FP8 mixed-precision training, and auxiliary-loss-free load balancing for Mixture of Experts.
Jiang, A.Q., et al. "Mixtral of Experts." (2024). arXiv:2401.04088
Introduces the sparse Mixture of Experts architecture behind Mixtral, showing how expert routing achieves strong performance at lower compute cost.
OpenAI. "GPT-4 Technical Report." (2023). arXiv:2303.08774
OpenAI's report on GPT-4 capabilities, covering multimodal inputs, safety mitigations, and benchmark performance across professional exams.
DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." (2025). arXiv:2501.12948
Demonstrates how reinforcement learning can elicit extended chain-of-thought reasoning in open-weight models.
Team, Gemini, et al. "Gemini: A Family of Highly Capable Multimodal Models." (2024). arXiv:2312.11805
Google's multimodal model family, covering native image, audio, and video understanding alongside text.

Key Books & Surveys

Zhao, W.X., et al. "A Survey of Large Language Models." (2023). arXiv:2303.18223
A comprehensive survey covering pretraining, alignment, evaluation, and applications of LLMs across the full landscape.
Ustun, A., et al. "Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model." (2024). arXiv:2402.07827
Covers the Aya multilingual model trained on 101 languages, with discussion of cross-lingual transfer and cultural bias.
Snell, C., et al. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." (2024). arXiv:2408.03314
Formalizes the tradeoff between train-time and test-time compute, showing when inference scaling is preferable to larger models.

Tools & Libraries

Hugging Face Model Hub. huggingface.co/models
The central repository for discovering, downloading, and deploying open-weight models, with over 500,000 model cards.
Hugging Face Open LLM Leaderboard. huggingface.co/spaces/open-llm-leaderboard
Community-maintained benchmark comparison of open-weight models across reasoning, knowledge, and coding tasks.
A lightweight tool for running open-weight LLMs locally on consumer hardware with a simple CLI interface.
vLLM: Easy, Fast, and Cheap LLM Serving. github.com/vllm-project/vllm
High-throughput serving engine with PagedAttention, used widely for deploying open-weight models in production.