Models

Section 78.4

"The reasoning-first tier (o3, Claude with extended thinking, Gemini 2.5 Pro). The cost-first tier (Haiku, Mini, Flash). The open-weights tier (Llama, Qwen, DeepSeek). Each tier is a deployment thesis."

FrontierFrontier, Frontier-Model-Reader AI Agent
Note: Learning Objectives
Big Picture

The 2025-26 model shelf has three tiers: reasoning-first (o3 and o4-mini, Claude Opus 4.5 with extended thinking, Gemini 2.5 Pro with Deep Think, DeepSeek-R1 and R1-0528, the GPT-5 family's adaptive thinking), long-context multimodal (Gemini 2.5 Pro, Claude Opus 4.5, GPT-5), and open-weights (Llama-3.3 70B / Llama 4 family, Qwen3, DeepSeek-V3.1, Kimi K2, GLM-4.6). Knowing what each tier exists for, and the open-source projects (vLLM, llama.cpp, MLX) that ship them, is most of model selection.

Prerequisites

This section assumes the LLM model-zoo vocabulary from Section 14.4 and the open-weights LLM landscape from Section 10.10.

Three-tier 2026 frontier model shelf: reasoning-first, long-context multimodal, open-weights
Figure 78.4.1: The three-tier 2026 frontier model shelf. The reasoning-first tier (OpenAI o3 and o4-mini, GPT-5's adaptive thinking, Claude Opus 4.5 extended thinking, Gemini 2.5 Pro Deep Think, DeepSeek-R1, Qwen3 with enable_thinking, Grok 4) trades latency and cost for hard-task accuracy through test-time compute scaling. The long-context multimodal tier (Gemini 2.5 Pro with 2M-token windows, Claude Opus 4.5 with 1M, GPT-5 native multimodal) plus world models (Genie 2, V-JEPA 2, Cosmos) target documents, video, and embodied workloads. The open-weights tier (Llama 4, Qwen3, DeepSeek-V3.1, Kimi K2, GLM-4.6, GPT-OSS, Nemotron) is the self-host path, with Stargate's $500B infrastructure commitment and DeepSeek-V3's roughly $5.6M base-training cost bracketing the spread.

78.4.1 Reasoning-first models

Key Insight
Hybrid reasoning toggles are the 2025-26 architectural trend

The single most important pattern across the 2025-26 frontier is the unification of "fast" and "thinking" modes under one model. Qwen3 exposes enable_thinking=True on the chat template; Claude 3.7+ and Claude 4 ship extended thinking; GPT-5 routes adaptively via an internal think-or-answer gate; Gemini 2.5 Pro adds Deep Think. This is the architectural convergence that makes 2025-26 distinct from 2024. The GRPO recipe from DeepSeek-R1 (2025) is the open recipe behind much of this; see also Section 19.6 for the foundational papers.

78.4.1.5 World models for video and robotics

The 2025 frontier expansion that is not text-only: world models for video and robotics. Genie 2 (DeepMind), V-JEPA 2 (Meta), and Cosmos (NVIDIA, 2025) together represent the world-model frontier: models that learn the dynamics of physical environments well enough to plan in them. These are not text models with vision tacked on; they are foundation models for video and robotics directly.

78.4.2 Long-context and multimodal frontier

78.4.3 Open-weights frontier

Real-World Scenario
Stargate and the cost of staying on the frontier

The January 2025 Stargate announcement (a $500B infrastructure commitment from OpenAI / SoftBank / Oracle for US data centers over five years) is the economic anchor for the "frontier requires capital" thesis. Paired with the DeepSeek-R1 disclosure of ~$5.6M for V3-base training a few weeks earlier, the two numbers bracket the 2025 cost spectrum: ten thousand times more capital separates the closed-frontier path from the open-recipe path. Treat both numbers as ground truth for any "how much does the frontier cost" discussion through 2026.

78.4.4 Comparing the frontier models

Table 78.4.1a: 65.4.1 Frontier models (mid-2026).
Model Released Frontier on Access Notes
OpenAI o3 / o4-mini 2024-25 Hard reasoning API Test-time compute model
GPT-5 family Aug 2025 Generalist + adaptive reasoning API Internal think-or-answer gate
Claude Opus 4.5 2025 SWE-bench, long-context coding API Extended thinking mode
Gemini 2.5 Pro March 2025 Multimodal, 2M context, Deep Think API Largest context window
Grok 4 July 2025 Real-time web search API xAI
DeepSeek-R1 Jan 2025 Open reasoning Open weights (MIT) GRPO recipe; the open-replication anchor
Qwen3-235B April 2025 Multilingual MoE, hybrid reasoning Open weights (Apache 2.0) enable_thinking toggle
Kimi K2 2025 Open-weight scale (1T MoE) Open weights 32B active
Llama 4 family 2025-Q2 Open frontier Open weights (LCL) Scout / Maverick / Behemoth
Key Insight: Reasoning is the 2025-2026 frontier

The most important capability change of 2024-2026 was the shift from "predict the next token" models to test-time-compute / reasoning models. The cost-per-task economics changed substantially: more tokens per task, but higher success on the hardest tasks. Whether this trend extends or saturates is the open research question Part XII discusses.

Real-World Scenario: vLLM's 2025 inflection

vLLM started as a Berkeley research project (PagedAttention paper, June 2023). Through 2024 it was the open-source serving library, growing steadily. In Q1-Q2 2025 it crossed an inflection: enterprise adoption (Anyscale, Databricks, NVIDIA Triton) brought engineering resources, the project shipped FP8 / FP4 quantization paths, multi-LoRA serving, speculative decoding (Medusa/EAGLE), and disaggregated prefill/decode. GitHub stars went from ~25k in early 2025 to ~50k by mid-2026, but the more meaningful signal was that the top-five LLM-hosting companies all standardized on vLLM as their default open-weight serving stack. By Q1 2026 vLLM had surpassed TGI (Hugging Face's serving library) as the open-source production default. The lesson: a project's value is the layer of infrastructure that uses it, not its star count. The vLLM repo and the vLLM blog document the trajectory.

Warning: GitHub stars are vanity

Stars correlate with awareness, not with adoption or quality. Many high-star repos (15-30k stars) are research demos with no production users; many production-critical infrastructure projects (vLLM in early 2024, SGLang in 2025) had relatively modest star counts while running serving for major labs. The better signals: (1) named enterprise users with case studies; (2) the rate of commits in the last 90 days; (3) issue-resolution time; (4) presence in adjacent production stacks (NVIDIA Triton, AWS Bedrock, etc.). Treat stars as a noisy proxy at best.

Key Takeaways

What's Next?

In the next section, Section 78.5: External Reading & Communities, we build on the material covered here.

Further Reading
OpenAI o-series (o3, GPT-5 variants).
Claude Opus 4.5 / Sonnet 4.5 (the Claude 4 family launched May 2025).
Gemini 2.5 Pro with Deep Think (DeepMind, March 2025).
DeepSeek-R1: GRPO from cold-start.
vLLM: the open-source production serving stack.
llama.cpp: the cross-platform inference engine.
MLX: Apple Silicon inference.