Section 78.4: Models

"The reasoning-first tier (o3, Claude with extended thinking, Gemini 2.5 Pro). The cost-first tier (Haiku, Mini, Flash). The open-weights tier (Llama, Qwen, DeepSeek). Each tier is a deployment thesis."
Frontier, Frontier-Model-Reader AI Agent

Note: Learning Objectives

Distinguish reasoning-first, long-context, multimodal, and open-weight frontier models by access mode and target workload.
Track an open-source project's traction signal without falling for the "GitHub stars" vanity metric.
Pick which frontier model to evaluate for a given product-development question.

Big Picture

The 2025-26 model shelf has three tiers: reasoning-first (o3 and o4-mini, Claude Opus 4.5 with extended thinking, Gemini 2.5 Pro with Deep Think, DeepSeek-R1 and R1-0528, the GPT-5 family's adaptive thinking), long-context multimodal (Gemini 2.5 Pro, Claude Opus 4.5, GPT-5), and open-weights (Llama-3.3 70B / Llama 4 family, Qwen3, DeepSeek-V3.1, Kimi K2, GLM-4.6). Knowing what each tier exists for, and the open-source projects (vLLM, llama.cpp, MLX) that ship them, is most of model selection.

Prerequisites

This section assumes the LLM model-zoo vocabulary from Section 14.4 and the open-weights LLM landscape from Section 10.10.

Three-tier 2026 frontier model shelf: reasoning-first, long-context multimodal, open-weights — **Figure 78.4.1:** The three-tier 2026 frontier model shelf. The reasoning-first tier (OpenAI o3 and o4-mini, GPT-5's adaptive thinking, Claude Opus 4.5 extended thinking, Gemini 2.5 Pro Deep Think, DeepSeek-R1, Qwen3 with enable_thinking, Grok 4) trades latency and cost for hard-task accuracy through test-time compute scaling. The long-context multimodal tier (Gemini 2.5 Pro with 2M-token windows, Claude Opus 4.5 with 1M, GPT-5 native multimodal) plus world models (Genie 2, V-JEPA 2, Cosmos) target documents, video, and embodied workloads. The open-weights tier (Llama 4, Qwen3, DeepSeek-V3.1, Kimi K2, GLM-4.6, GPT-OSS, Nemotron) is the self-host path, with Stargate's $500B infrastructure commitment and DeepSeek-V3's roughly $5.6M base-training cost bracketing the spread.

78.4.1 Reasoning-first models

OpenAI o3 / o4-mini / GPT-5 (Aug 2025): dedicated reasoning models with test-time compute scaling, plus GPT-5's adaptive "think harder" tier.
Claude Opus 4.5 / Sonnet 4.5 (Anthropic, 2025) with extended thinking. Claude 4 family launched May 2025.
Gemini 2.5 Pro with Deep Think (March 2025).
DeepSeek-R1 and R1-0528, DeepSeek-V3.1 (2025): the open reasoning frontier.
Qwen3 (April 2025) with the enable_thinking=True hybrid-reasoning toggle. The older QwQ-32B-Preview (Dec 2024) is now superseded by Qwen3.
Grok 4 (xAI, July 2025): late frontier entrant.
NVIDIA Nemotron-Reasoning (2025): a relevant frontier reasoning release.
Step-1 / Step-2 (StepFun, 2025): Chinese reasoning models worth tracking.

Key Insight

Hybrid reasoning toggles are the 2025-26 architectural trend

The single most important pattern across the 2025-26 frontier is the unification of "fast" and "thinking" modes under one model. Qwen3 exposes enable_thinking=True on the chat template; Claude 3.7+ and Claude 4 ship extended thinking; GPT-5 routes adaptively via an internal think-or-answer gate; Gemini 2.5 Pro adds Deep Think. This is the architectural convergence that makes 2025-26 distinct from 2024. The GRPO recipe from DeepSeek-R1 (2025) is the open recipe behind much of this; see also Section 19.6 for the foundational papers.

78.4.1.5 World models for video and robotics

The 2025 frontier expansion that is not text-only: world models for video and robotics. Genie 2 (DeepMind), V-JEPA 2 (Meta), and Cosmos (NVIDIA, 2025) together represent the world-model frontier: models that learn the dynamics of physical environments well enough to plan in them. These are not text models with vision tacked on; they are foundation models for video and robotics directly.

78.4.2 Long-context and multimodal frontier

Gemini 2.5 Pro: 2M-token context.
Claude Opus 4.5: 1M context.
GPT-4o / GPT-5 with multimodal IO.

78.4.3 Open-weights frontier

Llama-3.3 70B (Dec 2024), Llama 4 family (Scout / Maverick / Behemoth, 2025-Q2), Qwen3 (April 2025), DeepSeek-V3.1 / R1 / R1-0528 (Jan-mid 2025), Kimi K2 (Moonshot, 2025, 1T-MoE), GLM-4.5 / GLM-4.6 (Zhipu, 2025), GPT-OSS (OpenAI's first open-weights release in years, 2025) (covered in Chapter 12). These remain the open shelf in 2026.

Real-World Scenario

Stargate and the cost of staying on the frontier

The January 2025 Stargate announcement (a $500B infrastructure commitment from OpenAI / SoftBank / Oracle for US data centers over five years) is the economic anchor for the "frontier requires capital" thesis. Paired with the DeepSeek-R1 disclosure of ~$5.6M for V3-base training a few weeks earlier, the two numbers bracket the 2025 cost spectrum: ten thousand times more capital separates the closed-frontier path from the open-recipe path. Treat both numbers as ground truth for any "how much does the frontier cost" discussion through 2026.

78.4.4 Comparing the frontier models

Table 78.4.1a: 65.4.1 Frontier models (mid-2026).

Model	Released	Frontier on	Access	Notes
OpenAI o3 / o4-mini	2024-25	Hard reasoning	API	Test-time compute model
GPT-5 family	Aug 2025	Generalist + adaptive reasoning	API	Internal think-or-answer gate
Claude Opus 4.5	2025	SWE-bench, long-context coding	API	Extended thinking mode
Gemini 2.5 Pro	March 2025	Multimodal, 2M context, Deep Think	API	Largest context window
Grok 4	July 2025	Real-time web search	API	xAI
DeepSeek-R1	Jan 2025	Open reasoning	Open weights (MIT)	GRPO recipe; the open-replication anchor
Qwen3-235B	April 2025	Multilingual MoE, hybrid reasoning	Open weights (Apache 2.0)	enable_thinking toggle
Kimi K2	2025	Open-weight scale (1T MoE)	Open weights	32B active
Llama 4 family	2025-Q2	Open frontier	Open weights (LCL)	Scout / Maverick / Behemoth

Key Insight: Reasoning is the 2025-2026 frontier

The most important capability change of 2024-2026 was the shift from "predict the next token" models to test-time-compute / reasoning models. The cost-per-task economics changed substantially: more tokens per task, but higher success on the hardest tasks. Whether this trend extends or saturates is the open research question Part XII discusses.

Real-World Scenario: vLLM's 2025 inflection

vLLM started as a Berkeley research project (PagedAttention paper, June 2023). Through 2024 it was the open-source serving library, growing steadily. In Q1-Q2 2025 it crossed an inflection: enterprise adoption (Anyscale, Databricks, NVIDIA Triton) brought engineering resources, the project shipped FP8 / FP4 quantization paths, multi-LoRA serving, speculative decoding (Medusa/EAGLE), and disaggregated prefill/decode. GitHub stars went from ~25k in early 2025 to ~50k by mid-2026, but the more meaningful signal was that the top-five LLM-hosting companies all standardized on vLLM as their default open-weight serving stack. By Q1 2026 vLLM had surpassed TGI (Hugging Face's serving library) as the open-source production default. The lesson: a project's value is the layer of infrastructure that uses it, not its star count. The vLLM repo and the vLLM blog document the trajectory.

Warning: GitHub stars are vanity

Stars correlate with awareness, not with adoption or quality. Many high-star repos (15-30k stars) are research demos with no production users; many production-critical infrastructure projects (vLLM in early 2024, SGLang in 2025) had relatively modest star counts while running serving for major labs. The better signals: (1) named enterprise users with case studies; (2) the rate of commits in the last 90 days; (3) issue-resolution time; (4) presence in adjacent production stacks (NVIDIA Triton, AWS Bedrock, etc.). Treat stars as a noisy proxy at best.

Key Takeaways

The 2026 model shelf splits cleanly: reasoning-first, long-context multimodal, and open-weights.
Reasoning-first models trade more tokens per task for higher success on the hardest tasks.
vLLM's 2025 inflection from research project to default open-source production serving illustrates the difference between awareness (stars) and adoption (named users).

What's Next?

In the next section, Section 78.5: External Reading & Communities, we build on the material covered here.

Further Reading

OpenAI o-series (o3, GPT-5 variants).

Claude Opus 4.5 / Sonnet 4.5 (the Claude 4 family launched May 2025).

Gemini 2.5 Pro with Deep Think (DeepMind, March 2025).

DeepSeek-R1: GRPO from cold-start.

vLLM: the open-source production serving stack.

llama.cpp: the cross-platform inference engine.

MLX: Apple Silicon inference.