Models

Section 14.4

"We split the world into closed APIs, open weights, and the thing you wrote last weekend on Modal. Each tier solves one problem and creates three."

FrontierFrontier, Tier-Aware AI Agent
Big Picture

The LLMs you call in Part III split into three tiers by access mode: closed APIs (GPT, Claude, Gemini, the frontier you rent), open weights (Llama, Mistral, Qwen, DeepSeek, the foundation you can host), and customised checkpoints (your fine-tune, your distillation, your LoRA on top of a base). This section tells you which tier earns which call inside an agent or RAG system.

Prerequisites

This section assumes the frontier-model lineage from Section 11.1 through Section 11.3, the open-weights model zoo from Section 10.10, and the fine-tuning recipes from Section 13.1.

The models you call in Part III split into three tiers by access mode. The first is closed-API frontier (the GPT-5 family, Claude Opus 4.5, Gemini 2.5 Pro and successors). The second is open-weight production tier (Llama-3.3 70B and Llama 4 family, Qwen3, DeepSeek-V3.1 / R1, Kimi K2) served either through an aggregator or through your own vLLM / Ollama instance. The third is the small fast tier (Mistral Small, Gemini 2.5 Flash, Claude Haiku 4.5, GPT-5-mini, o4-mini) where latency and cost win over quality. This section walks all three and names the per-token cost order of magnitude as of mid-2026.

Treat the prices below as approximate. Provider pricing changes monthly and varies by region; Artificial Analysis is the right place to check current numbers. What stays roughly stable is the ratio: flagship models cost 10 to 50x what their small siblings cost, and self-hosted open-weight inference cost depends almost entirely on your hardware utilization.

14.4.1 Closed-API flagships

Key Insight
Hybrid reasoning toggles change how you prompt in 2026

The 2025 frontier converged on a single architectural pattern: one model that can either answer fast or "think harder" via extended chain-of-thought. Claude 3.7+ exposes a thinking parameter; GPT-5 routes adaptively through an internal think-or-answer gate; Gemini 2.5 Pro adds Deep Think. The practical shift for API-callers: you now choose a reasoning budget per request. Default thinking off for retrieval and summarization; default thinking on for math, code, and multi-step planning. Thinking tokens are 3-5x the price of completion tokens, so the routing decision matters for cost.

14.4.2 Closed-API small / fast tier

14.4.3 Open-weight models served via API

14.4.4 Comparing the API-callable model lineup

Table 14.4.1: API-callable models in mid-2026 (approximate; check Artificial Analysis for current numbers).
ModelProviderIn/out per 1M tokThinking tokens?ContextBest for
GPT-5OpenAI$3 / $15Yes (premium)400KReasoning, code
o3OpenAI$10 / $40 (plus thinking premium)Yes, always200KHeavy reasoning
Claude Opus 4.5Anthropic$15 / $75Yes (premium)200K-1MAgentic, long writing
Gemini 2.5 ProGoogle$1.25 / $10Yes (Deep Think)1M-2MMultimodal, long context
GPT-5-miniOpenAI$0.25 / $1.25No400KHigh-volume cheap calls
o4-miniOpenAI~$0.55 / $2.20Yes, always200KCheap reasoning
Claude Haiku 4.5Anthropic$1 / $5No200KBackground tasks
Gemini 2.5 FlashGoogle$0.15 / $0.60No1MCheapest serious model
Llama-3.3 70BOpenRouter / Together$0.30 / $0.30No128KCheap open-weight
DeepSeek-V3.1 / R1DeepSeek / OpenRouter$0.14-$0.27 / $0.28-$1.10R1: yes128KCheap reasoning
Scatter plot of API-callable models
Figure 14.4.1a: API-callable models on the price-quality Pareto frontier (mid-2026). Models below the frontier are dominated. Default routing should sit at the cheap end of the frontier; escalate to flagships only when your own eval shows a real quality gap. Input price per million tokens on log x-axis, LM Arena Elo quality on y-axis. Cheap tier (Gemini 2.5 Flash, Llama-3.3 70B, DeepSeek-V3.1, GPT-5-mini) sits at the bottom-left frontier; mid tier (Claude Haiku 4.5, Gemini 2.5 Pro, GPT-5) holds the middle; Claude Opus 4.5 and o3 anchor the top-right. A dashed Pareto frontier traces the price-quality Pareto-optimal points.
Key Insight
The cheap-tier models cleared the bar for most real work in 2025

Through 2023 and most of 2024, the smallest models from each provider were obviously worse: they hallucinated more, refused more, missed instructions. By mid-2025 the gap closed: Gemini Flash, Claude Haiku, and GPT-5-mini handle most production tasks (extraction, classification, simple coding, RAG answer-generation) within 5% of their flagship siblings at 10 to 50x lower cost. Default to the cheap tier; reach for the flagship only when you measure a quality gap on your own task.

Warning
The cheap tier breaks agentic / multi-step workloads at higher rates

The "default to cheap" advice has one important exception: long multi-step agent traces are where cheap-tier models still drop the ball. Tool-use accuracy at the cheap tier is often 70-85% on a per-step basis, which compounds to 30-50% success on 5-step trajectories. The flagship tier holds 90-95% per step and 60-80% end-to-end. If your application is agentic, A/B the flagship vs cheap tier on real multi-step tasks before locking in cheap as the default; the per-call savings can be eaten by retries and partial failures. The 2025 "Stargate" $500B infrastructure announcement and the DeepSeek-R1 $5.6M base-training disclosure together anchor the cost discussion: frontier capability is now ten thousand times more expensive than the open-recipe alternative.

Real-World Scenario
Prompt caching cuts repeated-context cost by 90%

Code Fragment 14.4.1 below shows how the cache_control flag tells Anthropic to cache the 50K-token system prompt for 5 minutes. Subsequent calls within the cache window pay 0.1x the input price for the cached portion. OpenAI implements automatic prompt caching only above a ~1024-token threshold (no flag needed but the threshold matters: short prompts never hit it); Gemini exposes Context Caching via an explicit cache resource. All three SDKs follow the same pattern and the savings are typically 70 to 90% for RAG pipelines.

import anthropic
client = anthropic.Anthropic()
SYSTEM = open("long-system-prompt.md").read()  # 50K tokens, reused

resp = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[{"type":"text","text":SYSTEM,
             "cache_control":{"type":"ephemeral"}}],
    messages=[{"role":"user","content":"summarize section 3"}],
)
Code Fragment 14.4.1b: The "default to cheap" advice has one important exception: long multi-step agent traces are where cheap-tier models still drop the ball.

14.4.5 Picking a default

For Part III exercises: start with Claude Haiku 4.5 or Gemini 2.5 Flash for cheap calls, Claude Opus 4.5 or GPT-5 for the flagship comparisons, and DeepSeek-R1 or DeepSeek-V3.1 through OpenRouter when you need a low-cost reasoning model. If you anticipate hosting your own inference at any scale, the next section (16.5) and Part XIV's full Idea-to-Product Tools of the Trade chapter cover the cost-break math.

What's Next?

In the next section, Section 14.5: External Reading & Communities, we build on the material covered here.

Further Reading

Closed-Source Frontier Models

OpenAI (2024). "GPT-4o System Card." openai.com/index/gpt-4o-system-card. Reference for GPT-4o capabilities and limitations.
Anthropic (2024). "Claude 3.5 Sonnet Model Card." anthropic.com/news/claude-3-5-sonnet. Reference for Claude 3.5 Sonnet.
Google DeepMind (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805. Reference for the Gemini model family.

Open Models

Touvron, H., et al. (2023). "Llama 2." arXiv:2307.09288. Reference open-weight LLM family.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Reference 2024-25 open-weight MoE.