Frontier Models: OpenAI & Anthropic

Section 7.1

Behind every closed-source frontier model is a technical report that tells you everything except the part you actually wanted to know.

BertBert, Redaction Savvy AI Agent
Big Picture

Why study closed-source models? Although their weights and training details remain proprietary, frontier closed-source models set the benchmark for what is possible with large language models. Understanding their capabilities, architectural hints, and positioning helps practitioners choose the right tool for each task, anticipate where the field is headed, and recognize the gap (or lack thereof) between proprietary and open alternatives. Building on the historical model lineage from Section 6.1, this section maps the landscape as of early 2025, with notes on rapidly evolving developments.

Prerequisites

This section assumes familiarity with the transformer architecture from Section 3.1 and the pretraining concepts from Section 6.1 (landmark models). Understanding of Section 18.1 and alignment from Section 6.1 (InstructGPT discussion) provides context for the post-training techniques mentioned here.

7.1.1 The Frontier Model Landscape

A race track showing frontier model providers competing to build the most capable AI systems
Figure 7.1.1: The frontier model race: OpenAI, Anthropic, and Google sprint to build ever more capable systems while the rest of us try to keep up.

The term "frontier model" refers to the most capable AI systems available at any given time. As of 2026, the frontier is contested: OpenAI, Anthropic, and Google DeepMind remain at the top across most benchmarks, with xAI (Grok 3/4), DeepSeek (V3, R1), and Qwen (Qwen 3) pushing into frontier territory on specific axes such as coding, mathematical reasoning, and multilingual performance. Cohere and Mistral continue to compete in enterprise and open-weights niches. The competitive dynamics are intense, with new model releases arriving every few months and benchmark leads changing hands regularly.

What distinguishes these frontier models from their predecessors is not merely scale. They incorporate architectural refinements such as mixture of experts and extended context mechanisms; sophisticated post-training alignment procedures including RLHF (reinforcement learning from human feedback), constitutional AI, and RLAIF (reinforcement learning from AI feedback, in which an AI rater replaces the human), all detailed in Chapter 18; and increasingly, native multimodal capabilities that allow a single model to process text, images, audio, and video within a unified architecture.

Closed-source frontier model ecosystem by competitive tier
Figure 7.1.2: The closed-source frontier model ecosystem, organized by competitive tier.

7.1.2 OpenAI: GPT-4o and the o-Series

Fun Fact

The pace of frontier model releases has become so rapid that by the time a benchmark paper finishes peer review, the model it evaluates may already have two successors. AI benchmarking is like reviewing a restaurant that changes its entire menu every quarter.

GPT-4o: Multimodal Unification

GPT-4o (the "o" stands for "omni") represents OpenAI's push toward native multimodality. Unlike earlier systems that bolted separate vision encoders onto a text model, GPT-4o processes text, images, and audio within a single end-to-end architecture. This unification means the model can respond to a spoken question about an image without passing through separate speech-to-text and image-captioning pipelines, reducing latency and enabling richer cross-modal reasoning. The callout below makes precise what "native" buys you at the level of the attention and loss computation, which is the mechanism the rest of this part assumes but does not spell out.

Key Insight: Why Native Fusion Changes the Computation, Not Just the Wiring

"Native end-to-end multimodal" is often described as a packaging convenience, but the substantive difference is where the modalities meet inside the network. In native fusion, image and audio patches are projected into the same embedding space as text and concatenated into one token sequence, so they participate in self-attention from layer 0. Every layer recomputes attention over the joint stream: a text query token can attend to an image patch and vice versa at every depth. Concretely, if a prompt contributes $n_t$ text tokens and an image contributes $n_v$ patch tokens, the attention matrix at each layer is $(n_t + n_v) \times (n_t + n_v)$, and the off-diagonal $n_t \times n_v$ blocks carry the cross-modal interactions. The language-modeling loss is then computed over the joint sequence, so the gradient from a next-token prediction flows back through the image-patch representations, shaping how vision is encoded.

A bolt-on adapter (the LLaVA-style pattern of an encoder plus a projection layer feeding a frozen LLM) computes vision once: a separate encoder produces fixed features, a small projector maps them into the LLM input space, and from there the LLM attends only to those already-summarized vectors. The attention matrix the LLM sees is still $(n_t + n_v) \times (n_t + n_v)$, but the $n_v$ vision tokens never change once injected, and (when the encoder is frozen) the LM loss never re-attends to or reshapes the raw modality. The consequence is twofold: cross-modal grounding is shallower, because vision is committed to a single representation before the LLM ever reasons about the text, and the training signal is weaker, because the loss cannot adjust the perceptual front-end to the demands of the downstream task. Native fusion pays more compute per layer for deeper grounding and an end-to-end gradient; the adapter trades grounding depth for cheaper training and modularity.

Key technical characteristics of GPT-4o include:

GPT-4o: a single transformer handles text, vision, and audio in one pass
Figure 7.1.3a: GPT-4o unifies text, vision, and audio in a single transformer rather than chaining separate models for each modality. The shared embedding space allows the model to reason across modalities in one forward pass, which is the architectural reason its audio-conditioned replies arrive in under 350 ms.
# Calling GPT-4o on text + image in a single request
from openai import OpenAI
import base64

client = OpenAI()
with open("contract_page_3.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode()

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text",
             "text": "Extract any termination-for-convenience clauses, with paragraph numbers."},
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }],
    temperature=0,
)
print(resp.choices[0].message.content)
Output: Section 12.3 (Termination for Convenience): Either party may terminate this Agreement, without cause, upon thirty (30) days prior written notice to the other party. Any prepaid fees covering the unexpired portion of the Term shall be refunded on a pro-rata basis.
Code Fragment 7.1.1a: Mixed text + image request to GPT-4o. The same endpoint handles vision input, removing the separate OCR-then-LLM pipeline that was standard with GPT-4 plus a vision adapter.
Real-World Scenario
Selecting a Frontier Model for an Enterprise Contract Analysis Platform

Who: A CTO at a legal AI startup evaluating closed-source frontier models for a contract review and risk analysis product.

Situation: The platform needed to extract key clauses, identify risks, and generate plain-language summaries from complex commercial contracts ranging from 10 to 200 pages.

Problem: GPT-4o offered strong general capabilities but had a 128K token limit that could not handle the longest contracts in a single pass. Claude 3.5 Sonnet supported 200K tokens but was more expensive per token. Gemini 2.0 Pro offered 1M token context but showed weaker performance on legal nuance in early testing.

Dilemma: Optimize for cost (GPT-4o with chunking), context length (Gemini for single-pass processing), or quality on legal tasks (Claude, which scored highest on their legal benchmark but at higher per-token cost).

Decision: They implemented a tiered approach: Claude 3.5 Sonnet for contracts under 180K tokens (85% of their volume) and GPT-4o with a map-reduce chunking strategy for longer documents.

How: The team built a routing layer that estimated contract token count, selected the appropriate model, and used standardized output schemas across both providers. They ran a 500-contract evaluation comparing all three models on extraction accuracy, risk identification F1 score, and cost.

Result: Claude achieved 91% extraction accuracy versus GPT-4o's 87% and Gemini's 84% on their legal benchmark. The tiered approach reduced costs by 30% compared to using Claude for everything, while maintaining 90%+ accuracy across all contract lengths.

Lesson: No single frontier model dominates all dimensions. Building a model routing layer that selects providers based on task requirements (context length, domain accuracy, cost) often outperforms committing to a single vendor.

The o-Series: Reasoning Models

OpenAI's o1 and o3 models represent a fundamentally different approach to capability scaling. Rather than simply making the model larger or training it on more data, the o-series models spend additional compute at inference time by generating extended internal chains of thought before producing a final answer. This "thinking" process is hidden from the user but can consume thousands of tokens internally.

The o1 model demonstrated dramatic improvements on tasks requiring multi-step reasoning: competitive mathematics, formal logic, complex code generation, and scientific problem-solving. The o3 model extended these capabilities further, achieving scores on benchmarks like ARC-AGI (the Abstraction and Reasoning Corpus for AGI, an abstract visual-pattern benchmark designed to resist memorization) that had previously been considered out of reach for language models. We will explore the technical mechanisms behind these reasoning models in detail in Section 7.4.

Reasoning Model Architectures

The "thinking" capability of o-series and similar models is not merely a prompting trick. It involves a distinct generation phase that produces reasoning tokens before the final visible answer tokens. Understanding this flow matters for production engineers because it directly shapes latency, cost, and KV cache behavior.

The key distinction across reasoning model families lies in token visibility:

The KV cache implications are substantial and worth understanding before deploying reasoning models in production. Standard language model generation requires KV cache entries only for the tokens generated so far. A reasoning model generating 5,000 thinking tokens before reaching the first answer token means the entire 5,000-token reasoning chain sits in GPU memory throughout the answer generation phase. For a 128K context window model, this is manageable. But at scale, with many concurrent sessions, the memory pressure can force smaller batch sizes and reduce throughput. We examine KV cache design in depth in Chapter 16 and look at inference-time compute scaling in Section 8.1.

Note: Pricing Tiers

OpenAI employs a tiered pricing structure, which we examine in practical detail in Section 11.1. GPT-4o mini serves as the cost-effective option for high-volume, lower-complexity tasks. GPT-4o handles general-purpose work. The o-series models command premium pricing because their extended reasoning consumes substantially more compute per query. For production applications, choosing the right tier involves balancing task complexity against cost constraints.

7.1.3 Anthropic: The Claude Family

OpenAI's GPT and o-series define one wing of the frontier; Anthropic's Claude family defines another. Where OpenAI emphasizes capability acceleration and reasoning model variants, Anthropic builds around a different organizing principle, Constitutional AI, that yields distinctive behavior on safety-sensitive queries. We examine that principle through the lens of Claude 3.5 Sonnet, the model that made the approach commercially undeniable.

Claude 3.5 Sonnet and Constitutional AI

Anthropic's Claude models are distinguished by two core design principles: safety through Constitutional AI (CAI, explored further in Section 18.5) and strong performance on long-context tasks. Constitutional AI works by training the model against a set of explicitly stated principles (a "constitution") rather than relying solely on human preference data. During training, the model critiques its own outputs against these principles and revises them, creating a self-improving alignment loop.

Claude 3.5 Sonnet, released in mid-2024, achieved frontier-level performance across coding, analysis, and reasoning benchmarks while maintaining a 200K token context window. Its success demonstrated that safety-focused training need not come at the cost of raw capability. By 2026, Claude 3.5 Sonnet is two generations behind the current Claude 4.5 family (covered below), but it remains a useful historical anchor for the Constitutional AI approach.

The Constitutional AI loop can be written compactly. Let $\pi$ be the initial helpful-but-unsafe model, $\mathcal{C} = \{c_1, \dots, c_K\}$ the constitution, and $\mathrm{rev}_\mathcal{C}$ a revision operator that prompts $\pi$ to critique a response against $\mathcal{C}$ and rewrite it. The supervised CAI dataset is

$$\mathcal{D}_{\mathrm{CAI}} = \big\{\,(x,\, \mathrm{rev}_\mathcal{C}^{\,T}(\pi(x)))\, : \, x \in \mathcal{X}_{\text{red-team}}\,\big\},$$

where $\mathrm{rev}_\mathcal{C}^{\,T}$ denotes $T$ iterations of critique-and-revise (typically $T = 2$ to $4$). Anthropic then trains a preference model $r_\phi$ on AI-generated rankings under the same constitution and runs RLAIF to obtain the final policy

$$\pi^* = \arg\max_\pi \;\mathbb{E}_{x \sim \mathcal{X}}\!\Big[r_\phi(\pi(x))\Big] \;-\; \beta \,\mathrm{KL}\!\big(\pi \,\|\, \pi_{\mathrm{SFT}}\big).$$
# Calling Claude 3.5 Sonnet with explicit "thinking" via XML scaffolding
from anthropic import Anthropic

client = Anthropic()
resp = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=(
        "You are a careful legal analyst. When asked about a clause, "
        "first write your reasoning inside <reasoning>...</reasoning> tags, "
        "then give the final answer."
    ),
    messages=[
        {"role": "user",
         "content": "Does this NDA survive termination? Quote the controlling clause."}
    ],
)
print(resp.content[0].text)  # includes a <reasoning> block then the answer
Code Fragment 7.1.2a: Claude 3.5 Sonnet with explicit XML-scaffolded reasoning. Constitutional AI training encourages the model to follow the structured system prompt rather than emit a single, unstructured answer.
Real-World Scenario
Claude 3.5 Sonnet for medical-record summarisation

Who: Compliance lead at a hospital network deploying an AI assistant for clinician handoff notes.

Problem: The team needed summaries of multi-hundred-page patient charts that would refuse to speculate about diagnoses outside the documented evidence and would cite the source line whenever a claim was made.

Why Claude: In a 1,000-chart bake-off, Claude 3.5 Sonnet refused 96% of unsupported-diagnosis prompts versus 78% for GPT-4o, while still producing summaries rated 4.6/5 by attending physicians (versus 4.5/5 for GPT-4o). The team attributed the difference to Constitutional AI's principled-refusal behaviour: Claude tended to explain why a request was outside scope rather than emit a generic boilerplate refusal, which let clinicians quickly re-phrase the question.

Decision: Claude 3.5 Sonnet for production summarisation, with GPT-4o-mini as a fallback for any request that Claude refused, gated by a second-line review queue.

Lesson: When refusal quality matters as much as helpfulness, Constitutional AI's explicit-principle behaviour is a measurable product feature, not just a research curiosity.

Constitutional AI Architecture

Constitutional AI (CAI) introduced a training methodology that is architecturally distinct from standard RLHF. Understanding the difference matters for practitioners who want to reason about why Claude behaves differently from GPT or Gemini on sensitive queries, and for researchers who study alignment techniques. The original paper is Bai et al. (2022), cited in the bibliography below.

Standard RLHF requires a large corpus of human-labeled preference pairs: human raters compare two model outputs and indicate which is better. A reward model is trained on those preferences, and the policy model (the LLM) is then fine-tuned to maximize reward. The bottleneck is human labeling: it is expensive, slow, and inconsistent across raters, especially for sensitive or nuanced content.

CAI replaces the human critique step with an AI critique step, structured as follows:

  1. Generate: The model (initially fine-tuned from a helpful but potentially harmful baseline) generates a response to a prompt that was specifically chosen to elicit harmful behavior.
  2. Critique: The same model is prompted to critique its own response against each principle in the constitution. For example: "Identify specific ways in which the assistant's last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal."
  3. Revise: The model then rewrites the response to address the critique. This produce/critique/revise cycle is repeated multiple times, generating progressively safer revisions.
  4. Train on revisions: The final revised responses are used as supervised fine-tuning targets, replacing the original harmful outputs. This is the "supervised CAI" phase.
  5. RLAIF preference ranking: A separate AI feedback model (using the same constitution as evaluation criteria) then generates preference rankings between original and revised responses. These AI-generated preference labels are used to train a preference model, which drives a final RLHF-style fine-tuning phase.

The crucial difference from standard RLHF is that the critique step uses AI feedback rather than human feedback. Human raters are still involved in training the underlying models, but the specific preference labels used for alignment are generated by the model itself. This creates a scalable alignment loop: as the model improves, its self-critiques become more accurate, improving the quality of subsequent fine-tuning signal.

Key Insight: Why CAI Produces Different Behaviors

Because Claude's alignment training explicitly articulates the principles being optimized (the constitution), the model develops a more legible sense of why certain responses are preferred. This is argued to produce more consistent behavior across paraphrases of the same question and more principled refusals that cite the relevant concern rather than blanket "I can't do that" responses. RLHF-trained models, by contrast, learn preferences from data without explicit principle articulation, which can produce less consistent behavior on edge cases not well-represented in the training set.

The Claude 4 and 4.5 Family

The Claude 4 generation introduced a model family spanning multiple capability and cost tiers, and the 2025 Claude 4.5 refresh extended each tier with improved coding, agentic, and long-context performance:

A notable architectural feature across the Claude family is the extended context window. Default models support 200K tokens of input context, and the Sonnet 4.5 1M-context variant pushes this to a full million, enough to process entire codebases, lengthy legal documents, or multi-chapter manuscripts in a single pass. This capability is not merely about accepting long inputs; Anthropic has invested in ensuring that retrieval accuracy remains high even when relevant information is buried deep within the context.

Key Insight

The "needle in a haystack" problem: Many models accept long context windows but fail to reliably retrieve and use information from arbitrary positions within that context. Anthropic's Claude models have consistently scored well on "needle in a haystack" evaluations, where a specific fact is inserted at a random position within a long document and the model must locate and use it accurately. This capability matters enormously for real-world applications like document analysis and codebase understanding.

What's Next?

In the next part of this section, Section 7.2: Frontier: Gemini, Architecture & Benchmarks, the frontier model landscape, openai's gpt-4o and the o-series, and anthropic's claude family.

Further Reading

Technical Reports & System Cards

OpenAI (2024). "GPT-4o System Card." Official system card detailing GPT-4o's multimodal capabilities, safety evaluations, and deployment guardrails. Useful for understanding how frontier labs communicate model limitations and risk assessments.
Anthropic (2024). "Claude 3.5 Sonnet Model Card." Anthropic's model documentation covering Claude 3.5 Sonnet's capabilities, benchmarks, and intended use cases. Useful for comparing architectural philosophy across frontier providers.
Anthropic (2024). "The Claude Model Spec." Describes Anthropic's approach to specifying model behavior, including safety properties, helpfulness goals, and honesty constraints. A unique window into how alignment objectives translate into product design.

Research Papers

Google DeepMind (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv preprint arXiv:2312.11805. Comprehensive technical report on Google's Gemini model family, covering architecture, training methodology, and multimodal evaluation. Key reference for understanding the native multimodal approach versus bolt-on vision adapters.

Architecture and Training Papers

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., ... Kaplan, J. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint arXiv:2212.08073. The foundational Constitutional AI paper from Anthropic. Introduces the generate/critique/revise training loop and RLAIF (RL from AI Feedback), showing that models can be aligned against explicit principles without requiring human preference labels for every step. Useful for understanding why Claude's behavior differs from RLHF-trained models.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv preprint arXiv:2305.13245. Introduces Grouped Query Attention (GQA), the attention variant now used in most production-grade frontier and open-weight models. Shows that grouping query heads to share K/V heads reduces KV cache memory by 4x to 8x with negligible quality loss, enabling longer contexts and larger serving batch sizes. The paper also describes a method for converting MHA checkpoints to GQA without full retraining.
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint arXiv:1911.02150. Introduces Multi-Query Attention (MQA), the most memory-efficient attention variant, in which all query heads share a single K/V head. Adopted by Google in Gemini to maximize serving throughput. The tradeoff is slightly more quality degradation than GQA; most subsequent models prefer GQA as a middle ground.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Singh Chaplot, D., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Renard Lavaud, L., Lachaux, M., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). "Mistral 7B." arXiv preprint arXiv:2310.06825. Technical report for Mistral 7B, introducing the combination of Grouped Query Attention and Sliding Window Attention. SWA limits each layer's attention to a local window, enabling linear memory scaling with context length while maintaining strong performance. This paper is the primary published reference for SWA + GQA in open-weight frontier models.

Blog Posts & Announcements

OpenAI (2024). "Learning to Reason with LLMs." OpenAI Blog. OpenAI's announcement of o1's chain-of-thought reasoning capabilities, explaining how reinforcement learning enables extended deliberation at inference time. Important context for the shift toward test-time compute scaling.