Models

Section 40.4

"Frontier chat models all sound the same after the third turn; the differences are vibe, voice, and who you trust with your logs."

LexicaLexica, Brand-Indifferent Conversational AI Agent
Big Picture

The "conversational model" landscape in 2026 splits four ways: frontier chat-tuned APIs (Claude 3.5 / 4.x family, GPT-4o / GPT-5, Gemini 1.5 / 2.5) that dominate Arena Elo and are the default for production-quality assistants; voice-aware models (GPT-4o Realtime, Gemini Live, Sonic) that take audio in and emit audio out without a cascaded pipeline; open-weights chat models (Llama-3.x/4 chat, Mistral chat, Qwen-Chat, ChatGLM, DeepSeek-Chat) you can host yourself; and persona / character models (Character.AI's in-house models, Pi by Inflection / Microsoft's Pi-derived assistants) tuned specifically for ongoing companionship and persona consistency. This section is the field guide, organized by tier rather than by vendor.

Prerequisites

This section assumes the LLM model zoo from Section 14.4 and the realtime-voice models from Section 39.3.

Conversation, unlike most other tasks, depends as much on style and prosody as on factual correctness; that is why a model can ace MMLU and lose on LMSYS Arena. The selection axes in this section are therefore: Arena Elo or equivalent style score; latency and audio modality; openness (closed API vs open weights); and persona-consistency vs assistant-correctness tuning. Most production picks are made on a combination of (a) Arena rank in your target language, (b) availability of audio modality if relevant, and (c) self-host vs API economics.

40.4.1 Frontier chat-tuned API models

These are the closed-source models that anchor the top of the LMSYS Arena and that most production chat applications default to in 2026. The performance differences within this tier are small and frequently invert across benchmarks; the differentiators are price, latency, audio support, and personality.

40.4.2 Voice-aware models for realtime conversation

Telephone-call split panel. Left robot labelled cascaded pipeline holds flash cards labelled STT, LLM, TTS, looks stressed. Right robot labelled GPT-4o Realtime sips coffee, mid-sentence.
Figure 40.4.1: Cascaded voice pipelines pay the latency tax of intermediate text. Realtime speech-to-speech models avoid the cascade.

Voice-aware models take audio in and emit audio out without a cascaded STT-LLM-TTS pipeline. Their advantages over cascaded pipelines are dramatic: latency drops from ~800ms to ~300ms, the model hears prosody and emotion (and can produce them), and barge-in / overlap is natural rather than a hack atop a turn-based STT.

40.4.3 Open-weights chat models

Open-weights chat models are the right choice when data residency, cost-at-scale, model customization (LoRA / continued pretraining / RLHF), or vendor independence matter more than the marginal benchmark gap to the frontier APIs.

40.4.4 Persona and character models

Persona and character models are tuned specifically for ongoing roleplay or companion conversation, where persona consistency over hundreds of turns matters more than encyclopedic accuracy.

Key Insight
Aha Moment: The Engagement-vs-Accuracy Inversion

Character.AI's published 2023 product metrics tell the persona-first story numerically. Average session length on Character.AI was 29 minutes; ChatGPT's average session at the same period was 8 minutes. On factual benchmarks (MMLU, TriviaQA) Character.AI's underlying model lagged GPT-3.5 by an estimated 12-18 percentage points. Yet weekly retention was 60 percent on Character.AI vs roughly 35 percent on ChatGPT. The training objective is the explanation: Character.AI's RLHF reward function explicitly favors persona-consistent, emotionally-engaging continuation over factual correctness; the model is allowed to invent backstory if it stays in character. Same transformer architecture, same scale class, opposite reward shaping, opposite product profile. The lesson: for companion AI, "wrong in a fascinating way" beats "right in a boring way." Picking a frontier assistant model for a companion product is a category error, the same way picking a calculator for a date is.

40.4.5 Picking a chat model in 2026

Table 40.4.1a: Conversational models by tier and pick-when (2026).
Model Tier Strength Pick when
Claude Opus 4.5 / Sonnet 4.5 Frontier API Tool use, steerability Production assistants, agentic chat
Claude 3.5 Sonnet / Haiku Frontier API Tonal quality, refusal Quality / cost trade-off
GPT-4o / GPT-4o mini Frontier API Multimodal, cost-efficient Multimodal chat or high-volume
GPT-5 / o-series Frontier API Reasoning Reasoning-heavy conversation
Gemini 2.5 Pro / Flash Frontier API Long context Document-grounded long chat
GPT-4o Realtime / Gemini Live Voice-aware Low-latency voice Realtime voice agents
Moshi Voice-aware open Open speech-to-speech Self-hosted voice research
Llama-3.3-70B-Instruct Open chat Production open chat Self-host required
Qwen2.5-72B-Instruct Open chat Multilingual, Chinese Non-English chat
DeepSeek-V3 / R1 Open chat / reasoning Cost-efficient reasoning Self-host with reasoning
Character.AI in-house Persona Persona consistency, engagement Consumer character products

40.4.6 The four selection axes

Four-axis decision framework for picking a chat model: quality (arena Elo), latency and modality, cost and deployment options, and safety and policy controls, with concrete vendor and benchmark examples in each quadrant
Figure 40.4.2: Four axes for picking a chat model in 2026: capability tier, hosting and privacy posture, latency and cost envelope, and the persona or domain fit the product requires.
Key Insight: Pick by axis, not by vendor

Most chat-model selection debates boil down to one of the four axes: quality (Elo), latency-and-modality, openness, persona-vs-assistant. The right way to pick is to rank your axes for your specific use case, then pick the cheapest model in the top axis quadrant. Defaulting to "the strongest model" wastes money; defaulting to "the cheapest model" wastes quality. The cost / quality elbow has moved every six months since 2023, so re-pick at least quarterly.

Real-World Scenario: A coaching app's model migration

A behavioral-health coaching startup launched in 2023 on GPT-3.5-Turbo, switched to GPT-4o mini in mid-2024 when it shipped (quality bump at similar cost), tested Claude 3.5 Haiku in late 2024 (better empathic tone, won on internal eval), and moved their voice channel to GPT-4o Realtime in early 2025 (latency win was decisive). The actual driving signal was their own NPS-correlated eval (a held-out set of 200 hand-graded conversations with style + safety rubrics) which they re-ran every model release. Public benchmarks were used only for the initial filter (must be top-5 on Arena for English); the within-top-5 ranking was always done on their own data.

Warning: Distillation and "open" come with caveats

Many "open" models in 2024-26 are trained on outputs from closed models (Vicuna on ChatGPT, many fine-tunes on GPT-4 traces). This can be a license violation for commercial deployment depending on the closed model's terms of service. Llama-3+, Qwen2.5+, and Mistral's later releases are trained primarily on their own data and are safer for commercial use; older fine-tunes (Vicuna, WizardLM, many Hugging Face SFT mixes) often have a chain of upstream OpenAI distillation in their lineage. Always check the dataset card before commercial deployment.

Fun Fact: The Arena's hidden taxonomy

The LMSYS Arena reveals an interesting taxonomy when you look at category breakdowns: Claude family routinely tops "creative writing" and "long form"; OpenAI's o-series tops "coding" and "math"; Gemini tops "long context" categories. Different chat models really are good at different things, and the headline Elo averages over user preferences that may not match your specific use case. The category-specific Elo is the more useful number once you have decided your use case.

40.4.7 Cost and latency budgeting

Chat-model selection in production is rarely just about quality. The two binding constraints are cost-per-conversation and latency-per-turn, and both depend on the model.

40.4.8 Fine-tuning vs prompting for chat tone

The decision between fine-tuning a base or chat-tuned model versus prompting a frontier chat-tuned model for a specific tone or persona is one of the recurring practical questions. The 2026 default has shifted toward prompting:

The cost-benefit calculation: a Claude or GPT-4o fine-tune is in the low thousands of dollars one-time plus a per-token premium for the fine-tuned model; the prompt-engineering effort to match it is in the dozens-of-hours range. For most application-tone use cases, prompting wins; fine-tuning is reserved for the cases where prompting demonstrably fails on your eval.

See Also

The headline differences between Claude, Llama-3-Instruct, Qwen-Chat, and DeepSeek-Chat are mostly differences in post-training objective; the dominant 2024-26 recipe in the open ecosystem is Direct Preference Optimization (Rafailov et al. 2023), and Constitutional AI (Bai et al. 2022) is its policy-anchored cousin. For the full DPO loss derivation, the role of the $\beta$ temperature, the connection to KL-constrained RLHF, and a worked numeric walkthrough, see Section 18.3; the RLHF-versus-DPO comparison appears in Section 18.1. The one-sentence summary: that single supervised loss on preference pairs is what lets a 7B open model match a 70B base on chat quality after a few thousand preference comparisons.

40.4.9 Multilingual chat model considerations

Non-English chat exposes differences English benchmarks hide. Five things to know:

40.4.10 Evaluation tied to the model choice

The right way to pick a chat model in 2026, distilled:

  1. Filter to the top-5 models in your target language on the LMSYS Arena, plus any open-weights model that satisfies hard constraints (data residency, openness).
  2. Run your internal eval (Section 40.3.7) against each shortlisted model.
  3. Within the top-3 internal-eval performers, pick by cost and latency for your projected scale.
  4. Re-evaluate quarterly. The model market moves fast enough that the right choice three months ago is rarely the right choice today.
Fun Fact
The size-vs-capability scaling is no longer monotonic

Until late 2023 the rough rule was "bigger model = better chat". By mid-2024 that ordering was broken: GPT-4o mini outperformed many 70B-class open models on chat benchmarks; Gemini 1.5 Flash routinely beat Llama-3-70B on Arena. The 2024-26 lesson is that the chat-tuning recipe (data quality, RLHF, Constitutional AI) matters as much as scale. A well-tuned 30B model can beat a poorly-tuned 70B on conversational quality. The implication: do not assume bigger is better; benchmark on your data.

40.4.11 Small and edge-deployable chat models

An entire tier of conversational models exists below the 7B-class production minimum: distilled, mobile, and edge models for on-device chat. These are relevant when network latency, cost, or privacy rule out cloud inference.

The 2026 reality for edge chat is that 1-3B models are usable for narrow assistant tasks (search, summarization, simple Q&A in-domain) but unreliable for open-ended chat. The 7-8B quantized models are the sweet spot for "good enough chat on a laptop or mid-range phone"; below that, narrow-task fine-tuning is usually required.

40.4.11.1 Phi-3 in depth: the "textbook quality" recipe

Microsoft's Phi-3 family (3.8B parameters for Phi-3-mini, 14B for Phi-3-medium) is the reference example of the "small but capable" thesis. The team's claim is that a relatively small model trained on aggressively-curated, textbook-like data can match a 7B model trained on raw web text. The training-data formula is roughly

$$\mathcal{D}_{\mathrm{Phi}\text{-}3} \;=\; \mathcal{D}_{\mathrm{filtered\text{-}web}} \;\cup\; \mathcal{D}_{\mathrm{synthetic\text{-}textbook}}, \qquad |\mathcal{D}_{\mathrm{Phi}\text{-}3}| \approx 3.3 \text{ T tokens},$$

with the synthetic portion generated by GPT-4 to plug specific knowledge gaps. The trade-off is that Phi-3 over-indexes on the topics covered by its synthetic textbooks and under-performs on long-tail Web-trivia questions that 7B web-trained models handle well.

# Running Phi-3-mini on a laptop CPU via llama.cpp
from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="microsoft/Phi-3-mini-4k-instruct-gguf",
    filename="Phi-3-mini-4k-instruct-q4.gguf",
    n_ctx=4096, n_threads=8,
)
resp = llm.create_chat_completion(
    messages=[{"role": "user",
                "content": "Summarise the difference between TCP and UDP in three bullets."}],
    temperature=0.2, max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])
Code Fragment 40.4.1b: Phi-3-mini quantised to 4-bit GGUF runs at roughly 20 tokens per second on a modern laptop CPU, fast enough for an offline assistant or a privacy-sensitive deployment with no GPU available.
Practical Example: Phi-3-mini as an on-device privacy bot

A healthcare startup needed a HIPAA-compliant note-summarisation assistant that ran entirely on the clinician's laptop with no network egress. The team benchmarked Phi-3-mini-4k Q4 against Llama 3.2 3B Q4 on a 200-note internal eval: Phi-3 reached 87% rubric pass rate vs Llama 3.2's 81%, and ran at 18 tokens/s on a 2023 MacBook Pro vs Llama's 22 tokens/s. They picked Phi-3 for accuracy on the clinical summarisation rubric and accepted the 4-token-per-second latency cost. The decisive factor was that Phi-3's synthetic-textbook training had over-indexed on medical-style structured writing, which is exactly the format their rubric scored.

40.4.12 Model evals vs product evals

A subtle but important distinction in chat-model evaluation: model evals measure the model itself (Arena Elo, MT-Bench score); product evals measure your specific application (your bot's task-success rate on your eval set). The two often diverge:

Always run product evals before deciding. Model evals are necessary for shortlisting; product evals are necessary for committing.

40.4.13 What to look for in a model card

Every chat-tuned model release publishes a model card; reading them critically is a load-bearing skill. The questions to ask:

40.4.14 Model family summary by characteristic

Table 40.4.2a: Conversational models by primary characteristic.
Characteristic Frontier API winner Open-weights winner Notes
Arena Elo (English) Claude / GPT-4o / Gemini family Llama-3.x, Qwen2.5-72B Top of leaderboard rotates monthly
Tool use / function calling Claude 3.5/4.x Llama-3.1+ Instruct Claude leads on multi-call reliability
Long context Gemini 2.5 Pro (2M) Qwen2.5 (1M context variants) "Effective" length usually less than claimed
Multilingual Gemini 2.5 / Claude / GPT-4o Qwen2.5, mistral-large-2 Qwen leads on Chinese specifically
Voice (S2S) GPT-4o Realtime, Gemini Live Moshi Open lags closed by ~1-2 years
Cost-efficient chat GPT-4o mini, Claude Haiku Llama-3.1-8B, Mistral-Small-3, Qwen-32B Open wins at scale; API wins at low volume
Reasoning GPT-5 / o-series, Claude Opus 4.5 DeepSeek-R1, Qwen-QwQ Open is competitive; gap closing
Persona consistency Claude (steerable via prompt) Llama / Mistral fine-tunes Character.AI dominates closed character space

What's Next?

In the next section, Section 40.5: External Reading and Communities, we build on the material covered here.

Further Reading
Anthropic (2024). "Claude 3.5 Sonnet." Anthropic News, June 2024. anthropic.com/news/claude-3-5-sonnet. Reference release notes for Claude 3.5 Sonnet, the model that defined the top of the chat-tuned tier through 2024-25.
OpenAI (2024). "Hello GPT-4o." OpenAI Blog, May 2024. openai.com/index/hello-gpt-4o. Launch post for GPT-4o; defines the omni-modal chat model template that the Realtime API extends.
Meta AI (2024). "The Llama 3 Herd of Models." Meta AI Research. ai.meta.com/research/publications/the-llama-3-herd-of-models. The Llama-3 technical report; canonical reference for the open chat-tuning recipe used by the Llama-3 family.
Defossez, A., et al. (2024). "Moshi: a speech-text foundation model for real-time dialogue." Kyutai Technical Report. kyutai.org/Moshi.pdf. The Moshi paper; canonical reference for open speech-to-speech architectures.
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv preprint. arXiv:2412.19437. The DeepSeek-V3 technical report; canonical reference for the cost-efficient open MoE chat model line.
Yang, A., et al. (2024). "Qwen2.5 Technical Report." arXiv preprint. arXiv:2412.15115. The Qwen2.5 technical report; the multilingual chat-tuning reference of 2024.