Section 10.4: Reasoning Models & Multimodal APIs

"The best API calls are the ones where you let the model think before it speaks."
Pip, Thoughtfully Patient AI Agent

Big Picture

Reasoning models and multimodal APIs represent the two most significant expansions of what LLM APIs can do. Building on the core API patterns from Section 10.1 and the structured output techniques from Section 10.2, this section introduces two capabilities that change how you design LLM-powered applications. Reasoning models (OpenAI o3, Anthropic Claude with extended thinking, Google Gemini with thinking mode) add an explicit thinking phase before generating answers, dramatically improving performance on complex tasks at the cost of higher latency and token usage. Multimodal APIs allow images, PDFs, and audio to be sent directly alongside text, eliminating separate preprocessing pipelines. Together, these capabilities enable applications that were impractical with standard text-only APIs, from automated document analysis to visual reasoning systems.

Prerequisites

This section assumes familiarity with the LLM API landscape and authentication patterns from Section 10.1: API Landscape & Architecture, the chat completions format from Section 10.2: Chat Completions & Structured Output, and the retry and error-handling patterns in Section 10.3: API Engineering Best Practices. Understanding of transformer decoding from Section 05.1 will help clarify how reasoning tokens fit into the generation pipeline.

Key Insight

Reasoning models are like a student who is allowed to use scratch paper during an exam. Standard models must give their answer immediately, token by token, with no opportunity to plan ahead. Reasoning models, by contrast, generate an internal chain of thought (the "thinking" tokens) before producing the final answer. You, the API caller, control how much scratch paper the model gets by setting a thinking budget (measured in tokens). More budget means more thorough reasoning but higher latency and cost. The key insight is that you are not just paying for output tokens anymore; you are paying for the model to think, and that thinking time is a tunable parameter.

1. The Reasoning Model Landscape

A model sitting in a thinking pose, with visible extended thinking tokens being generated before the final answer — **Figure 10.4.1**: Reasoning models take a moment to think (sometimes quite a long moment) before answering, trading speed for significantly better accuracy.

Why reasoning models exist: Standard autoregressive models generate each token conditioned only on the tokens before it, with no ability to "look ahead" or revise their approach mid-generation. For simple tasks like classification or summarization, this single-pass generation works well. But for tasks requiring multi-step logic (math proofs, code debugging, strategic analysis), the model needs to try approaches, backtrack, and verify its own work. Reasoning models address this by adding a dedicated "thinking" phase where the model generates internal reasoning tokens before producing the visible answer. This thinking phase is essentially the model running its own chain-of-thought automatically, trained via reinforcement learning (covered in Section 17.1).

Reasoning models represent a fundamental shift in how LLMs handle complex tasks. Unlike standard models that generate tokens left to right in a single pass (as described in Chapter 05), reasoning models perform explicit multi-step reasoning before producing a final answer. This capability emerged with OpenAI's o1 model in late 2024 and has since been adopted across providers. Figure 10.4.1 illustrates how reasoning tokens fit into the generation pipeline, and the table below summarizes the key providers and their reasoning offerings.

Tip

Start with effort: "low" for reasoning models and increase only when accuracy demands it. Many teams default to "high" reasoning effort on every request, then wonder why their latency tripled and their token bill quadrupled. For classification and simple extraction tasks, standard models are usually faster, cheaper, and equally accurate.

Code Fragment 10.4.2 shows how to call OpenAI's o3 reasoning model via the Responses API, including the effort parameter and the separation of thinking summaries from the final answer.

# Call OpenAI's o3 reasoning model via the Responses API
# The reasoning parameter controls how much "thinking" the model does
from openai import OpenAI

client = OpenAI()

# The Responses API with reasoning model
response = client.responses.create(
 model="o3",
 input=[
 {
 "role": "user",
 "content": "Prove that the square root of 2 is irrational."
 }
 ],
 reasoning={
 "effort": "high" # Options: "low", "medium", "high"
 },
 max_output_tokens=16000 # Includes both thinking and output tokens
)

# The response contains separate output items
for item in response.output:
 if item.type == "reasoning":
 print(f"[Thinking] ({len(item.summary)} chars of summary)")
 for summary in item.summary:
 print(f" {summary.text}")
 elif item.type == "message":
 print(f"\n[Answer]\n{item.content[0].text}")

# Token usage breakdown
print(f"\nInput tokens: {response.usage.input_tokens}")
print(f"Reasoning tokens: {response.usage.output_tokens_details.reasoning_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")

[Thinking] (342 chars of summary) The model considers a proof by contradiction approach... Assumes sqrt(2) = a/b where a,b are coprime integers... [Answer] **Proof that sqrt(2) is irrational:** Assume for contradiction that sqrt(2) = a/b where a and b are integers with no common factors. Then 2 = a²/b², so a² = 2b². This means a² is even, so a must be even. Write a = 2k. Then 4k² = 2b², giving b² = 2k², so b is also even. But this contradicts a and b having no common factors. Therefore sqrt(2) is irrational. ∎ Input tokens: 28 Reasoning tokens: 1847 Output tokens: 512

Code Fragment 10.4.1: Call OpenAI's o3 reasoning model via the Responses API

1.2 Anthropic Extended Thinking

Anthropic's approach to reasoning uses the existing Messages API with an additional thinking configuration. When extended thinking is enabled, Claude generates a thinking block before its response. Unlike OpenAI's approach, Anthropic returns the full thinking trace (not just a summary), giving developers visibility into the model's reasoning process. The thinking budget is specified explicitly in tokens, providing fine-grained control over the reasoning depth. Code Fragment 10.4.2 shows this approach in practice.

# Enable Anthropic extended thinking with explicit budget control
# The thinking block exposes the full reasoning trace
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=16000,
 thinking={
 "type": "enabled",
 "budget_tokens": 10000 # Max tokens for thinking (min: 1024)
 },
 messages=[
 {
 "role": "user",
 "content": "Analyze this business scenario: A SaaS company has "
 "70% gross margins but is burning $2M/month with "
 "$18M in the bank. They are growing 15% MoM. "
 "Should they raise now or wait?"
 }
 ]
)

# Response contains both thinking and text blocks
for block in response.content:
 if block.type == "thinking":
 print(f"[Thinking] ({len(block.thinking)} chars)")
 print(block.thinking[:500] + "...")
 elif block.type == "text":
 print(f"\n[Answer]\n{block.text}")

[Thinking] (2841 chars) Let me analyze this systematically. The company has 70% gross margins, which is healthy for SaaS. Monthly burn is $2M with $18M in the bank, giving roughly 9 months of runway. Growth at 15% MoM is exceptional... [Answer] **Recommendation: Raise now, but from a position of strength.** With 9 months of runway and 15% MoM growth, you have leverage but not infinite time. At current growth rates, ARR will roughly triple in 8 months, but fundraising typically takes 3 to 4 months. Starting now gives you 2 to 3 months of additional growth data to strengthen your pitch while maintaining a comfortable runway buffer...

Code Fragment 10.4.2: Enable Anthropic extended thinking with explicit budget control

1.3 Google Gemini Thinking

Google's Gemini 2.5 models support a thinking mode configured through the thinking_config parameter. The implementation follows a similar pattern to Anthropic's approach: specify a token budget, receive the thinking trace alongside the response. Gemini's thinking is particularly strong on multimodal reasoning tasks, where the model can reason about images and documents before answering. Code Fragment 10.4.6 shows this approach in practice.

# Google Gemini thinking mode with token budget control
# The thinking_config parameter sets the reasoning depth
from google import genai
from google.genai import types

client = genai.Client()

response = client.models.generate_content(
 model="gemini-2.5-flash",
 contents="Design a database schema for a hospital management system "
 "that handles patient records, appointments, billing, and "
 "inventory. Consider HIPAA compliance requirements.",
 config=types.GenerateContentConfig(
 thinking_config=types.ThinkingConfig(
 thinking_budget=8000 # tokens for thinking
 )
 )
)

# Access thinking and response parts
for part in response.candidates[0].content.parts:
 if part.thought:
 print(f"[Thinking]\n{part.text[:500]}...")
 else:
 print(f"\n[Answer]\n{part.text}")

# Token usage
print(f"Thinking tokens: {response.usage_metadata.thinking_tokens}")
print(f"Output tokens: {response.usage_metadata.candidates_token_count}")

[Thinking] I need to design a hospital management system schema. Let me consider the core entities: patients, providers, appointments, billing, and inventory. HIPAA compliance requires audit logging and access controls... [Answer] ## Hospital Management Database Schema ### Core Tables - **patients**: patient_id (PK), mrn, first_name, last_name, dob, ssn_encrypted... - **providers**: provider_id (PK), npi, specialty, department_id (FK)... - **appointments**: appointment_id (PK), patient_id (FK), provider_id (FK), scheduled_at... - **billing_records**: billing_id (PK), encounter_id (FK), cpt_code, amount... - **inventory_items**: item_id (PK), name, category, quantity_on_hand, reorder_point... Thinking tokens: 3241 Output tokens: 1856

Code Fragment 10.4.3: Google Gemini thinking mode with token budget control

1.4 Choosing a Thinking Budget

The thinking budget is the single most important parameter when working with reasoning models. Setting it too low produces shallow reasoning that may not outperform a standard model. Setting it too high wastes tokens (and money) on problems that do not need deep thought. Here is a practical guide:

1.4 Choosing a Thinking Budget Comparison

Task Type	Recommended Budget	Reasoning Effort	Example
Simple classification	1,024 to 2,048	Low	Sentiment analysis, topic labeling
Code generation	4,096 to 8,192	Medium	Write a function, fix a bug
Multi-step math	8,192 to 16,384	High	Proofs, competition problems
Complex analysis	10,000 to 32,000	High	Architecture design, legal analysis
Research-level reasoning	32,000+	Maximum	Novel algorithm design, theorem proving

Warning

Reasoning tokens count toward your usage and are billed at the model's output token rate. For OpenAI's o3, reasoning tokens are billed at the same rate as output tokens. For Claude, thinking tokens are billed at the output token rate. A request with a 10,000-token thinking budget could generate 10,000 thinking tokens plus 2,000 output tokens, meaning the effective cost is 6x what you might expect from the output alone. Always monitor reasoning_tokens or thinking_tokens in the usage metadata to track costs.

2. Multimodal API Calls

A robot with multiple sensory inputs (eyes for images, ears for audio, hands for text) representing multimodal APIs — **Figure 10.4.2**: Multimodal APIs give your model eyes, ears, and reading glasses all at once, processing images, audio, and text in a single call.

Why multimodal APIs matter for production systems: Before multimodal APIs, processing a document with images required a multi-step pipeline: OCR the text, run a separate vision model on images, then combine the results before sending to an LLM. Each step introduced latency, complexity, and potential errors. Native multimodal APIs collapse this pipeline into a single API call, where the model can jointly reason about text, images, and layout. This is particularly powerful for document AI applications where table structure, handwriting, and visual context are essential for correct extraction.

Modern LLM APIs accept more than text. Images, audio, video frames, and documents can all be included as content blocks within API requests, reflecting the broader trend toward multimodal models. Each provider has its own format for multimodal content, but the general pattern is similar: instead of passing a plain string as the user message, you pass an array of content blocks, each with a type and corresponding data. Figure 10.4.2 shows the content block structure that all major providers share. Code Fragment 10.4.4 shows this approach in practice.

# Send an image to Claude for visual analysis using base64 encoding
# The content array mixes image and text blocks in a single message
import base64
from pathlib import Path

# Encode a local image
image_path = Path("chart.png")
image_data = base64.standard_b64encode(image_path.read_bytes()).decode("utf-8")

# === OpenAI Vision (Responses API) ===
from openai import OpenAI
openai_client = OpenAI()

response = openai_client.responses.create(
 model="gpt-4o",
 input=[{
 "role": "user",
 "content": [
 {"type": "input_text", "text": "Describe the trends in this chart."},
 {
 "type": "input_image",
 "image_url": f"data:image/png;base64,{image_data}"
 }
 ]
 }]
)

# === Anthropic Vision ===
import anthropic
claude_client = anthropic.Anthropic()

response = claude_client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=1024,
 messages=[{
 "role": "user",
 "content": [
 {
 "type": "image",
 "source": {
 "type": "base64",
 "media_type": "image/png",
 "data": image_data
 }
 },
 {"type": "text", "text": "Describe the trends in this chart."}
 ]
 }]
)

# === Google Gemini Vision ===
from google import genai
from google.genai import types

gemini_client = genai.Client()
response = gemini_client.models.generate_content(
 model="gemini-2.5-flash",
 contents=[
 types.Part.from_bytes(data=image_path.read_bytes(), mime_type="image/png"),
 "Describe the trends in this chart."
 ]
)

[Thinking...]...... [Answer] For a startup with 5 engineers, a monolith is almost always the right starting point. Here is the breakdown: **Monolith Pros:** Single deployment, simpler debugging, faster iteration, one codebase for 5 people to navigate... **Monolith Cons:** Harder to scale individual components, deployment couples all changes, technology lock-in... **Microservices Pros:** Independent scaling, technology diversity, team autonomy, fault isolation... **Microservices Cons:** Network complexity, distributed debugging, operational overhead (5 engineers managing multiple services is unsustainable)... Thinking: 3500 chars | Answer: 1247 chars

Code Fragment 10.4.4: Send an image to Claude for visual analysis using base64 encoding

2.2 Document Understanding

PDF documents can be sent directly to several APIs, eliminating the need for a separate OCR pipeline (see also Chapter 27 on document AI). Anthropic and Google both accept PDF files as base64-encoded content blocks, with the model processing each page as a visual input. This is especially powerful for documents with complex layouts, tables, and figures that traditional text extraction handles poorly. Code Fragment 10.4.6 shows this approach in practice.

# === Anthropic PDF Support ===
import anthropic, base64

client = anthropic.Anthropic()
pdf_data = base64.standard_b64encode(
 open("quarterly_report.pdf", "rb").read()
).decode("utf-8")

response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=4096,
 messages=[{
 "role": "user",
 "content": [
 {
 "type": "document",
 "source": {
 "type": "base64",
 "media_type": "application/pdf",
 "data": pdf_data
 }
 },
 {
 "type": "text",
 "text": "Extract all financial metrics from this quarterly "
 "report. Return them as a JSON object with metric "
 "names as keys and values as numbers."
 }
 ]
 }]
)

# === Google Gemini PDF Support ===
from google import genai
from google.genai import types

gemini_client = genai.Client()
response = gemini_client.models.generate_content(
 model="gemini-2.5-flash",
 contents=[
 types.Part.from_bytes(
 data=open("quarterly_report.pdf", "rb").read(),
 mime_type="application/pdf"
 ),
 "Extract all financial metrics and present them in a table."
 ]
)

Code Fragment 10.4.5: Sending PDF documents to Anthropic and Google Gemini APIs. Both providers accept base64-encoded PDFs as content blocks, eliminating the need for separate OCR pipelines when processing documents with complex layouts.

2.3 Audio Input

Audio inputs enable voice-to-analysis workflows without a separate speech-to-text step. OpenAI's GPT-4o and Google's Gemini both accept audio files directly, processing the raw audio signal alongside text instructions. This preserves information that text transcription would lose, such as tone, emphasis, and speaker emotion.

Note

Audio inputs are tokenized at roughly 32 tokens per second of audio for OpenAI and similar rates for Gemini. A 5-minute customer service call would consume approximately 9,600 input tokens. For high-volume audio processing, it may still be more cost-effective to use a dedicated speech-to-text service (like Whisper) and then send the transcript to the LLM, since audio tokens are billed at a higher rate than text tokens.

3. Streaming with Reasoning Models

Streaming is critical for user-facing applications because reasoning models can take 10 to 60 seconds to generate a response (most of which is spent on thinking tokens). Without streaming, users see nothing until the entire response is complete. With streaming, you can show a progress indicator during the thinking phase and then stream the answer tokens as they arrive. Figure 10.4.3 illustrates the streaming event timeline for a reasoning model response.

Code Fragment 10.4.6 shows how to stream tokens as they are generated.

# Stream a reasoning response from Anthropic with extended thinking
# Shows thinking progress and answer tokens as they arrive
import anthropic

client = anthropic.Anthropic()

# Stream a reasoning response
thinking_chars = 0
answer_chars = 0

with client.messages.stream(
 model="claude-sonnet-4-20250514",
 max_tokens=16000,
 thinking={
 "type": "enabled",
 "budget_tokens": 8000
 },
 messages=[{
 "role": "user",
 "content": "What are the pros and cons of microservices vs "
 "monolith for a startup with 5 engineers?"
 }]
) as stream:
 current_block = None

 for event in stream:
 if event.type == "content_block_start":
 if event.content_block.type == "thinking":
 current_block = "thinking"
 print("[Thinking...]", end="", flush=True)
 elif event.content_block.type == "text":
 current_block = "text"
 print("\n\n[Answer] ", end="", flush=True)

 elif event.type == "content_block_delta":
 if current_block == "thinking":
 thinking_chars += len(event.delta.thinking)
 # Show progress dots during thinking
 if thinking_chars % 500 == 0:
 print(".", end="", flush=True)
 elif current_block == "text":
 print(event.delta.text, end="", flush=True)
 answer_chars += len(event.delta.text)

print(f"\n\nThinking: {thinking_chars} chars | Answer: {answer_chars} chars")

Code Fragment 10.4.6: Stream a reasoning response from Anthropic with extended thinking

4. Combining Reasoning and Multimodal Inputs

The most powerful use cases emerge when reasoning models process multimodal inputs. A reasoning model analyzing a complex architectural diagram, a financial chart with annotations, or a multi-page legal document can leverage its extended thinking to plan its analysis before producing a response. This combination is particularly effective for tasks that require both visual understanding and logical reasoning. For prompting techniques designed specifically for reasoning models, see Section 11.5. The reinforcement learning methods that train reasoning capabilities are covered in Section 17.1.

4.1 Practical Pattern: Document Analysis with Reasoning

Consider a workflow that processes invoices: the model must read the document (vision), extract structured data (comprehension), cross-reference totals (arithmetic), and flag discrepancies (reasoning). A standard model might miss arithmetic errors or misread ambiguous handwriting. A reasoning model with adequate thinking budget can catch these issues by explicitly checking its work.

Real-World Scenario: Reasoning Models for Invoice Verification

Who: Sara, a finance automation lead at a mid-size manufacturing company.

Situation: Her accounts payable team processed 500 invoices per day. Each invoice had to be verified against the purchase order for line-item accuracy and correct totals, and the team was piloting an LLM-based verification pipeline to reduce manual review.

Problem: GPT-4o correctly identified line items 94% of the time but miscalculated totals on 12% of invoices. Since the team could not trust the arithmetic, they still had to manually review every flagged invoice plus a random sample, negating most of the automation benefit.

Decision: Sara enabled medium reasoning effort on GPT-4o (alternatively o3 via the Responses API), which rechecks arithmetic during its thinking phase. The thinking trace also provided an audit trail showing exactly how each total was verified.

Result: Total miscalculations dropped to under 2%. Reasoning tokens added approximately $0.08 per invoice, but the reduction in human review time saved roughly $0.50 per invoice, yielding a net savings of $0.42 per invoice ($210 per day).

Lesson: Reasoning models pay for themselves when the task involves verifiable computation (arithmetic, logical checks) and the cost of human fallback exceeds the added token cost. The thinking trace doubles as an audit trail for compliance-sensitive workflows.

Fun Fact

OpenAI's o1 model, the first widely available reasoning model, was initially so secretive about its thinking process that developers called it "the black box within the black box." Early versions did not expose thinking tokens at all. The push for transparency led to the current approach where providers return at least a summary of the reasoning trace, and some (Anthropic, Google, DeepSeek) return the full internal monologue.

5. Cross-Provider Abstraction Patterns

Given the differences between providers, production applications often need an abstraction layer that normalizes reasoning and multimodal calls. The key differences to abstract are: (1) how reasoning is configured, (2) how multimodal content blocks are formatted, (3) how streaming events are structured, and (4) how token usage is reported. Code Fragment 10.4.7 shows this approach in practice.

# Build a modality router that dispatches inputs to the correct API
# Supports text, image, audio, and document inputs with provider selection
from dataclasses import dataclass
from typing import Optional

@dataclass
class ReasoningConfig:
 """Provider-agnostic reasoning configuration."""
 enabled: bool = False
 budget_tokens: int = 4096
 effort: str = "medium" # low, medium, high

@dataclass
class ReasoningResponse:
 """Normalized response from any reasoning model."""
 thinking: Optional[str]
 answer: str
 thinking_tokens: int
 output_tokens: int
 total_tokens: int

def call_reasoning_model(
 provider: str,
 model: str,
 prompt: str,
 reasoning: ReasoningConfig = ReasoningConfig()
) -> ReasoningResponse:
 """Unified interface for reasoning model calls."""

 if provider == "openai":
 from openai import OpenAI
 client = OpenAI()
 response = client.responses.create(
 model=model,
 input=[{"role": "user", "content": prompt}],
 reasoning={"effort": reasoning.effort},
 max_output_tokens=reasoning.budget_tokens + 4096
 )
 thinking = ""
 answer = ""
 for item in response.output:
 if item.type == "reasoning":
 thinking = " ".join(s.text for s in item.summary)
 elif item.type == "message":
 answer = item.content[0].text
 return ReasoningResponse(
 thinking=thinking,
 answer=answer,
 thinking_tokens=response.usage.output_tokens_details.reasoning_tokens,
 output_tokens=response.usage.output_tokens,
 total_tokens=response.usage.total_tokens
 )

 elif provider == "anthropic":
 import anthropic
 client = anthropic.Anthropic()
 response = client.messages.create(
 model=model,
 max_tokens=reasoning.budget_tokens + 4096,
 thinking={"type": "enabled", "budget_tokens": reasoning.budget_tokens}
 if reasoning.enabled else {"type": "disabled"},
 messages=[{"role": "user", "content": prompt}]
 )
 thinking = ""
 answer = ""
 for block in response.content:
 if block.type == "thinking":
 thinking = block.thinking
 elif block.type == "text":
 answer = block.text
 return ReasoningResponse(
 thinking=thinking,
 answer=answer,
 thinking_tokens=getattr(response.usage, "thinking_tokens", 0),
 output_tokens=response.usage.output_tokens,
 total_tokens=response.usage.input_tokens + response.usage.output_tokens
 )

 elif provider == "google":
 from google import genai
 from google.genai import types
 client = genai.Client()
 response = client.models.generate_content(
 model=model,
 contents=prompt,
 config=types.GenerateContentConfig(
 thinking_config=types.ThinkingConfig(
 thinking_budget=reasoning.budget_tokens
 )
 )
 )
 thinking = ""
 answer = ""
 for part in response.candidates[0].content.parts:
 if part.thought:
 thinking = part.text
 else:
 answer = part.text
 return ReasoningResponse(
 thinking=thinking,
 answer=answer,
 thinking_tokens=response.usage_metadata.thinking_tokens,
 output_tokens=response.usage_metadata.candidates_token_count,
 total_tokens=response.usage_metadata.total_token_count
 )

# Usage
result = call_reasoning_model(
 provider="anthropic",
 model="claude-sonnet-4-20250514",
 prompt="Explain why P != NP is hard to prove.",
 reasoning=ReasoningConfig(enabled=True, budget_tokens=8000)
)
print(f"Thinking: {result.thinking_tokens} tokens")
print(f"Answer: {result.answer[:200]}...")

Thinking: 4312 tokens Answer: The P vs NP problem asks whether every problem whose solution can be verified in polynomial time can also be *solved* in polynomial time. Proving P != NP is hard because we lack the mathematic...

Code Fragment 10.4.7: Build a modality router that dispatches inputs to the correct API

Tip: Cache Identical Requests

If you call the same prompt repeatedly (common in batch processing and testing), cache responses keyed by the hash of (model, prompt, temperature, other params). Even a simple dictionary cache can save thousands of API calls during development.

Key Takeaways

Reasoning models generate internal thinking tokens before producing an answer, trading latency and cost for significantly better performance on complex tasks.
Thinking budgets are the primary tuning lever: start low (2,048 tokens) for simple tasks and increase to 10,000+ only for problems requiring multi-step reasoning.
Multimodal content blocks allow images, PDFs, and audio to be sent directly in API calls, eliminating separate preprocessing pipelines for many use cases.
Streaming is essential for reasoning models because thinking can take 10 to 60 seconds; stream thinking progress and answer tokens to keep users informed.
Provider differences are significant: OpenAI uses the Responses API with effort levels, Anthropic uses budget_tokens in the Messages API, and Google uses thinking_config. Build abstraction layers for cross-provider support.
Reasoning tokens are billed at output token rates, so a request with 10,000 thinking tokens costs 5x more than the visible output alone. Always monitor reasoning token usage.

Research Frontier

The optimal allocation of thinking budget across tasks is an open research problem.

Current approaches use fixed budgets, but adaptive reasoning would allow models to dynamically allocate more thinking tokens to harder sub-problems. DeepSeek-R1's approach of embedding thinking directly in the output stream (using <think> tags) contrasts with the API-mediated approaches of other providers and raises questions about whether reasoning should be a model capability or a serving infrastructure feature.

Meanwhile, multimodal reasoning (combining vision, text, and audio in a single thinking chain) is rapidly improving but still struggles with spatial reasoning and precise measurement tasks.

Exercises

Exercise 10.4.1: Thinking Budget Experiment Coding

Using any reasoning model API (OpenAI o3, Anthropic Claude with extended thinking, or Google Gemini with thinking mode), send the same complex math problem with three different thinking budget levels (low, medium, high). Record the answer quality, thinking token count, output token count, and total latency for each. What is the relationship between thinking budget and answer quality for your chosen problem?

Show Answer

You should observe that higher thinking budgets generally produce more thorough reasoning and fewer errors, but with diminishing returns. A simple problem (e.g., "What is 15% of 340?") shows no improvement beyond the lowest budget. A complex problem (e.g., multi-step word problem or proof) shows significant improvement from low to medium but often plateaus from medium to high. The key insight is that thinking budget should be calibrated to task complexity, not set to maximum by default.

Exercise 10.4.2: Standard vs. Reasoning Model Comparison Coding

Pick a task that requires multi-step reasoning (e.g., "Given these five product reviews, identify the two most common complaints and explain whether they are related"). Send it to both a standard model (GPT-4o or Claude Sonnet without thinking) and a reasoning model (o3 or Claude with extended thinking). Compare the quality, cost, and latency of both responses. For which tasks does the reasoning model provide a clear advantage?

Show Answer

Reasoning models typically outperform standard models on tasks that require: (1) counting or enumeration, (2) multi-step logical inference, (3) considering multiple constraints simultaneously, or (4) detecting subtle relationships across multiple inputs. For simple extraction or summarization tasks, standard models are usually sufficient and significantly cheaper. The cost difference is typically 3x to 10x due to thinking tokens, so the reasoning model's advantage must be substantial to justify the premium.

Exercise 10.4.3: Multimodal Content Blocks Coding

Construct an API call that sends both an image (e.g., a chart or diagram) and a text question about that image to a multimodal model. Then modify the call to include two images and ask the model to compare them. How does the token usage change between the single-image and dual-image requests? What are the provider-specific differences in how image content blocks are specified?

Show Answer

Image tokens vary by provider and resolution. OpenAI uses a tile-based system where a 1024x1024 image costs roughly 765 tokens (low detail) to 1,105+ tokens (high detail). Anthropic charges based on image dimensions (e.g., a 1024x1024 image costs about 1,334 tokens). Adding a second image roughly doubles the image token cost. The key differences are: OpenAI uses image_url content blocks with a detail parameter; Anthropic uses image content blocks with base64 or URL source types; Google Gemini uses inline_data with MIME type specification.

Exercise 10.4.4: Streaming Reasoning Tokens Coding

Implement a streaming handler for a reasoning model that: (1) displays a spinner or progress indicator during the thinking phase, (2) prints thinking summaries as they arrive, and (3) streams the final answer token by token. Handle the case where the thinking phase exceeds 30 seconds by showing an elapsed time counter. Test with both a simple and a complex query to observe the difference in thinking duration.

Show Answer

The implementation should use stream=True (or equivalent) and process events by type. For OpenAI's Responses API, listen for response.reasoning.delta and response.output_text.delta events. For Anthropic, handle content_block_start (type: thinking) and content_block_delta events, distinguishing between thinking and text deltas. For Google, process candidates with thought content parts separately. The thinking phase for a simple question (e.g., "What is 2+2?") takes under 2 seconds, while a complex proof can take 30 to 60 seconds, making the progress indicator essential for user experience.

Exercise 10.4.5: Provider-Agnostic Reasoning Wrapper Coding

Using Code Fragment 10.4.7 as a starting point, extend the provider-agnostic reasoning abstraction to support: (1) automatic thinking budget selection based on prompt length (longer prompts suggest more complex tasks), (2) fallback from reasoning to standard model if thinking tokens exceed a cost threshold, and (3) logging of thinking token usage for cost tracking. Test your wrapper across at least two providers.

Show Answer

A good implementation would include a budget_strategy parameter that maps prompt length ranges to thinking budgets (e.g., under 200 tokens = low, 200 to 1000 = medium, over 1000 = high). The cost threshold fallback should estimate cost before the request completes by monitoring streaming thinking tokens and aborting if they exceed the limit, then retrying with a standard model. Logging should capture: provider, model, prompt_tokens, thinking_tokens, output_tokens, latency_ms, and estimated_cost. The wrapper should normalize these metrics across providers to enable fair comparison.

What Comes Next

In the next chapter, Chapter 11: Prompt Engineering, we explore how to craft effective prompts for both standard and reasoning models, including techniques that leverage the thinking capabilities introduced here.

References and Further Reading

Reasoning Models

OpenAI. (2024). Learning to Reason with LLMs. OpenAI Technical Blog.

OpenAI's technical overview of the o1 model family, explaining how chain-of-thought reasoning is integrated into the model's inference process. Provides the conceptual foundation for understanding reasoning tokens and thinking budgets.

Technical Report

DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

Describes how DeepSeek-R1 develops reasoning capabilities through reinforcement learning, producing models that generate explicit thinking traces. The open-weight release enabled the broader community to study and build on reasoning model internals.

Paper

Multimodal APIs

OpenAI. (2025). Responses API Reference. OpenAI Platform Documentation.

The official documentation for OpenAI's Responses API, covering reasoning parameters, multimodal inputs, tool use, and streaming. Essential reference for implementing the OpenAI examples in this section.

Documentation

Anthropic. (2025). Extended Thinking Documentation. Anthropic Docs.

Anthropic's guide to using extended thinking with the Messages API, including budget configuration, streaming patterns, and best practices for thinking-enabled requests.

Documentation

Google. (2025). Thinking with Gemini. Google AI Developer Documentation.

Google's documentation for thinking mode in Gemini models, covering configuration, budget control, and multimodal reasoning capabilities. Useful for implementing the Google examples in this section.

Documentation

Multimodal Understanding

Yang, Z. et al. (2023). The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).

An early systematic evaluation of GPT-4V's multimodal capabilities, documenting strengths and failure modes across diverse visual understanding tasks. Provides context for understanding the capabilities and limitations of vision APIs discussed in this section.

Paper