"The art of prompting a reasoning model is knowing when to stop helping it think."
Prompt, Zen-Like AI Agent
Reasoning models and multimodal models require fundamentally different prompting strategies than standard LLMs. The chain-of-thought techniques from Section 11.2 were designed for models that have no internal planning capability. Reasoning models internalize the reasoning process, making explicit CoT prompting redundant or counterproductive. Multimodal models that accept images, documents, and audio need prompts that guide visual attention and specify how to integrate information across modalities. This section provides a brief orientation to reasoning model prompting and then focuses in depth on multimodal prompt design. The API mechanics for reasoning and multimodal models are covered in Section 10.4.
Prerequisites
This section builds on the core prompting techniques from Section 11.1: Prompt Design Fundamentals and the advanced patterns in Section 11.2: Advanced Prompting Techniques. Familiarity with reasoning model APIs from Section 10.4: Reasoning Models & Multimodal APIs is strongly recommended, as this section focuses on the prompting layer rather than the API mechanics.
1. Prompting Reasoning Models: Summary
Prompting and using reasoning models is covered in depth in Section 08.4: Prompting and Using Reasoning Models, which includes decision frameworks, budget control APIs for each provider (OpenAI, Anthropic, Google), structured output strategies, best-of-N sampling, and common pitfalls. This section provides the key principles; see Chapter 8 for the complete treatment with code examples.
Why explicit CoT hurts reasoning models: Standard models (GPT-4o, Claude Sonnet without thinking) generate tokens left to right with no internal planning phase. Chain-of-thought prompting helps by giving these models a text-based "scratchpad." Reasoning models (o3, Claude with extended thinking, Gemini 2.5 with thinking) already have an internal thinking phase, trained via reinforcement learning. When you add explicit "think step by step" instructions, you force the model to generate redundant external reasoning on top of its internal reasoning, which can interfere with the model's learned reasoning strategy and waste tokens. The correct approach is to focus your prompt on the problem statement and constraints, and let the model handle the reasoning internally.
The core principle: stop writing chain-of-thought instructions for reasoning models. These models already have an internal thinking phase; adding explicit "think step by step" prompting is redundant at best and harmful at worst. Instead, focus your prompt on (1) a precise problem statement, (2) clear constraints and edge cases, and (3) the desired output format. Reasoning models are not universally better: they underperform standard models on simple classification, creative writing, and high-volume low-complexity tasks. For a decision framework and provider-specific guidance, see Section 08.4.
Prompting a reasoning model with "think step by step" is like telling a chess grandmaster to "try to think about the game." They were already doing that, and your instruction just cluttered their thought process. The hardest lesson for experienced prompt engineers is learning to write less when moving to reasoning models.
Reasoning models (o3, Claude with extended thinking, Gemini 2.5 with thinking) are not universally better than standard models. They underperform on simple classification, creative writing, and high-volume tasks where speed matters more than depth. Reasoning tokens are billed at output rates, so a request with 10,000 thinking tokens costs 5x more than the visible output alone. Before switching to a reasoning model, benchmark it against a standard model on your specific task. If accuracy is comparable, the standard model saves both cost and latency.
2. Multimodal Prompting Patterns
Multimodal prompts combine text with images, documents, or audio. The prompting challenge is guiding the model to attend to the right parts of the visual or audio input and produce the desired output format. Unlike text-only prompts where the model reads everything sequentially, multimodal models must integrate information from different modalities, and the quality of this integration depends heavily on how you frame the task.
2.1 Image + Text Prompting
Why text-before-image ordering matters: Vision transformers process image patches as tokens interleaved with text tokens. When the text instruction comes first, the model's attention mechanism can use that instruction as a "query" when processing the image tokens that follow. This is similar to how a person reads a question before looking at a chart: the question primes you to look for specific information. If the image comes first, the model processes it without context, and important details may receive less attention. This ordering effect is especially pronounced for complex images with multiple data points.
The most effective pattern for image understanding is: describe what you want extracted, then provide the image. Placing the text instruction before the image gives the model a "lens" through which to process the visual input. For complex images (charts, diagrams, screenshots), include specific guidance about what to focus on. Code Fragment 11.5.1 demonstrates two common multimodal prompt patterns: structured data extraction from a chart and side-by-side comparison of two images.
# Multimodal prompting patterns for image analysis with Claude
# Pattern 1: structured extraction from a chart image
# Pattern 2: multi-image comparison with explicit labeling
import anthropic, base64
client = anthropic.Anthropic()
def encode_image(path):
return base64.standard_b64encode(open(path, "rb").read()).decode()
# PATTERN 1: Structured extraction from a chart
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
# Text instruction FIRST (gives the model a lens)
{
"type": "text",
"text": "This is a bar chart showing quarterly revenue. "
"Extract each quarter's value and calculate the "
"quarter-over-quarter growth rate. Return as JSON."
},
# Image SECOND
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": encode_image("revenue_chart.png")
}
}
]
}]
)
# PATTERN 2: Multi-image comparison
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Compare these two UI designs. "
"Image 1 is the current version, Image 2 is the proposed "
"redesign. Identify: (a) layout changes, (b) color "
"changes, (c) potential accessibility concerns."},
{"type": "image", "source": {"type": "base64",
"media_type": "image/png",
"data": encode_image("current_ui.png")}},
{"type": "text", "text": "Image 1 (current) above. "
"Image 2 (proposed) below."},
{"type": "image", "source": {"type": "base64",
"media_type": "image/png",
"data": encode_image("proposed_ui.png")}}
]
}]
)
2.2 Document Understanding Prompts
When sending PDF documents, the prompt design should account for the fact that the model sees each page as an image. For multi-page documents, explicitly tell the model whether information might span pages. For forms and tables, specify the exact fields to extract. For contracts and legal documents, point out which clauses or sections are most relevant.
2.3 Video Frame Analysis
Current models do not accept raw video, but you can extract key frames and send them as a sequence of images. The prompt should establish the temporal relationship between frames and guide the model to track changes across them.
A single image consumes 85 to 1,700 tokens depending on resolution. A 10-page PDF can consume 8,000 to 17,000 tokens. When designing multimodal prompts, consider whether you need the full resolution or whether a downscaled image suffices. For batch processing, crop images to the relevant region before sending them to the API. This can reduce costs by 50% or more while often improving accuracy by removing distracting content.
3. Choosing Between Standard and Reasoning Models
When deciding which model class to use, evaluate: (1) task complexity (multi-step reasoning favors reasoning models), (2) modality needs (images, documents, audio require multimodal support), (3) latency tolerance (reasoning models take 5 to 30 seconds), and (4) cost sensitivity (reasoning models cost 3 to 10x more per query). For a detailed decision flowchart and provider-specific budget control APIs (OpenAI reasoning_effort, Anthropic budget_tokens, Google thinkingBudget), see Section 08.4.
Store prompts in version-controlled files (not inline strings) and tag each deployment with a prompt version. When quality regresses, you can instantly diff the current prompt against the last known-good version.
- Remove chain-of-thought scaffolding when switching to reasoning models. For detailed guidance on reasoning model prompting, budget control, and common pitfalls, see Section 08.4.
- Place text instructions before images in multimodal prompts to give the model a processing lens for visual inputs.
- Multimodal prompts benefit from the same principles as text prompting: provide context, specify format, and set clear expectations. Crop images to the relevant region to reduce cost and improve accuracy.
- Reasoning models are not always better: for simple classification, creative writing, and high-volume low-complexity tasks, standard models are faster, cheaper, and sometimes more effective.
Who: A fintech startup's ML team building an automated due-diligence report generator for venture capital firms.
Situation: Their existing GPT-4o pipeline used a 2,000-token system prompt with detailed chain-of-thought instructions: "First analyze the revenue trend, then assess burn rate, then evaluate the competitive landscape, then..." The pipeline produced structured reports from pitch decks (sent as images) and financial spreadsheets.
Problem: When they switched to o3 without modifying the prompt, report quality dropped noticeably. The model's reasoning traces showed it was trying to follow the prescribed steps while simultaneously running its own analysis, producing redundant and sometimes contradictory sections.
Decision: They stripped the prompt from 2,000 tokens to 400 tokens: a clear problem statement ("Produce a due-diligence summary for a Series B investment"), explicit constraints (risk factors to flag, required output sections, maximum length), and the output JSON schema. All chain-of-thought scaffolding was removed.
Result: Report quality scores from analyst reviewers improved from 7.2/10 to 8.6/10. The shorter prompt also reduced input token costs by 80%. The reasoning traces showed the model discovered analytical angles that the original hand-crafted steps had missed, particularly around unit economics and customer concentration risk.
Lesson: When migrating to reasoning models, less prompting often produces better results; focus your prompt budget on problem clarity and constraints rather than reasoning scaffolding.
The interaction between prompt design and reasoning model behavior is poorly understood. Early research suggests that reasoning models develop internal "prompting strategies" during their thinking phase, effectively re-prompting themselves.
This raises the question of whether prompt engineering for reasoning models will converge toward simply stating the problem clearly, with all sophistication handled by the model's thinking process. Research on prompt-free reasoning (where the model determines the task from context alone) and thinking-aware prompting (where the prompt explicitly references the thinking phase) are active areas of exploration.
Exercises
Create a multimodal prompt that sends an image of a data visualization (bar chart, scatter plot, or dashboard screenshot) to a vision-capable model. Write three variants: (1) image only with a generic question, (2) image plus detailed textual context about the data source and expected patterns, (3) image plus a structured output schema for the analysis. Compare the accuracy and specificity of each variant. Which additional context improved results most?
Show Answer
Variant (2) typically produces the most accurate analysis because contextual information helps the model interpret ambiguous visual elements (axis labels, color meanings, scale). Variant (3) produces the most structured and parseable output but may miss insights that fall outside the schema. The combination of (2) and (3), providing both context and output structure, generally produces the best results. The key lesson is that multimodal prompting benefits from the same principles as text prompting: provide context, specify format, and set clear expectations.
Design a multimodal prompt pipeline for extracting structured data from scanned invoices. The pipeline should handle: (a) images of varying quality, (b) multi-page documents, and (c) different invoice layouts. Define the prompt template, the output schema (JSON with vendor, date, line items, total), and a fallback strategy for when the model's confidence is low. Test with at least 3 sample invoice images and report extraction accuracy.
Show Answer
An effective pipeline: (1) Pre-process images to ensure minimum 300 DPI resolution and correct rotation. (2) Use a prompt that specifies the exact fields to extract, provides the JSON schema, and instructs the model to output "UNCERTAIN" for any field it cannot confidently read. (3) For multi-page documents, send all pages in a single request with text labels between them ("Page 1 of 3 above. Page 2 of 3 below."). (4) Fallback: if more than 2 fields are marked UNCERTAIN, route to a human reviewer or retry with a higher-resolution scan. Key finding: placing the text instruction before the first image consistently improves extraction accuracy by 10 to 15% compared to placing it after.
For exercises on reasoning model prompting (prompt migration, thinking budget calibration, decision frameworks, and failure modes), see the exercises in Section 08.4.
What Comes Next
In the next chapter, Chapter 12: Hybrid ML + LLM Systems, we explore how to combine traditional machine learning with LLMs, building systems that leverage the strengths of both approaches.
OpenAI. (2024). Learning to Reason with LLMs. OpenAI Blog.
OpenAI's explanation of how o1-class models handle reasoning internally, with guidance on how prompting should change when using reasoning models. The recommended starting point for understanding why chain-of-thought prompting becomes unnecessary.
Details how DeepSeek-R1 develops its reasoning behavior through RL training, providing insight into why the model responds to prompts differently than instruction-tuned models. The open-weight release enables direct experimentation with prompt variations.
Yang, Z. et al. (2023). The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).
Systematic evaluation of multimodal prompting strategies, documenting which prompt designs work best for different visual understanding tasks. Directly applicable to the image and document prompting patterns discussed in this section.
Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.
The foundational chain-of-thought paper, which provides the baseline against which reasoning model prompting should be compared. Understanding the original CoT technique is essential for appreciating why it becomes counterproductive with reasoning models.
Anthropic. (2025). Prompt Engineering Guide. Anthropic Documentation.
Anthropic's comprehensive guide to prompting Claude models, including specific guidance for extended thinking mode. Covers system prompt design, multimodal inputs, and output formatting for both standard and reasoning configurations.
