Part 3: Working with LLMs
Chapter 10: Working with LLM APIs

API Landscape & Architecture

Any sufficiently advanced API is indistinguishable from a junior colleague who types really fast.

Pip Pip, Caffeinated AI Agent
Big Picture

A team at a fintech startup shipped their first LLM feature using the OpenAI API. It worked beautifully in testing. Then the bill arrived: $12,000 for one month, ten times their estimate. They had missed that their retry logic resubmitted full conversation histories on every 429 error, that their prompt included 800 tokens of unused system instructions, and that Anthropic's prompt caching could have cut costs by 90% for their repetitive workload (the same 800-token system prompt was resent with every request, making up the vast majority of input tokens). The API you choose, and how you call it, is not a footnote. It is a core architectural decision that shapes your cost structure, latency profile, and reliability posture. Building on the model landscape from Section 07.1 and the inference optimization techniques from Section 09.4, this section builds the fluency you need to make that decision well: we survey the major providers, compare their architectures side by side, and establish the patterns that will carry through the rest of Part 3.

Prerequisites

This section assumes familiarity with the transformer architecture from Section 04.1 and the decoding strategies covered in Section 05.1. Understanding of the modern model landscape from Section 07.1 will help contextualize which providers offer which models, and the inference optimization concepts from Section 09.1 explain why serving infrastructure matters for API design.

1. The LLM API Ecosystem

A marketplace showing various LLM API providers with different offerings, pricing, and capabilities
Figure 10.1.1: The LLM API marketplace: a bazaar of providers, each with unique strengths, pricing models, and just enough differences to keep your adapter code interesting.
Warning

All pricing figures in this chapter reflect approximate rates as of early 2025. LLM API prices change frequently, often decreasing by 50% or more within a year. Always check provider documentation for current pricing before making architectural decisions based on cost. Where possible, we express costs as ratios (e.g., "10x cheaper") rather than absolute dollar amounts to extend shelf life.

Why this matters: Every LLM application, from a simple chatbot to a complex multi-agent system, starts with an API call. Understanding the differences between providers is not academic trivia; it directly determines your cost structure, reliability posture, and how quickly you can ship features. Teams that treat the API layer as an afterthought end up locked into one provider, overpaying for tokens, and scrambling during outages.

The modern LLM API ecosystem is built around a surprisingly simple pattern: you send a sequence of messages (a conversation) to an HTTP endpoint, and the model returns a completion. Under the hood, each provider tokenizes your input, runs inference through a transformer model, and decodes the output tokens back into text. Despite this conceptual simplicity, each provider has evolved distinct conventions, capabilities, and pricing models that shape how you build applications.

The major commercial providers include OpenAI, Anthropic, and Google. Each offers hosted inference with usage-based pricing, typically measured in tokens. Beyond these, cloud platforms like AWS Bedrock and Azure OpenAI provide enterprise wrappers that add compliance, networking, and billing features on top of the same underlying models. For open-source models, serving frameworks like vLLM, TGI, and Ollama expose OpenAI-compatible endpoints, creating a de facto standard. Figure 10.1.1 maps this ecosystem and the relationships between providers.

Fun Note

OpenAI's chat completions format has become so dominant that even competitors implement it. Anthropic has its own Messages API, Google has its own Gemini API, yet every open-source serving framework defaults to the OpenAI format. It is the QWERTY keyboard of LLM APIs: not necessarily optimal, but so widely adopted that resistance is futile.

LLM API ecosystem: application connecting to OpenAI, Anthropic, Google, Enterprise, and Open Source providers via common HTTP POST JSON pattern
Figure 10.1.2: The LLM API ecosystem. All major providers share a common HTTP + JSON pattern, but differ in message format, parameter names, and special capabilities.

2. OpenAI Chat Completions API

Fun Fact

The OpenAI Chat Completions API became so ubiquitous that virtually every open-source serving framework now implements an "OpenAI-compatible" endpoint. It is the rare case where a proprietary API became a de facto open standard simply because everyone got tired of learning new request formats.

The OpenAI Chat Completions API established the dominant pattern for LLM APIs. You send a list of messages, each with a role (system, user, or assistant) and content, along with generation parameters. The API returns a completion object containing the model's response, token usage counts, and metadata.

2.1 Core Request Structure

An API depicted as a hotel receptionist handling requests, managing rooms (models), and processing check-ins (prompts)
Figure 10.1.3: Think of an LLM API as a hotel receptionist: you describe what you need, they route your request to the right room, and hand back the result.

The key parameters that control generation behavior are:

Code Fragment 10.1.2 demonstrates the approach described above.

# Send a chat completion request with system and user messages
# Temperature and max_tokens control randomness and response length
from openai import OpenAI

client = OpenAI() # reads OPENAI_API_KEY from environment

# Send chat completion request to the API
response = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "system", "content": "You are a helpful coding assistant."},
 {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
 ],
 temperature=0.7,
 max_tokens=150
)

# Extract the generated message from the API response
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.prompt_tokens} input, "
 f"{response.usage.completion_tokens} output")
A list comprehension is a concise way to create lists in Python by writing an expression followed by a for clause inside square brackets. For example, [x**2 for x in range(10)] creates a list of squares from 0 to 81, and you can add an optional if clause to filter elements. Tokens used: 28 input, 52 output
Code Fragment 10.1.1: Send a chat completion request with system and user messages
Note

Temperature vs. top_p: OpenAI recommends adjusting one or the other, not both simultaneously. Temperature scales the logits before softmax, while top_p truncates the distribution after softmax. Using both can create unpredictable interactions. For deterministic output, set temperature=0. For creative tasks, try temperature=0.8 to 1.2.

2.2 Streaming Responses with Server-Sent Events

A conveyor belt delivering tokens one at a time via Server-Sent Events, showing streaming API responses
Figure 10.1.4: Streaming via SSE: tokens arrive on a conveyor belt one by one, so your users see results forming in real time instead of staring at a spinner.

For interactive applications, waiting for the entire response to complete before displaying anything creates a poor user experience. Streaming delivers tokens as they are generated using the Server-Sent Events (SSE) protocol. Each chunk arrives as a small JSON object containing the delta (the new content since the last chunk). Code Fragment 10.1.7 shows this in practice.

# Stream a chat completion using Server-Sent Events
# Each chunk delivers incremental tokens as they are generated
from openai import OpenAI

client = OpenAI()

# Send chat completion request to the API
stream = client.chat.completions.create(
 model="gpt-4o",
 messages=[
 {"role": "user", "content": "Write a haiku about APIs."}
 ],
 stream=True
)

collected_content = ""
for chunk in stream:
 delta = chunk.choices[0].delta
 if delta.content:
 collected_content += delta.content
 print(delta.content, end="", flush=True)

print(f"\n\nFull response: {collected_content}")
Signals cross the wire JSON carries the question Tokens flow like rain Full response: Signals cross the wire JSON carries the question Tokens flow like rain
Code Fragment 10.1.2: Stream a chat completion using Server-Sent Events
Key Insight

When to stream: Streaming is essential for chatbots and interactive UIs where time-to-first-token (TTFT) matters more than total latency. For batch processing or backend pipelines where you need the complete response before proceeding, non-streaming calls are simpler and provide the full usage object in a single response.

2.3 The Batch API

OpenAI's Batch API allows you to submit large collections of requests (up to 50,000) that are processed asynchronously within a 24-hour window. The key advantage is a 50% cost reduction compared to synchronous API calls. To put that in concrete terms: if your evaluation pipeline processes 10,000 prompts at $0.005 per request, the Batch API drops the cost from $50 to $25 for the same results, just delivered within 24 hours instead of immediately. This makes it ideal for tasks like dataset labeling, bulk classification, or evaluation pipelines where real-time responses are unnecessary. Code Fragment 10.1.3 shows this approach in practice.

# Set up the OpenAI client and send a chat completion request
# The response object contains generated text and usage metadata
from openai import OpenAI
import json

client = OpenAI()

# Step 1: Create a JSONL file with batch requests
requests = [
 {
 "custom_id": f"review-{i}",
 "method": "POST",
 "url": "/v1/chat/completions",
 "body": {
 "model": "gpt-4o-mini",
 "messages": [
 {"role": "system", "content": "Classify the sentiment as positive, negative, or neutral."},
 {"role": "user", "content": review}
 ],
 "max_tokens": 10
 }
 }
 for i, review in enumerate(["Great product!", "Terrible service.", "It was okay."])
]

with open("batch_input.jsonl", "w") as f:
 for req in requests:
 f.write(json.dumps(req) + "\n")

# Step 2: Upload and submit the batch
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
batch_job = client.batches.create(input_file_id=batch_file.id, endpoint="/v1/chat/completions",
 completion_window="24h")
print(f"Batch submitted: {batch_job.id}, status: {batch_job.status}")
Batch submitted: batch_abc123def456, status: validating
Code Fragment 10.1.3: Set up the OpenAI client and send a chat completion request

3. Anthropic Messages API

Anthropic's Messages API follows a similar conversational pattern but introduces several distinctive design choices. System prompts are a separate top-level parameter rather than a message role. The API supports prompt caching (which can reduce costs by up to 90% for repeated prefixes), extended thinking for complex reasoning, and a robust tool use system.

3.1 Core Differences from OpenAI

The most notable architectural differences include:

Code Fragment 10.1.4 demonstrates the approach described above.

# Call Claude using the Anthropic Messages API
# The response includes input/output token counts for cost tracking
import anthropic

client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment

response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=200,
 system="You are a helpful coding assistant. Be concise.",
 messages=[
 {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
 ]
)

print(response.content[0].text)
print(f"Tokens: {response.usage.input_tokens} input, "
 f"{response.usage.output_tokens} output")
print(f"Stop reason: {response.stop_reason}")
A list comprehension lets you create a new list by applying an expression to each item in an iterable, all in a single readable line like [x**2 for x in range(10)]. You can also add conditions to filter elements, such as [x for x in data if x > 0]. Tokens: 26 input, 58 output Stop reason: end_turn
Code Fragment 10.1.4: Call Claude using the Anthropic Messages API

3.2 Prompt Caching

Why prompt caching matters: In most production applications, the system prompt is identical across thousands of requests. Without caching, you pay full price for those same system prompt tokens every single time. Prompt caching exploits this repetition: the provider stores the tokenized, processed representation of your prefix, so subsequent requests skip the expensive prefill computation. The intuition is similar to how a web browser caches static assets. The system prompt is your "static asset," and caching prevents you from re-downloading it on every page load.

Anthropic's prompt caching feature is particularly valuable when you repeatedly send requests with a shared prefix (for example, a long system prompt or a large document in context). By marking content blocks with cache_control, you tell the API to cache that prefix. Subsequent requests that share the same cached prefix receive a significant discount on input token costs and reduced latency. Code Fragment 10.1.5 shows this approach in practice.

# Configure and execute the API request
# Adjust parameters based on your use case
response = client.messages.create(
 model="claude-sonnet-4-20250514",
 max_tokens=300,
 system=[
 {
 "type": "text",
 "text": "You are an expert on the Python standard library. " * 50, # Long system prompt
 "cache_control": {"type": "ephemeral"} # Cache this prefix
 }
 ],
 messages=[
 {"role": "user", "content": "What is the difference between os.path and pathlib?"}
 ]
)

# Check cache performance
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
Cache creation tokens: 262 Cache read tokens: 0
Code Fragment 10.1.5: Configure and execute the API request

4. Google Gemini API

Google's Gemini API uses a generateContent endpoint that shares the same conversational pattern but employs different terminology. Messages are called "contents," roles are "user" and "model" (not "assistant"), and the API supports unique features like grounding (connecting to Google Search), code execution, and native multimodal input. Code Fragment 10.1.7 shows this in practice.

# Call Google Gemini using the genai SDK
# Configuration uses GenerateContentConfig for temperature and token limits
from google import genai

client = genai.Client() # reads GOOGLE_API_KEY from environment

response = client.models.generate_content(
 model="gemini-2.5-flash",
 contents="Explain Python list comprehensions in two sentences.",
 config=genai.types.GenerateContentConfig(
 temperature=0.7,
 max_output_tokens=150,
 system_instruction="You are a helpful coding assistant."
 )
)

print(response.text)
print(f"Tokens: {response.usage_metadata.prompt_token_count} input, "
 f"{response.usage_metadata.candidates_token_count} output")
A list comprehension provides a compact syntax for creating lists by applying an expression to each element of an iterable, written as [expression for item in iterable]. You can optionally include conditions to filter elements, like [x for x in numbers if x > 0]. Tokens: 18 input, 47 output
Code Fragment 10.1.6: Call Google Gemini using the genai SDK
Library Shortcut: LiteLLM for Unified Provider Interface

Tired of learning three different SDKs? LiteLLM provides a single interface for OpenAI, Anthropic, Gemini, and 100+ other providers:

# LiteLLM: unified interface for 100+ LLM providers.
# Swap provider by changing only the model string prefix.
from litellm import completion

r1 = completion(model="gpt-4o", messages=[{"role": "user", "content": "Hi"}])
r2 = completion(model="anthropic/claude-sonnet-4-20250514", messages=[{"role": "user", "content": "Hi"}])
r3 = completion(model="gemini/gemini-2.5-flash", messages=[{"role": "user", "content": "Hi"}])
Code Fragment 10.1.7: LiteLLM as a unified provider abstraction. The same completion() call works with OpenAI, Anthropic, and Gemini by changing only the model string prefix.

pip install litellm. Full routing and fallback patterns appear in Section 10.3.

5. Enterprise Wrappers: AWS Bedrock and Azure OpenAI

Enterprise cloud providers wrap the underlying models with additional infrastructure for security, compliance, and billing. AWS Bedrock provides access to models from Anthropic, Meta, Mistral, and others through a unified API with IAM authentication, VPC endpoints, and consolidated AWS billing. Azure OpenAI deploys the same OpenAI models within Microsoft's cloud, adding features like content filtering, private networking, and regional data residency.

The following example shows how to call Claude on AWS Bedrock using boto3. The key difference from the direct Anthropic API is authentication (IAM credentials instead of an API key) and the endpoint configuration:

# Call Claude on AWS Bedrock using boto3 with IAM authentication
# The request body follows the Anthropic Messages API format
import boto3
import json

# Bedrock uses IAM credentials (configured via AWS CLI or environment variables)
bedrock = boto3.client(
 service_name="bedrock-runtime",
 region_name="us-east-1"
)

body = json.dumps({
 "anthropic_version": "bedrock-2023-05-31",
 "max_tokens": 256,
 "messages": [
 {"role": "user", "content": "Explain what an API gateway does in two sentences."}
 ]
})

response = bedrock.invoke_model(
 modelId="anthropic.claude-sonnet-4-20250514-v1:0",
 body=body,
 contentType="application/json",
 accept="application/json"
)

result = json.loads(response["body"].read())
print(result["content"][0]["text"])
An API gateway acts as a single entry point for client requests, routing them to the appropriate backend services while handling cross-cutting concerns like authentication and rate limiting. It simplifies client interactions by abstracting the complexity of the underlying microservices architecture.
Code Fragment 10.1.8: Call Claude on AWS Bedrock using boto3 with IAM authentication
Warning

API version drift: Enterprise wrappers sometimes lag behind the direct provider APIs by days or weeks. A feature available on api.openai.com may not yet be available on Azure OpenAI (or may require a specific API version string). Always check the enterprise wrapper's documentation for supported features and versions before building against them.

Note

All three major providers (GPT-4o, Claude, Gemini) now support image input alongside text. You can send images as base64-encoded data or URLs in the messages array. While this section focuses on text APIs, be aware that these same endpoints handle multimodal input. Consult each provider's documentation for image formatting details, token counting for images, and supported image formats.

6. Open-Source Serving: The OpenAI-Compatible Pattern

A powerful convention has emerged in the open-source ecosystem: serving frameworks expose an API that mimics the OpenAI Chat Completions format. This means you can point the OpenAI Python SDK at any compatible server by changing the base_url, and your existing code works without modification. Frameworks like vLLM, Text Generation Inference (TGI), and Ollama all support this pattern. Code Fragment 10.1.5 shows this approach in practice.

# Send a chat completion request to the OpenAI API
# The messages list follows the multi-turn conversation format
from openai import OpenAI

# Point the OpenAI client at a local vLLM server
client = OpenAI(
 base_url="http://localhost:8000/v1",
 api_key="not-needed" # Local servers often skip auth
)

# Send chat completion request to the API
response = client.chat.completions.create(
 model="meta-llama/Llama-3.1-8B-Instruct",
 messages=[
 {"role": "user", "content": "Explain Python list comprehensions in two sentences."}
 ],
 temperature=0.7,
 max_tokens=150
)

# Extract the generated message from the API response
print(response.choices[0].message.content)
List comprehensions in Python let you build a new list by applying an expression to each element of an existing iterable in a single line. They can include optional filtering with an if clause, making them a concise alternative to traditional for loops with append calls.
Code Fragment 10.1.9: Send a chat completion request to the OpenAI API
Key Insight

The portability benefit: Because the OpenAI-compatible API has become a de facto standard, code written against the OpenAI SDK can run against local models, cloud providers, or even custom fine-tuned models with zero changes to the application logic. Only the base_url and model name change. This is a deliberate design choice by the open-source community to reduce switching costs.

7. Provider Comparison

The following table summarizes the key differences across the major providers. Understanding these differences helps you choose the right provider for your use case and design portable abstractions.

7. Provider Comparison Intermediate
Feature OpenAI Anthropic Google Gemini
Endpoint /chat/completions /messages generateContent
System prompt Message with role "system" Top-level system param system_instruction config
Assistant role name assistant assistant model
Max tokens Optional Required Optional
Streaming SSE (stream=True) SSE (stream=True) SSE (stream=True)
Prompt caching Automatic Explicit markers Context caching API
Batch API Yes (50% discount) Yes (Message Batches) No dedicated batch
Tool use Function calling Tool use Function declarations

Despite the naming differences summarized above, all providers follow the same fundamental lifecycle. Figure 10.1.3 illustrates this universal request/response cycle.

Universal request/response cycle for LLM APIs: client sends parameters, provider processes through five steps, returns JSON response
Figure 10.1.5: The universal request/response cycle for LLM APIs. Despite naming differences, every provider follows this same five-step pipeline.

8. Authentication and Rate Limits

Why authentication and rate limiting exist: LLM inference is computationally expensive, and providers need to manage shared GPU resources across thousands of customers. Rate limits prevent any single customer from monopolizing capacity, while authentication ties usage to billing accounts and access policies. Understanding these constraints is essential because your application architecture must be designed around them, not as an afterthought. When you deploy to production, rate limit handling becomes a core reliability concern.

Every provider requires authentication, typically via an API key passed in the Authorization header. Enterprise providers like Azure OpenAI also support Azure Active Directory tokens. Rate limits are enforced at multiple levels: requests per minute (RPM), tokens per minute (TPM), and sometimes requests per day (RPD). Exceeding these limits results in HTTP 429 (Too Many Requests) responses.

Warning

Never hardcode API keys. Store them in environment variables, secrets managers (like AWS Secrets Manager or HashiCorp Vault), or .env files that are excluded from version control. A leaked API key can result in unauthorized usage and significant charges on your account.

Rate limit headers returned with each response tell you how much capacity remains. Monitoring these headers allows you to implement proactive throttling rather than waiting for 429 errors:

9. Choosing Between Providers

The right provider depends on your specific requirements. Consider these factors when making your decision:

Note

Multi-provider strategy: Many production systems use multiple providers. A common pattern is to route simple tasks to smaller, cheaper models and reserve expensive frontier models for complex queries. We will explore this routing pattern in detail in Section 10.3.

Self-Check
Q1: In the Anthropic Messages API, where is the system prompt specified?
Show Answer
The system prompt is specified as a separate top-level system parameter in the API call, not as a message with the "system" role inside the messages array. This is a key architectural difference from the OpenAI API.
Q2: What is the primary advantage of the OpenAI Batch API?
Show Answer
The Batch API provides a 50% cost reduction compared to synchronous API calls. Requests are processed asynchronously within a 24-hour window, making it ideal for non-real-time tasks like bulk classification, dataset labeling, or evaluation pipelines.
Q3: Why has the OpenAI-compatible API format become a de facto standard for open-source model serving?
Show Answer
Because it allows developers to use the OpenAI Python SDK with any compatible server simply by changing the base_url parameter. This means existing application code works without modification when switching between providers, reducing switching costs and making local and cloud deployments interchangeable.
Q4: What HTTP status code indicates you have exceeded an API rate limit?
Show Answer
HTTP 429 (Too Many Requests). Rate limits are enforced at multiple levels: requests per minute (RPM), tokens per minute (TPM), and sometimes requests per day (RPD). The response headers include information about remaining capacity and reset times.
Q5: Name two differences between how Google Gemini and OpenAI label their message roles.
Show Answer
First, Gemini uses "model" for the assistant's role while OpenAI uses "assistant." Second, Gemini uses "contents" (plural) as the parameter name for the message list while OpenAI uses "messages." These naming differences require translation when building provider-agnostic abstractions.
Note

Pick any code example from this section and try these experiments:

  1. Change the temperature from 0.0 to 1.0 and run the same prompt five times. Compare how the outputs vary. At what temperature do you start seeing meaningfully different responses?
  2. Set max_tokens to 10 on a question that requires a long answer. Observe how the model truncates its response mid-sentence. This is why production code must handle incomplete responses gracefully.
  3. Send the same prompt to two different providers (e.g., OpenAI and the vLLM local example). Compare the outputs, latency, and token counts. Note which differences are cosmetic and which are substantive.
Tip: Set Timeouts on Every API Call

Always configure explicit timeouts (connection and read) on LLM API calls. A 30-second timeout is a reasonable default. Without timeouts, a single slow response can block your entire application or exhaust your connection pool.

Key Takeaways
Note

Now that you can call any major LLM API and stream responses, the next question is: how do you make the model return structured, machine-readable output instead of free-form text? And how do you let the model interact with external systems through function calls? Section 10.2 tackles both of these challenges.

Real-World Scenario: Migrating from Single Provider to Multi-Provider Architecture

Who: A backend engineering team at a mid-stage SaaS startup building an AI-powered customer support platform.

Situation: The team had built their entire product on OpenAI's Chat Completions API, processing roughly 50,000 requests per day across ticket classification, response drafting, and sentiment analysis.

Problem: A 45-minute OpenAI outage during peak hours caused a complete service failure, resulting in hundreds of unanswered support tickets and an escalation from their largest enterprise customer.

Dilemma: They could add Anthropic as a fallback (requiring code changes for the different message format) or adopt a unified abstraction layer like LiteLLM (simpler switching but adding a dependency). They also considered self-hosting an open model via vLLM as a third tier.

Decision: They chose LiteLLM as the abstraction layer with three providers: OpenAI as primary, Anthropic as secondary, and a self-hosted Mistral model via vLLM (using the OpenAI-compatible endpoint) as a last resort for basic classification tasks.

How: They wrapped all API calls through LiteLLM, configured automatic failover with provider-specific rate limit detection, and standardized on the OpenAI message format internally. The vLLM endpoint required no code changes because it exposed the same API shape.

Result: Over six months, the system experienced zero complete outages despite three individual provider incidents. Response drafting quality remained within 2% across providers, and the self-hosted fallback handled 12% of total traffic during peak load, reducing API costs by $4,200/month.

Lesson: The OpenAI-compatible API format is a de facto standard that makes multi-provider architectures practical; investing in provider abstraction early prevents painful migrations later.

The following snippet shows how LiteLLM unifies provider calls with a single function. For the full treatment of provider routing, fallback, and cost tracking, see Section 10.3.

# pip install litellm
import litellm

# Same function, different providers (prefix determines the backend)
response = litellm.completion(
 model="anthropic/claude-sonnet-4-20250514",
 messages=[{"role": "user", "content": "Explain caching in one sentence."}],
 max_tokens=60
)
print(response.choices[0].message.content)
print(f"Cost: ${litellm.completion_cost(response):.6f}")
Caching stores frequently accessed data in a faster storage layer to reduce retrieval time and computational cost. Cost: $0.000285
Code Fragment 10.1.10: pip install litellm
Fun Fact

The OpenAI Chat Completions API format has become so ubiquitous that even competitors adopt it. vLLM, Ollama, and dozens of other serving tools expose OpenAI-compatible endpoints, making the OpenAI Python SDK a "universal remote" for models it never trained.

Research Frontier

Unified multi-provider protocols. The proliferation of provider-specific API formats has driven projects like LiteLLM and the emerging Model Context Protocol (MCP) to standardize tool calling and message formats across vendors. As of 2025, no single standard has won, but convergence toward OpenAI-compatible endpoints continues to accelerate, with vLLM, SGLang, and Ollama all adopting this format.

Semantic caching at the API layer. Rather than caching exact prompts, research teams at companies like Zilliz and Gptcache are exploring embedding-based semantic caches that recognize when a new prompt is "close enough" to a previously answered one. Early results show 40 to 60% cache hit rates on production workloads with acceptable quality degradation.

Disaggregated inference APIs. Providers are beginning to separate prefill (prompt processing) from decode (token generation) into distinct API tiers, allowing cost optimization based on workload profile. This mirrors the disaggregated serving architectures discussed in Section 09.4.

Exercises

Exercise 10.1.1: Provider comparison Conceptual

Name three key differences between the OpenAI Chat Completions API and the Anthropic Messages API in terms of message format, token counting, and streaming behavior.

Answer Sketch

OpenAI uses a flat messages array with role/content pairs and counts tokens with tiktoken; Anthropic separates the system message into its own parameter, counts tokens server-side, and returns a structured content block array rather than a single string. Streaming event formats also differ: OpenAI sends delta objects while Anthropic sends typed content_block events.

Exercise 10.1.2: Token economics Coding

Write a Python function that estimates the cost of a single API call given the model name, input token count, and output token count. Support at least three models with different pricing tiers.

Answer Sketch

Create a dictionary mapping model names to (input_price_per_1k, output_price_per_1k) tuples. Multiply input tokens by the input rate and output tokens by the output rate, then sum. For example: {'gpt-4o': (0.0025, 0.01), 'claude-sonnet': (0.003, 0.015), 'gemini-flash': (0.0001, 0.0004)}. Return (input_tokens / 1000) * input_rate + (output_tokens / 1000) * output_rate.

Exercise 10.1.3: API authentication Conceptual

Explain why API keys should be stored in environment variables rather than hardcoded in source files. What are two alternative approaches for managing secrets in production?

Answer Sketch

Hardcoded keys can be accidentally committed to version control and exposed publicly. Two production alternatives: (1) a secrets manager such as AWS Secrets Manager or HashiCorp Vault that injects secrets at runtime, and (2) service account authentication using cloud IAM roles, which avoids long-lived API keys entirely.

Exercise 10.1.4: Streaming vs. non-streaming Coding

Write a Python snippet using the OpenAI SDK that streams a chat completion response and prints each token as it arrives. Include proper error handling for network interruptions.

Answer Sketch

Use client.chat.completions.create(..., stream=True) in a try/except block. Iterate over the response with for chunk in response, extracting chunk.choices[0].delta.content. Wrap in try/except (openai.APIConnectionError, openai.APITimeoutError) to handle network issues gracefully.

Exercise 10.1.5: Rate limit strategy Analysis

A production system receives 500 requests per minute but the API rate limit is 200 RPM. Design a strategy that serves all requests without dropping any, using queuing, batching, or caching.

Answer Sketch

Combine three techniques: (1) implement a request queue with exponential backoff that retries 429 responses, (2) use semantic caching to serve identical or near-identical queries from cache (reducing effective RPM), and (3) batch multiple short requests into a single prompt where possible. For remaining overflow, implement a priority queue that processes high-priority requests first and delays low-priority ones.

What Comes Next

In the next section, Section 10.2: Structured Output & Tool Integration, we explore structured output generation and tool integration, enabling LLMs to produce machine-readable results and call external functions.

References & Further Reading
Provider Documentation

OpenAI. (2024). Chat Completions API Reference.

The definitive reference for the most widely adopted LLM API. Covers message formats, streaming, function calling, and all model parameters. Essential reading for anyone building on the OpenAI ecosystem or any OpenAI-compatible endpoint.

🔧 Tool

Anthropic. (2024). Messages API Documentation.

Documents Anthropic's Messages API, including the system prompt as a top-level parameter, content blocks, and prompt caching. Useful for understanding how Anthropic's design choices differ from the OpenAI format.

🔧 Tool

Google. (2024). Gemini API Documentation.

Covers Google's Gemini models, including multimodal inputs, grounding with Google Search, and the generationConfig parameter structure. Important for teams evaluating multi-provider strategies.

🔧 Tool
Infrastructure & Deployment

vLLM Project. (2024). vLLM: Easy, Fast, and Cheap LLM Serving.

Documentation for the leading open-source LLM serving engine. Explains PagedAttention, continuous batching, and the OpenAI-compatible API endpoint that makes self-hosted models drop-in replacements for cloud APIs.

🔧 Tool

AWS. (2024). Amazon Bedrock User Guide.

Covers AWS's managed LLM service, including model selection, guardrails, and integration with other AWS services. Relevant for enterprise teams already invested in the AWS ecosystem.

🔧 Tool

Microsoft. (2024). Azure OpenAI Service Documentation.

Details Azure's hosted OpenAI models with enterprise features like content filtering, private networking, and managed identity authentication. Key reference for organizations with Azure compliance requirements.

🔧 Tool

BerriAI. (2024). LiteLLM: Call 100+ LLM APIs Using the OpenAI Format.

Open-source library that unifies 100+ LLM providers behind the OpenAI SDK interface. Enables provider-agnostic code, automatic failover, and cost tracking across multiple APIs.

🔧 Tool