Section 33.8: Economic Design of LLM Systems

"The cheapest token is the one you never generate."
A Methodical Compass, Token-Thrifty AI Agent

Big Picture

LLM costs scale with usage, and without deliberate economic design, a successful product can become unprofitable overnight. This section provides the engineering patterns that control LLM economics: token budgeting at request, user, and feature levels; cascade architectures that route queries to the cheapest adequate model; semantic caching that eliminates redundant inference; and prompt optimization that reduces token counts without sacrificing quality. These techniques connect the inference optimization from Chapter 9 to the ROI measurement from Section 33.3, turning cost management from a reactive budget exercise into a proactive engineering discipline.

Prerequisites

This section builds on ROI measurement from Section 33.3, compute planning from Section 33.5, evaluation and observability from Chapter 29, and inference optimization from Chapter 9. Familiarity with LLM API pricing models is helpful.

A cartoon robot chef in a kitchen carefully portioning token ingredients into measuring cups of different sizes, with a recipe showing cost optimization and a refrigerator with a recycling symbol representing semantic caching, illustrating economic design of LLM systems. — Token consumption is the primary cost driver for LLM applications. Without explicit budgets at the request, user, and feature level, costs grow unpredictably as usage evolves.

1. Token Budgeting Strategies

Token consumption is the primary cost driver for LLM applications. Without explicit budgets, costs grow unpredictably as users discover new use cases and usage patterns evolve. A token budgeting system allocates consumption limits at three levels: per-request (preventing a single query from consuming an entire day's budget), per-user (ensuring equitable access across the organization), and per-feature (allowing product teams to control costs for specific capabilities).

The per-request budget sets a hard ceiling on the total tokens (input plus output) for any single interaction. This prevents pathological cases: a user pasting a 100-page document into the chat, or an agent entering a loop that generates thousands of tool calls. The per-user budget resets on a daily or monthly cycle and tracks cumulative consumption. The per-feature budget lets product managers allocate a fixed token envelope to each product surface (the customer support chatbot gets 10M tokens/day, the code review assistant gets 5M tokens/day) and receive alerts when consumption approaches the limit.

# Example: Three-tier token budget enforcement
from dataclasses import dataclass
from datetime import datetime, timezone
import time

@dataclass
class BudgetConfig:
 max_tokens_per_request: int = 32_000
 max_tokens_per_user_daily: int = 500_000
 max_tokens_per_feature_daily: int = 10_000_000

class TokenBudgetEnforcer:
 """Enforces token budgets at request, user, and feature levels."""

 def __init__(self, redis_client, config: BudgetConfig):
 self.redis = redis_client
 self.config = config

 def check_budget(
 self,
 user_id: str,
 feature_id: str,
 estimated_tokens: int,
 ) -> dict:
 """Check all three budget tiers before allowing a request."""
 today = datetime.now(timezone.utc).strftime("%Y-%m-%d")

 # Tier 1: Per-request limit
 if estimated_tokens > self.config.max_tokens_per_request:
 return {
 "allowed": False,
 "reason": f"Request exceeds per-request limit "
 f"({estimated_tokens:,} > {self.config.max_tokens_per_request:,})",
 }

 # Tier 2: Per-user daily limit
 user_key = f"budget:user:{user_id}:{today}"
 user_used = int(self.redis.get(user_key) or 0)
 if user_used + estimated_tokens > self.config.max_tokens_per_user_daily:
 return {
 "allowed": False,
 "reason": f"User daily budget exhausted "
 f"({user_used:,} used of {self.config.max_tokens_per_user_daily:,})",
 }

 # Tier 3: Per-feature daily limit
 feature_key = f"budget:feature:{feature_id}:{today}"
 feature_used = int(self.redis.get(feature_key) or 0)
 if feature_used + estimated_tokens > self.config.max_tokens_per_feature_daily:
 return {
 "allowed": False,
 "reason": f"Feature '{feature_id}' daily budget exhausted",
 }

 return {"allowed": True, "user_remaining": self.config.max_tokens_per_user_daily - user_used}

 def record_usage(self, user_id: str, feature_id: str, actual_tokens: int):
 """Record actual token consumption after a request completes."""
 today = datetime.now(timezone.utc).strftime("%Y-%m-%d")
 pipe = self.redis.pipeline()
 for key in [
 f"budget:user:{user_id}:{today}",
 f"budget:feature:{feature_id}:{today}",
 ]:
 pipe.incrby(key, actual_tokens)
 pipe.expire(key, 86400 * 2) # expire after 2 days
 pipe.execute()

Code Fragment 33.8.1: Example: Three-tier token budget enforcement

Fun Fact

A SaaS startup reduced their monthly LLM spend from $47,000 to $8,200 by implementing a three-tier cascade: 70% of queries went to GPT-4o-mini ($0.15/M tokens), 25% to Claude Sonnet ($3/M tokens), and only 5% to GPT-4o ($5/M tokens). Customer satisfaction scores stayed flat. The CEO's summary: "We were serving filet mignon to everyone, including people who just wanted a sandwich."

2. Cascade Design: Routing Queries by Complexity

Not every query deserves the most expensive model. A cascade (or tiered routing) system classifies incoming queries by complexity and routes them to the cheapest model capable of handling them. Simple queries ("What are your business hours?") go to a small, fast model. Complex queries requiring multi-step reasoning go to a large frontier model. This approach can reduce costs by 60 to 80% while maintaining quality on hard queries.

The router itself must be cheap to run. Common approaches include: a lightweight classifier trained on historical query/complexity pairs, a keyword or regex-based heuristic, or a small LLM call that evaluates complexity before routing. The best cascades also include a fallback escalation mechanism: if the small model's response scores below a confidence threshold, the query is automatically re-routed to the larger model.

# Example: Cost-aware cascade router with confidence-based escalation
from dataclasses import dataclass

@dataclass
class ModelTier:
 name: str
 model_id: str
 cost_per_1k_input: float # USD
 cost_per_1k_output: float # USD
 max_complexity: str # "simple", "moderate", "complex"

TIERS = [
 ModelTier("small", "gpt-4o-mini", 0.00015, 0.0006, "simple"),
 ModelTier("medium", "claude-sonnet-4", 0.003, 0.015, "moderate"),
 ModelTier("large", "claude-opus-4", 0.015, 0.075, "complex"),
]

class CascadeRouter:
 """Routes queries to the cheapest capable model."""

 def __init__(self, classifier, llm_clients: dict):
 self.classifier = classifier
 self.clients = llm_clients

 async def route(self, query: str, context: dict) -> dict:
 # Step 1: Classify query complexity (cheap operation)
 complexity = self.classifier.classify(query)
 tier = self._select_tier(complexity)

 # Step 2: Call the selected model
 response = await self.clients[tier.model_id].complete(query, context)

 # Step 3: Check confidence; escalate if needed
 if response["confidence"] < 0.7 and tier.name != "large":
 next_tier = self._escalate(tier)
 response = await self.clients[next_tier.model_id].complete(query, context)
 response["escalated"] = True
 response["original_tier"] = tier.name
 response["final_tier"] = next_tier.name
 else:
 response["escalated"] = False
 response["final_tier"] = tier.name

 # Step 4: Record cost
 response["cost_usd"] = self._compute_cost(
 TIERS[[t.name for t in TIERS].index(response["final_tier"])],
 response["input_tokens"],
 response["output_tokens"],
 )
 return response

 def _select_tier(self, complexity: str) -> ModelTier:
 complexity_order = ["simple", "moderate", "complex"]
 for tier in TIERS:
 if complexity_order.index(complexity) <= complexity_order.index(tier.max_complexity):
 return tier
 return TIERS[-1]

 def _escalate(self, current: ModelTier) -> ModelTier:
 idx = [t.name for t in TIERS].index(current.name)
 return TIERS[min(idx + 1, len(TIERS) - 1)]

 def _compute_cost(self, tier: ModelTier, input_tokens: int, output_tokens: int) -> float:
 return (input_tokens / 1000 * tier.cost_per_1k_input +
 output_tokens / 1000 * tier.cost_per_1k_output)

Code Fragment 33.8.2: Example: Cost-aware cascade router with confidence-based escalation

Library Shortcut: LiteLLM for Cost-Aware Routing

The same result in 8 lines with LiteLLM Router:


# pip install litellm
from litellm import Router

router = Router(model_list=[
 {"model_name": "simple", "litellm_params": {"model": "gpt-4o-mini"}},
 {"model_name": "complex", "litellm_params": {"model": "gpt-4o"}},
], routing_strategy="cost-based-routing")
# LiteLLM handles fallback, retry, and cost tracking automatically:
response = router.completion(model="simple", messages=[{"role": "user", "content": query}])
print(f"Cost: ${response._hidden_params['response_cost']:.6f}")

Code Fragment 33.8.3: Example: Cost-aware cascade router with confidence-based escalation

Key Insight

The cascade router pays for itself within days. In a typical enterprise deployment, 60 to 70% of queries are simple (FAQ lookups, status checks, basic reformatting). Routing these to a model that costs 100x less than the frontier model saves far more than the overhead of the routing classifier. The key metric to track is the escalation rate: if more than 30% of queries escalate from small to large, your classifier needs retraining or your complexity boundaries need adjustment.

3. Semantic Caching

Semantic caching stores LLM responses keyed by the semantic meaning of the query rather than its exact text. When a new query arrives, the system computes its embedding and searches the cache for similar previous queries. If a match exceeds a similarity threshold, the cached response is returned without calling the LLM. This is particularly effective for applications with repetitive queries (customer support, FAQ bots, internal help desks) where the same questions appear with minor phrasing variations.

Cache invalidation is the hard part. Cached responses become stale when the underlying knowledge changes (a product price update, a new policy, a corrected fact). Strategies include: time-based TTL (cache expires after N hours), event-based invalidation (a webhook from the CMS triggers cache flush), and scope-based invalidation (only clear cache entries whose source documents were modified).

# Example: Semantic cache with similarity threshold and TTL
import time
import hashlib
import numpy as np
from dataclasses import dataclass

@dataclass
class CacheEntry:
 query: str
 query_embedding: list[float]
 response: str
 model: str
 created_at: float
 ttl_seconds: int
 source_doc_ids: list[str]

class SemanticCache:
 """Cache LLM responses by semantic similarity."""

 def __init__(self, similarity_threshold: float = 0.95, default_ttl: int = 3600):
 self.threshold = similarity_threshold
 self.default_ttl = default_ttl
 self.entries: list[CacheEntry] = []

 def lookup(self, query_embedding: list[float]) -> str | None:
 """Find a cached response for a semantically similar query."""
 now = time.time()
 query_vec = np.array(query_embedding)

 best_match = None
 best_score = 0.0

 for entry in self.entries:
 # Check TTL
 if now - entry.created_at > entry.ttl_seconds:
 continue
 # Compute cosine similarity
 entry_vec = np.array(entry.query_embedding)
 score = float(
 np.dot(query_vec, entry_vec)
 / (np.linalg.norm(query_vec) * np.linalg.norm(entry_vec))
 )
 if score > best_score:
 best_score = score
 best_match = entry

 if best_match and best_score >= self.threshold:
 return best_match.response
 return None

 def store(self, query: str, embedding: list[float], response: str,
 model: str, source_doc_ids: list[str] | None = None):
 self.entries.append(CacheEntry(
 query=query,
 query_embedding=embedding,
 response=response,
 model=model,
 created_at=time.time(),
 ttl_seconds=self.default_ttl,
 source_doc_ids=source_doc_ids or [],
 ))

 def invalidate_by_source(self, doc_id: str):
 """Remove all cache entries that depend on a specific source document."""
 self.entries = [
 e for e in self.entries if doc_id not in e.source_doc_ids
 ]

 def get_stats(self) -> dict:
 now = time.time()
 active = [e for e in self.entries if now - e.created_at <= e.ttl_seconds]
 return {
 "total_entries": len(self.entries),
 "active_entries": len(active),
 "expired_entries": len(self.entries) - len(active),
 }

Code Fragment 33.8.4: Example: Semantic cache with similarity threshold and TTL

In production, the in-memory list shown above is replaced by a vector database (Qdrant, Pinecone, or Redis with vector search) that scales to millions of cached queries. The cost savings can be dramatic: a customer support bot with 40% cache hit rate at a 0.95 similarity threshold saves 40% of its LLM API costs, often paying for the vector database several times over.

Library Shortcut: GPTCache for Semantic Caching

The same result in 7 lines with GPTCache:


# pip install gptcache
from gptcache import cache
from gptcache.adapter import openai as cached_openai
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation import OnnxModelEvaluation

cache.init(embedding_func=Onnx(), similarity_evaluation=OnnxModelEvaluation())
cache.set_openai_key()
# All subsequent calls use semantic caching automatically:
response = cached_openai.ChatCompletion.create(
 model="gpt-4o-mini",
 messages=[{"role": "user", "content": "What is your return policy?"}],
)
# Similar queries hit cache; dissimilar queries call the API

Code Fragment 33.8.5: pip install gptcache

4. Prompt Optimization for Cost

Prompt design directly affects cost. A system prompt with 2,000 tokens is prepended to every request; with 100,000 requests per day, that system prompt alone costs the equivalent of 200 million input tokens daily. Systematic prompt optimization can reduce costs by 30 to 50% without degrading quality.

Key optimization techniques include: system prompt compression (rewriting verbose instructions into concise equivalents), few-shot reduction (testing whether 1 or 2 examples achieve the same quality as 5), context window trimming (sending only the most relevant retrieved documents rather than all of them), and output length control (instructing the model to respond concisely when verbose responses are unnecessary).

# Example: Prompt cost analyzer and optimizer
class PromptCostAnalyzer:
 """Analyze and optimize prompt costs across an LLM application."""

 def __init__(self, tokenizer, cost_per_1k_input: float):
 self.tokenizer = tokenizer
 self.cost_per_1k = cost_per_1k_input

 def analyze(self, messages: list[dict]) -> dict:
 """Break down token costs by message role."""
 breakdown = {}
 total = 0
 for msg in messages:
 role = msg["role"]
 tokens = len(self.tokenizer.encode(msg["content"]))
 breakdown[role] = breakdown.get(role, 0) + tokens
 total += tokens

 cost = total / 1000 * self.cost_per_1k
 return {
 "token_breakdown": breakdown,
 "total_tokens": total,
 "estimated_cost_usd": cost,
 "system_prompt_pct": breakdown.get("system", 0) / max(total, 1) * 100,
 }

 def suggest_optimizations(self, messages: list[dict]) -> list[str]:
 """Suggest cost reduction strategies based on prompt structure."""
 analysis = self.analyze(messages)
 suggestions = []

 if analysis["system_prompt_pct"] > 40:
 suggestions.append(
 f"System prompt is {analysis['system_prompt_pct']:.0f}% of total tokens. "
 f"Consider compressing instructions or moving static content to a cached prefix."
 )

 # Count few-shot examples
 user_msgs = [m for m in messages if m["role"] == "user"]
 assistant_msgs = [m for m in messages if m["role"] == "assistant"]
 if len(user_msgs) > 3 and len(assistant_msgs) > 2:
 suggestions.append(
 f"Found {len(assistant_msgs)} few-shot examples. "
 f"Test whether 1 or 2 examples achieve comparable quality."
 )

 # Check for verbose context
 for msg in messages:
 if msg["role"] == "user" and len(self.tokenizer.encode(msg["content"])) > 4000:
 suggestions.append(
 "User message exceeds 4,000 tokens. Consider trimming retrieved "
 "context to top-3 documents or summarizing before inclusion."
 )
 break

 return suggestions

Code Fragment 33.8.6: Example: Prompt cost analyzer and optimizer

Warning

Prompt optimization must be validated against quality metrics. A system prompt compressed from 2,000 to 500 tokens might save 75% on input costs, but if it increases hallucination rate from 2% to 8%, the net effect is negative. Always run optimized prompts through your evaluation pipeline before deploying them to production. The best practice is to A/B test prompt changes, measuring both cost reduction and quality impact simultaneously.

5. Evaluation Cost Management

LLM-as-judge evaluation pipelines (see Chapter 29) themselves consume tokens. Running a GPT-4o judge over every production response doubles your token costs. Evaluation cost management applies the same economic thinking to the evaluation pipeline: use sampling (evaluate 10% of responses instead of 100%), use cheaper judge models for initial screening (a small model flags potentially bad responses, a large model evaluates only the flagged ones), and set explicit evaluation budgets per feature.

# Example: Evaluation budget manager with stratified sampling
import random

class EvalBudgetManager:
 """Manage evaluation costs with sampling and cascade judging."""

 def __init__(self, daily_eval_budget_usd: float, judge_cost_per_eval: float):
 self.daily_budget = daily_eval_budget_usd
 self.cost_per_eval = judge_cost_per_eval
 self.max_evals_per_day = int(daily_eval_budget_usd / judge_cost_per_eval)
 self.evals_today = 0

 def should_evaluate(self, response: dict) -> str:
 """Decide whether and how to evaluate a response."""
 if self.evals_today >= self.max_evals_per_day:
 return "skip" # budget exhausted

 # Always evaluate: high-value features, new prompt versions, flagged responses
 if response.get("feature_id") in ("customer_support", "legal_review"):
 self.evals_today += 1
 return "full_eval" # use the expensive judge

 if response.get("prompt_version_is_canary"):
 self.evals_today += 1
 return "full_eval"

 # Sample 10% of remaining traffic
 if random.random() < 0.10:
 self.evals_today += 1
 return "sampled_eval"

 return "skip"

 def get_status(self) -> dict:
 return {
 "evals_today": self.evals_today,
 "max_evals": self.max_evals_per_day,
 "budget_remaining_usd": (self.max_evals_per_day - self.evals_today) * self.cost_per_eval,
 }

Code Fragment 33.8.7: Example: Evaluation budget manager with stratified sampling

6. Build vs. Buy: The Economic Perspective

The build-vs.-buy decision (covered architecturally in Section 33.6) has a distinct economic dimension. A self-hosted open-weight model has zero marginal API cost but requires GPU infrastructure, operational expertise, and ongoing maintenance. An API-based model has high marginal cost but zero infrastructure burden. The crossover point depends on your volume: at low volumes, APIs win; at high volumes, self-hosting wins. The breakeven calculation must account for the full cost stack.

For most enterprises, a hybrid strategy is optimal: use APIs for frontier-quality tasks (complex reasoning, creative generation) and self-hosted models for high-volume, lower-complexity tasks (classification, extraction, embeddings). This approach captures the best economics of both worlds while maintaining access to the strongest models for the hardest problems.

# Example: Build vs. buy breakeven analysis configuration
cost_analysis:
 api_model:
 name: "claude-sonnet-4"
 input_cost_per_1k: 0.003
 output_cost_per_1k: 0.015
 avg_input_tokens: 2000
 avg_output_tokens: 500
 cost_per_request: 0.0135 # (2 * 0.003) + (0.5 * 0.015)

 self_hosted_model:
 name: "Llama-3.3-70B"
 gpu_type: "H100"
 gpus_required: 2
 gpu_cost_per_hour: 3.50 # cloud rental
 throughput_requests_per_hour: 800
 cost_per_request: 0.00875 # (3.50 * 2) / 800
 fixed_monthly_cost: 5040 # 2 GPUs * $3.50/hr * 720 hrs

 breakeven:
 requests_per_month_breakeven: 373_333
 # Below 373K requests/month: API is cheaper
 # Above 373K requests/month: self-hosted is cheaper
 # Factor in: engineering time for ops (~$15K/month equivalent)
 adjusted_breakeven: 1_484_000 # accounting for operational overhead

Code Fragment 4: A breakeven analysis comparing API-based Claude Sonnet 4 at $0.0135 per request against self-hosted Llama 3.3 70B on two H100s at $0.00875 per request. The raw crossover is at 373K requests per month, but adding $15K of monthly engineering operations cost pushes the adjusted breakeven to nearly 1.5 million requests per month.

7. Cost Observability

Cost observability means tracking and visualizing LLM spending at the granularity needed to make decisions: cost per query, cost per user, cost per feature, and cost per conversation. Without this visibility, cost optimization is guesswork. The observability stack should include real-time dashboards showing current spend rate, anomaly detection that alerts when a feature's cost suddenly spikes, and historical trend analysis for budget planning.

The implementation builds on the observability infrastructure from Chapter 30. Every LLM response already includes token counts in its metadata. The cost observability layer multiplies these counts by the model's per-token pricing and aggregates the results along the dimensions that matter for your business.

# Example: Cost observability with anomaly detection
from collections import defaultdict
from datetime import datetime, timezone
import statistics

class CostObserver:
 """Track and alert on LLM costs across dimensions."""

 def __init__(self, alert_callback):
 self.alert = alert_callback
 self.costs_by_feature: dict[str, list[float]] = defaultdict(list)
 self.costs_by_user: dict[str, float] = defaultdict(float)
 self.hourly_costs: list[float] = []

 def record(self, feature_id: str, user_id: str, cost_usd: float):
 """Record a cost event and check for anomalies."""
 self.costs_by_feature[feature_id].append(cost_usd)
 self.costs_by_user[user_id] += cost_usd

 # Anomaly detection: alert if a single request costs 10x the feature average
 feature_costs = self.costs_by_feature[feature_id]
 if len(feature_costs) > 100:
 avg = statistics.mean(feature_costs[-100:])
 if cost_usd > avg * 10:
 self.alert(
 level="warning",
 message=f"Cost anomaly: feature '{feature_id}' request cost "
 f"${cost_usd:.4f} (10x the average ${avg:.4f})",
 metadata={
 "feature_id": feature_id,
 "user_id": user_id,
 "cost": cost_usd,
 "average": avg,
 },
 )

 def get_dashboard_data(self) -> dict:
 """Generate dashboard metrics."""
 return {
 "cost_by_feature": {
 f: {"total": sum(costs), "avg": statistics.mean(costs), "count": len(costs)}
 for f, costs in self.costs_by_feature.items()
 },
 "top_users_by_cost": sorted(
 self.costs_by_user.items(), key=lambda x: x[1], reverse=True
 )[:20],
 "total_cost_usd": sum(
 sum(costs) for costs in self.costs_by_feature.values()
 ),
 }

Code Fragment 33.8.8: Example: Cost observability with anomaly detection

8. ROI Framework: Measuring LLM Value

Cost optimization without value measurement is incomplete. The ROI framework connects LLM costs to business outcomes: time saved (measured in hours of human labor replaced or accelerated), quality improvement (measured by error rates, customer satisfaction, or output consistency), and revenue impact (measured by conversion rate changes, deal velocity, or new revenue streams enabled by LLM features).

The formula is straightforward: ROI = (Value Generated - Total Cost) / Total Cost. The challenge lies in measuring value accurately. Time savings are the easiest to quantify: if a code review assistant reduces review time from 30 minutes to 10 minutes, the value per review is 20 minutes of engineer time multiplied by the loaded hourly rate. Quality improvements require A/B testing: compare error rates with and without the LLM feature. Revenue impact requires attribution models that isolate the LLM's contribution from other factors.

Real-World Scenario: ROI Calculation for a Support Chatbot

Who: A VP of Customer Success and a finance business partner at a mid-size SaaS company handling 20,000 support tickets per month

Situation: The company deployed an LLM-powered support chatbot to deflect routine tickets (password resets, billing questions, feature how-tos) and provide suggested responses for escalated cases handled by human agents.

Problem: The CFO questioned the $14,000/month spend ($8,500 LLM API, $500 vector database, $5,000 equivalent engineering maintenance) and asked for a rigorous ROI analysis before approving the budget for the next fiscal year.

Decision: The VP of Customer Success partnered with finance to measure two value streams: ticket deflection (45% of tickets handled without a human agent, saving $12 per deflected ticket) and agent acceleration (15% faster resolution on escalated tickets where agents used chatbot suggestions). They ran a controlled A/B test over four weeks to validate the deflection rate.

Result: Monthly value was $114,000 ($108,000 from 9,000 deflected tickets plus $6,000 from faster escalated resolution). ROI: ($114,000 - $14,000) / $14,000 = 714%. Even with conservative estimates (halving the deflection value), ROI exceeded 300%. The CFO approved a two-year budget commitment.

Lesson: LLM ROI cases are strongest when you can measure a concrete per-unit cost displacement (such as cost per ticket) rather than relying on qualitative productivity claims.

Lab: Cost-Aware Routing with Budget Enforcement

This lab ties together the economic patterns from this section into a working system. The implementation combines cascade routing, token budgets, semantic caching, and cost tracking into a single middleware layer that wraps any LLM application.

# Lab: Complete cost-aware LLM middleware
class CostAwareMiddleware:
 """
 Middleware that enforces budgets, checks cache, routes to the cheapest
 capable model, and tracks costs.
 """

 def __init__(
 self,
 budget: TokenBudgetEnforcer,
 cache: SemanticCache,
 router: CascadeRouter,
 observer: CostObserver,
 embedder,
 ):
 self.budget = budget
 self.cache = cache
 self.router = router
 self.observer = observer
 self.embedder = embedder

 async def process(
 self,
 query: str,
 user_id: str,
 feature_id: str,
 context: dict,
 ) -> dict:
 # Step 1: Estimate tokens and check budget
 estimated_tokens = len(query.split()) * 2 # rough estimate
 budget_check = self.budget.check_budget(user_id, feature_id, estimated_tokens)
 if not budget_check["allowed"]:
 return {
 "response": "Your token budget has been reached for today. "
 "Please try again tomorrow or contact your administrator.",
 "cost_usd": 0.0,
 "source": "budget_limit",
 }

 # Step 2: Check semantic cache
 embedding = await self.embedder.embed(query)
 cached = self.cache.lookup(embedding)
 if cached:
 return {
 "response": cached,
 "cost_usd": 0.0,
 "source": "cache",
 }

 # Step 3: Route to cheapest capable model
 result = await self.router.route(query, context)

 # Step 4: Cache the response
 self.cache.store(
 query=query,
 embedding=embedding,
 response=result["response"],
 model=result["final_tier"],
 )

 # Step 5: Record budget usage and cost
 actual_tokens = result["input_tokens"] + result["output_tokens"]
 self.budget.record_usage(user_id, feature_id, actual_tokens)
 self.observer.record(feature_id, user_id, result["cost_usd"])

 return {
 "response": result["response"],
 "cost_usd": result["cost_usd"],
 "source": f"model:{result['final_tier']}",
 "escalated": result.get("escalated", False),
 }

Code Fragment 33.8.9: Lab: Complete cost-aware LLM middleware

Key Insight

Economic design is not an afterthought. The most successful LLM deployments build cost awareness into the architecture from day one. Retrofitting budgets, caching, and routing onto a system that was designed without them is significantly harder than incorporating these patterns during initial development. The cost-aware middleware pattern shown in this lab adds minimal latency (one embedding computation and one cache lookup) while potentially saving 50 to 70% of LLM API costs through caching and cascade routing combined.

Key Insight

Key Takeaways

Token budgeting assigns per-request cost ceilings that prevent runaway spending from verbose prompts, excessive tool calls, or retry loops.
Cascade design routes queries by complexity: simple requests go to cheap, fast models; complex requests escalate to expensive, capable models, optimizing cost per quality level.
Semantic caching stores LLM responses keyed by embedding similarity, serving cached results for semantically equivalent queries and reducing costs by 30 to 60%.
Prompt optimization reduces token counts through compression, template reuse, and eliminating redundant context, directly cutting per-request costs.
Evaluation cost management matters because running LLM-as-judge evaluations at scale can itself become a significant cost center; batching, caching, and using smaller judge models help control it.

Exercises

Exercise 33.8.1: Token Budget Design Conceptual

Design a token budgeting strategy for a company with 500 employees using an internal LLM assistant. The company spends $30,000/month on LLM APIs. Allocate budgets across three user tiers (basic, power user, executive) and three features (code assistant, document Q&A, email drafting).

Answer Sketch

At $30K/month with average cost of $0.01 per 1K tokens, total budget is approximately 3 billion tokens/month. Basic users (400): 2M tokens/day each (800M/day). Power users (80): 10M tokens/day each (800M/day). Executives (20): 20M tokens/day each (400M/day). Feature budgets: code assistant 50% ($15K), document Q&A 30% ($9K), email drafting 20% ($6K). Set per-request maximums at 32K tokens to prevent abuse. Alert at 80% of daily user budgets.

Exercise 33.8.2: Cascade Router Evaluation Coding

Implement a cascade router that classifies queries into three complexity tiers using a simple heuristic (query length, presence of technical terms, question type). Evaluate it on a sample dataset and compute the cost savings compared to routing everything to the largest model.

Answer Sketch

Build a classifier that uses query length (under 50 tokens = simple), presence of reasoning keywords ("analyze", "compare", "explain why" = complex), and question type (yes/no = simple, open-ended = moderate/complex). Test on 1,000 sample queries. Compute cost as: sum of (tier cost * queries routed to tier). Compare to baseline of all queries going to the large model. Expected savings: 55 to 65% for a typical enterprise query distribution.

Exercise 33.8.3: ROI Model Conceptual

Build an ROI model for deploying an LLM-powered code review assistant. Include all cost components (API, infrastructure, engineering time) and all value components (time saved, bug reduction, knowledge sharing). Calculate the breakeven point in months.

Answer Sketch

Costs: API ($5K/month), infrastructure ($1K/month), 0.5 FTE maintenance ($7.5K/month), initial setup (1 FTE for 2 months = $30K). Monthly cost: $13.5K. Value: 50 engineers saving 20 min/review, 3 reviews/day = 50 hours/day saved. At $75/hour loaded rate = $3,750/day = $75K/month. Bug reduction: 15% fewer production bugs, each costing $2K average = $6K/month savings. Total monthly value: $81K. ROI: 500%. Breakeven: month 1 (even accounting for setup costs, payback period is under 2 months).

Research Frontier

Autonomous cost optimization agents are beginning to emerge: systems that monitor real-time token consumption, model performance, and latency, then automatically adjust cascade thresholds, cache policies, and model routing without human intervention. Combined with inference-time compute markets (where providers bid to serve requests at the lowest cost for a given quality level), these approaches may fundamentally change how organizations budget for and consume LLM compute.

What Comes Next

This concludes the Strategy, Product, and ROI chapter. Continue to Part X: Frontiers, which looks beyond today's state of the art to emerging architectures and AI's broader societal implications.

References & Further Reading

Key References

Chen, L., Zaharia, M., Zou, J. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv:2305.05176.

Introduces strategies for reducing LLM API costs, including model cascading, prompt adaptation, and caching. The foundational paper for the cost optimization techniques discussed throughout this section.

📄 Paper

Anthropic. (2025). "Prompt Caching Documentation." Anthropic Docs. docs.anthropic.com.

Official documentation for Anthropic's prompt caching feature, which reduces costs and latency for repeated prompt prefixes. Directly relevant to the caching strategies discussed in this section.

🛠 Tool

OpenAI. (2025). "API Pricing and Token Usage." OpenAI Platform. platform.openai.com.

OpenAI's pricing page with token-level cost breakdowns for each model tier. Essential reference for the cost calculations and model selection decisions in this section.

🛠 Tool

a16z. (2024). "The Economics of AI Infrastructure." Andreessen Horowitz Research.

Analyzes the economics of AI infrastructure at scale, including how costs evolve as applications grow from prototype to production. Provides strategic context for the tactical optimizations discussed in this section.

📄 Paper

Bang, Y. et al. (2023). "A Survey on LLM-as-a-Judge." arXiv.

Surveys the use of LLMs as automated evaluators, relevant to the quality-monitoring aspect of cost optimization. Understanding LLM-as-judge reliability helps teams balance evaluation costs against human annotation budgets.

📄 Paper

Vllm Project. (2025). "vLLM: Easy, Fast, and Cheap LLM Serving." github.com/vllm-project/vllm.

Documentation for vLLM, a high-throughput inference engine that uses PagedAttention for efficient memory management. The primary tool for teams self-hosting models who want to maximize throughput per GPU dollar.

🛠 Tool

GPTCache. (2024). "Semantic Cache for LLMs." github.com/zilliztech/GPTCache.

Open-source semantic caching library that stores and retrieves LLM responses based on query similarity. A practical implementation of the semantic caching optimization discussed in this section.

🛠 Tool