Part 11: From Idea to AI Product
Chapter 38 · Section 38.3

Lock-in, Portability, and Multi-Provider Strategy

"The most dangerous lock-in is not in your contract. It is in your team's muscle memory."

Compass Compass, Vendor Wary AI Agent
Big Picture

Lock-in in the LLM era is fundamentally different from traditional vendor lock-in. When your product depends on a cloud database, lock-in comes from proprietary query languages, data formats, and migration friction. When your product depends on an LLM, the technical switching cost is surprisingly low (most providers accept nearly identical REST payloads), but the cognitive switching cost is high. Your team learns one provider's quirks, optimizes prompts for one model's personality, and builds institutional knowledge around one set of failure modes. This section explores why cognitive lock-in matters more than contract lock-in, how to plan for AI continuity, and how to architect your system so that switching providers is a configuration change rather than a rewrite.

Prerequisites

This section assumes familiarity with LLM APIs (Chapter 10), including request/response formats and authentication patterns. Understanding of evaluation pipelines (Chapter 29) will help you appreciate the portability checklist at the end. The AI strategy discussion in Chapter 33 provides useful background on build-vs-buy decisions that intersect with provider selection.

A diagram showing a central application connected to multiple LLM providers through an abstraction layer, with arrows indicating portability between providers.
Figure 38.3.1: A well-architected AI product maintains connections to multiple providers through a clean abstraction boundary, keeping switching costs low even while committing deeply to one provider for day-to-day operations.

1. Vendor Lock-in vs. Cognitive Lock-in

The traditional cloud computing playbook warns against vendor lock-in: proprietary APIs, non-standard data formats, egress fees, and contractual traps that make migration expensive. These concerns apply to LLM providers too, but they are not the primary risk. The primary risk is something subtler.

Key Insight

Switching LLM providers is technically easy but cognitively expensive. Your team has spent months learning one model's quirks: which prompts produce reliable structured output, where it hallucinates, how it handles edge cases in your domain. That institutional knowledge is not transferable. When you switch providers, you restart the learning curve. The most effective lock-in mitigation is not abstraction layers (though those help); it is a comprehensive evaluation suite that lets you test any new provider against your quality bar in hours, not weeks.

1.1 What Traditional Lock-in Looks Like

In conventional cloud infrastructure, lock-in manifests through concrete technical barriers. Your application uses a proprietary database query language that no other provider supports. Your data sits behind egress fees that make extraction prohibitively expensive. Your deployment scripts assume a specific orchestration system. Each of these barriers adds measurable switching cost, and the total cost grows linearly with the depth of integration.

1.2 Cognitive Lock-in: The Hidden Dependency

With LLM providers, the technical barriers are relatively thin. Most providers offer OpenAI-compatible endpoints, accept similar JSON payloads, and return responses in nearly identical formats. The real lock-in is cognitive: it lives in your team's habits, mental models, and accumulated intuitions about how a specific model behaves.

Consider what happens when a team spends six months building a product on GPT-4o. They learn that GPT-4o responds well to a particular system prompt structure. They discover that it handles ambiguous user inputs in a specific way. They develop intuitions about when to set temperature to 0.0 versus 0.7. They accumulate a library of prompt patterns that exploit GPT-4o's strengths and work around its weaknesses. None of this knowledge transfers cleanly to Claude, Gemini, or Llama.

Traditional Lock-in vs. Cognitive Lock-in in AI Products
Dimension Traditional Vendor Lock-in Cognitive Lock-in (LLM)
Primary barrier Proprietary APIs, data formats, egress fees Team habits, prompt patterns, model-specific intuitions
Switching cost type Engineering time to rewrite integrations Relearning model behaviour; re-tuning prompts; re-running evaluations
Visibility Measurable (lines of code, data volume, contract terms) Invisible until you attempt a switch
Growth rate Linear with integration depth Exponential with team experience on one provider
Mitigation Standard APIs, open formats, multi-cloud architecture Provider-agnostic evaluation suites, prompt externalization, regular cross-provider testing
Two growth curves shown as vines growing up a wall. The left vine representing traditional vendor lock-in grows slowly and steadily. The right vine representing cognitive lock-in starts thin but then explodes exponentially, wrapping around a developer's brain like a friendly but inescapable octopus. A calendar shows time passing from one month to six months. At the bottom, a pair of scissors that could easily cut the left vine is comically too small for the right one.
Figure 38.3.2: Traditional vendor lock-in grows linearly with integration depth, but cognitive lock-in grows exponentially as teams accumulate model-specific intuitions, prompt patterns, and failure-mode knowledge that resist transfer to other providers.
Key Insight

Cognitive lock-in grows exponentially, not linearly. Every week your team spends on a single provider deepens their model-specific intuitions. After three months, switching feels easy on paper (the API call is nearly identical) but terrifying in practice (every prompt needs re-testing, every edge case needs re-discovery, every quality threshold needs re-calibration). The antidote is not to avoid commitment; it is to invest in provider-agnostic evaluation infrastructure so you can measure what a switch would actually cost. We cover this in the portability checklist in Section 6.

2. AI Continuity Planning

Business continuity planning is standard practice for databases, payment processors, and cloud infrastructure. Surprisingly few teams apply the same discipline to their LLM dependencies, even though the risks are at least as severe. Your LLM provider can change pricing overnight, deprecate the model version you depend on, suffer an extended outage, or alter model behaviour through a silent update.

2.1 The Five Continuity Risks

Every AI product that depends on an external model provider faces five categories of continuity risk. Understanding these risks is the first step toward mitigating them.

  1. Pricing shocks. Your provider doubles the per-token cost. This has happened multiple times in the industry. If your unit economics depend on a specific price point, a sudden increase can make your product unprofitable overnight. The AI unit economics framework from Section 38.1 should include sensitivity analysis for a 2x and 5x cost increase.
  2. Model deprecation. The specific model version you have optimized your prompts for reaches end-of-life. The replacement model may behave differently, breaking prompts that passed your eval suite. OpenAI deprecated the original GPT-4 in favour of GPT-4 Turbo, then GPT-4o, each time changing behaviour in subtle ways.
  3. Extended outage. Your provider experiences a multi-hour or multi-day outage. If your product has no fallback, your users experience a complete service failure. This is not hypothetical; every major LLM provider has experienced significant outages.
  4. Behavioural drift. The provider updates model weights, safety filters, or post-processing logic without formal version change. Your prompts produce different outputs. Your eval scores drop. Section 38.4 covers drift detection; here we focus on the continuity planning response.
  5. Terms of service changes. The provider changes its acceptable use policy, data retention practices, or training data policies in ways that conflict with your compliance requirements or your customers' expectations.

2.2 The AI Continuity Plan Template

A minimal AI continuity plan answers four questions for each risk category: What is the trigger? What is the immediate response? What is the fallback? What is the recovery timeline?

AI Continuity Plan: Trigger, Response, Fallback, Recovery
Risk Trigger Immediate Response Fallback Recovery Target
Pricing shock Cost per request exceeds 2x baseline Activate cost circuit breaker; enable aggressive caching Route to secondary provider or smaller model 48 hours to re-evaluate unit economics
Model deprecation Deprecation notice from provider Pin current version; begin eval suite runs on replacement Migrate prompts to next-best model (pre-tested quarterly) 2 weeks for prompt re-optimization
Extended outage Provider API returns errors for >5 minutes Automatic failover to secondary provider Degrade gracefully (cached responses, rule-based fallback) Seconds (automated) to minutes (manual review)
Behavioural drift Eval scores drop >5% on stable test set Alert on-call engineer; freeze deployments Roll back to pinned model version or switch provider Hours to days depending on severity
ToS change Provider policy update email Legal review within 48 hours Migrate to compliant provider (pre-validated quarterly) 2 to 4 weeks for full migration
Real-World Scenario: Quarterly Provider Fire Drill

Who: A platform engineering lead at a Series B fintech company running all LLM traffic through a single provider.

Situation: The team had built a provider abstraction layer and a secondary provider integration six months earlier, but nobody had tested the failover path since the initial setup.

Problem: When the primary provider experienced a 90-minute outage, the failover silently failed because the secondary provider's API had changed its authentication scheme. Customers saw errors for the full duration of the outage.

Decision: The lead instituted a quarterly "provider fire drill": on a designated day each quarter, the team routes 5% of production traffic to the secondary provider and compares quality scores against the primary.

Result: The first drill after the policy change caught two integration issues (a deprecated parameter and a response format change). Subsequent drills have run cleanly, and the team now has quarterly data points showing how prompts perform across providers. The cost is minimal (5% of one day's traffic), and the next provider outage triggered a seamless failover.

Lesson: Failover infrastructure that is never tested is failover infrastructure that does not work; schedule regular drills to keep secondary integrations alive.

3. Translation Cost Collapse: Why Traditional Lock-in Is Fading

Here is a counterintuitive development: LLMs themselves are dissolving the technical barriers that create traditional vendor lock-in. The very technology that creates cognitive lock-in is simultaneously destroying API-level lock-in.

3.1 LLMs as Universal Translators

Consider what it takes to switch between LLM providers at the API level. You need to translate your request format (usually minor JSON differences), adapt your prompt structure (system/user/assistant role mappings), and normalize the response format (extracting the generated text from slightly different JSON paths). In the pre-LLM era, this kind of format translation was tedious, error-prone, and expensive. Today, you can literally ask an LLM to do the translation for you.

More importantly, the industry is converging on a de facto standard. The OpenAI chat completions API format has become the lingua franca that nearly every provider supports, either natively or through compatibility layers. Anthropic, Google, Mistral, and most open-weight model serving frameworks (vLLM, Ollama, LiteLLM) all offer OpenAI-compatible endpoints. This convergence is not accidental; it is a natural consequence of network effects in developer ecosystems.

3.2 What Switching Actually Costs Today

Given this convergence, the technical switching cost between providers has collapsed to near zero for the API call itself. What remains are three non-trivial costs:

  1. Prompt re-optimization. Different models respond differently to the same prompt. A prompt that scores 94% on GPT-4o might score 82% on Claude Sonnet. The gap is not because one model is "worse"; it is because each model has different strengths, different instruction-following patterns, and different default behaviours. Re-optimizing prompts typically requires 1 to 3 days of focused effort per critical prompt.
  2. Evaluation re-baselining. Your eval thresholds were calibrated for one model's output distribution. A new model will produce outputs with different characteristics, even when the quality is equivalent. You need to re-run your full eval suite (Chapter 29) and potentially adjust scoring rubrics.
  3. Edge case rediscovery. Over months of production use, your team discovers specific input patterns that cause problems and adds guardrails to handle them. A new model may fail on entirely different inputs, requiring a new round of edge case discovery and mitigation.
Fun Fact

In 2024, an engineer at a mid-sized startup reported that switching their entire product from GPT-4 to Claude 3.5 Sonnet took less than two hours for the API integration (literally changing one URL and one model name in their LiteLLM config) but three weeks for prompt re-optimization and evaluation re-baselining. The ratio of integration effort to optimization effort was approximately 1:80. This perfectly illustrates why cognitive lock-in dominates technical lock-in in the LLM era.

4. The Portable Monogamy Strategy

Given the realities above, the optimal strategy for most teams is what we call "portable monogamy": commit deeply to one provider for speed of execution, but architect your system so that switching providers is a configuration change rather than a rewrite. This approach gives you the best of both worlds: the depth of optimization that comes from mastering one model, and the optionality that comes from clean abstraction boundaries.

4.1 Why Full Commitment Beats Premature Multi-Provider

Teams that try to support multiple providers from day one often end up with a lowest-common-denominator integration. They avoid using provider-specific features (function calling syntax, structured output modes, extended context windows) because those features differ across providers. The result is a product that uses no provider well, optimized for portability at the expense of capability.

Portable monogamy avoids this trap. You commit fully to your primary provider, exploit their unique capabilities, and optimize your prompts specifically for their model. But you do this through an abstraction layer that isolates provider-specific logic behind clean interfaces. When you need to switch (or when you want to add a second provider for specific tasks), you implement a new adapter rather than rewriting your application.

4.2 The Abstraction Layer Pattern

The core of the portable monogamy strategy is a provider abstraction layer. This is not a heavyweight framework; it is a thin interface that separates your application logic from provider-specific API details. The following implementation demonstrates the pattern.

"""Provider abstraction layer for LLM-powered applications.

Supports multiple providers behind a unified interface while allowing
provider-specific optimizations through adapter classes.
"""
from __future__ import annotations

import json
import os
import time
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any


@dataclass
class LLMRequest:
 """Provider-agnostic request format."""
 system_prompt: str
 user_message: str
 model: str | None = None # Override default model
 temperature: float = 0.7
 max_tokens: int = 1024
 response_format: dict | None = None # JSON schema if needed
 tools: list[dict] | None = None # Function calling definitions
 metadata: dict = field(default_factory=dict) # For logging, tracing


@dataclass
class LLMResponse:
 """Provider-agnostic response format."""
 content: str
 model: str
 provider: str
 usage: dict # {"prompt_tokens": N, "completion_tokens": N}
 latency_ms: float
 raw_response: Any # Original provider response for debugging
 tool_calls: list[dict] | None = None


class LLMProvider(ABC):
 """Abstract base class for LLM provider adapters."""

 @abstractmethod
 def complete(self, request: LLMRequest) -> LLMResponse:
 """Send a completion request and return a normalized response."""
 ...

 @abstractmethod
 def health_check(self) -> bool:
 """Return True if the provider is reachable and responding."""
 ...


class OpenAIProvider(LLMProvider):
 """Adapter for OpenAI and OpenAI-compatible endpoints."""

 def __init__(self, model: str = "gpt-4o", base_url: str | None = None):
 import openai
 self.model = model
 self.client = openai.OpenAI(base_url=base_url) if base_url else openai.OpenAI()

 def complete(self, request: LLMRequest) -> LLMResponse:
 model = request.model or self.model
 start = time.perf_counter()

 kwargs: dict[str, Any] = {
 "model": model,
 "messages": [
 {"role": "system", "content": request.system_prompt},
 {"role": "user", "content": request.user_message},
 ],
 "temperature": request.temperature,
 "max_tokens": request.max_tokens,
 }
 if request.response_format:
 kwargs["response_format"] = request.response_format
 if request.tools:
 kwargs["tools"] = request.tools

 resp = self.client.chat.completions.create(**kwargs)
 elapsed = (time.perf_counter() - start) * 1000

 return LLMResponse(
 content=resp.choices[0].message.content or "",
 model=resp.model,
 provider="openai",
 usage={
 "prompt_tokens": resp.usage.prompt_tokens,
 "completion_tokens": resp.usage.completion_tokens,
 },
 latency_ms=elapsed,
 raw_response=resp,
 tool_calls=self._extract_tool_calls(resp),
 )

 def health_check(self) -> bool:
 try:
 self.client.models.list()
 return True
 except Exception:
 return False

 @staticmethod
 def _extract_tool_calls(resp) -> list[dict] | None:
 msg = resp.choices[0].message
 if not msg.tool_calls:
 return None
 return [
 {"name": tc.function.name, "arguments": json.loads(tc.function.arguments)}
 for tc in msg.tool_calls
 ]


class AnthropicProvider(LLMProvider):
 """Adapter for the Anthropic Messages API."""

 def __init__(self, model: str = "claude-sonnet-4-20250514"):
 import anthropic
 self.model = model
 self.client = anthropic.Anthropic()

 def complete(self, request: LLMRequest) -> LLMResponse:
 model = request.model or self.model
 start = time.perf_counter()

 kwargs: dict[str, Any] = {
 "model": model,
 "system": request.system_prompt,
 "messages": [{"role": "user", "content": request.user_message}],
 "temperature": request.temperature,
 "max_tokens": request.max_tokens,
 }
 if request.tools:
 kwargs["tools"] = self._convert_tools(request.tools)

 resp = self.client.messages.create(**kwargs)
 elapsed = (time.perf_counter() - start) * 1000

 content = ""
 tool_calls = []
 for block in resp.content:
 if block.type == "text":
 content += block.text
 elif block.type == "tool_use":
 tool_calls.append({"name": block.name, "arguments": block.input})

 return LLMResponse(
 content=content,
 model=resp.model,
 provider="anthropic",
 usage={
 "prompt_tokens": resp.usage.input_tokens,
 "completion_tokens": resp.usage.output_tokens,
 },
 latency_ms=elapsed,
 raw_response=resp,
 tool_calls=tool_calls if tool_calls else None,
 )

 def health_check(self) -> bool:
 try:
 self.client.messages.create(
 model=self.model, max_tokens=10,
 messages=[{"role": "user", "content": "ping"}],
 )
 return True
 except Exception:
 return False

 @staticmethod
 def _convert_tools(openai_tools: list[dict]) -> list[dict]:
 """Convert OpenAI-format tool definitions to Anthropic format."""
 return [
 {
 "name": t["function"]["name"],
 "description": t["function"].get("description", ""),
 "input_schema": t["function"]["parameters"],
 }
 for t in openai_tools
 ]
Code Fragment 38.3.1: A provider abstraction layer with adapters for OpenAI and Anthropic. The LLMRequest and LLMResponse dataclasses define a provider-agnostic contract. Each adapter translates between this contract and the provider's native API. Adding a new provider (Google Gemini, Mistral, a local Ollama instance) requires only a new adapter class; no application code changes.

4.3 Prompt Externalization

The second pillar of portable monogamy is prompt externalization: storing your prompts in files outside your application code rather than embedding them as string literals. This practice provides three benefits.

  1. Version control. Prompts evolve independently of application logic. Storing them in separate files gives you a clean git history of prompt changes, making it easy to correlate quality shifts with specific prompt edits.
  2. Provider-specific variants. You can maintain prompt variants optimized for different models in parallel (e.g., classify_intent.openai.txt and classify_intent.anthropic.txt) and select the right one at runtime based on the active provider.
  3. Non-engineer iteration. Product managers, domain experts, and prompt engineers can edit prompts without touching application code, lowering the barrier to iteration and reducing deployment risk.
"""Prompt loader with provider-specific variant support."""
from pathlib import Path


class PromptStore:
 """Load prompts from external files with optional provider-specific variants.

 Directory structure:
 prompts/
 classify_intent.txt # Default (provider-agnostic)
 classify_intent.openai.txt # OpenAI-optimized variant
 classify_intent.anthropic.txt # Anthropic-optimized variant
 summarize_ticket.txt
 """

 def __init__(self, prompts_dir: str = "prompts"):
 self.base_dir = Path(prompts_dir)

 def load(self, prompt_name: str, provider: str | None = None) -> str:
 """Load a prompt by name, preferring provider-specific variant if available."""
 if provider:
 variant_path = self.base_dir / f"{prompt_name}.{provider}.txt"
 if variant_path.exists():
 return variant_path.read_text(encoding="utf-8").strip()

 default_path = self.base_dir / f"{prompt_name}.txt"
 if default_path.exists():
 return default_path.read_text(encoding="utf-8").strip()

 raise FileNotFoundError(
 f"No prompt found for '{prompt_name}' "
 f"(checked provider='{provider}' and default)"
 )


# Usage
store = PromptStore("prompts")
system_prompt = store.load("classify_intent", provider="anthropic")
# Returns contents of prompts/classify_intent.anthropic.txt if it exists,
# otherwise falls back to prompts/classify_intent.txt
Code Fragment 38.3.2: A prompt store that supports provider-specific variants. When the active provider is "anthropic," the store first looks for classify_intent.anthropic.txt. If that file does not exist, it falls back to the default classify_intent.txt. This pattern lets you maintain provider-optimized prompts without polluting your application logic with conditional branches.

5. Multi-Provider Routing

Once you have a clean abstraction layer, you can move beyond single-provider architecture to multi-provider routing: sending different requests to different models based on the characteristics of each request. This is not about redundancy (though it enables that too); it is about matching each task to the most cost-effective model that can handle it.

5.1 Why Route to Multiple Providers?

Not all LLM requests are equal. A simple classification task ("Is this email spam?") does not need a frontier model with a $15/million-token price tag. A complex legal analysis probably does. Routing lets you allocate your most expensive models to the tasks that genuinely need them and send everything else to cheaper, faster alternatives.

The three dimensions that drive routing decisions are:

5.2 Implementing a Model Router

A model router examines each incoming request and selects the best provider/model combination based on routing rules. The following implementation shows a rule-based router with automatic fallback.

"""Model router with rule-based routing and automatic fallback."""
from __future__ import annotations

import logging
from dataclasses import dataclass

logger = logging.getLogger(__name__)


@dataclass
class RouteRule:
 """A single routing rule that maps request characteristics to a provider."""
 name: str
 provider_key: str # Key into the provider registry
 model: str # Model to use with this provider
 condition: callable # Function(LLMRequest) -> bool
 priority: int = 0 # Higher priority rules are checked first
 max_tokens_threshold: int | None = None # Skip if request exceeds this


class ModelRouter:
 """Route requests to the optimal provider based on configurable rules.

 Supports automatic fallback: if the selected provider fails, the router
 tries the next eligible provider in priority order.
 """

 def __init__(self, providers: dict[str, LLMProvider]):
 self.providers = providers
 self.rules: list[RouteRule] = []
 self.fallback_key: str | None = None
 self.fallback_model: str | None = None

 def add_rule(self, rule: RouteRule) -> None:
 self.rules.append(rule)
 self.rules.sort(key=lambda r: r.priority, reverse=True)

 def set_fallback(self, provider_key: str, model: str) -> None:
 """Set the default provider used when no rules match or all fail."""
 self.fallback_key = provider_key
 self.fallback_model = model

 def route(self, request: LLMRequest) -> LLMResponse:
 """Route a request through matching rules with automatic fallback."""
 eligible_rules = [r for r in self.rules if r.condition(request)]

 # Try each eligible rule in priority order
 for rule in eligible_rules:
 provider = self.providers.get(rule.provider_key)
 if provider is None:
 logger.warning("Provider '%s' not found, skipping rule '%s'",
 rule.provider_key, rule.name)
 continue

 try:
 request_copy = LLMRequest(
 system_prompt=request.system_prompt,
 user_message=request.user_message,
 model=rule.model,
 temperature=request.temperature,
 max_tokens=request.max_tokens,
 response_format=request.response_format,
 tools=request.tools,
 metadata={**request.metadata, "route_rule": rule.name},
 )
 return provider.complete(request_copy)
 except Exception as exc:
 logger.error("Provider '%s' failed for rule '%s': %s",
 rule.provider_key, rule.name, exc)
 continue # Fall through to next eligible rule

 # All rules exhausted; use fallback
 if self.fallback_key and self.fallback_key in self.providers:
 logger.info("All rules exhausted, using fallback provider '%s'",
 self.fallback_key)
 request.model = self.fallback_model
 return self.providers[self.fallback_key].complete(request)

 raise RuntimeError("No provider available: all rules failed and no fallback set")


# ---- Example configuration ----

def is_simple_classification(req: LLMRequest) -> bool:
 """Heuristic: short prompts requesting a single label are simple tasks."""
 return (
 len(req.user_message) < 500
 and any(kw in req.system_prompt.lower()
 for kw in ["classify", "categorize", "label", "yes or no"])
 )

def needs_long_context(req: LLMRequest) -> bool:
 """Route long inputs to a model with extended context support."""
 return len(req.user_message) > 15_000

def needs_code_generation(req: LLMRequest) -> bool:
 """Detect requests that need strong code generation capabilities."""
 return any(kw in req.system_prompt.lower()
 for kw in ["write code", "generate code", "implement", "function that"])


# Build the router
router = ModelRouter(providers={
 "openai": OpenAIProvider(model="gpt-4o"),
 "openai_mini": OpenAIProvider(model="gpt-4o-mini"),
 "anthropic": AnthropicProvider(model="claude-sonnet-4-20250514"),
})

# Simple tasks go to the cheapest, fastest model
router.add_rule(RouteRule(
 name="simple_to_mini",
 provider_key="openai_mini",
 model="gpt-4o-mini",
 condition=is_simple_classification,
 priority=10,
))

# Long-context tasks go to Claude (200K context window)
router.add_rule(RouteRule(
 name="long_context_to_claude",
 provider_key="anthropic",
 model="claude-sonnet-4-20250514",
 condition=needs_long_context,
 priority=8,
))

# Code generation tasks go to GPT-4o
router.add_rule(RouteRule(
 name="code_to_gpt4o",
 provider_key="openai",
 model="gpt-4o",
 condition=needs_code_generation,
 priority=5,
))

# Default fallback for everything else
router.set_fallback("openai", "gpt-4o")

# Usage: the application code never mentions a specific provider
response = router.route(LLMRequest(
 system_prompt="Classify this customer email as: billing, technical, or general.",
 user_message="I was charged twice for my subscription last month.",
))
Code Fragment 38.3.3: A rule-based model router with automatic fallback. Each RouteRule defines a condition function that examines the request and decides whether this rule applies. Rules are evaluated in priority order, and if the selected provider fails, the router automatically tries the next eligible provider. The application code calls router.route() without knowing which provider will handle the request.

5.3 When Multi-Provider Routing Makes Sense

Multi-provider routing adds complexity. It is not always worth it. The decision depends on your scale and the diversity of your workload.

When to Adopt Multi-Provider Routing
Scenario Single Provider Multi-Provider Routing
Monthly LLM spend < $1,000 Stick with one provider. Optimization effort exceeds savings. Not worth the complexity overhead.
Monthly spend $1,000 to $10,000 Optimize prompts for primary provider; maintain a tested fallback. Consider routing simple tasks to a cheaper model (often 30% to 50% savings).
Monthly spend > $10,000 Likely leaving money on the table. Strong candidate. Route by task complexity to optimize cost per outcome.
High availability requirement Single point of failure risk. Automatic failover justifies the complexity regardless of cost.
Diverse task types One model handles everything, but some tasks are over-served. Match model capability to task requirements for better cost/quality tradeoff.
Real-World Scenario: Cost Savings from Task-Based Routing

Who: An ML engineer at a customer support platform processing 100,000 LLM requests per day.

Situation: All requests were routed to GPT-4o at approximately $5 per 1,000 requests (blended input/output cost), producing a daily LLM bill of $500.

Problem: Cost analysis revealed that 65% of requests were simple intent classifications (e.g., "Is this a billing question or a shipping question?") that did not require frontier-model reasoning, yet they consumed the same per-request cost as complex multi-turn conversations.

Decision: The engineer implemented task-based routing: a lightweight classifier tagged each request by complexity, sending simple intent classifications to GPT-4o mini (which handled them with equivalent accuracy) and reserving GPT-4o for complex reasoning tasks.

Result: The blended daily cost dropped from $500 to approximately $210, saving roughly $8,700 per month. The routing logic required two days of engineering effort and one day of evaluation validation.

Lesson: Task-based model routing is often the highest-ROI optimization available, delivering large cost reductions with minimal engineering investment.

6. The Portability Checklist

The strategies described above coalesce into a concrete checklist. Each item is something you can implement today, regardless of whether you plan to switch providers soon. Think of these as insurance premiums: small investments that pay off enormously if you ever need to make a change.

6.1 Architecture and Code

  1. Implement a provider abstraction layer. All LLM calls go through a unified interface (like the one in Code Fragment 38.3.1). No provider-specific SDK calls appear in application logic.
  2. Externalize all prompts. System prompts, few-shot examples, and output format instructions live in files, not in code. Support provider-specific variants (Code Fragment 38.3.2).
  3. Normalize response schemas. Define your own response dataclasses (like LLMResponse above) and map every provider's raw response into this format. Application logic should never parse provider-specific JSON.
  4. Standardize on OpenAI-compatible APIs where possible. For open-weight models and local deployments, use serving frameworks (vLLM, Ollama) that expose OpenAI-compatible endpoints. This reduces the number of adapter implementations you need to maintain.
  5. Isolate provider-specific features. If you use a capability unique to one provider (e.g., Anthropic's extended thinking, OpenAI's structured outputs mode), wrap it in an adapter method with a graceful degradation path for providers that lack the feature.

6.2 Evaluation and Testing

  1. Build a provider-agnostic eval suite. Your evaluation pipeline (Chapter 29) should be able to run against any provider without modification. The eval suite tests your product's quality, not a specific model's behaviour.
  2. Run cross-provider eval quarterly. Even if you have no plans to switch, run your eval suite against at least one alternative provider every quarter. Track the quality delta over time. If the gap narrows, your switching cost is decreasing.
  3. Maintain a "switchability score." Define a simple metric: the percentage of your eval cases that pass at an acceptable quality threshold on your secondary provider. A switchability score above 85% means you could switch within days. Below 70% means you have significant re-optimization work ahead.

6.3 Operations and Process

  1. Document model-specific knowledge. When your team discovers a model quirk (e.g., "GPT-4o tends to over-apologize in customer service contexts" or "Claude performs better with XML-tagged instructions"), write it down. This institutional knowledge is the substance of cognitive lock-in, and making it explicit reduces its cost.
  2. Run quarterly provider fire drills. Route a small percentage of traffic to your secondary provider for one day. Verify that your failover logic works, measure quality, and update your continuity plan based on what you learn.
  3. Monitor provider announcements. Subscribe to status pages, changelog feeds, and deprecation notices from your primary and secondary providers. Build a 90-day advance warning system for model deprecations.
  4. Negotiate contract flexibility. If you sign a committed-use agreement with a provider, negotiate exit clauses tied to specific performance or pricing triggers. Your continuity plan should include contract terms, not just technical architecture.
Note: The 80/20 of This Checklist

If you implement only three items from this checklist, make them: (1) the provider abstraction layer, (2) prompt externalization, and (3) the quarterly cross-provider eval run. These three practices alone reduce your effective switching cost by an estimated 60% to 70%. Everything else provides incremental value but is less critical for most teams at most stages of growth.

7. Putting It All Together: An Architecture Diagram

The following diagram shows how the components from this section fit into a complete architecture. Your application logic sits at the top, completely unaware of which provider is handling each request. The prompt store supplies externalized prompts. The model router examines each request and selects the best provider/model combination. Each provider adapter translates between the unified interface and the provider's native API. The evaluation pipeline runs against the unified interface, making it inherently provider-agnostic.

Portable AI Architecture: Layer Responsibilities
Layer Responsibility Provider-Specific?
Application Logic Business rules, user interaction, orchestration No. Calls router.route(request) only.
Prompt Store Load and version prompts; select provider-specific variants Contains variants, but exposes a provider-agnostic load() method.
Model Router Select optimal provider/model based on request characteristics Knows about providers but is configured declaratively.
Provider Adapters Translate between unified interface and native APIs Yes. This is the only layer with provider-specific code.
Evaluation Pipeline Test quality against the unified interface No. Runs against any provider through the router.

The key insight in this architecture is that provider-specific knowledge is concentrated in exactly two places: the adapter classes and the prompt variant files. Everything else is provider-agnostic. When you need to add a new provider, you write one adapter class and (optionally) a set of prompt variants. When you need to switch your primary provider, you change the router configuration and run your eval suite. No application logic changes.

8. Common Anti-Patterns

Teams frequently fall into patterns that make portability harder than it needs to be. Recognizing these anti-patterns early saves significant refactoring later.

  1. Hardcoded model names in application logic. When model identifiers like "gpt-4o" appear in business logic files (not in configuration or adapter classes), every model change requires a code change. Extract model names into configuration.
  2. Prompt strings embedded in source code. Inline prompt strings defeat version control, make provider-specific variants impossible, and create merge conflicts when multiple team members iterate on prompts simultaneously. Move prompts to external files.
  3. Parsing provider-specific response JSON in application code. If your business logic accesses response.choices[0].message.content (OpenAI format) directly, you have coupled your application to one provider's schema. Use a normalized response dataclass.
  4. Skipping evaluation when switching models. Teams sometimes swap model names and assume everything works because "the new model is supposed to be better." Always run your full eval suite after any model change. Always.
  5. Building abstraction layers that abstract too much. The opposite extreme: abstracting away capabilities that differ between providers (tool calling, structured output, streaming) behind a lowest-common-denominator interface. Good abstraction preserves access to provider-specific features through optional parameters, not by eliminating them.
Key Takeaways

What Comes Next

With your provider strategy in place and your architecture built for portability, Section 38.4: Post-Launch Monitoring and Iteration covers the continuous feedback loops that keep your AI product healthy after launch. You will learn to detect quality drift, design A/B tests for non-deterministic systems, and build the iteration flywheel that separates products that thrive from those that quietly degrade.

Key Takeaways
Self-Check
Q1: Your team has been using GPT-4o for eight months and is considering switching to Claude Sonnet for cost reasons. The API integration change takes two hours. What are the three remaining cost categories that will dominate the migration effort, and approximately how long should you budget for each?
Show Answer
The three remaining cost categories are: (1) Prompt re-optimization (1 to 3 days per critical prompt): each prompt needs to be tested on the new model and adjusted for its different instruction-following patterns and default behaviours. (2) Evaluation re-baselining (1 to 2 days): running your full eval suite on the new model, analyzing the quality delta, and adjusting scoring thresholds to account for different output characteristics. (3) Edge case rediscovery (ongoing for 2 to 4 weeks post-switch): the new model will fail on different inputs than the old one, and your team needs time to discover and mitigate these new failure modes through production monitoring. Total budget: 2 to 6 weeks depending on the number of prompts and the criticality of the application.
Q2: A colleague argues that building a provider abstraction layer is premature optimization because "we will never switch providers." Give two reasons the abstraction layer is valuable even if you never switch.
Show Answer
Two reasons the abstraction layer is valuable even without switching: (1) Testability. The abstraction layer lets you substitute a mock provider during testing, making your integration tests fast, deterministic, and free of API costs. Without it, testing requires either hitting the live API (slow and expensive) or building ad-hoc mocks that couple to one provider's response format. (2) Model upgrades within the same provider. Providers release new models frequently (GPT-4 to GPT-4 Turbo to GPT-4o, Claude 3 to 3.5 to 4). Each upgrade can change behaviour. The abstraction layer, combined with externalized prompts and a provider-agnostic eval suite, makes within-provider model upgrades a structured, measurable process rather than a hope-and-pray deployment.
Q3: Your AI product handles 100,000 requests per day. 60% are simple classification tasks, 30% are moderate summarization tasks, and 10% are complex multi-step reasoning tasks. You currently send all requests to GPT-4o at $5 per 1,000 requests. Design a routing strategy and estimate the monthly cost savings.
Show Answer
Routing strategy: send the 60,000 simple classification requests to GPT-4o mini ($0.50 per 1,000 requests), the 30,000 moderate summarization requests to GPT-4o ($5 per 1,000), and the 10,000 complex reasoning requests to GPT-4o ($5 per 1,000). Before routing: 100,000 x $5/1,000 = $500/day = $15,000/month. After routing: (60,000 x $0.50/1,000) + (30,000 x $5/1,000) + (10,000 x $5/1,000) = $30 + $150 + $50 = $230/day = $6,900/month. Monthly savings: approximately $8,100 (54% reduction). The key assumption to validate: GPT-4o mini must achieve acceptable quality on the classification tasks. Run your eval suite on a sample before deploying the route. If accuracy drops below your threshold, adjust the routing condition to be more selective about which tasks qualify as "simple."
References & Further Reading
Foundational Papers

Shankar, S., et al. (2024). "Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences." ACL 2024.

Demonstrates the importance of provider-agnostic evaluation methodology. Directly relevant to the cross-provider eval practices in this section's portability framework. Teams building multi-provider evaluation suites should read this to understand validator calibration challenges.

📄 Paper

Chen, L., Zaharia, M., & Zou, J. (2023). "How Is ChatGPT's Behavior Changing over Time?" arXiv:2307.09009.

Provides empirical evidence that model behaviour changes across versions without explicit notice, underscoring the need for AI continuity planning. Essential reading for any team relying on a single provider, as it quantifies the silent drift risk discussed in this section.

📄 Paper

Ding, Y., et al. (2024). "LLM Routing: Optimizing Cost and Quality in Multi-Model Deployments." arXiv:2404.12258.

Formalizes the model routing problem and evaluates several routing strategies, providing theoretical grounding for the practical router pattern in this section. Researchers and cost-conscious engineering teams will find the break-even analysis especially useful.

📄 Paper

Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015.

The seminal paper on ML systems debt, whose warnings about entanglement and hidden dependencies apply directly to the cognitive lock-in patterns described in this section. Required reading for any team building production ML or LLM systems.

📄 Paper

Madaan, A., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." NeurIPS 2023.

Demonstrates that LLMs can evaluate and improve their own outputs, a capability that underlies the translation cost collapse argument in this section. Practitioners exploring self-improvement loops and provider migration automation will find the methodology directly applicable.

📄 Paper
Tools & Libraries

Vaillant, R. (2024). "LiteLLM: A Unified Interface for 100+ LLM Providers." GitHub.

An open-source implementation of the provider abstraction pattern described in this section, supporting automatic fallback and cost tracking across providers. Teams implementing portable monogamy or multi-provider routing should evaluate this as a starting point for their abstraction layer.

🔧 Tool