
"Behind every good LLM product is a gateway that vendors do not see."
Deploy, Gateway-Guarding AI Agent
Chapter 62 deployed a single model. This chapter handles the rest: AI gateways (Portkey, Kong AI Gateway, LiteLLM Router), model routing, fallback chains, vendor abstraction, cost-aware routing, and the day when one provider has an outage and your product cannot afford to.
Production LLM deployments need gateways for rate limiting, model routing for cost/quality optimization, and observability. This chapter covers the gateway pattern, intelligent routing, and the operational surface that AI gateways expose.
Chapter Overview
As LLM applications grow beyond a single model and provider, managing API keys, retry logic, rate limits, cost tracking, and routing becomes unsustainable without a gateway. This chapter teaches the AI gateway pattern: the unified-API layer that abstracts providers, the model-routing strategies (cost-aware, latency-aware, capability-aware), the canonical implementations (LiteLLM, Portkey, Helicone, custom gateways), and the observability story that makes a multi-provider stack governable.
AI gateways went from "clever optimization" to "non-negotiable infrastructure" between 2023 and 2026. This chapter is the practitioner's pattern catalog.
- Explain the AI gateway pattern and the operational problems it solves.
- Apply cost-aware, latency-aware, and capability-aware routing strategies.
- Configure LiteLLM, Portkey, or Helicone as a unified-API gateway.
- Architect a multi-provider failover and circuit-breaker pattern.
- Instrument a gateway with cost, latency, and quality observability across providers.
Sections in This Chapter
Prerequisites
- Production engineering core from Chapter 62
- LLM APIs from Chapter 11
- Familiarity with at least one API gateway or service-mesh tool
- 63.1 The Gateway Pattern The case for an AI gateway layer, the architectural shape of a production gateway (proxy plus Redis plus ledger), the five responsibilities every gateway owns, and the situations where introducing a gateway is the wrong call. Intermediate
- 63.2 Routing and Reliability LiteLLM Proxy as the open-source reference, multi-provider fallback chains, token-aware rate limiting, multi-region failover with provider affinity, and the commercial landscape: OpenRouter, Portkey, and Cloudflare AI Gateway. Advanced
- 63.3 Caching and Cost Management Embedding-based semantic caching, tiered budget enforcement (warn, downgrade, reject), per-user prompt-budget back-pressure, and gateway-level model-version pinning that decouples application code from provider deprecation calendars. Advanced
What's Next?
This chapter begins with Section 63.1: The Gateway Pattern (the case for and against), continues into Section 63.2: Routing and Reliability (the mechanics), and concludes with Section 63.3: Caching and Cost Management (the economics). Each section builds on the previous one, so we recommend reading them in order.