AI Gateways & Model Routing

Chapter opener illustration: AI Gateways & Model Routing.

"Behind every good LLM product is a gateway that vendors do not see."

DeployDeploy, Gateway-Guarding AI Agent
Looking Back

Chapter 62 deployed a single model. This chapter handles the rest: AI gateways (Portkey, Kong AI Gateway, LiteLLM Router), model routing, fallback chains, vendor abstraction, cost-aware routing, and the day when one provider has an outage and your product cannot afford to.

Big Picture

Production LLM deployments need gateways for rate limiting, model routing for cost/quality optimization, and observability. This chapter covers the gateway pattern, intelligent routing, and the operational surface that AI gateways expose.

Chapter Overview

As LLM applications grow beyond a single model and provider, managing API keys, retry logic, rate limits, cost tracking, and routing becomes unsustainable without a gateway. This chapter teaches the AI gateway pattern: the unified-API layer that abstracts providers, the model-routing strategies (cost-aware, latency-aware, capability-aware), the canonical implementations (LiteLLM, Portkey, Helicone, custom gateways), and the observability story that makes a multi-provider stack governable.

AI gateways went from "clever optimization" to "non-negotiable infrastructure" between 2023 and 2026. This chapter is the practitioner's pattern catalog.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

What's Next?

This chapter begins with Section 63.1: The Gateway Pattern (the case for and against), continues into Section 63.2: Routing and Reliability (the mechanics), and concludes with Section 63.3: Caching and Cost Management (the economics). Each section builds on the previous one, so we recommend reading them in order.

Further Reading

Model Routing & Cascades

Ong, I., Almahairi, A., Wu, V., Chiang, W.-L., Wu, T., Gonzalez, J. E., et al. (2024). "RouteLLM: Learning to Route LLMs with Preference Data." arXiv preprint. arXiv:2406.18665. The reference paper for production model routing: predicts which queries need a strong model and which can use a cheap one, the core economics of any AI gateway.
Chen, L., Zaharia, M., & Zou, J. (2023). "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance." arXiv preprint. arXiv:2305.05176. Introduces LLM cascades and prompt adaptation as cost-optimization primitives; the conceptual basis for the routing policies a gateway implements.

Gateway Patterns & Caching

Bang, F. (2023). "GPTCache: An Open-Source Semantic Cache for LLM Applications." NLP-OSS Workshop at EMNLP. ACL Anthology. Defines the semantic-cache pattern (embedding-keyed cache with similarity threshold) that gateways use to absorb repeat traffic.
Hu, Q. J., Bieker, J., Li, X., Jiang, N., Keigwin, B., Ranganath, G., et al. (2024). "RouterBench: A Benchmark for Multi-LLM Routing Systems." arXiv preprint. arXiv:2403.12031. Provides the evaluation methodology for comparing routing strategies on cost-quality Pareto fronts, which a production gateway must measure continuously.