Section 31.1: Application Architecture & Deployment

The first 90 percent of deployment accounts for the first 90 percent of the budget. The remaining 10 percent accounts for the other 90 percent of the headaches.
Deploy, Budget-Bruised AI Agent

Big Picture

A well-designed LLM application architecture separates model inference, business logic, and API layers so that each can scale, version, and fail independently. Whether you deploy on a dedicated GPU cluster, a managed cloud endpoint, or a serverless function, the same architectural principles apply: stateless request handling, streaming for low perceived latency, health checks for reliability, and containerization for reproducibility. Building on the serving frameworks from Section 09.4, this section covers the full deployment stack from local development with FastAPI through production deployment on AWS, GCP, Azure, and serverless platforms.

Prerequisites

This section builds on llm apis from Section 10.1: API Landscape and Architecture and agent concepts covered in Section 22.1: Foundations of AI Agents.

A rocket launching from a laptop screen, symbolizing an LLM application moving from local development to production deployment in the cloud — **Figure 31.1.1**: Deploying an LLM application is like a rocket launch: months of preparation on the ground (local dev, testing, evaluation) for the critical moment when the system goes live and must perform reliably under real conditions.

1. API Layer with FastAPI

Your model works perfectly in a Jupyter notebook. Now 500 users need to call it simultaneously, and each request takes 10 seconds to generate. Without proper architecture, the 501st user waits 10 minutes. FastAPI solves this with native asynchronous request handling, automatic OpenAPI documentation, and Pydantic validation, making it the dominant framework for serving LLM applications in Python. The key design pattern is to separate the API layer from the model inference layer, allowing you to swap models without changing the API contract.

Mental Model: The Restaurant Kitchen

Think of your LLM deployment architecture like a restaurant. The API layer is the front-of-house staff taking orders and delivering plates, while the model inference layer is the kitchen doing the actual cooking. Just as a restaurant can swap chefs or change recipes without retraining waiters, separating your API from your model lets you upgrade, swap, or scale the inference backend independently. The analogy thins when you consider that real kitchens do not need to handle 500 simultaneous orders in under a second, but the principle of decoupled responsibilities holds.

This separation matters even more for LLM applications than for traditional web services. Models change frequently (new versions, different providers, fine-tuned variants), and deployment constraints differ across environments (cloud GPU, serverless, on-premise). If the API contract is stable and well-documented, you can swap GPT-4o for Claude Sonnet or a self-hosted Llama variant without touching a single line of frontend code. This also enables the A/B testing patterns covered in Section 31.4.

A restaurant split into front-of-house (API layer) and kitchen (model inference layer), with waiters passing orders through a service window, illustrating the three-layer LLM application architecture — **Figure 31.1.2**: The restaurant kitchen (model inference) and the front-of-house (API layer) communicate through a well-defined service window (API contract), allowing each side to scale and evolve independently.

Figure 31.1.3 The three-layer architecture separates API concerns, application logic, and model inference so each layer can evolve independently. The code below puts this into practice.

Basic Chat Completion Endpoint

This snippet creates a FastAPI endpoint that wraps an LLM chat completion call with streaming support.


# Define ChatRequest
# Key operations: API endpoint setup, API interaction
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from openai import AsyncOpenAI
import json, asyncio

app = FastAPI(title="LLM Chat API", version="1.0")
client = AsyncOpenAI()

class ChatRequest(BaseModel):
 messages: list[dict[str, str]]
 model: str = "gpt-4o-mini"
 temperature: float = 0.7
 stream: bool = False

@app.post("/v1/chat")
async def chat(req: ChatRequest):
 if req.stream:
 return StreamingResponse(
 stream_response(req), media_type="text/event-stream"
 )
 response = await client.chat.completions.create(
 model=req.model,
 messages=req.messages,
 temperature=req.temperature,
 )
 return {"content": response.choices[0].message.content}

async def stream_response(req: ChatRequest):
 stream = await client.chat.completions.create(
 model=req.model, messages=req.messages,
 temperature=req.temperature, stream=True
 )
 async for chunk in stream:
 delta = chunk.choices[0].delta.content
 if delta:
 yield f"data: {json.dumps({'content': delta})}\n\n"
 yield "data: [DONE]\n\n"

Transformer attention is a mechanism that allows each token in a sequence to compute a weighted sum over all other tokens, determining how much "attention" to pay to each position. The input is projected into queries, keys, and values; attention weights are computed as the softmax of scaled dot products between queries and keys, then applied to the values. Multi-head attention extends this by running several attention functions in parallel, each capturing different types of relationships in the data.

Code Fragment 31.1.1: Define ChatRequest

Note

Always use AsyncOpenAI (not the synchronous client) inside FastAPI to avoid blocking the event loop. A synchronous call in an async endpoint will serialize all concurrent requests, destroying throughput under load.

Fun Fact

Google's Bard (now Gemini) lost $100 billion in market value on its launch day because it gave a factually incorrect answer about the James Webb Space Telescope in a demo. A single hallucination, in a single public demo, moved financial markets. This incident crystallized why production safety is not optional; it is existential for AI products.

2. Streaming Protocols: SSE and WebSockets

Streaming is essential for LLM applications because token generation is sequential. Without streaming, users stare at a blank screen for several seconds before seeing any output. Two protocols dominate LLM streaming: Server-Sent Events (SSE) for unidirectional streaming and WebSockets for bidirectional communication. Figure 31.1.1 compares these two protocols side by side. Code Fragment 31.1.4 below puts this into practice.

Figure 31.1.4: SSE provides simpler one-way streaming ideal for chat; WebSockets enable full-duplex communication for interactive applications.

LitServe for High-Performance Serving

Code Fragment 31.1.3 demonstrates this approach in practice.


# Define LLMServingAPI; implement setup, decode_request, predict
# Key operations: training loop, prompt construction
import litserve as ls
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

class LLMServingAPI(ls.LitAPI):
 def setup(self, device):
 """Load model once during server startup."""
 self.tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")
 self.model = AutoModelForCausalLM.from_pretrained(
 "microsoft/phi-2", torch_dtype=torch.float16
 ).to(device)

 def decode_request(self, request):
 return request["prompt"]

 def predict(self, prompt):
 inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
 outputs = self.model.generate(**inputs, max_new_tokens=256)
 return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

 def encode_response(self, output):
 return {"generated_text": output}

# Launch with: python server.py
if __name__ == "__main__":
 api = LLMServingAPI()
 server = ls.LitServer(api, accelerator="gpu", devices=1)
 server.run(port=8000)

Code Fragment 31.1.2: Deploying a model with LitServe and containerized deployment. LitServe's setup, decode_request, predict, and encode_response methods separate the model lifecycle from request handling, with automatic batching and GPU memory management. Health checks and resource reservations keep the stack resilient under load.

Tip

Always pin your model version in the environment config (e.g., MODEL_NAME=gpt-4o-2024-08-06 rather than gpt-4o). Provider aliases like gpt-4o silently update to newer snapshots, which can change behavior without any code change on your side. Pinning the version turns a surprise regression into a deliberate upgrade you can test first.

3. Containerization with Docker Compose

This snippet defines a Docker Compose configuration for deploying an LLM application with its dependencies.

# docker-compose.yml
version: "3.9"
services:
 llm-api:
 build: .
 ports:
 - "8000:8000"
 environment:
 - OPENAI_API_KEY=${OPENAI_API_KEY}
 - MODEL_NAME=gpt-4o-mini
 deploy:
 resources:
 reservations:
 devices:
 - driver: nvidia
 count: 1
 capabilities: [gpu]
 healthcheck:
 test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
 interval: 30s
 timeout: 10s
 retries: 3

 redis:
 image: redis:7-alpine
 ports:
 - "6379:6379"

 nginx:
 image: nginx:alpine
 ports:
 - "80:80"
 volumes:
 - ./nginx.conf:/etc/nginx/nginx.conf:ro
 depends_on:
 - llm-api

Code Fragment 31.1.3: Define LLMServingAPI; implement setup, decode_request, predict

4. Cloud Platform Deployment

As shown in Figure 31.1.2, each major cloud provider offers a similar stack of deployment primitives, from fully managed model endpoints down to raw GPU instances.

Figure 31.1.5: Each major cloud provider offers managed model endpoints, container orchestration, serverless functions, and raw GPU compute.

AWS Bedrock Integration

Code Fragment 31.1.3 demonstrates this approach in practice.


# implement bedrock_chat
# Key operations: attention mechanism, results display, prompt construction
import boto3, json

def bedrock_chat(prompt: str, model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0"):
 """Call a model through AWS Bedrock."""
 client = boto3.client("bedrock-runtime", region_name="us-east-1")

 body = json.dumps({
 "anthropic_version": "bedrock-2023-05-31",
 "messages": [{"role": "user", "content": prompt}],
 "max_tokens": 1024,
 "temperature": 0.7,
 })

 response = client.invoke_model(modelId=model_id, body=body)
 result = json.loads(response["body"].read())
 return result["content"][0]["text"]

# Example
answer = bedrock_chat("Explain transformer attention in one paragraph.")
print(answer)

Code Fragment 31.1.4: implement bedrock_chat

5. Serverless Deployment

5. Serverless Deployment Advanced Comparison

Platform	Cold Start	GPU Support	Best For	Pricing Model
Modal	~1s (warm containers)	A10G, A100, H100	Custom model inference	Per-second GPU billing
Replicate	~5-15s	A40, A100	Open-source model hosting	Per-second compute
HF Inference Endpoints	~30s (scaling from 0)	T4, A10G, A100	HuggingFace model zoo	Per-hour instance
AWS Lambda	~1-3s	None (CPU only)	Lightweight API proxies	Per-request + duration
Cloud Run	~2-5s	L4 GPUs (preview)	Container-based serving	Per-request + vCPU/s

Modal Deployment Example

Code Fragment 31.1.3 demonstrates this approach in practice.


# Define LLMInference; implement load_model, generate, main
# Key operations: attention mechanism, results display, containerized deployment
import modal

app = modal.App("llm-inference")

image = modal.Image.debian_slim().pip_install(
 "transformers", "torch", "accelerate"
)

@app.cls(gpu="A10G", image=image, container_idle_timeout=300)
class LLMInference:
 @modal.enter()
 def load_model(self):
 from transformers import pipeline
 self.pipe = pipeline(
 "text-generation",
 model="meta-llama/Llama-3.1-8B-Instruct",
 device_map="auto",
 torch_dtype="auto",
 )

 @modal.method()
 def generate(self, prompt: str, max_tokens: int = 512):
 result = self.pipe(prompt, max_new_tokens=max_tokens)
 return result[0]["generated_text"]

@app.local_entrypoint()
def main():
 llm = LLMInference()
 print(llm.generate.remote("What is attention in transformers?"))

Attention in transformers is a mechanism that allows each position in a sequence to attend to all other positions, computing relevance-weighted representations. It enables the model to capture long-range dependencies and contextual relationships without recurrence. The self-attention operation computes query, key, and value projections for each token and uses scaled dot-product scores to weight the contribution of every other token to the current representation.

Code Fragment 31.1.5: Define LLMInference; implement load_model, generate, main

Library Shortcut: RunPod Serverless for GPU Inference

RunPod (pip install runpod) offers serverless GPU endpoints with per-second billing and support for A100, H100, and consumer-grade GPUs. You define a handler function, package it as a Docker image, and deploy it as an auto-scaling endpoint.

# worker.py: RunPod serverless handler
import runpod
from transformers import pipeline

pipe = pipeline(
 "text-generation",
 model="meta-llama/Llama-3.1-8B-Instruct",
 device_map="auto",
 torch_dtype="auto",
)

def handler(event):
 prompt = event["input"]["prompt"]
 max_tokens = event["input"].get("max_tokens", 512)
 result = pipe(prompt, max_new_tokens=max_tokens)
 return {"output": result[0]["generated_text"]}

runpod.serverless.start({"handler": handler})
# Deploy: runpodctl deploy --gpu A100 --image my-registry/llm-worker:latest

Code Fragment 31.1.6: Define LLMInference; implement load_model, generate, main

Figure 31.1.6: Deployment decision tree based on model ownership and infrastructure requirements.

Warning

Serverless GPU platforms charge per second of GPU time. A misconfigured container_idle_timeout can keep expensive GPUs running idle. Always set explicit timeouts and implement scale-to-zero for development workloads.

Key Insight

The best deployment strategy depends on your traffic pattern, as the decision tree in Figure 31.1.6 illustrates. Steady, high-volume traffic favors dedicated instances (SageMaker, Vertex AI). Bursty or low-volume traffic favors serverless (Modal, Replicate). API-only applications with no custom models should use managed endpoints (OpenAI, Bedrock, Azure OpenAI) to avoid infrastructure management entirely.

Self-Check

1. Why should you use AsyncOpenAI instead of the synchronous client inside FastAPI endpoints?

Show Answer

Synchronous calls block the event loop, serializing all concurrent requests. AsyncOpenAI allows FastAPI to handle multiple requests concurrently while waiting for API responses, maintaining throughput under load.

2. What is the main advantage of SSE over WebSockets for LLM chat streaming?

Show Answer

SSE is simpler to implement, works over standard HTTP, has built-in automatic reconnection, and is sufficient for the unidirectional server-to-client streaming pattern that LLM chat requires. WebSockets add unnecessary complexity when bidirectional communication is not needed.

3. When would you choose Modal over AWS SageMaker for model deployment?

Show Answer

Modal is better for bursty or development workloads because it offers per-second billing, fast cold starts, and simpler configuration. SageMaker is better for steady production workloads that need auto-scaling policies, A/B testing of model variants, and deep AWS ecosystem integration.

4. What does a health check endpoint verify in a containerized LLM deployment?

Show Answer

A health check verifies that the service is responsive and able to serve requests. For LLM applications, this typically means confirming that the model is loaded in memory, the API server is accepting connections, and any downstream dependencies (databases, API keys) are reachable.

5. Why is Docker Compose useful for local LLM application development?

Show Answer

Docker Compose orchestrates multi-service architectures (API server, Redis cache, nginx reverse proxy, vector database) with a single command. It provides reproducible environments, GPU passthrough, health checks, and dependency ordering, closely mirroring production infrastructure during development.

Real-World Scenario: Migrating a Monolithic LLM App to Layered Architecture

Who: A backend engineering team at a Series B fintech startup

Situation: The team had built an LLM-powered document analysis tool as a single FastAPI endpoint that handled prompt construction, OpenAI API calls, and response formatting in one function.

Problem: When they needed to swap GPT-4o for Claude on certain tasks, every change required modifying the same monolithic handler, causing merge conflicts and regressions.

Dilemma: Refactoring the entire codebase would delay a critical product launch by three weeks, but continuing with the monolith would make multi-model support impossible.

Decision: They split the application into three layers (API routing, business logic, inference) over a single sprint, using an adapter pattern for the inference layer.

How: Each LLM provider got its own adapter class implementing a shared interface. The business logic layer called the interface, not the provider directly. FastAPI routes handled only HTTP concerns.

Result: Adding Claude support took two days instead of the projected two weeks. Deployment failures dropped 60% because each layer could be tested independently.

Lesson: Separating API, logic, and inference layers pays for itself the first time you need to change any one of those three concerns independently.

Tip: Use Blue-Green Deployments for Model Updates

When deploying a new model version, keep the old version running alongside it. Route a small percentage of traffic to the new version first, compare metrics, and only switch fully once you have verified quality. This prevents bad model updates from affecting all users.

Key Takeaways

Separate API, business logic, and inference layers so each can scale and version independently.
Use SSE for standard chat streaming; reserve WebSockets for bidirectional use cases like real-time collaboration or voice.
LitServe provides a minimal, high-performance framework for serving custom models with batching and GPU management.
AWS Bedrock, GCP Vertex AI, and Azure OpenAI offer managed model endpoints that eliminate infrastructure management for API-based models.
Serverless platforms (Modal, Replicate, HF Inference Endpoints) are ideal for bursty workloads, while dedicated endpoints suit steady production traffic.
Containerize with Docker Compose for local development, then deploy to cloud container orchestrators (ECS, GKE, ACA) for production.

With the backend architecture in place, the next step is building the user interface that connects to it. Section 31.2 covers frontend frameworks from quick Python demos to production React applications.

Research Frontier

Open Questions:

What is the optimal architecture for serving LLM applications that need to support both real-time and batch workloads? Current patterns (synchronous APIs, streaming, batch queues) each have trade-offs that are not well characterized.
How should LLM applications handle graceful degradation when the underlying model provider experiences outages or latency spikes?

Recent Developments (2024-2025):

LLM gateway and router tools like LiteLLM, Portkey, and AI Gateway (2024-2025) matured to support automatic failover between model providers, load balancing, and cost-based routing across multiple LLM APIs.

Explore Further: Deploy an LLM application behind an LLM gateway (like LiteLLM) configured with two model providers. Simulate a provider outage and measure how quickly failover occurs and whether response quality degrades.

Exercises

Exercise 31.1.1: API Design Choices Conceptual

Explain why FastAPI with async handlers is preferred over Flask for serving LLM applications. What specific characteristics of LLM workloads make asynchronous handling important?

Answer Sketch

LLM inference takes seconds per request (not milliseconds like typical web requests). Synchronous Flask blocks the worker thread during the entire LLM call, meaning one worker handles one request at a time. FastAPI's async handlers release the thread during the I/O wait (the LLM API call), allowing the same worker to handle many concurrent requests. This is critical when each request takes 2-10 seconds and you need to serve hundreds of concurrent users.

Exercise 31.1.2: Streaming Architecture Coding

Write a FastAPI endpoint that streams LLM responses using Server-Sent Events (SSE). The endpoint should accept a prompt, call an LLM API with streaming enabled, and forward each token to the client as it arrives.

Answer Sketch

Use StreamingResponse from fastapi.responses with media_type="text/event-stream". Create an async generator that calls the LLM API with stream=True, iterates over chunks, and yields f"data: {chunk}\n\n" for each token. Include a final data: [DONE] event. Handle client disconnection by catching asyncio.CancelledError. This reduces perceived latency dramatically because users see the first token in milliseconds rather than waiting for the full generation.

Exercise 31.1.3: Containerization Conceptual

Describe a Dockerfile for an LLM application that includes the API server, model dependencies, and a health check endpoint. What considerations are specific to LLM applications (compared to standard web services)?

Answer Sketch

LLM-specific considerations: (1) image size is large due to PyTorch and model weights (use multi-stage builds to reduce size), (2) GPU support requires NVIDIA base images and CUDA drivers, (3) startup time is long due to model loading (use a readiness probe separate from the liveness probe), (4) memory requirements are high (set appropriate resource limits). The health check should verify that the model is loaded and can generate a response, not just that the HTTP server is running.

Exercise 31.1.4: Cloud Deployment Comparison Analysis

Compare deploying an LLM application on AWS (ECS/EKS with GPU instances), GCP (Cloud Run with GPU), and a serverless platform (Lambda/Cloud Functions). For each, identify the key limitation for LLM workloads.

Answer Sketch

AWS ECS/EKS: full control, GPU support via p4/p5 instances. Limitation: complex setup, requires Kubernetes expertise. GCP Cloud Run with GPU: simpler autoscaling. Limitation: cold start times when scaling from zero can be 30+ seconds due to model loading. Serverless (Lambda): no GPU support, 15-minute timeout, limited memory. Limitation: unsuitable for self-hosted models; only viable for API-proxy patterns where the LLM runs elsewhere. Best choice depends on whether you self-host models or call external APIs.

Exercise 31.1.5: Blue-Green Deployment Discussion

Describe how to implement a blue-green deployment strategy for an LLM application where you are upgrading the model version. What additional verification steps are needed beyond standard web service deployments?

Answer Sketch

Standard blue-green: deploy the new version alongside the old, switch traffic after health checks pass. LLM-specific additions: (1) run the canary test suite against the green (new) deployment before switching traffic, (2) compare evaluation scores between blue and green on a shared test set, (3) route a small percentage of traffic (5%) to green first and monitor quality metrics, (4) verify that response latency and token costs are within acceptable ranges, (5) maintain the ability to roll back instantly if quality degrades after full cutover. The key difference is that LLM deployments need quality verification, not just health checks.

What Comes Next

In the next section, Section 31.2: Frontend & User Interfaces, we explore frontend and user interface design for LLM applications, covering streaming, chat UIs, and interaction patterns.