Text Generation Inference (TGI) is HuggingFace's production-grade inference server, built in Rust and Python for maximum throughput and reliability. TGI provides Docker-first deployment, built-in quantization support, token streaming, and a battle-tested router that handles concurrent requests efficiently. If you are already invested in the HuggingFace ecosystem, TGI offers the shortest path from a model on the Hub to a production API endpoint.
For a comparative analysis of TGI against vLLM, SGLang, TensorRT-LLM, and other frameworks (including benchmarking methodology and decision criteria), see Section 9.4: Serving Infrastructure. This section focuses on TGI-specific deployment recipes and configuration.
1. What Is TGI and When Should You Use It?
Text Generation Inference (TGI) is an open-source serving framework developed by HuggingFace, purpose-built for deploying large language models. Unlike general-purpose model servers such as Triton or TorchServe, TGI is specialized for autoregressive text generation. Its architecture consists of two main components: a Rust-based router that handles HTTP connections, request queuing, and token streaming, and a Python model server that runs the actual inference on GPU using custom CUDA kernels.
TGI is an excellent choice when you need a production-ready serving solution with minimal configuration, when your models are hosted on HuggingFace Hub, or when you want built-in support for features like watermarking, grammar-constrained generation, and speculative decoding. It powers HuggingFace's own Inference Endpoints service, which means it has been tested at significant scale.
TGI's Rust router is a critical architectural decision. By handling all I/O, connection management, and request scheduling in Rust, TGI avoids Python's GIL bottleneck for the networking layer. The Python process only handles the GPU-bound inference work, where the GIL is released during CUDA kernel execution anyway.
2. Docker Deployment
The recommended way to deploy TGI is through its official Docker image. This bundles all CUDA dependencies, the Rust router, and the Python model server into a single container. The following command pulls and runs TGI with a Llama model.
# Pull and run TGI with a Llama 3.1 8B model
docker run --gpus all --shm-size 1g -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--max-input-tokens 2048 \
--max-total-tokens 4096 \
--max-batch-prefill-tokens 4096
Let us break down the key parts of this command. The --gpus all flag exposes all host GPUs
to the container. The --shm-size 1g flag increases shared memory, which is required for
PyTorch's data loader workers. The -v /data:/data mount provides persistent storage for
downloaded model weights so they survive container restarts.
2.1 Docker Compose for Persistent Deployments
For production deployments, a Docker Compose file provides a declarative and reproducible setup. The following configuration defines a TGI service with health checks and automatic restarts.
# docker-compose.yml
version: "3.9"
services:
tgi:
image: ghcr.io/huggingface/text-generation-inference:latest
ports:
- "8080:80"
volumes:
- model-cache:/data
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
shm_size: "1g"
command: >
--model-id meta-llama/Llama-3.1-8B-Instruct
--quantize awq
--max-input-tokens 2048
--max-total-tokens 4096
--max-concurrent-requests 128
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
volumes:
model-cache:
3. Environment Variables and Configuration
TGI accepts configuration through both command-line arguments and environment variables. Environment
variables are prefixed with TGI_ or use HuggingFace-standard names. The table below lists
the most important configuration options.
| CLI Flag / Env Variable | Default | Description |
|---|---|---|
--model-id | (required) | HuggingFace model ID or local path |
--quantize | None | Quantization method: awq, gptq, bitsandbytes, eetq |
--max-input-tokens | 1024 | Maximum allowed input prompt length |
--max-total-tokens | 2048 | Maximum combined input + output token count |
--max-batch-prefill-tokens | 4096 | Maximum tokens processed in a single prefill step |
--max-concurrent-requests | 128 | Maximum number of simultaneous requests the router accepts |
--num-shard | 1 | Number of GPU shards for tensor parallelism |
HUGGING_FACE_HUB_TOKEN | None | Token for accessing gated models on HuggingFace Hub |
The --max-batch-prefill-tokens parameter directly affects GPU memory usage during the prompt processing phase. Setting it too high can cause out-of-memory errors, especially on GPUs with less than 40 GB of VRAM. Start with 4096 and increase gradually while monitoring memory usage.
4. Quantization Options
TGI supports several quantization backends that reduce model memory footprint and often increase throughput. You select the quantization method at startup; TGI then loads the model weights in the specified format. The model must have been pre-quantized in the chosen format (with the exception of bitsandbytes, which quantizes on the fly).
# Run with AWQ quantization (model must be pre-quantized)
docker run --gpus all --shm-size 1g -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-13B-Chat-AWQ \
--quantize awq
# Run with bitsandbytes 4-bit quantization (quantizes on the fly)
docker run --gpus all --shm-size 1g -p 8080:80 \
-v /data:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--quantize bitsandbytes-nf4
5. Making Requests: Streaming and Non-Streaming
TGI exposes two primary HTTP endpoints: /generate for standard request/response and
/generate_stream for server-sent events (SSE) streaming. It also provides an
OpenAI-compatible endpoint at /v1/chat/completions. The following Python examples
demonstrate both modes.
import requests
TGI_URL = "http://localhost:8080"
# Non-streaming request
response = requests.post(
f"{TGI_URL}/generate",
json={
"inputs": "What is the capital of France?",
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.9,
"do_sample": True,
},
},
)
result = response.json()
print(result["generated_text"])
For applications that benefit from displaying tokens as they are generated (chatbots, for example), streaming provides a much better user experience. TGI uses server-sent events for its streaming endpoint.
import requests
import json
# Streaming request using server-sent events
response = requests.post(
f"{TGI_URL}/generate_stream",
json={
"inputs": "Explain gradient descent in simple terms.",
"parameters": {
"max_new_tokens": 200,
"temperature": 0.5,
},
},
stream=True,
)
for line in response.iter_lines():
if line:
decoded = line.decode("utf-8")
if decoded.startswith("data:"):
token_data = json.loads(decoded[5:])
# Each chunk contains a single token
if not token_data.get("details"):
print(token_data["token"]["text"], end="", flush=True)
print() # Final newline
6. Health Checks and Monitoring
TGI provides health and metrics endpoints that integrate with standard monitoring stacks. The
/health endpoint returns HTTP 200 when the model is loaded and ready to serve. The
/metrics endpoint exposes Prometheus-format metrics for throughput, latency, and queue
depth.
# Check if TGI is healthy
curl http://localhost:8080/health
# Fetch Prometheus metrics
curl http://localhost:8080/metrics
# Key metrics to monitor:
# tgi_request_duration_seconds - End-to-end request latency
# tgi_request_count - Total requests processed
# tgi_queue_size - Current queue depth
# tgi_batch_current_size - Active batch size
Set up a Grafana dashboard that tracks tgi_queue_size and tgi_batch_current_size over time. When the queue consistently exceeds your --max-concurrent-requests limit, it is time to scale horizontally by adding more TGI replicas behind a load balancer.
7. Router Configuration and Batching Behavior
The Rust router is the entry point for all requests. It maintains a priority queue, groups requests into batches for the model server, and handles backpressure when the system is overloaded. Several parameters control how aggressively the router batches requests.
| Parameter | Effect |
|---|---|
--max-waiting-tokens | Maximum tokens a request can wait in queue before being scheduled; lower values reduce latency, higher values improve throughput |
--waiting-served-ratio | Ratio of waiting vs. running tokens; controls how aggressively new requests are admitted to the batch |
--max-batch-size | Hard limit on batch size; useful for controlling memory usage on smaller GPUs |
The router also supports request prioritization and token budgets. When a request exceeds the
--max-input-tokens limit, TGI returns an HTTP 422 error immediately rather than
attempting to process and fail partway through. This fail-fast behavior prevents wasted GPU cycles.
Summary
TGI provides a Docker-native, production-ready serving solution for LLMs with a clean separation between its Rust networking layer and Python inference layer. Its built-in quantization support, streaming endpoints, health checks, and Prometheus metrics make it well-suited for deployment behind container orchestration systems like Kubernetes. In the next section, we explore SGLang, which takes a fundamentally different approach by providing a programming language for structured LLM interactions.