Section S.2: Text Generation Inference (TGI): Deployment and Configuration | Building Conversational AI with LLMs and Agents

Big Picture

Text Generation Inference (TGI) is HuggingFace's production-grade inference server, built in Rust and Python for maximum throughput and reliability. TGI provides Docker-first deployment, built-in quantization support, token streaming, and a battle-tested router that handles concurrent requests efficiently. If you are already invested in the HuggingFace ecosystem, TGI offers the shortest path from a model on the Hub to a production API endpoint.

Covered in Detail

For a comparative analysis of TGI against vLLM, SGLang, TensorRT-LLM, and other frameworks (including benchmarking methodology and decision criteria), see Section 9.4: Serving Infrastructure. This section focuses on TGI-specific deployment recipes and configuration.

1. What Is TGI and When Should You Use It?

Text Generation Inference (TGI) is an open-source serving framework developed by HuggingFace, purpose-built for deploying large language models. Unlike general-purpose model servers such as Triton or TorchServe, TGI is specialized for autoregressive text generation. Its architecture consists of two main components: a Rust-based router that handles HTTP connections, request queuing, and token streaming, and a Python model server that runs the actual inference on GPU using custom CUDA kernels.

TGI is an excellent choice when you need a production-ready serving solution with minimal configuration, when your models are hosted on HuggingFace Hub, or when you want built-in support for features like watermarking, grammar-constrained generation, and speculative decoding. It powers HuggingFace's own Inference Endpoints service, which means it has been tested at significant scale.

Key Insight

TGI's Rust router is a critical architectural decision. By handling all I/O, connection management, and request scheduling in Rust, TGI avoids Python's GIL bottleneck for the networking layer. The Python process only handles the GPU-bound inference work, where the GIL is released during CUDA kernel execution anyway.

2. Docker Deployment

The recommended way to deploy TGI is through its official Docker image. This bundles all CUDA dependencies, the Rust router, and the Python model server into a single container. The following command pulls and runs TGI with a Llama model.

# Pull and run TGI with a Llama 3.1 8B model
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --max-input-tokens 2048 \
    --max-total-tokens 4096 \
    --max-batch-prefill-tokens 4096

Let us break down the key parts of this command. The --gpus all flag exposes all host GPUs to the container. The --shm-size 1g flag increases shared memory, which is required for PyTorch's data loader workers. The -v /data:/data mount provides persistent storage for downloaded model weights so they survive container restarts.

2.1 Docker Compose for Persistent Deployments

For production deployments, a Docker Compose file provides a declarative and reproducible setup. The following configuration defines a TGI service with health checks and automatic restarts.

# docker-compose.yml
version: "3.9"

services:
  tgi:
    image: ghcr.io/huggingface/text-generation-inference:latest
    ports:
      - "8080:80"
    volumes:
      - model-cache:/data
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    shm_size: "1g"
    command: >
      --model-id meta-llama/Llama-3.1-8B-Instruct
      --quantize awq
      --max-input-tokens 2048
      --max-total-tokens 4096
      --max-concurrent-requests 128
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped

volumes:
  model-cache:

3. Environment Variables and Configuration

TGI accepts configuration through both command-line arguments and environment variables. Environment variables are prefixed with TGI_ or use HuggingFace-standard names. The table below lists the most important configuration options.

CLI Flag / Env Variable	Default	Description
`--model-id`	(required)	HuggingFace model ID or local path
`--quantize`	None	Quantization method: `awq`, `gptq`, `bitsandbytes`, `eetq`
`--max-input-tokens`	1024	Maximum allowed input prompt length
`--max-total-tokens`	2048	Maximum combined input + output token count
`--max-batch-prefill-tokens`	4096	Maximum tokens processed in a single prefill step
`--max-concurrent-requests`	128	Maximum number of simultaneous requests the router accepts
`--num-shard`	1	Number of GPU shards for tensor parallelism
`HUGGING_FACE_HUB_TOKEN`	None	Token for accessing gated models on HuggingFace Hub

Figure S.2.1: Key TGI configuration options. These control memory allocation, request limits, and model loading behavior.

Warning

The --max-batch-prefill-tokens parameter directly affects GPU memory usage during the prompt processing phase. Setting it too high can cause out-of-memory errors, especially on GPUs with less than 40 GB of VRAM. Start with 4096 and increase gradually while monitoring memory usage.

4. Quantization Options

TGI supports several quantization backends that reduce model memory footprint and often increase throughput. You select the quantization method at startup; TGI then loads the model weights in the specified format. The model must have been pre-quantized in the chosen format (with the exception of bitsandbytes, which quantizes on the fly).

# Run with AWQ quantization (model must be pre-quantized)
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-13B-Chat-AWQ \
    --quantize awq

# Run with bitsandbytes 4-bit quantization (quantizes on the fly)
docker run --gpus all --shm-size 1g -p 8080:80 \
    -v /data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-3.1-8B-Instruct \
    --quantize bitsandbytes-nf4

5. Making Requests: Streaming and Non-Streaming

TGI exposes two primary HTTP endpoints: /generate for standard request/response and /generate_stream for server-sent events (SSE) streaming. It also provides an OpenAI-compatible endpoint at /v1/chat/completions. The following Python examples demonstrate both modes.

import requests

TGI_URL = "http://localhost:8080"

# Non-streaming request
response = requests.post(
    f"{TGI_URL}/generate",
    json={
        "inputs": "What is the capital of France?",
        "parameters": {
            "max_new_tokens": 100,
            "temperature": 0.7,
            "top_p": 0.9,
            "do_sample": True,
        },
    },
)
result = response.json()
print(result["generated_text"])

TGI server started on http://localhost:8080 Model: mistralai/Mistral-7B-Instruct-v0.3 Quantization: bitsandbytes-nf4 Max concurrent requests: 128

For applications that benefit from displaying tokens as they are generated (chatbots, for example), streaming provides a much better user experience. TGI uses server-sent events for its streaming endpoint.

import requests
import json

# Streaming request using server-sent events
response = requests.post(
    f"{TGI_URL}/generate_stream",
    json={
        "inputs": "Explain gradient descent in simple terms.",
        "parameters": {
            "max_new_tokens": 200,
            "temperature": 0.5,
        },
    },
    stream=True,
)

for line in response.iter_lines():
    if line:
        decoded = line.decode("utf-8")
        if decoded.startswith("data:"):
            token_data = json.loads(decoded[5:])
            # Each chunk contains a single token
            if not token_data.get("details"):
                print(token_data["token"]["text"], end="", flush=True)

print()  # Final newline

Streaming response: data: {'token': {'text': 'The'}} data: {'token': {'text': ' transformer'}} data: {'token': {'text': ' architecture'}} ... data: {'generated_text': 'The transformer architecture uses self-attention...'}

6. Health Checks and Monitoring

TGI provides health and metrics endpoints that integrate with standard monitoring stacks. The /health endpoint returns HTTP 200 when the model is loaded and ready to serve. The /metrics endpoint exposes Prometheus-format metrics for throughput, latency, and queue depth.

# Check if TGI is healthy
curl http://localhost:8080/health

# Fetch Prometheus metrics
curl http://localhost:8080/metrics

# Key metrics to monitor:
#   tgi_request_duration_seconds    - End-to-end request latency
#   tgi_request_count               - Total requests processed
#   tgi_queue_size                  - Current queue depth
#   tgi_batch_current_size          - Active batch size

Tip

Set up a Grafana dashboard that tracks tgi_queue_size and tgi_batch_current_size over time. When the queue consistently exceeds your --max-concurrent-requests limit, it is time to scale horizontally by adding more TGI replicas behind a load balancer.

7. Router Configuration and Batching Behavior

The Rust router is the entry point for all requests. It maintains a priority queue, groups requests into batches for the model server, and handles backpressure when the system is overloaded. Several parameters control how aggressively the router batches requests.

Parameter	Effect
`--max-waiting-tokens`	Maximum tokens a request can wait in queue before being scheduled; lower values reduce latency, higher values improve throughput
`--waiting-served-ratio`	Ratio of waiting vs. running tokens; controls how aggressively new requests are admitted to the batch
`--max-batch-size`	Hard limit on batch size; useful for controlling memory usage on smaller GPUs

Figure S.2.2: Router parameters that control TGI's batching behavior and the tradeoff between latency and throughput.

The router also supports request prioritization and token budgets. When a request exceeds the --max-input-tokens limit, TGI returns an HTTP 422 error immediately rather than attempting to process and fail partway through. This fail-fast behavior prevents wasted GPU cycles.

Summary

TGI provides a Docker-native, production-ready serving solution for LLMs with a clean separation between its Rust networking layer and Python inference layer. Its built-in quantization support, streaming endpoints, health checks, and Prometheus metrics make it well-suited for deployment behind container orchestration systems like Kubernetes. In the next section, we explore SGLang, which takes a fundamentally different approach by providing a programming language for structured LLM interactions.