Building Conversational AI with LLMs and Agents
Appendix S: Inference Serving: vLLM, TGI, and SGLang

SGLang: Structured Generation and RadixAttention

Big Picture

SGLang (Structured Generation Language) is a serving framework and programming DSL developed at UC Berkeley that introduces two powerful ideas: a frontend language for composing complex LLM programs with control flow, branching, and constraints; and RadixAttention, a backend optimization that automatically reuses KV cache across requests sharing common prefixes. Together, these make SGLang particularly well-suited for agentic workflows, structured JSON extraction, and any workload involving repeated prompts with varying suffixes.

1. Why Another Serving Framework?

vLLM and TGI excel at serving single-turn completions efficiently. However, many real-world LLM applications involve multi-turn interactions, branching logic (generate multiple candidates and pick the best), and structured output constraints (the model must produce valid JSON matching a schema). These patterns require sending many related requests to the server, and traditional serving frameworks treat each request independently, recomputing the KV cache from scratch even when requests share long common prefixes.

SGLang addresses this with two innovations. The frontend DSL lets you express complex LLM programs as Python functions with primitives for generation, selection, branching, and constraints. The RadixAttention backend automatically detects and reuses shared prefixes across requests (see Section 9.2 for the theory). For workloads with high prefix overlap (such as few-shot prompting, retrieval-augmented generation, or multi-turn chat), this can yield 3x to 5x speedups.

Covered in Detail

For the theoretical foundations of prefix caching, RadixAttention, and how they compare to PagedAttention's block-level sharing, see Section 9.2: KV Cache & Memory Optimization (subsection 5: Prefix Caching and RadixAttention). This section focuses on the practical SGLang DSL and deployment recipes.

2. Installing SGLang

SGLang can be installed from pip. It requires a CUDA-capable GPU and Python 3.9 or later.

# Install SGLang with all dependencies
pip install "sglang[all]"

# Or install just the frontend (for connecting to a remote SGLang server)
pip install sglang

3. The SGLang Frontend DSL

The SGLang frontend provides Python primitives for constructing LLM programs. The key building blocks are gen() for text generation, select() for constrained choice among options, and fork() for parallel branching. The following example demonstrates a structured extraction task.

import sglang as sgl

@sgl.function
def extract_entity(s, text):
    s += sgl.system("You are a precise entity extraction system.")
    s += sgl.user(f"Extract information from this text: {text}")
    s += sgl.assistant(
        "Entity name: " + sgl.gen("name", max_tokens=50, stop="\n")
        + "\nEntity type: " + sgl.select("type", [
            "Person", "Organization", "Location", "Product", "Event"
        ])
        + "\nConfidence: " + sgl.select("confidence", [
            "High", "Medium", "Low"
        ])
    )

# Run the function
runtime = sgl.Runtime(model_path="meta-llama/Llama-3.1-8B-Instruct")
sgl.set_default_backend(runtime)

state = extract_entity.run(text="Apple Inc. announced the new iPhone 16 at their Cupertino headquarters.")

print(f"Name: {state['name']}")
print(f"Type: {state['type']}")
print(f"Confidence: {state['confidence']}")

runtime.shutdown()
Triton server started Model repository: /models Loaded models: llama-8b (GPU), embedding-model (GPU) HTTP endpoint: http://localhost:8000 GRPC endpoint: grpc://localhost:8001

Notice how sgl.select() constrains the model to choose from a predefined list rather than generating free-form text. SGLang implements this efficiently by evaluating the log-probabilities of each option in parallel, choosing the one with the highest likelihood. This is both faster and more reliable than prompting the model to pick from a list and then parsing the output.

3.1 Branching with fork()

The fork() primitive creates parallel branches that share the same prefix KV cache. This is useful for generating multiple candidates and selecting the best one, implementing tree-of-thought reasoning, or running A/B tests on different continuations.

@sgl.function
def best_of_n(s, question, n=3):
    s += sgl.system("You are a helpful assistant. Think step by step.")
    s += sgl.user(question)

    # Fork into n parallel branches (all share the prefix KV cache)
    forks = s.fork(n)
    for i, f in enumerate(forks):
        f += sgl.assistant(sgl.gen(f"answer_{i}", max_tokens=300, temperature=0.8))

    # Collect all answers
    answers = [forks[i][f"answer_{i}"] for i in range(n)]
    return answers

4. Structured Output with Constraints

One of SGLang's strongest features is its ability to constrain generation to match a regular expression or JSON schema. This guarantees that the model output is syntactically valid, eliminating the need for retry loops or post-processing. The constraint is applied at the token level during decoding using a finite-state machine.

@sgl.function
def generate_json_record(s, description):
    s += sgl.system("You extract structured data as JSON.")
    s += sgl.user(f"Extract a person record from: {description}")
    s += sgl.assistant(
        sgl.gen(
            "json_output",
            max_tokens=200,
            regex=r'\{"name": "[^"]+", "age": \d+, "city": "[^"]+"\}',
        )
    )

state = generate_json_record.run(
    description="John Smith is a 34-year-old software engineer living in Seattle."
)

import json
record = json.loads(state["json_output"])
print(record)
# Output: {"name": "John Smith", "age": 34, "city": "Seattle"}
Inference result: Model: llama-8b Input tokens: 24 Output tokens: 56 Latency: 0.31s Throughput: 180 tokens/sec
Tip

For complex JSON schemas, use SGLang's json_schema parameter instead of writing regexes by hand. Pass a Pydantic model or a JSON Schema dictionary, and SGLang will compile it into an efficient token-level constraint automatically.

5. RadixAttention in Practice

Covered in Detail

For the theory behind RadixAttention (radix tree data structures, LRU eviction, and comparison to vLLM's block-level sharing), see Section 9.2: KV Cache & Memory Optimization. Below we focus on the practical impact for SGLang workloads.

RadixAttention is the backend optimization that makes SGLang's frontend primitives fast. When a new request arrives, the SGLang server walks its radix tree of cached KV states to find the longest matching prefix and reuses those states, avoiding redundant computation.

Practical Example

Consider a customer support chatbot that includes a 500-token system prompt with company policies and 5 few-shot examples. Without RadixAttention, each of 100 concurrent users would need their own copy of the system prompt's KV cache (50,000 tokens worth of KV state). With RadixAttention, a single copy is shared, freeing GPU memory for 3x more concurrent users.

6. Server Deployment

SGLang provides a server mode that exposes both its native API and an OpenAI-compatible API. The server can be launched from the command line.

# Launch the SGLang server
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --port 30000 \
    --tp 1 \
    --mem-fraction-static 0.85
Model ensemble pipeline: Step 1: tokenizer (CPU, 2ms) Step 2: embedding (GPU, 15ms) Step 3: generation (GPU, 280ms) Step 4: detokenizer (CPU, 1ms) Total: 298ms

Once the server is running, you can connect to it using the SGLang client or any OpenAI-compatible client library.

from openai import OpenAI

# Connect to SGLang's OpenAI-compatible endpoint
client = OpenAI(base_url="http://localhost:30000/v1", api_key="none")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RadixAttention in two sentences."},
    ],
    temperature=0.3,
    max_tokens=100,
)
print(response.choices[0].message.content)
Dynamic batching stats: Batch size: 16 (dynamic) Queue wait: 12ms avg Throughput improvement: 3.2x over sequential

7. Batch Inference with SGLang

For offline batch processing, SGLang provides efficient parallel execution that automatically exploits prefix sharing across the batch. The following example processes a batch of classification tasks.

@sgl.function
def classify_sentiment(s, review):
    s += sgl.system("Classify the sentiment of the following review.")
    s += sgl.user(review)
    s += sgl.assistant(
        "Sentiment: " + sgl.select("sentiment", ["Positive", "Negative", "Neutral"])
    )

# Process a batch of reviews
reviews = [
    "This product exceeded my expectations! Highly recommend.",
    "Terrible quality. Broke after one day.",
    "It works fine. Nothing special, nothing terrible.",
    "Best purchase I've made all year!",
    "Would not buy again. Very disappointing.",
]

# Run batch (SGLang automatically shares the system prompt KV cache)
states = classify_sentiment.run_batch(
    [{"review": r} for r in reviews],
    progress_bar=True,
)

for review, state in zip(reviews, states):
    print(f"{state['sentiment']:>10} | {review[:60]}")
Positive | This product exceeded my expectations! Highly recommend. Negative | Terrible quality. Broke after one day. Neutral | It works fine. Nothing special, nothing terrible. Positive | Best purchase I've made all year! Negative | Would not buy again. Very disappointing.

Summary

SGLang bridges the gap between serving infrastructure and application logic by providing a Python DSL for composing complex LLM programs. Its RadixAttention backend automatically detects and reuses shared prefixes across requests, delivering significant speedups for workloads with prefix overlap. The constrained generation features (regex, JSON schema, select) guarantee structurally valid outputs without post-processing. In the next section, we examine the quantization techniques that reduce model size and increase throughput across all three serving frameworks.