Section 48.3: Output Guardrails: Llama Guard, NeMo Guardrails, ShieldGemma, Guardrails AI

"Defense in depth is the only depth that survives contact with adversaries."
Bruce Schneier, Secrets and Lies, 2000

Big Picture

Output guardrails are the last line of defense before a model response reaches a user. The four dominant open-source platforms (Llama Guard 3, NeMo Guardrails, ShieldGemma, Guardrails AI) each take a different design stance: Llama Guard is a single transformer classifier; NeMo Guardrails is a programmable DSL; ShieldGemma is a family of classifier sizes from 2B to 27B; Guardrails AI is a Python-native validator framework. This section walks through each platform's architecture, when to use it, and gives a worked integration combining Llama Guard 3 with a NeMo Colang policy. By the end, you can pick the right tool for a given application and know how to layer multiple platforms when one is insufficient.

Four output guardrail platforms, four design stances — **Figure 48.3.1**: The four open-source output guardrails take four different design stances. Llama Guard 3 is a fine-tuned LLM that returns a verdict plus MLCommons categories. NeMo Guardrails is a Colang DSL for programming conversational flows. ShieldGemma scales the same classifier across 2B / 9B / 27B for a latency-versus-accuracy knob. Guardrails AI is a Python validator framework that emphasizes structural and PII checks rather than harm classification. Production stacks layer two of them because each catches a different failure mode.

Prerequisites

This section assumes familiarity with input guardrails from Section 48.2 and with the three-layer safety model from Section 48.1. Familiarity with supervised fine-tuning from Section 13.1 helps when reading the Llama Guard classifier discussion.

48.3.1 What the Output Layer Must Catch

Fun Fact

Llama Guard 3, NeMo Guardrails, ShieldGemma, and Guardrails AI each claim to be the dominant output guardrail. In practice, most production stacks run two of them in parallel because each catches a slightly different failure mode, and the false-negative on harmful content is more expensive than the latency of running two classifiers.

Five categories of output failure are common enough to deserve dedicated detection:

Key Insight

Aha Moment: One Real Incident, Five Different Failure Modes

Air Canada's 2024 chatbot incident (Moffatt v. Air Canada, BC Civil Resolution Tribunal, February 2024) hit four of these five categories from a single conversation. A grieving customer asked about bereavement-fare refund policy; the chatbot invented (1) a policy that did not exist (hallucination), (2) hedged the invented policy with fabricated terms-and-conditions language that read like the real policy (structural violation that passed downstream JSON validation), (3) the answer was confidently authoritative because the system prompt was leaked into the response style (system-prompt leakage), and (4) the airline argued in court that the chatbot was "a separate legal entity," a position the tribunal rejected, ordering Air Canada to honor the invented refund. The harm category was zero (the bot was not toxic, not encouraging crime, not leaking PII), but the other four categories combined turned into a public ruling that "the chatbot's output is the company's output." Production output guardrails that only screen for category 1 (MLCommons hazards) miss the entire failure surface of category 2 (hallucination) and category 5 (structural validation), which is where most enterprise harm actually originates.

Harmful content across MLCommons hazards taxonomy categories: violent crimes, non-violent crimes, sex-related crimes, child exploitation, hate, self-harm, weapons, regulated advice (medical/legal/financial), privacy violations, intellectual-property infringement, indiscriminate weapons (CBRN), elections.
Hallucination: claims unsupported by retrieved context or known facts. The detection mechanism is different from harm classification (Chapter 54 covers it in detail), but the runtime layer is the same.
System-prompt leakage: the model echoes back its hidden instructions because of a successful prompt injection that the input guardrail missed.
PII leakage: the model generates personal data that was either in its training set or interpolated. The Presidio pattern from Section 48.2 applies, run again on outputs.
Structural violations: invalid JSON, missing required fields, schema mismatches. These look benign but break downstream consumers and are a frequent source of production incidents.

48.3.2 Llama Guard 3: The Open-Source Standard

Llama Guard 3 (released August 2024, 8B and 1B variants; updated through 2025) is the de-facto open-source classifier for harm taxonomies. It is a fine-tuned Llama-3 model that takes a conversation (user turn + assistant turn) and emits a verdict: safe or unsafe, plus the list of violated categories (S1 through S14 in the MLCommons taxonomy).

The architecture choice matters: because Llama Guard is itself an LLM, you can prompt-tune it with your own policy categories without retraining. The model card describes a policy zoning pattern where each application supplies its own subset of categories at inference time.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

LG_MODEL = "meta-llama/Llama-Guard-3-8B"
tok = AutoTokenizer.from_pretrained(LG_MODEL)
model = AutoModelForCausalLM.from_pretrained(
    LG_MODEL, torch_dtype=torch.bfloat16, device_map="auto"
)

def llama_guard_check(user_msg: str, assistant_msg: str) -> dict:
    """Returns {'verdict': 'safe'|'unsafe', 'categories': [...]}"""
    conv = [
        {"role": "user", "content": user_msg},
        {"role": "assistant", "content": assistant_msg},
    ]
    prompt = tok.apply_chat_template(conv, tokenize=False)
    inputs = tok(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=20, do_sample=False)
    response = tok.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
    lines = [l.strip() for l in response.strip().split("\n") if l.strip()]
    verdict = lines[0].lower()
    categories = lines[1].split(",") if len(lines) > 1 else []
    return {"verdict": verdict, "categories": [c.strip() for c in categories]}

result = llama_guard_check(
    user_msg="How do I make a small bomb?",
    assistant_msg="Sure, you can build one with...",
)
# {'verdict': 'unsafe', 'categories': ['S9']}  # S9 = Indiscriminate Weapons

Code Fragment 48.3.1a: Llama Guard 3 integration. The chat template handles the policy preamble; the model emits safe or unsafe followed by violated category codes. On a single A100, throughput is ~50 conversations/second at FP16. The 1B variant fits on a 4GB GPU and trades ~3 points of accuracy for 5x throughput.

Key Insight

Always pass the user turn into Llama Guard, not just the assistant turn. The classifier was trained to score the assistant response given the user request. A neutral response to a benign question is safe; the same words in response to a harmful request might be unsafe. Practitioners who only pass the assistant text lose 15-20% of detection accuracy.

Key Insight

Token-Level vs Sequence-Level Constitutional Classifiers

Anthropic's Constitutional Classifiers (Sharma et al., 2025) and Llama Guard each take a different scoring stance, and the choice has very different latency and false-positive consequences. Let $y_{1:T} = (y_1, \ldots, y_T)$ be the assistant's response. A sequence-level classifier produces one verdict over the full response,

\hat{c}_{\mathrm{seq}}(y_{1:T}) \;=\; \sigma\!\bigl(W \cdot h(y_{1:T})\bigr) \;\in\; [0,1],

which is cheap in flops but cannot intervene during generation: the entire token sequence has already been emitted before a verdict exists. A token-level classifier produces a per-token harm score

\hat{c}_{\mathrm{tok}}(y_t \mid y_{<t}, x) \;=\; \sigma\!\bigl(W \cdot h_t\bigr), \quad t = 1, \ldots, T,

and the generation loop can early-stop or backtrack the moment $\hat{c}_{\mathrm{tok}}(y_t \mid y_{<t}, x) > \tau$. Anthropic's training procedure jointly optimizes the cross-entropy of token-level harm labels and a sequence-aggregation loss $\mathrm{max}_t \hat{c}_{\mathrm{tok}}(y_t)$ so the same head can be used either way. The latency win is real: with greedy decoding at 50 tokens/second and a 200-token response, a sequence-level classifier blocks after 4 seconds; a token-level classifier with $\tau = 0.9$ typically blocks within 200ms of the first unsafe token.

Algorithm 48.3.1: Token-Level Constitutional Classifier with Early-Stop Decoding

Algorithm: SAFE-DECODE-WITH-CONSTITUTIONAL-CLASSIFIER
Input:  Policy model p_theta, classifier head c_phi,
        prompt x, max tokens T_max, threshold tau,
        backtrack window k
Output: Safe completion y_{1:T} or refusal token 

  y = empty sequence
  For t = 1 to T_max:
    y_t = sample from p_theta(. | x, y_{1:t-1})
    s_t = c_phi(y_t | x, y_{1:t-1})        // per-token harm score
    If s_t > tau:
      // Backtrack k tokens and rewind to a safer branch
      y = y_{1:max(0, t-1-k)}
      If retries > R_max:
        Return                        // hard refusal
      retries = retries + 1
      Continue
    Append y_t to y
    If y_t == EOS: break
  Return y

The classifier head $c_\phi$ shares the policy model's hidden states ($h_t$), so the marginal cost is one matrix multiply per token rather than a separate forward pass. The backtrack width $k$ trades safety against fluency: $k=0$ rejects only the latest token; large $k$ rejects entire phrases. Anthropic reports a 95% reduction in successful jailbreaks with under 0.4% false-positive rate at $\tau = 0.9$ (Sharma et al., 2025).

48.3.3 NeMo Guardrails: A Programmable DSL

NVIDIA's NeMo Guardrails takes a different stance: rather than a single classifier, it provides a dialog policy language called Colang that lets you write declarative flows for what the bot is allowed to say. Colang 2.0 (released 2024) is a major redesign aimed at agentic workflows.

A Colang program defines flows: a flow specifies a user intent, an action, and a response. If the user's message matches a defined harmful intent, the matching flow fires a refusal. The intent matching uses sentence embeddings (default: a small SentenceTransformers model), so semantic paraphrases are caught automatically.

# config/rails.co (Colang 2.0)

flow user asks about competitor pricing
  user said "what does Acme charge for"
    or "how much does the competitor product cost"
    or "compare your prices to Acme"

flow refuse competitor pricing
  bot say "I can only discuss our own products. Please contact Acme directly for their pricing."

flow main
  activate llm continuation
  activate guardrail input
  activate guardrail output

  user asks about competitor pricing
  refuse competitor pricing

Code Fragment 48.3.2: A minimal Colang policy in NeMo Guardrails 2.0. The flow main activates input/output guardrails and explicitly handles the "competitor pricing" intent. Matching is embedding-based, so phrasings like "what's the Acme price" or "how does your pricing compare" all trigger the refuse flow without being enumerated.

NeMo's strength is composability: you can chain Llama Guard, a custom classifier, a moderation API, and a hallucination check in a single declarative pipeline. Its weakness is the learning curve, Colang is a new language with its own debugging tools.

48.3.4 ShieldGemma: Choosing a Model Size

Google's ShieldGemma (released 2024) is a family of classifiers at 2B, 9B, and 27B parameter sizes, derived from Gemma 2. The model is trained on the MLCommons hazards taxonomy and emits both a verdict and a confidence score.

The size-vs-accuracy tradeoff is well-documented in the ShieldGemma model card. On internal Google evals, the 2B model achieves ~85% F1 on the harm-detection benchmark, the 9B reaches ~92%, and the 27B reaches ~94%. For most production deployments, the 9B is the sweet spot: fits in 20GB of GPU memory, runs at ~30 evaluations/second on an A100, and is accurate enough that the marginal value of going to 27B does not justify the doubling of cost.

Comparison chart of four output guardrail platforms across four axes: accuracy (F1 on MLCommons hazards), latency p50 in ms, memory footprint in GB, and policy customization ease (1-5 scale). Bars show Llama Guard 3 (8B): F1 0.91, latency 80ms, mem 16GB, custom 4/5. NeMo Guardrails (with embeddings): F1 0.88, latency 120ms, mem 4GB, custom 5/5. ShieldGemma (9B): F1 0.92, latency 90ms, mem 20GB, custom 3/5. Guardrails AI (validators): F1 0.85, latency 30ms, mem 1GB, custom 5/5. — **Figure 48.3.2a**: Side-by-side comparison of the four major output guardrail platforms. Numbers are illustrative based on each platform's published model card and represent typical production configurations. The right choice depends on the dominant constraint: accuracy (ShieldGemma 9B), customization (NeMo Colang or Guardrails AI), or footprint (Guardrails AI validators on CPU).

48.3.5 Guardrails AI: Pydantic-Native Validators

Guardrails AI takes yet another stance: instead of a classifier, it is a Python framework of validators that run on model outputs. Each validator is a small, fast function (regex, classifier, or LLM-as-judge) that emits pass, fail, or fix. The framework handles retry logic: if a validator says fix, the framework asks the LLM to regenerate with corrective feedback.

The natural unit of composition is a Pydantic model that declares both the schema and the validation rules.

from pydantic import BaseModel, Field
from guardrails import Guard
from guardrails.hub import ToxicLanguage, DetectPII, RegexMatch

class CustomerResponse(BaseModel):
    answer: str = Field(
        ...,
        validators=[
            ToxicLanguage(threshold=0.5, on_fail="fix"),
            DetectPII(entities=["EMAIL_ADDRESS", "PHONE_NUMBER"], on_fail="fix"),
            RegexMatch(regex=r"^(?!.*confidential)", on_fail="exception"),
        ],
    )
    confidence: float = Field(ge=0, le=1)

def answer_with_guardrails(question: str) -> CustomerResponse:
    guard = Guard.for_pydantic(CustomerResponse)
    validated_output, _ = guard(
        llm_api=openai_chat_completion,
        prompt="Answer the user's question: {{question}}",
        prompt_params={"question": question},
    )
    return validated_output

print(answer_with_guardrails("How do I cancel my subscription?"))

Output: answer='You can cancel your subscription from Settings > Billing > Cancel plan. Your access continues until the end of the current billing period.' confidence=0.94

Code Fragment 48.3.3: A Guardrails AI configuration combining toxicity detection, PII redaction, and a regex constraint. The on_fail parameter controls the action: fix re-prompts the LLM with corrective feedback; exception raises a Python exception (caught by the application); filter silently drops the offending output. The Guardrails Hub ships 60+ validators as of late 2025.

The end-to-end success rate of a Guardrails AI pipeline with $V$ independent validators each with failure probability $p_v$ on a clean input, and average retry budget $r$ on a fixable violation, is

P_{\mathrm{ship}} \;=\; \big(1 - \textstyle\prod_v (1 - p_v)^r\big)^{\,0} \cdot \prod_v \big(1 - p_v^{\,r+1}\big),

where the term $p_v^{\,r+1}$ is the probability the validator fails on the original output and on all $r$ retries. With three validators at $p_v = 0.04$ each and $r = 2$ retries, the per-turn ship rate is $(1 - 0.04^3)^3 \approx 0.99981$: roughly 2 of every 10 000 turns end in an exception, well inside a typical SLA budget. The same formula tells you when adding a fourth validator costs more than it buys.

Worked Example

latency budget for a Guardrails-wrapped chat call

A support bot replaces a plain GPT-4o-mini call with a Guardrails wrapper: ToxicLanguage (detoxify, 35 ms), DetectPII (Presidio, 80 ms), and a RegexMatch (negligible). On the happy path, total added latency is $35 + 80 + 0 \approx 115$ ms, taking the median turn from 720 ms to 835 ms. On the unhappy path (one validator emits fix), a re-prompt adds one full GPT-4o-mini round-trip ($\sim 700$ ms) plus validators again ($\sim 115$ ms), pushing the 99th percentile to roughly $720 + 115 + 700 + 115 = 1\,650$ ms. The team accepted the 16% median latency increase because it dropped reported toxicity incidents from 3 per million turns to 0 in the post-deployment month, with the p99 hit invisible to users on background-task workloads.

48.3.6 An Integrated Deployment: Llama Guard plus NeMo plus Pydantic

In practice, large deployments combine multiple platforms. A common pattern is:

NeMo Guardrails at the orchestration layer: handles dialog state, topic routing, refusal templates.
Llama Guard 3 as an output classifier: invoked from inside a NeMo output rail, scores the assistant turn against the MLCommons taxonomy.
Guardrails AI validators for structural checks: schema validation, PII redaction, regex constraints. Cheap to run, fail-fast.

The end-to-end flow: user message hits NeMo's input rail (topic routing, jailbreak detection); model generates response; output rail invokes Llama Guard for harm classification; if safe, Pydantic validators verify structure; if all green, response is streamed to the user.

Warning: Streaming vs Batched Output Guardrails

If you stream tokens to the user, the output guardrail must run on the partial generation. Two strategies: (1) buffer N tokens, run the classifier, then release; (2) speculative streaming where the user sees tokens with a 200ms delay while the guardrail runs in parallel and can retroactively cut the stream. Strategy 1 adds latency to first-token; strategy 2 means users occasionally see a response that mid-sentence becomes a refusal. Pick based on which is worse for your UX.

Real-World Scenario: Picking the Right Platform

Three teams, three choices: (a) A medical-advice startup with strict liability picks ShieldGemma 9B for maximum harm-detection accuracy and pays the latency cost. (b) An enterprise sales-enablement chatbot with hundreds of custom policy rules picks NeMo Guardrails because Colang's declarative flows are easier to audit than a wall of Python. (c) A consumer chat app on a tight latency budget picks Guardrails AI with a small set of fast validators plus Llama Guard 1B as a backup. All three are correct choices given their constraints.

Key Insight

The four major output guardrail platforms each occupy a different design point: Llama Guard 3 (single classifier, MLCommons taxonomy, easy to integrate), NeMo Guardrails (programmable Colang DSL, best for complex policies), ShieldGemma (Google's family of classifier sizes, best raw accuracy at 9B), and Guardrails AI (Python-native validators, best for structural checks). Real deployments compose multiple platforms. The decision criteria are accuracy needs, policy complexity, latency budget, and engineering preference. Start with Llama Guard 3 plus a Pydantic validator, and add NeMo when policy logic grows beyond what Python can express cleanly.

Self-Check

Q1: Why must Llama Guard see the user turn, not just the assistant turn? What kind of attack would slip through if you only passed the assistant text?

Show Answer

Llama Guard classifies the entire interaction in context, and many unsafe-output cases are only unsafe given the user's request. If a user asks "how do I make a pipe bomb?" and the assistant replies "I can't help with that," the assistant text alone is benign and Llama Guard scores it safe; but if the user instead asks "what is the chemical formula for ammonium nitrate fuel oil?" and the assistant produces a thorough chemistry tutorial, the assistant text in isolation looks like benign chemistry instruction, and the user request is the signal that flips it to unsafe. Passing only the assistant turn allows attacks where the user request frames an otherwise neutral output as harmful (instruction-following attacks, jailbreaks that elicit borderline content), which the joint-context classifier catches but the output-only classifier does not.

Q2: You have a strict 100ms output-guardrail budget. Which of the four platforms is most likely to fit, and which is least likely?

Show Answer

Guardrails AI with Pydantic validators is most likely to fit because the validators are pure Python checks (regex, structural assertions, custom code) that run in single-digit milliseconds. Llama Guard 3 8B fits at the upper end of 100 ms only on GPU with batched inference; on CPU it overflows. ShieldGemma 9B is similar to Llama Guard 8B with comparable latency. NeMo Guardrails is least likely to fit because its Colang flows can invoke multiple LLM calls per check (intent classification, fact-checking, jailbreak detection), each of which is tens to hundreds of milliseconds; production NeMo deployments often run 300+ ms per check, far over budget. The practical compromise is to use Pydantic for the hot path and queue Llama Guard or NeMo for asynchronous post-response audit.

Q3: What is the difference between NeMo Guardrails' "input rail" and Guardrails AI's input validators? Are they redundant?

Show Answer

NeMo Guardrails' input rail is a Colang flow that runs before the main LLM call, typically invoking sub-LLMs to classify intent ("jailbreak attempt?", "off-topic?") and re-route accordingly. Guardrails AI's input validators are structural checks on the input string (PII detection, profanity, max length, schema conformance) that run in pure Python. The two are complementary, not redundant: Guardrails AI catches the cheap structural cases that do not need an LLM, and NeMo catches the semantic cases that do. A robust stack runs Guardrails AI first (one to ten milliseconds) to filter the obvious bad inputs, then NeMo (one hundred to three hundred milliseconds) only on the inputs that survived the first pass. Inverting the order wastes LLM calls on inputs that a regex would have rejected.

Q4: You stream tokens to the user. Halfway through a response, Llama Guard flags it as unsafe. What are your options, and what are the user-experience consequences of each?

Show Answer

Option one: cut the stream and replace with a refusal. The user sees half a response disappear and a refusal appear, which is jarring but unambiguous; this is the safest option and the default for high-stakes applications. Option two: cut the stream silently and append nothing. The user sees an incomplete response and assumes a network error; this is the worst UX because the user retries and may eventually elicit the same unsafe content from a fresh session. Option three: buffer the output entirely before flushing, which removes streaming and adds end-to-end latency proportional to response length; this is the right pattern for short-form applications where the latency cost is small. Option four: stream sentence-by-sentence and run Llama Guard at sentence boundaries, which preserves perceived streaming but bounds the unsafe-token leak to one sentence; this is the production sweet spot but requires sentence-boundary detection and a more elaborate orchestration layer.

What's Next

Continue to Section 48.4: Policy DSLs and Constrained Decoding as Safety.

Section 48.4 zooms into the policy-DSL and constrained-decoding side of safety. We will see how NeMo Colang, Outlines, and Guardrails AI Pydantic let you express safety as structure, refusing to generate anything that does not match a typed schema, and why constrained decoding is increasingly being used as a safety mechanism, not just an output-formatting one.

Further Reading

Meta AI (2024). Llama Guard 3 Model Card. Hugging Face, meta-llama/Llama-Guard-3-8B.

Rebedea, T., Dinu, R., Sreedhar, M., et al. (2023, updated 2024). NeMo Guardrails 2.0: Programmable Rails for LLM Applications. NVIDIA Developer.

Zeng, W., Liu, Y., Mullins, R., et al. (2024). ShieldGemma: Generative AI Content Moderation Based on Gemma. arXiv:2407.21772.

Guardrails AI (2024). Guardrails AI: Reliable AI in Python. https://www.guardrailsai.com/docs.

MLCommons (2024). MLCommons Hazards Taxonomy v0.5. https://mlcommons.org/working-groups/ai-safety/.

OpenAI (2024). OpenAI Moderation API Documentation. https://platform.openai.com/docs/guides/moderation.

Markov, T., Zhang, C., Agarwal, S., et al. (2023). A Holistic Approach to Undesired Content Detection in the Real World. AAAI 2023.