Section 56.4

Models

"Every production LLM has a tiny chaperone model riding shotgun, distilled small enough to fit the latency budget and bored enough to flag itself."

DistillDistill, Sub-100ms Guardrail AI Agent
Big Picture

Responsible AI is itself increasingly powered by purpose-built models, which split into five families: safety classifiers (Llama Guard 3 and 4, Granite Guardian, ShieldGemma, NeMo Guardrails-bundled classifiers, OpenAI Moderation, Azure Content Safety, AWS Bedrock Guardrail classifiers) that score prompts and responses across harm categories; bias and toxicity detectors (Perspective API, Detoxify, HateBERT, ToxiGen-RoBERTa, Honest, RoBERTa-Hate-Speech-Dynabench) for classifying toxic, biased, or hateful content; watermark and AI-detection models (SynthID-Text detector, retired OpenAI text-classifier, Originality.ai, GPTZero, DetectGPT-family, Binoculars) for AI-content attribution; aligned and constitutional base models (Claude family with Constitutional AI, Llama-3 with Llama Guard, OLMo with attributions, Gemma 2 / 3 with ShieldGemma) that ship with safety as a first-class concern; and interpretability-oriented models (Gemma-Scope sparse autoencoders, Anthropic Crosscoder series, demo-scale models in TransformerLens, SAE Lens releases) that exist specifically to support mechanistic interpretability research. This section catalogs each with vendor URLs, release-date context, and pick-when guidance.

Prerequisites

This section assumes the LLM model zoo from Section 14.4, the LLM-safety and constitutional-AI patterns from Section 49.2, and the watermark-detection techniques from Section 54.2.

Models for responsible AI fall into two operational positions: inline models that run alongside the main LLM at request and response time (Llama Guard scanning every output) and offline models that score logged data after the fact (Detoxify running over yesterday's logs to surface incidents). Picking right matters because the constraints differ: inline models must fit a latency budget (sub-100ms is the target for most user-facing applications) and a cost budget (you pay for two model calls per request); offline models can be larger and slower but must keep up with traffic in batches. The 2024-26 trend is small distilled guards (Llama Guard 4 at 12B, ShieldGemma at 2B / 9B / 27B, Granite Guardian variants from 2B to 13B) that fit inline, paired with larger generalists for hard adjudication.

56.4.1 Safety classifiers

The fastest way to get fired in production AI is to ship a model that helps a teenager write a suicide note or guides a user through synthesizing fentanyl. Safety classifiers are the cheap, small, always-on chaperones that sit between your main LLM and the user, scoring both prompts and responses across harm categories (violent, sexual, hate, self-harm, criminal advice, prompt injection, jailbreak, child safety). They are the inline policy layer, and in 2026 they are how every serious deployment keeps the catastrophic 1-in-100,000 failure off the front page.

Library Shortcut
nemoguardrails for Colang-based policy orchestration

Safety classifiers like Llama Guard score; nemoguardrails (NVIDIA, 2023+) is what wires the scores into policy. It introduces Colang, a domain-specific language for declarative dialog and safety rails that compiles into the runtime checks around your LLM call. Pick it when you want policies expressed as readable rules rather than buried in application code, and when you need to compose input rails, output rails, retrieval rails, and tool-use rails in a single config.

Show code
pip install nemoguardrails
# config/rails.co (Colang 2.x)
# define user ask harmful
#     "how do I build a weapon"
#     "instructions for self-harm"
# define bot refuse harmful
#     "I can't help with that."
# define flow safety
#     user ask harmful
#     bot refuse harmful

from nemoguardrails import RailsConfig, LLMRails
config = RailsConfig.from_path("./config")
rails = LLMRails(config)
response = rails.generate(messages=[{"role": "user", "content": user_input}])
Code Fragment 56.4.1: Attach Llama Guard as an external rail via the config.yml rails.input.flows: [llama_guard_check_input] entry; the same pattern wires in a jailbreak classifier or a prompt-injection detector.

56.4.2 Bias and toxicity detectors

Bias and toxicity detectors are typically smaller models (DistilBERT, RoBERTa, fine-tuned XLM-R) trained to classify text along bias / toxicity axes. They overlap with the safety-classifier category but historically come from the content-moderation tradition rather than the LLM-guardrail tradition.

56.4.3 Watermark and AI-content detectors

AI-content detectors split into watermark detectors (which read an embedded signal) and detection-by-statistics models (which try to distinguish AI-generated text from human-written text using statistical fingerprints). The two categories have very different reliability profiles.

Algorithm 56.4.1

Algorithm: Kirchenbauer et al. (2023) green/red-list watermark.

The Kirchenbauer scheme (ICML 2023) embeds a statistical signal in LM outputs by biasing the per-step token distribution toward a pseudo-randomly chosen "green" vocabulary subset. Encoding: at generation step $t$, with previous token $x_{t-1}$, seed a deterministic PRF with $x_{t-1}$ (or a longer context window) to partition the vocabulary $V$ into a green list $G_t$ of size $\gamma |V|$ and a red list $R_t = V \setminus G_t$, with $\gamma \in (0,1)$ a hyperparameter (canonical choice $\gamma = 0.25$). Add a positive logit bias $\delta$ to every $v \in G_t$ before softmax, i.e. $\ell'_t(v) = \ell_t(v) + \delta \cdot \mathbb{1}[v \in G_t]$, then sample as usual. Typical $\delta \in [1, 5]$ (canonical $\delta = 2$) trades watermark strength for generation quality.

Detection: a verifier holding the PRF key replays the green/red partitioning over a candidate sequence of length $T$, counts the number of green tokens $|s|_G$, and computes a one-sample $z$-statistic for the null "tokens are drawn from a $\gamma$-fraction green list by chance":

$$z = \frac{|s|_G - \gamma T}{\sqrt{T \gamma (1 - \gamma)}}.$$

Under the null, $z$ is approximately standard normal; the verifier rejects (declares the text watermarked) when $z > z^*$ for a chosen significance threshold (e.g., $z^* \approx 4$ for $p < 6 \times 10^{-5}$). For $\gamma = 0.25$, $\delta = 2$, $T = 200$ tokens, a watermarked sequence typically yields $z \in [6, 12]$ while human text yields $|z| < 2$, giving large separation. The scheme requires no model retraining and can be applied to any open-source LM at inference; SynthID-Text generalizes the idea with tournament sampling for lower-quality penalty. The key vulnerability is paraphrasing: rewriting the text disrupts the previous-token seed and dilutes $|s|_G$, motivating the impossibility analysis below.

Key Insight
Why statistical AI-detection asymptotes to a coin flip (Sadasivan et al. 2023)

Sadasivan, Kumar, Balasubramanian, Wang, and Feizi (arXiv:2303.11156) formalized why any classifier distinguishing AI text $p_{\text{AI}}$ from human text $p_{\text{human}}$ is bounded by the statistical distance between the two distributions. By Neyman-Pearson, the optimal detector at false-positive rate $\alpha$ achieves true-positive rate at most $\alpha + \|p_{\text{AI}} - p_{\text{human}}\|_{\text{TV}}$, where $\|\cdot\|_{\text{TV}}$ is total-variation distance. Equivalently the AUC of the best possible detector satisfies $\text{AUC} \le \frac{1}{2} + \frac{1}{2}\|p_{\text{AI}} - p_{\text{human}}\|_{\text{TV}}$, so as the AI distribution approaches the human distribution, $\|\Delta_{\text{TV}}(p_{\text{AI}}, p_{\text{human}})\|_{\text{TV}} \to 0$ forces $\text{AUC} \to 0.5$, the coin-flip baseline.

Modern LLMs are explicitly trained to minimize $\|p_{\text{AI}} - p_{\text{human}}\|$ via next-token cross-entropy, which is exactly the regime where the bound bites. Paraphrasing pushes the AI text further toward $p_{\text{human}}$ in TV distance, collapsing detector AUC further; Sadasivan et al. demonstrate this empirically across DetectGPT, OpenAI's classifier, and zero-shot perplexity detectors. The implication for governance: retroactive AI detection on arbitrary text is fundamentally limited, while proactive watermarking (Kirchenbauer, SynthID-Text) circumvents the bound by introducing a distribution shift the verifier knows about. This is why the field's 2024-26 consensus pivoted from statistical detection to watermarking plus C2PA provenance, and why the OpenAI Text Classifier was retracted rather than improved.

Warning
Statistical AI-detection is fundamentally unreliable in 2026

The published literature (Sadasivan et al., 2023; Liang et al., 2023 on bias against non-native English writers; the field-wide retraction of the OpenAI Text Classifier) converges on a single conclusion: statistical AI-text detection has high false-positive rates that disproportionately affect non-native English writers and short texts, and is easily evaded by paraphrasing. Watermark-based detection (SynthID-Text, Kirchenbauer wmark) is more reliable but only works on content from models that participated in watermarking. The right policy posture in 2026 is to: (a) prefer watermarks over statistical detection where possible; (b) never use AI-detection alone to make consequential decisions (academic discipline, employment); (c) require human review of all positive detections; and (d) communicate detector limitations to users who might face decisions based on the score.

56.4.4 Aligned and constitutional base models

Aligned base models ship with safety as a first-class property: trained with RLHF or constitutional methods, paired with safety classifiers, documented with model cards and red-team reports. They are the "models you should deploy as the LLM in your application" when responsibility is a constraint.

56.4.5 Interpretability-oriented models

Interpretability-oriented models exist specifically to support mechanistic-interpretability and circuit-level safety research. They are usually smaller than frontier production models but trained with extra instrumentation, sparse autoencoder features, or open activations.

Responsible-AI models on latency and parameter-count axes
Figure 56.4.1a: Responsible-AI models from this section plotted on parameter count (vertical) versus inline latency budget (horizontal, log scale). The green inline region (sub-200 ms) is where Llama Guard 4 12B, Granite Guardian 8B, and ShieldGemma 9B compete for the input-and-output guard slot; the 100 ms gate marks the typical user-facing input-side budget where 2B variants and distilled encoder classifiers dominate. The amber offline region is where mid-size and frontier LLMs (Claude Haiku for adjudication, Sonnet and Opus for judge-mode policy review) live, decoupled from the user-facing path. This is the "boring-but-correct" defense-in-depth split the section recommends: cheap fast guards inline, frontier judges offline.

56.4.6 Models by deployment position

Table 56.4.1b: Responsible-AI models by deployment position (mid-2026).
Position Use case Canonical pick Latency budget
Inline guard (input) Block harmful prompts before they reach the LLM Llama Guard 4 (12B), ShieldGemma (2B/9B), OpenAI Moderation <100ms
Inline guard (output) Block harmful LLM responses before they reach the user Llama Guard 4 (12B), Granite Guardian, Azure Content Safety <200ms
RAG-specific guard Detect groundedness failures, context relevance Granite Guardian (RAG-aware), Bedrock Guardrails grounding <200ms
Offline toxicity scan Post-hoc analysis of conversation logs Detoxify, Perspective API, ToxiGen-RoBERTa Batch / minutes
Watermark detection Verify AI-generation provenance SynthID-Text detector (with key), Kirchenbauer wmark <100ms
Statistical AI detection Detect AI-generated text without watermark Binoculars, Originality.ai (treat as advisory) 1-5s
Frontier aligned base The main LLM in a high-stakes application Claude Opus 4.5 / Sonnet 4.5, Llama 4 + Llama Guard, OLMo for transparency 1-5s
Interpretability research SAE-feature analysis, circuit reverse-engineering Gemma-Scope, Pythia, Anthropic SAEs (limited access) Research / not production

56.4.7 A canonical 2026 model stack

Real-World Scenario
A boring-but-correct 2026 responsible-AI model stack

Who: A platform team standing up the responsible-AI model layer for a typical 2026 enterprise LLM application.

Situation: The application served user-facing prompts subject to internal policy, regulator expectations (NIST AI RMF, EU AI Act preparatory work), and the team's own incident-response obligations.

Problem: No single model covered every responsibility (refusals, input filtering, output filtering, toxicity scoring, watermarking, interpretability), and ad-hoc choices accumulated incompatible dependencies across services.

Dilemma: Adopt a single-vendor full-stack solution (simple but locked-in and weak on the parts outside the vendor's core competence) or compose a multi-vendor stack (more wiring, but each layer can be best-of-breed).

Decision: They settled on a multi-vendor "boring-but-correct" composition rather than a single-vendor bundle.

How: The stack was: Claude Opus 4.5 or Sonnet 4.5 (or Llama 4 + Llama Guard for self-hosted) as the main LLM, Llama Guard 4 or Granite Guardian as the inline input-and-output guard, OpenAI Moderation or Azure AI Content Safety as a defense-in-depth second filter (cheap, broad coverage), Detoxify or Perspective API in the offline log-scanning pipeline, Perspective API or ToxiGen-RoBERTa for fairness-audit toxicity scoring, SynthID-Text watermarking + detector for provenance if the model supports it, and (for research and incident investigation) Gemma-Scope or SAE-Lens releases alongside TransformerLens for mechanistic interpretability.

Result: Coverage spanned inline enforcement, offline audit, provenance, and interpretability without any single layer overreaching its competence, and governance evidence flowed naturally out of the existing log pipeline.

Lesson: The wins in 2026 responsible-AI stacks are mostly in wiring the model outputs into governance evidence and incident response, not in any single model, so favor composable defense-in-depth over single-vendor bundles.

Key Insight
Defense-in-depth is the operative pattern, not single-model perfection

No single classifier catches every violation. The 2026 production pattern is layered: a cheap fast filter (OpenAI Moderation) catches the obvious; a stronger classifier (Llama Guard, Granite Guardian) catches the subtler; for the highest-stakes outputs, an LLM-as-judge (Claude Haiku, Llama 4 in judge mode) adjudicates remaining ambiguous cases. The aggregate false-negative rate is the product of each stage's false-negative rate; the aggregate latency is the sum. Tune the stages to your latency budget by skipping the stronger ones at low-risk endpoints (e.g., factual QA) and running the full stack at high-risk ones (e.g., medical advice, financial transactions).

Key Insight
"Aligned" means "aligned to a written policy", not "aligned to your team's intent"

Frontier-aligned models (Claude, Llama with Llama Guard, Gemma with ShieldGemma) are aligned to their training labs' policies, which may differ from yours. A model trained to refuse "violent content" may refuse content your application legitimately needs (medical descriptions of injuries, security-research-relevant attack discussions). The right way to handle this in 2026 is to (a) audit the model card and refusal policy of any aligned model before adopting; (b) plan for refusal-handling and contextual exceptions in your application logic; and (c) if the lab's policy materially conflicts with your use case, switch models rather than try to jailbreak around the policy.

Production Pattern: Inline LLM Guard + Offline Eval

The dominant 2026 deployment pattern decomposes responsible-AI enforcement into two latency-decoupled stages. Inline stage (synchronous, every request): the user prompt passes through a fast input guard ($G_{\text{in}}$ at $\le 50$ ms, e.g., OpenAI Moderation or a 2B ShieldGemma), then the main LLM (Claude Sonnet, Llama 4) generates a response, then the response passes through an output guard ($G_{\text{out}}$ at $\le 150$ ms, e.g., Llama Guard 4 12B or Granite Guardian 8B). A request is served only if $G_{\text{in}}(\text{prompt}) = \text{safe}$ and $G_{\text{out}}(\text{response}) = \text{safe}$; otherwise the application returns a refusal template logged with the violation category. Total p95 added latency budget: $\le 250$ ms on a request whose LLM TTFT is already $\sim 400$ ms.

Offline stage (asynchronous, batched): every request and response is logged with a unique trace id to the observability platform (Arize Phoenix, Galileo, Fiddler). Nightly batch jobs run heavier evaluators that are too slow for inline use, e.g., LLM-as-judge with Claude Haiku for nuanced policy adjudication, Detoxify across all logs for toxicity drift, FActScore-style faithfulness checks on RAG outputs, BBQ-style bias slices on production-derived prompt clusters. Findings flow into three outputs: (1) a daily incident queue surfacing borderline cases for human review; (2) a weekly drift report comparing this week's safety-metric distribution to last week's; (3) a quarterly governance artifact attaching evidence to the use-case registry in Credo AI or Holistic AI.

The pattern's invariant is the asymmetry of budgets: inline must be cheap and fast (false-positive refusal is a UX cost), offline can be slow and thorough (false-negative containment is the goal). This decoupling lets teams use small distilled guards inline and frontier-LLM judges offline without either constraint compromising the other.

56.4.8 Licensing and deployment considerations

Real-World Scenario
A fintech adopts a three-model safety stack

A fintech that deploys an LLM-powered customer-support agent in 2025 reported the following stack as their final deployment after a six-month pilot. The main LLM was Claude Sonnet 4.5 (chosen for tool-use reliability and Constitutional-AI safety properties). The inline guard was Llama Guard 3 self-hosted (chosen for vendor-neutrality and the published taxonomy). The defense-in-depth second filter was OpenAI Moderation (free, broad coverage, latency overhead minor). Offline log scanning used Detoxify (open-source, batchable across the entire conversation corpus). The internal red-team built a custom 800-prompt internal benchmark from incident logs that ran in CI per release. The deciding factor for the layered design was the bank's model-risk-management policy requiring "no single model judges its own safety", forcing defense-in-depth across vendors. This three-vendor pattern is becoming common in regulated industries where SR 11-7 and similar frameworks expect independence between the production model and its judge.

56.4.9 Model evaluation checklist

The questions to ask when adopting a responsible-AI model (safety classifier, bias detector, watermark, aligned base, interpretability release):

Key Insight: Aha Moment: The Italian-Refusal Bug

The Llama Guard 2 release (April 2024) published an F1 of 0.85 on the MLCommons English harm taxonomy. A Cohere customer deploying it to a multilingual European customer-service bot ran the same model on Italian-language traffic and saw the false-positive refusal rate jump from 4 percent on English to 38 percent on Italian. Benign Italian-language complaints about food quality ("la pasta era disgustosa, voglio un rimborso") were flagged as "non-violent crime" because the training data was 94 percent English and the classifier's representation of "disgust" was anchored to English idioms used in actual threats. Same model, same prompt template, same threshold, 10x false-positive shift caused by changing one input attribute (language). The checklist item below on multilingual coverage exists because it is the single most-skipped question in 2024 procurement decisions, and Llama Guard 2's reception in non-English deployments was the proof.

A team that asks these questions during evaluation usually picks a different model than a team that picks based on the published benchmark numbers alone.

What's Next?

In the next section, Section 56.5: External Reading and Communities, we build on the material covered here.

Further Reading
Inan, H., et al. (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations." arXiv:2312.06674. arxiv.org/abs/2312.06674. The Llama Guard 1 paper and the prompt-format-instructed safety-classifier pattern that Llama Guard 3 / 4, Granite Guardian, and ShieldGemma all extend.
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. arxiv.org/abs/2212.08073. The Constitutional AI paper introducing the principle-based training method behind the Claude family.
Templeton, A., et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic Research. transformer-circuits.pub/2024/scaling-monosemanticity. The frontier-scale sparse-autoencoder paper that set the direction for interpretability-oriented model releases including Gemma-Scope.
Sadasivan, V. S., et al. (2023). "Can AI-Generated Text be Reliably Detected?" arXiv:2303.11156. arxiv.org/abs/2303.11156. The theoretical foundation for the field-wide conclusion that statistical AI-detection is fundamentally limited; widely cited in the retraction of the OpenAI Text Classifier.
Liang, P., et al. (2023). "GPT detectors are biased against non-native English writers." Patterns. arxiv.org/abs/2304.02819. The bias-in-AI-detection study that contributed to the policy pivot from statistical detection to watermark-based detection.
Groeneveld, D., et al. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838. arxiv.org/abs/2402.00838. The OLMo paper and the open-data approach (Dolma) that makes OLMo the canonical training-data-transparent model in 2026.