Section 51.4: Models

Safety models fall into two roles: classifiers (which decide whether a prompt or response is safe) and judges (which score harmfulness on a continuous scale).

**Figure 51.4.1**: Two families of safety models. Categorical classifiers (Llama Guard 3, Prompt Guard, ShieldGemma, OpenAI Moderation) return a verdict and category. Scalar reward and constitutional models (Skywork-Reward, Anthropic CAI methodology) return a continuous score and feed best-of-N reranking. The bottom-right box shows the canonical production stack: Prompt Guard pre-filter, model generation, Llama Guard post-classification, Skywork reranking when latency budget allows.

51.4.1 Safety classifier models

Llama Guard 3 (Meta, 2024; 8B params) is Meta's open-weights safety classifier trained on the MLCommons AI Safety taxonomy across 13 hazard categories. Its objective is to provide a self-hostable safety classifier of similar quality to OpenAI Moderation, which matters when input or output classification must happen on your infrastructure. The core concept is per-category classification of either prompt or response (or both jointly). Pick Llama Guard 3 as your default open-weights safety classifier in 2026; for smaller-footprint deployments, Llama Guard 3 1B or ShieldGemma are alternatives.
Prompt Guard (Meta, 2024; 86M params) is Meta's small prompt-injection and jailbreak classifier specifically trained to detect attempts to override system prompts. Its objective is to provide an extremely fast (~5 ms on CPU) first-line filter for adversarial inputs, which matters as a pre-filter before more expensive safety classifiers. The core concept is a tiny DeBERTa-shaped classifier with three classes (benign, injection, jailbreak). Pick Prompt Guard as the pre-LLM input pre-filter; pair with Llama Guard for output post-classification.
OpenAI Moderation (OpenAI, 2022) is the free hosted classifier across 13 categories; see Section 51.1 for the platform-level description. Pick when you want zero-setup safety at zero cost; pair with Prompt Guard or Llama Guard for self-hosted defense in depth.
ShieldGemma (Google, 2024; 2B, 9B, 27B variants) is Google's Gemma-based safety classifier covering harassment, hate speech, sexually explicit, and dangerous content categories. Its objective is to provide an open Gemma-derived safety classifier that fits diverse deployment sizes, which matters when you want size flexibility (2B for edge, 27B for top quality). Pick ShieldGemma when you want size flexibility or for Gemma-ecosystem deployments; Llama Guard 3 has wider community adoption.

51.4.2 Constitutional / reward models for safety

Skywork-Reward (Skywork AI, 2024) is the open scalar reward model that returns a quality / safety score for (prompt, response) pairs. Its objective is to provide a fast scalar score for filtering large generation pools (best-of-N sampling, RL reward, candidate filtering), which matters when you need to score thousands of generations per minute. Pick Skywork-Reward when you need scalar-score filtering at scale; for category-specific classification, Llama Guard is more direct.
Anthropic Constitutional AI (Anthropic, 2022) is the methodology (not a downloadable model) for training models that critique and revise their own outputs against an explicit set of constitutional principles. Its objective is to produce safer models without per-example human labeling, which mattered as a cheaper alternative to RLHF on harmlessness. The core concept is "the model criticizes itself against the constitution" plus RL on those self-corrections. Adopt this methodology (integrated into Claude already) when training your own safety-aligned model.

51.4.3 Comparing the models

Table 51.4.1a: 39.4.1 Safety models (2026).

Model	Role	Open	Best for
Llama Guard 3	Multi-category classifier	Yes	Self-hosted moderation
Prompt Guard	Injection / jailbreak	Yes	Input filtering
ShieldGemma	Multi-category	Yes	Small-footprint deployment
OpenAI Moderation	Multi-category	No (API)	Quick safety net
Skywork-Reward	Continuous score	Yes	Reward-model-based filtering

Note: Layered defense

A production safety stack usually layers: input pre-classification (Prompt Guard or LlamaGuard) -> LLM call -> output post-classification (LlamaGuard) -> policy check (NeMo Guardrails) -> logging. Any single layer is bypassable; the stack is much harder to defeat.

What's Next?

In the next section, Section 51.5: External Reading & Communities, we build on the material covered here.

Further Reading

Safety Models

Inan, H., Upasani, K., Chi, J., et al. (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations." arXiv:2312.06674. Reference open-source safety classifier.

OpenAI (2022). "Moderation API." platform.openai.com/docs/guides/moderation. Reference safety-classification API.

Markov, T., Zhang, C., Agarwal, S., et al. (2023). "A Holistic Approach to Undesired Content Detection in the Real World." AAAI 2023. arXiv:2208.03274. The methodology paper behind OpenAI's Moderation API; the reference for designing taxonomies and training data for safety classifiers.

Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. Anthropic's paper on training safety with AI feedback; the underlying technique behind Claude's harmlessness training and a reference for self-supervised safety alignment.

Lees, A., Tran, V. Q., Tay, Y., et al. (2022). "A New Generation of Perspective API: Efficient Multilingual Character-level Transformers." KDD 2022. arXiv:2202.11176. Google Jigsaw's Perspective API for toxicity scoring; the canonical reference for multilingual content-moderation models.