Models

Section 51.4

Safety models fall into two roles: classifiers (which decide whether a prompt or response is safe) and judges (which score harmfulness on a continuous scale).

Two families of safety models.
Figure 51.4.1: Two families of safety models. Categorical classifiers (Llama Guard 3, Prompt Guard, ShieldGemma, OpenAI Moderation) return a verdict and category. Scalar reward and constitutional models (Skywork-Reward, Anthropic CAI methodology) return a continuous score and feed best-of-N reranking. The bottom-right box shows the canonical production stack: Prompt Guard pre-filter, model generation, Llama Guard post-classification, Skywork reranking when latency budget allows.

51.4.1 Safety classifier models

51.4.2 Constitutional / reward models for safety

51.4.3 Comparing the models

Table 51.4.1a: 39.4.1 Safety models (2026).
Model Role Open Best for
Llama Guard 3 Multi-category classifier Yes Self-hosted moderation
Prompt Guard Injection / jailbreak Yes Input filtering
ShieldGemma Multi-category Yes Small-footprint deployment
OpenAI Moderation Multi-category No (API) Quick safety net
Skywork-Reward Continuous score Yes Reward-model-based filtering
Note: Layered defense

A production safety stack usually layers: input pre-classification (Prompt Guard or LlamaGuard) -> LLM call -> output post-classification (LlamaGuard) -> policy check (NeMo Guardrails) -> logging. Any single layer is bypassable; the stack is much harder to defeat.

What's Next?

In the next section, Section 51.5: External Reading & Communities, we build on the material covered here.

Further Reading

Safety Models

Inan, H., Upasani, K., Chi, J., et al. (2023). "Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations." arXiv:2312.06674. Reference open-source safety classifier.
OpenAI (2022). "Moderation API." platform.openai.com/docs/guides/moderation. Reference safety-classification API.
Markov, T., Zhang, C., Agarwal, S., et al. (2023). "A Holistic Approach to Undesired Content Detection in the Real World." AAAI 2023. arXiv:2208.03274. The methodology paper behind OpenAI's Moderation API; the reference for designing taxonomies and training data for safety classifiers.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. Anthropic's paper on training safety with AI feedback; the underlying technique behind Claude's harmlessness training and a reference for self-supervised safety alignment.
Lees, A., Tran, V. Q., Tay, Y., et al. (2022). "A New Generation of Perspective API: Efficient Multilingual Character-level Transformers." KDD 2022. arXiv:2202.11176. Google Jigsaw's Perspective API for toxicity scoring; the canonical reference for multilingual content-moderation models.