
"Guardrails are the part of the system you only thank when they fail loudly."
Guard, Defense-In-Depth AI Agent
Chapters 47 through 50 covered the threat model. This chapter is the operational stack: NeMo Guardrails, Llama Guard, Granite Guardian, OpenAI Moderation, Lakera, Garak, and the day-to-day tooling that keeps an LLM product defensible.
Part X is the security and runtime safety part of the book (Part XI extends to ethics, trust, and governance). This chapter's toolbox is the moderation models (Llama Guard, OpenAI Moderation), the guardrail frameworks (NVIDIA NeMo Guardrails, Guardrails AI), the red-team toolkits (Garak, PyRIT), and the privacy libraries (Opacus, TF Privacy).
Chapter Overview
Part X covered adversarial security, guardrails, agent safety, and privacy. This chapter consolidates the safety and guardrails toolchain: the platforms (moderation APIs, red-team platforms, policy and compliance services), the libraries (guardrails frameworks, red-team toolkits, privacy-preserving training), the datasets and benchmarks (harmful-output, jailbreak, bias / fairness), the safety models (classifiers and judges), and the external literature that maintains the safety stack.
Safety tooling is contested and fast-moving, but the platforms, libraries, and benchmarks listed here are the ones that have stabilized enough to ship products against in 2026.
- Compare moderation APIs (OpenAI Moderation, Llama Guard, Perspective) for a given content policy.
- Wire guardrails frameworks (NeMo Guardrails, Guardrails AI) into a production LLM stack.
- Apply red-team toolkits (PyRIT, garak) to a model release-gate evaluation.
- Choose a privacy-preserving training library for differential privacy or federated learning.
- Track the safety and security venues, blogs, and communities that maintain the canon.
To add input-output safety filtering to any LLM call in 30 seconds:
pip install guardrails-ai
Guardrails AI wraps validators around any LLM client. For self-hosted moderation, run Llama Guard 3 behind vLLM. For red-teaming, Garak is the most-used scanner.
Sections in This Chapter
Prerequisites
- At least one of Chapter 47 through 50
- LLM APIs from Chapter 11
- Familiarity with running classifiers and small models in production
- 51.1 Platforms Part X's platforms divide into moderation APIs, red-team platforms, and policy / compliance services.
- 51.2 Libraries & Frameworks Safety libraries split into guardrails frameworks (which wrap LLM calls with validators), red-team toolkits, and privacy-preserving training libraries.
- 51.3 Datasets & Benchmarks Safety datasets cover three areas: harmful-output benchmarks, jailbreak / adversarial-input corpora, and bias / fairness benchmarks.
- 51.4 Models Safety models fall into two roles: classifiers (which decide whether a prompt or response is safe) and judges (which score harmfulness on a continuous scale).
- 51.5 External Reading & Communities The AI safety, security, and ethics literature is large and contested.
What Comes Next
Next: Chapter 52: Bias, Fairness & Hallucinations, opening Part XI. Part X hardened the system against attackers; Part XI confronts the harms a system can cause when nobody is attacking it: representational bias, allocational fairness, hallucinated facts, regulatory compliance (EU AI Act, GDPR, US frameworks), enterprise governance (NIST AI RMF, ISO 42001), licensing and IP, and machine unlearning. The shift is from "stop bad actors" to "do the right thing".