Guardrails and Runtime Safety

Chapter opener illustration: Guardrails and Runtime Safety.

"A guardrail is the apology you wrote before the model misbehaved."

GuardGuard, Defense-In-Depth AI Agent
Looking Back

Chapter 47 showed how attackers break models. This chapter is the defensive layer: input filters, output filters, classifier-based guardrails, content moderation, and the production rails (NeMo Guardrails, Llama Guard, Granite Guardian, OpenAI Moderation) that catch what the prompt could not.

Big Picture

Runtime content safety, output filtering, policy enforcement, and the difference between guardrails and alignment.

Chapter Overview

Guardrails are the runtime layer of LLM safety: the external checks that sit around a model and intercept inputs, outputs, and intermediate tool calls. This chapter walks the full guardrail stack from definition to deployment: what guardrails are (and are not), input guardrails (prompt-injection detection, PII pre-filtering), output guardrails (Llama Guard, NeMo Guardrails, ShieldGemma, Guardrails AI), policy DSLs and constrained decoding as safety primitives, and multimodal guardrails for image, audio, and video.

Guardrails are the difference between a research demo and a deployable system. By the end of this chapter you will know which guardrail to reach for, how to layer them, and how to avoid the false-confidence failure mode that single-guardrail deployments inherit.

Note: Learning Objectives

Prerequisites

Sections

What's Next?

Next: Chapter 49: Agent Safety & Autonomy. Guardrails close the input/output loop on a single LLM call. Chapter 49 extends the defense surface to tool-using agents: sandboxing untrusted code, permission scoping, multi-step trajectory monitoring, and the privilege-escalation patterns that emerge when an LLM can act in the world.