Chapter 48: Guardrails and Runtime Safety

Chapter opener illustration: Guardrails and Runtime Safety.

"A guardrail is the apology you wrote before the model misbehaved."
Guard, Defense-In-Depth AI Agent

Looking Back

Chapter 47 showed how attackers break models. This chapter is the defensive layer: input filters, output filters, classifier-based guardrails, content moderation, and the production rails (NeMo Guardrails, Llama Guard, Granite Guardian, OpenAI Moderation) that catch what the prompt could not.

Big Picture

Runtime content safety, output filtering, policy enforcement, and the difference between guardrails and alignment.

Chapter Overview

Guardrails are the runtime layer of LLM safety: the external checks that sit around a model and intercept inputs, outputs, and intermediate tool calls. This chapter walks the full guardrail stack from definition to deployment: what guardrails are (and are not), input guardrails (prompt-injection detection, PII pre-filtering), output guardrails (Llama Guard, NeMo Guardrails, ShieldGemma, Guardrails AI), policy DSLs and constrained decoding as safety primitives, and multimodal guardrails for image, audio, and video.

Guardrails are the difference between a research demo and a deployable system. By the end of this chapter you will know which guardrail to reach for, how to layer them, and how to avoid the false-confidence failure mode that single-guardrail deployments inherit.

Note: Learning Objectives

Explain the role of guardrails in the LLM safety stack and what they cannot replace.
Apply input guardrails (prompt-injection detection, PII pre-filtering) at the request layer.
Compare Llama Guard, NeMo Guardrails, ShieldGemma, and Guardrails AI as output-guardrail engines.
Use policy DSLs and constrained decoding to make unsafe output structurally impossible.
Architect multimodal guardrails for image, audio, and video inputs and outputs.
Diagnose false-positive and false-negative regressions in a guardrail deployment.

Prerequisites

Adversarial attacks from Chapter 47
Online observability from Chapter 44
Familiarity with classifier deployment basics

Sections

What's Next?

Next: Chapter 49: Agent Safety & Autonomy. Guardrails close the input/output loop on a single LLM call. Chapter 49 extends the defense surface to tool-using agents: sandboxing untrusted code, permission scoping, multi-step trajectory monitoring, and the privilege-escalation patterns that emerge when an LLM can act in the world.