Policy DSLs and Constrained Decoding as Safety

Section 48.4

"Make illegal states unrepresentable."

Yaron Minsky, OCaml for the Masses, 2011
Big Picture

The safest output is one that cannot be unsafe by construction. Constrained decoding (Outlines, Guidance, llama.cpp grammars) and policy DSLs (NeMo Colang, Guardrails AI schemas) let you express safety as structure: the model literally cannot generate tokens that violate a typed schema or a finite-state-machine policy. This section explains how structured output became a safety mechanism, when it works, and when it fails (the model can still hallucinate within a valid schema). We also look at Pydantic-driven safety contracts and the trade-off between expressivity and decoding speed.

Constrained decoding: logit masking against an FSM
Figure 48.4.1: Constrained decoding makes safety a property of the generation algorithm, not a post-hoc check. The Pydantic schema compiles to a finite-state machine over the model's vocabulary; at each step, Outlines (or Guidance, or llama.cpp grammars) masks out every logit that would violate the FSM. The model literally cannot emit an invalid verdict or leak PII through an unschemaed field, though it can still hallucinate within a valid schema.

Prerequisites

This section assumes familiarity with output guardrail platforms from Section 48.3 and with structured-output prompting from Section 12.5. Familiarity with decoding strategies from Section 4.2 helps when reading the logit-masking discussion.

48.4.1 From Output Validation to Output Impossibility

Fun Fact

Constrained decoding can make a model literally unable to emit a token that violates a JSON schema, and it can also make the model unable to refuse a malicious instruction if the schema accepts both safe and unsafe responses. The safest output by construction is also the dumbest, a tradeoff that policy DSL designers have been quietly relearning since 2024.

The output guardrails in Section 48.3 are detection systems: they run after the model finishes generating and decide whether to allow or block. Constrained decoding flips the architecture: it runs during generation and refuses to emit any token that would lead to an invalid output. The output cannot fail validation because the validator participated in producing it.

Mechanically, constrained decoders work by masking the logit distribution at each step. Given the partial output so far, the decoder asks the validator "which next tokens would keep this valid?" and zeros out the logits for all other tokens before sampling. The masking can come from a regex, a JSON schema, a context-free grammar, or a finite-state machine compiled from a Pydantic model.

From a safety perspective, this matters because many policy violations have structural signatures. A model that must emit JSON with a safety_verdict field can be forced to set that field to one of three allowed values. A model that must produce a refusal in a known format cannot generate a long, undisciplined response. A model whose output schema lacks a user_pii field cannot leak PII through that channel.

48.4.2 Outlines and Guidance: Production Constrained Decoding

Two libraries dominate the open-source constrained-decoding space. Outlines (Hugging Face / .txt) compiles regex and Pydantic schemas to FSMs over the model's vocabulary and reuses them across generations. Guidance (Microsoft) provides a richer template language with interleaved prompt-and-generate steps.

from outlines import models, generate
from pydantic import BaseModel, Field
from typing import Literal

class SafetyDecision(BaseModel):
    verdict: Literal["safe", "unsafe", "needs_review"]
    categories: list[Literal["self_harm", "violence", "privacy", "off_topic"]]
    rationale: str = Field(max_length=200)

def build_safety_judge(checkpoint: str):
    """Compile the SafetyDecision schema into a token-masked FSM."""
    model = models.transformers(checkpoint)
    return generate.json(model, SafetyDecision)

generator = build_safety_judge("meta-llama/Llama-3.1-8B-Instruct")
result: SafetyDecision = generator(
    "Classify this assistant response: 'I cannot help with that request.'"
)
# result is guaranteed to be a valid SafetyDecision instance.
# The model cannot emit any other JSON structure.
print(result.verdict, result.categories)
Output: safe []
Code Fragment 48.4.1a: Outlines compiles the Pydantic SafetyDecision model into a finite-state machine over the model's vocabulary. At every decoding step, only tokens that keep the partial output on a valid path through the FSM have non-zero probability. The verdict field can only be one of the three Literal values; the categories list can only contain the four allowed strings. Output validation becomes structurally impossible to fail.
Key Insight

Constrained decoding prevents structural violations but not semantic ones. The model is still free to set verdict: "safe" on a response that is unsafe. It cannot make up a fourth category, but it can pick the wrong one of the three allowed values. Structured output is necessary, not sufficient. Pair it with a classifier (Section 48.3) or with self-consistency (sample N times, take the majority).

48.4.3 NeMo Colang as a Policy DSL

Colang (introduced in Section 48.3) deserves a closer look because it is the most mature open-source DSL for dialog-level safety policies. The 2.0 release reframes flows as generator functions: a flow yields events (user messages, bot responses, tool calls) and can pause, resume, and call sub-flows. This makes Colang programs feel less like rules and more like a structured dialog script.

# banking_policy.co

flow user requests account transfer
  user said "transfer money to my friend"
    or "send money to {recipient}"
    or "make a wire transfer"

flow require two factor confirmation
  bot say "For your security, please confirm your identity with the 6-digit code sent to your phone."
  user provides code
  $code = match("^\\d{6}$")
  if $code is None
    bot say "That doesn't look like a valid 6-digit code. Let's try again."
    abort

flow process transfer
  $verified = verify_2fa($code)
  if not $verified
    bot say "Verification failed. Please contact support."
    abort
  $result = execute_transfer($recipient, $amount)
  bot say "Transfer complete. Confirmation: {$result.confirmation_id}"

flow main
  activate llm continuation
  activate guardrail input "no_jailbreak"
  activate guardrail output "no_harm"

  user requests account transfer
  require two factor confirmation
  process transfer
Code Fragment 48.4.2: A Colang policy for a banking chatbot. The flow refuses to execute a money transfer without two-factor confirmation; the structural guarantee is that execute_transfer can only be called after the require two factor confirmation flow completes successfully. The same pattern in Python-with-LangChain would scatter the check across multiple files and is much harder to audit.

The auditability story matters: Colang programs are short, declarative, and version-controllable. A compliance reviewer can read the policy without learning Python. A red team can write adversarial flows to probe the policy. The translation from natural-language requirements ("never execute a transfer without 2FA") to executable code is more direct than in any imperative framework.

48.4.4 Pydantic as a Safety Contract

Pydantic models are the lingua franca of structured output in Python. Guardrails AI, Outlines, Instructor, and the OpenAI Structured Outputs API all consume Pydantic schemas. From a safety perspective, the contract a Pydantic model expresses is much richer than just "this output is JSON":

from pydantic import BaseModel, Field, model_validator
from typing import Literal

class TriageOutput(BaseModel):
    model_config = {"extra": "forbid"}

    urgency: Literal["routine", "urgent", "emergency"]
    summary: str = Field(max_length=280)
    requires_human: bool

    @model_validator(mode="after")
    def emergency_must_escalate(self):
        if self.urgency == "emergency" and not self.requires_human:
            raise ValueError("emergency triage must set requires_human=True")
        return self
Code Fragment 48.4.3: A Pydantic safety contract. The structural part is enforced by Outlines / OpenAI / Anthropic structured-output APIs during decoding. The cross-field validator (emergency_must_escalate) enforces a policy invariant. If the model tries to emit an emergency triage without escalation, the validation raises and the application can retry or block.

48.4.5 When Constrained Decoding Is the Wrong Tool

Three failure modes are worth knowing before you reach for this hammer:

Key Insight
Aha Moment: The Schema That Hid a 12-Point Accuracy Drop

Tam et al. (2024, "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Large Language Model Performance," EMNLP) ran the cleanest experiment on this. They compared GPT-3.5-turbo on the GSM8K math benchmark in two conditions: free-form natural-language reasoning followed by an answer, vs strict JSON-schema-constrained output with a numeric "answer" field. Free-form: 79.3 percent accuracy. JSON-constrained: 67.1 percent accuracy. The model had not become 12 points dumber; the structural constraint cut the chain-of-thought reasoning tokens that lived between the question and the answer in natural prose. The schema looked like a safety win and quietly cost the team the equivalent of a 6-month accuracy regression. The lesson is the entire subsection in one number: constrained decoding is a contract with the parser, not with the model's capability. If your model needs reasoning room, the schema must allow a thinking field, or you pay 12 points to make the JSON nicer.

  1. Constraints distort the distribution. Forcing a token at each step changes the model's behavior. If your constraint is very restrictive, the model may produce nonsense within the schema (a valid JSON full of gibberish). The constraint can also defeat refusal-related training: a constrained model cannot say "I cannot help with that" if your schema requires a non-empty answer field. Always include a "refusal" branch in your schema.
  2. Schemas drift from reality. Production schemas evolve. Old conversations stored with old schemas don't validate under new ones. Version your schemas; carry the schema version in every persisted record.
  3. Decoding cost. Compiling a complex schema to an FSM over a 128K-token vocabulary takes time. Outlines amortizes this with caching, but cold-start latency for novel schemas can be hundreds of milliseconds. Pre-compile and persist schemas you use repeatedly.
Warning: The "Plausible Garbage" Failure

A common bug: you constrain the model to emit a JSON with a diagnosis field from a closed list of ICD-10 codes. The model dutifully picks one for every input, including inputs that are not medical queries at all. The schema does not let it say "I cannot diagnose this," so it picks the highest-prior code (often "R69" / "unknown cause"). Always include a refusal or no-op state in your schema, and verify with a hold-out test set that the model uses it appropriately.

48.4.6 Deployment Patterns

Three patterns recur in production:

Real-World Scenario: Tool-Call Safety via Schema

An agent has tools for send_email, book_calendar, and transfer_funds. Without constrained decoding, the model can invent tool names ("send_money", "wire_transfer") which the agent then has to error on. With a constrained schema that lists only the three real tools as a Literal type, the model is structurally incapable of calling a tool that does not exist. Combine with a requires_confirmation boolean that defaults to True for transfer_funds, and you have a tool-call layer that is safer than free-form decoding plus post-hoc parsing. Cross-link: Section 49.1 covers tool-call mediation in depth.

Key Insight

Constrained decoding turns safety into a structural property: outputs that would violate a typed schema cannot be sampled at all. Policy DSLs like NeMo Colang extend this to multi-turn dialog flows, making complex policies auditable as code. Pydantic is the cross-platform safety contract that unifies Outlines, Guardrails AI, Instructor, and the major commercial structured-output APIs. The technique catches structural violations perfectly and semantic violations not at all; pair it with a classifier or LLM-as-judge for full coverage.

Self-Check
Q1: You constrain a medical-triage bot to emit one of three urgency levels. The model picks "routine" for every input, including emergencies. What went wrong, and how do you debug it?
Show Answer
Constrained decoding only enforces that the output is one of {emergency, urgent, routine}; it does not enforce that the choice is correct. The model's prior over the three labels has collapsed onto "routine," which typically happens because: (a) the prompt provides no examples or weak few-shot, so the model defaults to its training-data prior where "routine" is the most common label; (b) the schema lists "routine" first in the enum, and some models exhibit a position bias; (c) the underlying model is not strong enough at clinical triage and a different model is required. Debug in this order: rerun with a few-shot prompt containing one example per urgency level; reorder the enum and check whether the bias moves; finally, evaluate on a labeled triage set to see whether the issue is the prompt or the underlying model capability.
Q2: A teammate proposes replacing your Llama Guard output classifier with a Pydantic schema. Why is this not equivalent? What does the classifier catch that the schema does not?
Show Answer
A Pydantic schema enforces structural properties of the output: the JSON keys are present, the values have the right type, the strings match the regex. It does not look inside the string values for semantic safety. A Llama Guard classifier examines the actual content and flags categories like sexual content, violence, self-harm, or harassment regardless of the structural envelope. Concretely, a Pydantic schema that requires {"response": str} passes both "Here is your meal plan" and "Here is the synthesis route for ricin"; only the classifier catches the second. The two are complementary: the schema catches malformed outputs at near-zero cost, the classifier catches semantically unsafe outputs at higher cost. Replacing one with the other leaves an obvious gap.
Q3: Outlines pre-compiles your JSON schema into an FSM. Why is the cold-start latency high, and what is the deployment-time mitigation?
Show Answer
Outlines builds a finite-state machine over the tokenizer's vocabulary so it can mask the logit set at each decoding step. The construction is O(schema_size times vocab_size), and for a non-trivial schema against a 32k-vocab tokenizer the build takes seconds to tens of seconds; this is the cold-start cost. The deployment-time mitigation is to compile the FSM ahead of time and serialize it (Outlines exposes a cache mechanism), then load the precompiled FSM at server startup rather than at first request. For a small number of fixed schemas this turns cold-start latency into a one-time deploy cost; for dynamic schemas you still pay the build cost on first use, which is why production deployments typically restrict the schema space rather than allow arbitrary user-supplied schemas.
Q4: You add a "refusal" branch to your structured-output schema. Walk through the prompt + sampling sequence by which a model uses it gracefully.
Show Answer
The schema is a discriminated union, for example {type: "answer", payload: Answer} | {type: "refusal", reason: str}. The prompt explicitly tells the model: "If you cannot or should not answer, emit a refusal object with a brief reason; otherwise emit an answer." At sampling time, constrained decoding lets the model choose either branch at the first key; the model's value alignment determines which branch it picks. The model emits, for example, {"type": "refusal", "reason": "request involves PII the policy forbids returning"}. The application layer then handles the two branches separately, surfacing the refusal as a clean error to the user instead of either pretending to answer or returning an empty result. The pattern preserves both the structural contract (the output is always one of two valid shapes) and the model's ability to decline, which an answer-only schema would force the model to violate.
What's Next

Continue to Section 48.5: Multimodal Guardrails: Image, Audio, Video Content Filtering.

Section 48.5 extends the guardrail story from text to multimodal inputs: images, audio, video. Microsoft Content Safety, AWS Rekognition, and Google Vertex AI provide hosted classifiers; image-input prompt injection is an emerging attack vector that text-only guardrails do not catch. We will see how to compose image filtering with text filtering and what to do when an attacker hides a malicious instruction inside a PNG.

Further Reading
Willard, B. T., Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702. (Outlines paper.)
Beurer-Kellner, L., Fischer, M., Vechev, M. (2024). Guiding LLMs the Right Way: Fast, Non-Invasive Constrained Generation. ICML 2024.
OpenAI (2024). Structured Outputs API Documentation. https://platform.openai.com/docs/guides/structured-outputs.
Anthropic (2024). Tool Use and Structured Outputs in Claude. Anthropic API documentation.
NVIDIA (2024). NeMo Guardrails Colang 2.0 Language Reference. https://docs.nvidia.com/nemo/guardrails/.
Pydantic (2024). Pydantic 2.x Documentation: Models, Validators, and Config. https://docs.pydantic.dev/latest/.
dottxt AI (2025). Outlines: Structured Text Generation. https://dottxt-ai.github.io/outlines/.