Structured-Output Validity Testing

Section 42.11

"Your model can be eloquent, factual, and helpful. If it forgets a comma, the JSON parser doesn't care."

EvalEval, Schema-Strict AI Agent
Big Picture

By 2026 most production LLM traffic is not free-form text, it is structured output. Tool calls, function arguments, JSON-mode responses, agent action specifications, and SQL queries are now the dominant shapes of model output. Yet eval coverage for structured output is thin: most teams measure "did we get back valid JSON?" and stop there, missing the harder failures, fields with the wrong names, types that don't match the schema, enum values outside the allowed set, refusals where the schema demands output, and the family of subtle protocol-mismatch bugs that crash downstream consumers. This section is the missing piece. We define a four-tier taxonomy of structured-output validity (syntactic, schema-compliant, semantic, behavioral), build a validator harness using the jsonschema library, walk through provider-specific quirks (OpenAI's JSON-mode and function-calling, Anthropic's tool-use XML), and integrate the whole thing with promptfoo for CI regression testing. The goal is a structured-output eval that catches the failures your text-only eval misses.

Prerequisites

This section assumes familiarity with structured generation from Section 4.3 (constrained decoding, JSON-mode), function calling from Section 11.2, and the eval-as-CI pattern from Section 42.1. Some familiarity with JSON Schema (Draft 2020-12 or Draft 7) helps with the validator code.

42.11.1 Why Structured-Output Validity Matters in 2026

Production Pattern: Schema-Validation Smoke Test in CI

When: any production system that consumes structured LLM output (function calls, JSON responses, tool-use blocks). How: for every tool or schema your model produces, build a fixture set of 30-100 prompts that should produce structured output. Run them on every PR; validate each output against the schema; report (a) parse rate (did we get well-formed JSON?), (b) schema-compliance rate (did it match the schema?), (c) semantic-correctness rate (did the field values make sense?). Watch for: all-pass results that hide a 5% silent-corruption rate (the parser accepted garbage values because the schema is too permissive). Result: structured-output regressions are caught at PR time, not by the downstream service that crashes on the corrupted payload.

The shift toward structured output happened gradually then suddenly. OpenAI added function calling in mid-2023, Anthropic added tool use in late 2023, JSON-mode became table-stakes across all major providers in 2024, and structured-output APIs (OpenAI's structured outputs with strict schema enforcement, Anthropic's tool use with input_schema, Google's Gemini structured output) became the recommended interaction pattern for almost any non-chatbot use case in 2024-2025. By 2026, the typical production LLM application has more structured calls than free-form completions.

This shift has consequences for evaluation. The text-evaluation toolkit (BLEU, ROUGE, BERTScore, LLM-as-judge) was designed for prose, not for "did the tool call have argument_x set correctly?" When teams build new structured-output features, they often ship them with only a handful of manual smoke tests, then discover field-name typos and type mismatches in production. The eval gap is real and the cost of closing it is small relative to the cost of a single production incident.

A reasonable mental model is a four-tier taxonomy of structured-output failures:

Four-tier funnel showing structured-output failure modes from cheap-to-detect (syntactic) at the top to hardest-to-detect (behavioral) at the bottom, with example failure modes
Figure 42.11.1: Four tiers of structured-output validity, from cheap-and-shallow (does it parse?) to expensive-and-deep (was this the right call in the right order?). Production evaluators that stop at tier 1 routinely ship models with 50% behavioral failure rates and a green parse-rate dashboard.
Warning: Don't Conflate JSON-Mode With Schema Compliance

OpenAI's response_format={"type": "json_object"} guarantees only that the output parses as JSON. It does NOT guarantee that the output matches any particular schema. The newer response_format={"type": "json_schema", "json_schema": ...} mode does enforce schema compliance, but only when the schema is in the supported subset (no anyOf, no recursive schemas, no unconstrained additionalProperties, no patterns). Many teams discover the difference too late, after building a service that assumes schema-mode behavior on a JSON-mode-only endpoint.

42.11.2 JSON Schema Compliance: Syntactic vs Semantic Validity

The most common eval question is "did the model produce JSON that matches the schema?" The naive answer: run json.loads() on the output and check that it didn't raise. That measures only syntactic validity. The harder question is schema compliance, and it has two flavors that production systems frequently confuse.

Syntactic validity asks: is the output well-formed JSON? Can json.loads() consume it without error? This catches missing brackets, trailing commas (which strict JSON forbids), unterminated strings, and similar parser-level mistakes. It is necessary but very far from sufficient.

Schema compliance asks: given that the output is well-formed JSON, does it satisfy the constraints of the declared schema? Required fields present, types correct, enum values within the allowed set, pattern constraints satisfied, numeric ranges respected. JSON Schema (Draft 2020-12) is the standard formal language for these constraints; the jsonschema Python library is the most widely used validator.

Semantic validity asks: given that the output is schema-compliant, are the field values correct in context? "Country" field contains a country, not a city. "Timestamp" field contains a recent date, not 1970. "Order ID" matches an existing order. This is task-specific and usually requires a domain judge (LLM-as-judge or a programmatic check against a source of truth).

A useful eval reports all three rates separately: parse rate, schema-compliance rate, semantic-correctness rate. A model can have 100% parse rate, 95% schema compliance, and 70% semantic correctness, and the three numbers diagnose different failures.

Building a validator

# Input: a raw model output string, a JSON Schema dict
# Output: a triple (parse_ok, schema_ok, errors) for an eval report
import json
from jsonschema import Draft202012Validator, ValidationError

def validate_structured_output(raw: str, schema: dict) -> dict:
    try:
        obj = json.loads(raw)
    except json.JSONDecodeError as e:
        return {"parse_ok": False, "schema_ok": False, "errors": [f"parse: {e}"]}
    validator = Draft202012Validator(schema)
    errors = [f"{list(e.absolute_path)}: {e.message}"
              for e in validator.iter_errors(obj)]
    return {"parse_ok": True, "schema_ok": not errors, "errors": errors}

schema = {
    "type": "object",
    "required": ["city", "temperature_c", "conditions"],
    "properties": {
        "city":          {"type": "string"},
        "temperature_c": {"type": "number", "minimum": -100, "maximum": 60},
        "conditions":    {"enum": ["sunny", "cloudy", "rainy", "snowy"]},
    },
    "additionalProperties": False,
}

raw = '{"city": "Paris", "temperature_c": 18.2, "conditions": "drizzling"}'
print(validate_structured_output(raw, schema))
Output: {'parse_ok': True, 'schema_ok': False, 'errors': ["['conditions']: 'drizzling' is not one of ['sunny', 'cloudy', 'rainy', 'snowy']"]}
Code Fragment 42.11.1a: A minimal validator that distinguishes parse failures (the model produced invalid JSON) from schema failures (the JSON is well-formed but violates the schema). Aggregating both rates across an eval set surfaces the two failure modes separately.

The validator above is small but it captures the operational core. In a real harness you would add: (a) an output normalizer that strips Markdown code fences before parsing (a frequent source of "parse failures" that are not really failures), (b) per-field error categorization to track which fields fail most often, and (c) a sampling layer that logs full failure cases for manual review.

42.11.3 Function-Calling Spec Compliance

Function-calling has its own protocol layer on top of JSON: the model emits a function name plus an arguments JSON, and downstream code dispatches to a registered handler. The eval question is whether the model's emitted call conforms to the function specification.

The most common failure modes, in rough descending order of frequency in 2025-2026 production data:

The eval harness should report each failure type separately. Aggregating them into a single "tool-call accuracy" number is convenient for dashboards but hides the diagnostic signal. When the typo rate goes up, the fix is usually in the function description (the model's confused about field naming); when the missing-required rate goes up, the fix is usually in the system prompt (the model's getting lazy about including all required fields).

Algorithm 42.11.1: Algorithm: Under the Hood: The Provider Matrix for Structured Output

The structured-output guarantees vary by provider, by model, and by endpoint. As of early 2026:

The Pareto-correct production stance: always run a validator pass downstream, regardless of which provider you use. The strict-mode guarantees are good but not absolute (edge cases involve very long outputs, unusual unicode, or schema features outside the strict subset). The validator catches what the model missed.

42.11.4 Tool-Use Protocol Eval: XML, JSON, and the Provider Matrix

Anthropic's tool use emits XML-flavored blocks (<tool_use>...</tool_use>) with a JSON arguments payload inside; OpenAI's emits a flat JSON object. Both have edge cases worth testing.

For Anthropic tool use, the eval needs to verify:

For OpenAI function calls, the eval is simpler (no XML layer), but the same content checks apply: well-formed JSON arguments, schema compliance, registered tool name. The finish_reason field plays the role of the stop reason; "tool_calls" indicates a function call.

For both providers, multi-tool-call eval is its own challenge: the model can emit multiple tool calls in a single response, and the eval has to verify each one individually plus their joint structure (no duplicate tool names where duplicates would be wrong, dependencies respected if some calls depend on others).

42.11.5 The "Good Output, Wrong Schema" Problem

One of the trickier failure modes is the case where the model produces something useful and well-formed, but does not match the declared schema. Examples:

These cases are diagnostically different from "the model couldn't parse the schema at all" (where outputs are random shapes) or "the model didn't try" (where outputs are free-form prose). A good eval distinguishes them and reports each separately.

One practical approach: when a schema violation occurs, run an LLM-as-judge pass that classifies the failure into categories: (a) wrong field names but correct semantic content, (b) wrong field types but correct semantic content, (c) wrong shape entirely, (d) partial output with missing semantic content, (e) refusal-style response (the model returned text instead of structured output). The category mix tells you what to fix: a high (a) rate suggests the schema descriptions are unclear; a high (e) rate suggests the system prompt is letting the model bail out too easily.

Note: Pydantic and Instructor as Validators

Many production teams use Pydantic models or the Instructor library instead of raw jsonschema. The eval logic is the same: parse the output, validate against the schema, classify failures. Pydantic's ValidationError gives more user-friendly error messages but the same coverage; Instructor adds automatic retries on schema failure, which is useful in production but should be turned off in eval (you want to measure the raw success rate, not the post-retry rate). Configure your eval harness to disable retries explicitly.

42.11.6 Refusal vs Fallback: When the Model Says No

A subtle case worth its own treatment: what happens when the model refuses to produce output, but the schema requires output? Three sub-cases, each scored differently in a well-designed eval:

(a) Pass-with-refusal. The model recognizes that the request is unsafe, off-topic, or impossible, and emits a structured refusal response (often a dedicated "refusal" field in the schema, or a special enum value). This is the desired behavior when a refusal is appropriate; the eval should score it as a pass.

(b) Fail-with-schema-violation. The model refuses by emitting free-form text ("I cannot help with that request") instead of using the schema's refusal mechanism. The downstream consumer sees a parse failure, not a refusal. This is a behavioral bug; the eval should score it as a fail and categorize it as a refusal-protocol violation.

(c) Fail-with-fabrication. The model produces schema-compliant output but the content is hallucinated to avoid acknowledging that it doesn't know. The eval cannot detect this from schema validation alone; it requires a semantic check or human review. This is the most dangerous of the three because the downstream system has no signal that anything went wrong.

A well-designed structured-output eval should report all three categories. The pass-vs-fail distinction is not enough; the kind of failure determines the fix. Conflating (b) and (c) into a single "fail" bucket loses diagnostic value.

Postmortem: The 0.3% That Wasn't Refusal

A customer-support team deployed a structured-output classifier that emitted one of five intent labels. Internal eval reported 99.7% accuracy; production reported 99.7% schema-compliance. Yet customer-satisfaction scores dropped 4 points after deployment. Investigation: the 0.3% of "schema-compliant" outputs were the cases where the model couldn't classify confidently. Instead of emitting the schema's "intent": "unknown" value, it picked the closest plausible intent. The downstream router sent these requests to the wrong queue. Fix: changed the eval to specifically test the model's behavior on ambiguous inputs and to score "fabrication on uncertain input" as a fail, even though it was schema-compliant. Lesson: schema compliance is necessary but not sufficient. Test the edge cases where the model should refuse or admit uncertainty, and verify it actually does.

42.11.7 CI Integration: Structured-Output Regression Testing With Promptfoo

The eval-as-CI pattern from Section 42.1 applies cleanly to structured output. Promptfoo (one of the most-adopted CI eval tools in 2025-2026) has first-class support for JSON-schema validation, which makes the integration short.

# promptfoo.yaml: structured-output regression suite
description: Weather tool-call regression

providers:
  - openai:chat:gpt-4o-mini
  - anthropic:messages:claude-haiku-4-5

prompts:
  - "Return a weather report for {{city}} as JSON."

# Reusable schema block: every test asserts against this contract.
defaultTest: &weather_schema
  assert:
    - type: is-json
    - type: is-valid-json-schema
      value:
        type: object
        required: [city, temperature_c, conditions]
        properties:
          city: { type: string }
          temperature_c: { type: number }
          conditions: { enum: [sunny, cloudy, rainy, snowy] }
        additionalProperties: false

tests:
  - vars: { city: Paris }
    <<: *weather_schema
    assert:
      - type: javascript
        value: "JSON.parse(output).city.toLowerCase() === 'paris'"
  - vars: { city: Tokyo }
    <<: *weather_schema
    assert:
      - type: javascript
        value: "JSON.parse(output).city.toLowerCase() === 'tokyo'"
Output: [PASS] openai:gpt-4o-mini: 1/1 tests passed (3 asserts). [PASS] anthropic:claude-haiku-4-5: 1/1 tests passed (3 asserts). Total: 2/2 (100%).
Code Fragment 42.11.2: A promptfoo configuration for a single structured-output test, with three assertions: JSON-parseable, schema-compliant, and semantically correct city echo. Run as promptfoo eval in CI; the YAML lives in the repo alongside the schema definition.

Wire this into your CI pipeline (GitHub Actions, GitLab CI, Buildkite, etc.) and gate merges on the test results. The typical 2026 pattern is a two-stage gate: first stage runs the structured-output suite on every PR with a 100-200 example fixture set (fast, deterministic), second stage runs a larger live-data replay (1000-5000 examples) nightly with looser thresholds. The fast stage catches obvious regressions; the slow stage catches the rare cases that only manifest on real traffic distributions.

Tip: Test the Schema, Not Just the Output

When you update a schema (adding a field, tightening an enum, marking a field required), run your structured-output eval suite against the previous model's outputs as a sanity check. A new schema that fails to validate a body of known-good outputs probably has a bug. This is the structured-output equivalent of a database migration dry-run, and it catches the case where an over-eager schema tightening would have broken existing traffic.

Fun Note: The Trailing-Comma Cargo Cult

When OpenAI shipped response_format=json_object in November 2023, dozens of teams jumped on it and ripped out their JSON-repair layers. By Q2 2024 the "json_object validates fine" memo had been quietly walked back at most of them: the mode guaranteed parseable JSON, not your schema. The classic 2024 internal Slack message ("we have 99% parse rate and 60% schema compliance, please put back the validator") became something of a meme on the Latent Space podcast, and is the reason every structured-output dashboard in 2026 ships with separate parse-rate and schema-compliance gauges by default.

Key Takeaways

Exercises

Exercise 35.6.1: Parse Rate vs Schema Compliance Conceptual

A team reports their model has 99% "JSON validity" on their eval set. Their production system crashes on roughly 5% of outputs. Explain how both numbers can be correct, and propose two changes to the eval that would catch the production failures.

Answer Sketch

"JSON validity" likely measures parse rate only, the model produces well-formed JSON 99% of the time. The 5% production crashes likely come from schema-violation failures (missing required fields, wrong types) that are still well-formed JSON. Changes to the eval: (1) add schema validation alongside parse-rate, reporting both separately; (2) test against the actual downstream consumer's schema, not a relaxed eval schema. A third option: write a contract-test fixture set where each known production crash is captured as an eval input that should fail with a specific error category, then track those over time.

Exercise 35.6.2: Provider Strict-Mode Limits Analysis

You want to use OpenAI's strict structured-output mode, but your schema includes a recursive definition (a tree structure where each node can contain children of the same type). Strict mode doesn't support recursion. List three ways to work around this and rank them by trade-off.

Answer Sketch

(1) Flatten the schema to a fixed maximum depth (e.g., 5 levels), declaring each level explicitly. Trade-off: depth limit, but keeps strict mode. (2) Use non-strict JSON mode and add a recursion-aware validator pass. Trade-off: no schema-mode guarantees, but full recursion. (3) Switch to a provider with grammar-constrained generation that supports recursion (Gemini's structured output, llguidance, outlines library). Trade-off: provider lock-in or extra dependency. Ranking depends on the application: option 1 for low-depth trees, option 2 for unbounded recursion with willingness to retry on failure, option 3 if grammar-constrained generation is acceptable.

Exercise 35.6.3: Function-Call Error Categorization Coding

Extend the validator from Code Fragment 42.11.1b to categorize each schema-validation failure into one of six categories: missing-required, extra-field, type-mismatch, enum-violation, pattern-violation, other. Compute per-category rates over an eval set.

Answer Sketch

The jsonschema library's ValidationError has a validator attribute that names the failed validator: "required" for missing fields, "additionalProperties" for extras, "type" for type mismatches, "enum" for enum violations, "pattern" for regex failures. Map each e.validator to one of the six categories (with "other" as fallback), and increment a counter per category. Aggregate across the eval set to get per-category rates. Sort by rate descending and report; the top category usually tells you the highest-leverage fix.

Exercise 35.6.4: Refusal vs Fabrication Test Analysis

Design an eval that distinguishes three model behaviors on ambiguous inputs: (a) correct refusal via the schema's refusal mechanism, (b) free-form refusal that violates the schema, (c) fabrication of a confident-looking but incorrect answer. Specify the eval inputs, the metrics, and how you would interpret the results.

Answer Sketch

Inputs: a labeled set of "should refuse" cases (questions for which the answer is genuinely unknown or out-of-scope) and "should answer" cases (mixed in to prevent the model from over-refusing). Metrics: per-input category (correct-refusal-in-schema, refusal-out-of-schema, correct-answer, fabrication). Compute (a) refusal precision: of refusals, what fraction were in-schema? (b) refusal recall: of cases that should have been refused, what fraction were refused? (c) fabrication rate: of should-refuse cases, what fraction got a confident wrong answer? Interpretation: high fabrication rate means the model isn't recognizing uncertainty; high out-of-schema refusal means the system prompt needs to teach the in-schema mechanism better.

Exercise 35.6.5: Promptfoo Integration Coding

Take an existing function in your codebase (or a hypothetical one) that consumes LLM-generated JSON. Write a promptfoo configuration that tests at least three positive cases (correct output expected) and three negative cases (edge cases where the model has historically failed). Include schema validation and at least one semantic assertion per test.

Answer Sketch

Structure: define one provider entry per model you care about. Define the prompt template, parameterized by test variables. For each test, declare vars for the inputs, then assert entries: is-json for parse-rate, is-valid-json-schema with the schema for compliance, and javascript or llm-rubric for semantic correctness (e.g., the city echoed matches the city requested, the date is plausible, the amount is positive). Run with promptfoo eval -c promptfoo.yaml; gate CI on the pass rate.

What Comes Next

In the next section, Section 42.12: Classical ML Evaluation Metrics, we continue.

The next section continues this chapter's tour of functional and behavioral testing patterns, picking up regression-test design, capability-coverage matrices, and the family of safety-relevant behavioral checks (jailbreak resistance, prompt-injection refusal) that complement structured-output validity.

Further Reading

JSON Schema and Validators

Wright, A., Andrews, H., Hutton, B., Dennis, G. (2022). "JSON Schema: A Media Type for Describing JSON Documents (Draft 2020-12)." json-schema.org
jsonschema (Python library). (2026). "An implementation of the JSON Schema specification for Python." github.com/python-jsonschema/jsonschema
Pydantic Documentation. (2026). "Pydantic V2: Data Validation Using Python Type Hints." docs.pydantic.dev

Provider Structured-Output Documentation

OpenAI. (2024-2026). "Structured Outputs." platform.openai.com/docs/guides/structured-outputs
OpenAI. (2024-2026). "Function Calling Guide." platform.openai.com/docs/guides/function-calling
Anthropic. (2024-2026). "Tool Use with Claude." docs.anthropic.com/en/docs/build-with-claude/tool-use
Google. (2024-2026). "Gemini API: Structured Output." ai.google.dev/gemini-api/docs/structured-output

Constrained Generation Research

Willard, B. T., Louf, R. (2023). "Efficient Guided Generation for Large Language Models." arXiv:2307.09702
Beurer-Kellner, L., Fischer, M., Vechev, M. (2024). "Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation." arXiv:2403.06988

CI Eval Tooling

Promptfoo Documentation. (2026). "Promptfoo: Test Your LLM App." promptfoo.dev/docs
Instructor Documentation. (2026). "Instructor: Structured LLM Outputs." python.useinstructor.com