"Your model can be eloquent, factual, and helpful. If it forgets a comma, the JSON parser doesn't care."
Eval, Schema-Strict AI Agent
By 2026 most production LLM traffic is not free-form text, it is structured output. Tool calls, function arguments, JSON-mode responses, agent action specifications, and SQL queries are now the dominant shapes of model output. Yet eval coverage for structured output is thin: most teams measure "did we get back valid JSON?" and stop there, missing the harder failures, fields with the wrong names, types that don't match the schema, enum values outside the allowed set, refusals where the schema demands output, and the family of subtle protocol-mismatch bugs that crash downstream consumers. This section is the missing piece. We define a four-tier taxonomy of structured-output validity (syntactic, schema-compliant, semantic, behavioral), build a validator harness using the jsonschema library, walk through provider-specific quirks (OpenAI's JSON-mode and function-calling, Anthropic's tool-use XML), and integrate the whole thing with promptfoo for CI regression testing. The goal is a structured-output eval that catches the failures your text-only eval misses.
Prerequisites
This section assumes familiarity with structured generation from Section 4.3 (constrained decoding, JSON-mode), function calling from Section 11.2, and the eval-as-CI pattern from Section 42.1. Some familiarity with JSON Schema (Draft 2020-12 or Draft 7) helps with the validator code.
42.11.1 Why Structured-Output Validity Matters in 2026
When: any production system that consumes structured LLM output (function calls, JSON responses, tool-use blocks). How: for every tool or schema your model produces, build a fixture set of 30-100 prompts that should produce structured output. Run them on every PR; validate each output against the schema; report (a) parse rate (did we get well-formed JSON?), (b) schema-compliance rate (did it match the schema?), (c) semantic-correctness rate (did the field values make sense?). Watch for: all-pass results that hide a 5% silent-corruption rate (the parser accepted garbage values because the schema is too permissive). Result: structured-output regressions are caught at PR time, not by the downstream service that crashes on the corrupted payload.
The shift toward structured output happened gradually then suddenly. OpenAI added function calling in mid-2023, Anthropic added tool use in late 2023, JSON-mode became table-stakes across all major providers in 2024, and structured-output APIs (OpenAI's structured outputs with strict schema enforcement, Anthropic's tool use with input_schema, Google's Gemini structured output) became the recommended interaction pattern for almost any non-chatbot use case in 2024-2025. By 2026, the typical production LLM application has more structured calls than free-form completions.
This shift has consequences for evaluation. The text-evaluation toolkit (BLEU, ROUGE, BERTScore, LLM-as-judge) was designed for prose, not for "did the tool call have argument_x set correctly?" When teams build new structured-output features, they often ship them with only a handful of manual smoke tests, then discover field-name typos and type mismatches in production. The eval gap is real and the cost of closing it is small relative to the cost of a single production incident.
A reasonable mental model is a four-tier taxonomy of structured-output failures:
- Syntactic failure: the output cannot be parsed as JSON (or as the requested format). Mismatched braces, trailing commas, unterminated strings. The cheapest to detect.
- Schema-compliance failure: the output parses, but does not match the declared schema. Missing required fields, extra fields, wrong types, enum violations, field-name typos.
- Semantic failure: the output is schema-compliant but the field values are wrong. "City" field contains a country name, "date" field contains "yesterday" instead of an ISO-8601 string, "user_id" is a hallucinated number.
- Behavioral failure: the output is schema-compliant and semantically correct, but the wrong tool was called, or the call was unnecessary, or the call sequence was wrong (in a multi-call agent).
OpenAI's response_format={"type": "json_object"} guarantees only that the output parses as JSON. It does NOT guarantee that the output matches any particular schema. The newer response_format={"type": "json_schema", "json_schema": ...} mode does enforce schema compliance, but only when the schema is in the supported subset (no anyOf, no recursive schemas, no unconstrained additionalProperties, no patterns). Many teams discover the difference too late, after building a service that assumes schema-mode behavior on a JSON-mode-only endpoint.
42.11.2 JSON Schema Compliance: Syntactic vs Semantic Validity
The most common eval question is "did the model produce JSON that matches the schema?" The naive answer: run json.loads() on the output and check that it didn't raise. That measures only syntactic validity. The harder question is schema compliance, and it has two flavors that production systems frequently confuse.
Syntactic validity asks: is the output well-formed JSON? Can json.loads() consume it without error? This catches missing brackets, trailing commas (which strict JSON forbids), unterminated strings, and similar parser-level mistakes. It is necessary but very far from sufficient.
Schema compliance asks: given that the output is well-formed JSON, does it satisfy the constraints of the declared schema? Required fields present, types correct, enum values within the allowed set, pattern constraints satisfied, numeric ranges respected. JSON Schema (Draft 2020-12) is the standard formal language for these constraints; the jsonschema Python library is the most widely used validator.
Semantic validity asks: given that the output is schema-compliant, are the field values correct in context? "Country" field contains a country, not a city. "Timestamp" field contains a recent date, not 1970. "Order ID" matches an existing order. This is task-specific and usually requires a domain judge (LLM-as-judge or a programmatic check against a source of truth).
A useful eval reports all three rates separately: parse rate, schema-compliance rate, semantic-correctness rate. A model can have 100% parse rate, 95% schema compliance, and 70% semantic correctness, and the three numbers diagnose different failures.
Building a validator
# Input: a raw model output string, a JSON Schema dict
# Output: a triple (parse_ok, schema_ok, errors) for an eval report
import json
from jsonschema import Draft202012Validator, ValidationError
def validate_structured_output(raw: str, schema: dict) -> dict:
try:
obj = json.loads(raw)
except json.JSONDecodeError as e:
return {"parse_ok": False, "schema_ok": False, "errors": [f"parse: {e}"]}
validator = Draft202012Validator(schema)
errors = [f"{list(e.absolute_path)}: {e.message}"
for e in validator.iter_errors(obj)]
return {"parse_ok": True, "schema_ok": not errors, "errors": errors}
schema = {
"type": "object",
"required": ["city", "temperature_c", "conditions"],
"properties": {
"city": {"type": "string"},
"temperature_c": {"type": "number", "minimum": -100, "maximum": 60},
"conditions": {"enum": ["sunny", "cloudy", "rainy", "snowy"]},
},
"additionalProperties": False,
}
raw = '{"city": "Paris", "temperature_c": 18.2, "conditions": "drizzling"}'
print(validate_structured_output(raw, schema))
The validator above is small but it captures the operational core. In a real harness you would add: (a) an output normalizer that strips Markdown code fences before parsing (a frequent source of "parse failures" that are not really failures), (b) per-field error categorization to track which fields fail most often, and (c) a sampling layer that logs full failure cases for manual review.
42.11.3 Function-Calling Spec Compliance
Function-calling has its own protocol layer on top of JSON: the model emits a function name plus an arguments JSON, and downstream code dispatches to a registered handler. The eval question is whether the model's emitted call conforms to the function specification.
The most common failure modes, in rough descending order of frequency in 2025-2026 production data:
- Argument-name typos. The function declares
user_idbut the model emitsuserIdorid. Strict schema-mode prevents this; permissive schemas pass it silently. - Missing required fields. The model emits some arguments and omits others, often because it inferred the rest from context but did not include them in the call.
- Type mismatches. A number field receives a string ("42" instead of 42), a boolean field receives a string ("true" instead of true), or a string field receives a number.
- Enum violations. The function declares
severityas enum [low, medium, high]; the model emits"critical". - Spurious calls. The model invokes a function when no function call was needed; it should have answered in text.
- Missed calls. The model answers in text when it should have invoked a function (often happens when a function's description is ambiguous about when to use it).
The eval harness should report each failure type separately. Aggregating them into a single "tool-call accuracy" number is convenient for dashboards but hides the diagnostic signal. When the typo rate goes up, the fix is usually in the function description (the model's confused about field naming); when the missing-required rate goes up, the fix is usually in the system prompt (the model's getting lazy about including all required fields).
The structured-output guarantees vary by provider, by model, and by endpoint. As of early 2026:
- OpenAI JSON-mode (
response_format={"type":"json_object"}): guarantees parseable JSON, no schema constraint. - OpenAI Structured Outputs (
response_format={"type":"json_schema", "strict": true}): guarantees schema compliance, but only for a subset of JSON Schema features. Requires"additionalProperties": falseon every object. - OpenAI Function Calling (
tools=[...]with strict mode): equivalent to Structured Outputs for the tool's argument JSON. - Anthropic Tool Use (
tools=[...]withinput_schema): no strict mode in the same sense; the model is steered by the schema but not constrained. Production teams add a validator pass and either reject or retry on schema failure. - Google Gemini Structured Output (
response_schema=...): grammar-constrained generation, similar guarantee to OpenAI's strict mode but with a different supported schema subset.
The Pareto-correct production stance: always run a validator pass downstream, regardless of which provider you use. The strict-mode guarantees are good but not absolute (edge cases involve very long outputs, unusual unicode, or schema features outside the strict subset). The validator catches what the model missed.
42.11.4 Tool-Use Protocol Eval: XML, JSON, and the Provider Matrix
Anthropic's tool use emits XML-flavored blocks (<tool_use>...</tool_use>) with a JSON arguments payload inside; OpenAI's emits a flat JSON object. Both have edge cases worth testing.
For Anthropic tool use, the eval needs to verify:
- Block well-formedness: matching open and close tags, no nested unescaped angle brackets inside text values, no broken XML.
- JSON-inside-XML validity: the arguments payload is itself parseable as JSON, then schema-validatable against the tool's
input_schema. - Tool-name correctness: the emitted tool name matches a registered tool.
- Stop-reason consistency: the stop reason is
tool_usewhen a tool was called, notend_turnwith text-embedded tool-use that downstream code might miss.
For OpenAI function calls, the eval is simpler (no XML layer), but the same content checks apply: well-formed JSON arguments, schema compliance, registered tool name. The finish_reason field plays the role of the stop reason; "tool_calls" indicates a function call.
For both providers, multi-tool-call eval is its own challenge: the model can emit multiple tool calls in a single response, and the eval has to verify each one individually plus their joint structure (no duplicate tool names where duplicates would be wrong, dependencies respected if some calls depend on others).
42.11.5 The "Good Output, Wrong Schema" Problem
One of the trickier failure modes is the case where the model produces something useful and well-formed, but does not match the declared schema. Examples:
- Schema requires
{"first_name": str, "last_name": str}; model emits{"name": "Jane Doe"}. The output is reasonable English, just not the right shape. - Schema requires a flat list of strings; model emits a list of objects each with a "value" field. Again, reasonable but wrong shape.
- Schema requires ISO-8601 dates; model emits "March 15th, 2026". Same data, wrong format.
These cases are diagnostically different from "the model couldn't parse the schema at all" (where outputs are random shapes) or "the model didn't try" (where outputs are free-form prose). A good eval distinguishes them and reports each separately.
One practical approach: when a schema violation occurs, run an LLM-as-judge pass that classifies the failure into categories: (a) wrong field names but correct semantic content, (b) wrong field types but correct semantic content, (c) wrong shape entirely, (d) partial output with missing semantic content, (e) refusal-style response (the model returned text instead of structured output). The category mix tells you what to fix: a high (a) rate suggests the schema descriptions are unclear; a high (e) rate suggests the system prompt is letting the model bail out too easily.
Many production teams use Pydantic models or the Instructor library instead of raw jsonschema. The eval logic is the same: parse the output, validate against the schema, classify failures. Pydantic's ValidationError gives more user-friendly error messages but the same coverage; Instructor adds automatic retries on schema failure, which is useful in production but should be turned off in eval (you want to measure the raw success rate, not the post-retry rate). Configure your eval harness to disable retries explicitly.
42.11.6 Refusal vs Fallback: When the Model Says No
A subtle case worth its own treatment: what happens when the model refuses to produce output, but the schema requires output? Three sub-cases, each scored differently in a well-designed eval:
(a) Pass-with-refusal. The model recognizes that the request is unsafe, off-topic, or impossible, and emits a structured refusal response (often a dedicated "refusal" field in the schema, or a special enum value). This is the desired behavior when a refusal is appropriate; the eval should score it as a pass.
(b) Fail-with-schema-violation. The model refuses by emitting free-form text ("I cannot help with that request") instead of using the schema's refusal mechanism. The downstream consumer sees a parse failure, not a refusal. This is a behavioral bug; the eval should score it as a fail and categorize it as a refusal-protocol violation.
(c) Fail-with-fabrication. The model produces schema-compliant output but the content is hallucinated to avoid acknowledging that it doesn't know. The eval cannot detect this from schema validation alone; it requires a semantic check or human review. This is the most dangerous of the three because the downstream system has no signal that anything went wrong.
A well-designed structured-output eval should report all three categories. The pass-vs-fail distinction is not enough; the kind of failure determines the fix. Conflating (b) and (c) into a single "fail" bucket loses diagnostic value.
A customer-support team deployed a structured-output classifier that emitted one of five intent labels. Internal eval reported 99.7% accuracy; production reported 99.7% schema-compliance. Yet customer-satisfaction scores dropped 4 points after deployment. Investigation: the 0.3% of "schema-compliant" outputs were the cases where the model couldn't classify confidently. Instead of emitting the schema's "intent": "unknown" value, it picked the closest plausible intent. The downstream router sent these requests to the wrong queue. Fix: changed the eval to specifically test the model's behavior on ambiguous inputs and to score "fabrication on uncertain input" as a fail, even though it was schema-compliant. Lesson: schema compliance is necessary but not sufficient. Test the edge cases where the model should refuse or admit uncertainty, and verify it actually does.
42.11.7 CI Integration: Structured-Output Regression Testing With Promptfoo
The eval-as-CI pattern from Section 42.1 applies cleanly to structured output. Promptfoo (one of the most-adopted CI eval tools in 2025-2026) has first-class support for JSON-schema validation, which makes the integration short.
# promptfoo.yaml: structured-output regression suite
description: Weather tool-call regression
providers:
- openai:chat:gpt-4o-mini
- anthropic:messages:claude-haiku-4-5
prompts:
- "Return a weather report for {{city}} as JSON."
# Reusable schema block: every test asserts against this contract.
defaultTest: &weather_schema
assert:
- type: is-json
- type: is-valid-json-schema
value:
type: object
required: [city, temperature_c, conditions]
properties:
city: { type: string }
temperature_c: { type: number }
conditions: { enum: [sunny, cloudy, rainy, snowy] }
additionalProperties: false
tests:
- vars: { city: Paris }
<<: *weather_schema
assert:
- type: javascript
value: "JSON.parse(output).city.toLowerCase() === 'paris'"
- vars: { city: Tokyo }
<<: *weather_schema
assert:
- type: javascript
value: "JSON.parse(output).city.toLowerCase() === 'tokyo'"
promptfoo eval in CI; the YAML lives in the repo alongside the schema definition.Wire this into your CI pipeline (GitHub Actions, GitLab CI, Buildkite, etc.) and gate merges on the test results. The typical 2026 pattern is a two-stage gate: first stage runs the structured-output suite on every PR with a 100-200 example fixture set (fast, deterministic), second stage runs a larger live-data replay (1000-5000 examples) nightly with looser thresholds. The fast stage catches obvious regressions; the slow stage catches the rare cases that only manifest on real traffic distributions.
When you update a schema (adding a field, tightening an enum, marking a field required), run your structured-output eval suite against the previous model's outputs as a sanity check. A new schema that fails to validate a body of known-good outputs probably has a bug. This is the structured-output equivalent of a database migration dry-run, and it catches the case where an over-eager schema tightening would have broken existing traffic.
When OpenAI shipped response_format=json_object in November 2023, dozens of teams jumped on it and ripped out their JSON-repair layers. By Q2 2024 the "json_object validates fine" memo had been quietly walked back at most of them: the mode guaranteed parseable JSON, not your schema. The classic 2024 internal Slack message ("we have 99% parse rate and 60% schema compliance, please put back the validator") became something of a meme on the Latent Space podcast, and is the reason every structured-output dashboard in 2026 ships with separate parse-rate and schema-compliance gauges by default.
- Distinguish four tiers of validity. Parse rate, schema compliance, semantic correctness, behavioral correctness. Report each separately.
- JSON-mode is not schema compliance. Use structured-output mode where available; always validate downstream regardless of provider guarantees.
- Categorize function-call failures. Typos, missing fields, type mismatches, enum violations, spurious calls, missed calls. The category determines the fix.
- The "good output, wrong schema" case is a real failure. Don't dismiss schema violations whose content looks reasonable, they crash downstream services just as hard.
- Refusal protocol matters. Test that the model refuses in-schema, not by reverting to free-form text.
- Integrate validation into CI. Promptfoo, DeepEval, or a custom harness, the principle is the same: every PR gets a schema-validation pass.
Exercises
A team reports their model has 99% "JSON validity" on their eval set. Their production system crashes on roughly 5% of outputs. Explain how both numbers can be correct, and propose two changes to the eval that would catch the production failures.
Answer Sketch
"JSON validity" likely measures parse rate only, the model produces well-formed JSON 99% of the time. The 5% production crashes likely come from schema-violation failures (missing required fields, wrong types) that are still well-formed JSON. Changes to the eval: (1) add schema validation alongside parse-rate, reporting both separately; (2) test against the actual downstream consumer's schema, not a relaxed eval schema. A third option: write a contract-test fixture set where each known production crash is captured as an eval input that should fail with a specific error category, then track those over time.
You want to use OpenAI's strict structured-output mode, but your schema includes a recursive definition (a tree structure where each node can contain children of the same type). Strict mode doesn't support recursion. List three ways to work around this and rank them by trade-off.
Answer Sketch
(1) Flatten the schema to a fixed maximum depth (e.g., 5 levels), declaring each level explicitly. Trade-off: depth limit, but keeps strict mode. (2) Use non-strict JSON mode and add a recursion-aware validator pass. Trade-off: no schema-mode guarantees, but full recursion. (3) Switch to a provider with grammar-constrained generation that supports recursion (Gemini's structured output, llguidance, outlines library). Trade-off: provider lock-in or extra dependency. Ranking depends on the application: option 1 for low-depth trees, option 2 for unbounded recursion with willingness to retry on failure, option 3 if grammar-constrained generation is acceptable.
Extend the validator from Code Fragment 42.11.1b to categorize each schema-validation failure into one of six categories: missing-required, extra-field, type-mismatch, enum-violation, pattern-violation, other. Compute per-category rates over an eval set.
Answer Sketch
The jsonschema library's ValidationError has a validator attribute that names the failed validator: "required" for missing fields, "additionalProperties" for extras, "type" for type mismatches, "enum" for enum violations, "pattern" for regex failures. Map each e.validator to one of the six categories (with "other" as fallback), and increment a counter per category. Aggregate across the eval set to get per-category rates. Sort by rate descending and report; the top category usually tells you the highest-leverage fix.
Design an eval that distinguishes three model behaviors on ambiguous inputs: (a) correct refusal via the schema's refusal mechanism, (b) free-form refusal that violates the schema, (c) fabrication of a confident-looking but incorrect answer. Specify the eval inputs, the metrics, and how you would interpret the results.
Answer Sketch
Inputs: a labeled set of "should refuse" cases (questions for which the answer is genuinely unknown or out-of-scope) and "should answer" cases (mixed in to prevent the model from over-refusing). Metrics: per-input category (correct-refusal-in-schema, refusal-out-of-schema, correct-answer, fabrication). Compute (a) refusal precision: of refusals, what fraction were in-schema? (b) refusal recall: of cases that should have been refused, what fraction were refused? (c) fabrication rate: of should-refuse cases, what fraction got a confident wrong answer? Interpretation: high fabrication rate means the model isn't recognizing uncertainty; high out-of-schema refusal means the system prompt needs to teach the in-schema mechanism better.
Take an existing function in your codebase (or a hypothetical one) that consumes LLM-generated JSON. Write a promptfoo configuration that tests at least three positive cases (correct output expected) and three negative cases (edge cases where the model has historically failed). Include schema validation and at least one semantic assertion per test.
Answer Sketch
Structure: define one provider entry per model you care about. Define the prompt template, parameterized by test variables. For each test, declare vars for the inputs, then assert entries: is-json for parse-rate, is-valid-json-schema with the schema for compliance, and javascript or llm-rubric for semantic correctness (e.g., the city echoed matches the city requested, the date is plausible, the amount is positive). Run with promptfoo eval -c promptfoo.yaml; gate CI on the pass rate.
What Comes Next
In the next section, Section 42.12: Classical ML Evaluation Metrics, we continue.
The next section continues this chapter's tour of functional and behavioral testing patterns, picking up regression-test design, capability-coverage matrices, and the family of safety-relevant behavioral checks (jailbreak resistance, prompt-injection refusal) that complement structured-output validity.