Failure Modes Specific to Government

Section 72.2

"NYC MyCity hallucinated tenant rights. Michigan MiDAS automated fraud accusations. Dutch SyRI flagged immigrants. Three case studies, one warning."

HalluxHallux, Public-Sector-Pessimist AI Agent
Big Picture

Five failure modes recur across public-sector LLM deployments often enough to deserve named patterns: the helpful-by-default failure (NYC MyCity), due-process violations in benefits decisions, public-records exposure of LLM interactions, accessibility failures, and procurement cycles that outlast the named model. Each has a remediation pattern that the affected agencies have published; collectively they shape the conservative design that successful public-sector LLM deployments adopt. This section walks through each failure mode and the lesson the field has internalized.

Prerequisites

This section assumes the government LLM use cases from Section 72.1, the hallucination vocabulary from Section 47.1, and the bias-and-fairness framing from Section 50.1.

The Chevy of Watsonville Pattern in the Public Sector

Fun Fact

The Chevy of Watsonville pattern is named after a 2024 incident where a California car-dealer chatbot powered by ChatGPT was tricked into agreeing to sell a new Chevy Tahoe for $1, then drafting a contract that the user screenshotted and posted on X. The dealership honored the deal as a publicity stunt but quietly pulled the chatbot the same week. The phrase "Chevy of Watsonville" is now industry shorthand for any LLM that agrees to things it should refuse.

Warning
The Chevy of Watsonville Pattern in the Public Sector

The 2024 NYC MyCity launch produced widely-publicized incorrect answers about employment law and housing policy that contradicted city regulations. Root cause: the chatbot was prompted as a helpful general assistant rather than a strict grounded-retrieval system, and answered policy questions that should have been refused. Lesson: in public-sector deployments, the default behavior should be refusal-to-answer outside the curated corpus, with explicit "contact this office" handoffs. "Helpful by default" is a poor default when the wrong answer carries legal consequence.

The MyCity launch was a teachable failure for the public sector at large. The post-launch coverage (The Markup's investigation, NYC Comptroller's audit, the City's own response and remediation) is now a standard reference in public-sector AI procurement. The fix is architectural: a strict grounded-retrieval system that refuses outside scope rather than improvising, plus a comprehensive evaluation set of representative constituent questions where the expected behavior is "I do not have that information; here is who to contact." The remediation deployed across the U.S. municipal-AI deployment landscape afterward standardized on this pattern.

Due-Process and Algorithmic Accountability

Warning: Due-Process and Algorithmic Accountability

U.S. administrative law (and equivalents in most democracies) requires that adverse decisions affecting individuals be explainable, appealable, and traceable to a decision-maker. An LLM that makes a benefits-denial recommendation cannot satisfy these requirements alone. The Michigan MiDAS unemployment-fraud-detection scandal (an earlier, non-LLM system that wrongly accused 40,000+ people of fraud) remains the cautionary tale: opaque automated decisions in public benefits programs create both legal liability and human harm at scale.

The Michigan MiDAS lesson is older than LLMs but more relevant than ever: the failure mode in automated public-benefits decisions is not the technology but the absence of human accountability for adverse outcomes. The legal framework that constrains this is robust: U.S. administrative law requires a decision-maker who can be questioned, a record that can be audited, and an appeals path that is accessible. An LLM that produces a benefits-eligibility recommendation cannot satisfy these requirements alone; a human caseworker must own the decision and be reachable for appeal. Successful 2024-2025 LLM deployments in benefits administration explicitly architect for this: the LLM helps the applicant gather documents, helps the caseworker prepare the decision, but the caseworker signs the decision and is named in the appeal record.

Public-Records Exposure of LLM Interactions

Warning: Public-Records Exposure of LLM Interactions

Agency LLM conversations may themselves be public records subject to FOIA. Procurement contracts must specify whether prompts and responses are logged, by whom, for how long, and under what disclosure rules. Several agencies have been caught flat-footed by FOIA requests for "all chatbot conversations" they did not anticipate.

The pattern: an agency deploys a chatbot, the chatbot generates conversations, an enterprising journalist or advocacy group files a FOIA request for all chatbot logs, and the agency discovers it has no records-retention policy for the new artifact. The remediation pattern is procurement-and-records-management: contracts specify retention, agency records officers classify chatbot transcripts under the records schedule, and FOIA-response capability is engineered into the logging architecture from the start. Several agencies now treat chatbot transcripts as routine public records and publish dashboards or data exports proactively, which both satisfies FOIA and surfaces useful analytics.

Accessibility (Section 508 / WCAG)

Warning: Accessibility (Section 508 / WCAG)

Federal agencies (and most state programs) must meet Section 508 accessibility standards. LLM chat interfaces need screen-reader compatibility, keyboard navigation, sufficient color contrast, and alternative-text for any generated images. Vendor demos that look good on a laptop often fail accessibility audits.

Several 2024-2025 public-sector LLM pilots failed Section 508 audits because the chat interface was built without screen-reader testing, the streaming-text UI behaved erratically with assistive technologies, or the generated images had no alternative text. The pattern that works is to treat accessibility as a first-class requirement from procurement onward: the RFP specifies WCAG 2.1 AA conformance, the vendor demonstrates assistive-technology compatibility before award, and the agency's accessibility specialists conduct independent verification before deployment. The U.S. Access Board's Section508.gov portal publishes specific guidance on AI and chatbot interfaces that has stabilized through 2024 and 2025.

Procurement Cycles Outlast Models

Warning: Procurement Cycles Outlast Models

A typical federal procurement (RFI, RFP, evaluation, award, implementation) runs 12-24 months. The model named in the contract is often a generation behind by deployment. Successful contracts specify capabilities and outcomes rather than model identifiers, with explicit upgrade clauses tied to evaluation thresholds.

The mismatch is structural and unfixable in the short term: federal procurement timelines are designed for predictability and accountability; frontier-model generations turn over every six to twelve months. The pattern that works is to specify capabilities ("the system shall produce summaries that score above X on the agency's evaluation set, and shall support upgrades to successor model versions that meet or exceed those scores") rather than model identifiers ("the system shall use GPT-4"). The latter language locks the contract to an aging model; the former permits the vendor to upgrade through the contract lifetime as long as evaluation thresholds are met. Several large federal agencies have published model contract-language templates that operationalize this; the GSA's AI Center of Excellence is the most-cited source.

Postmortem
The Robodebt-Style Failure That Predates LLMs (instructive)

Australia's Robodebt scandal (2016 to 2020) is the most-documented automated-decision-making failure in public benefits in any country. An algorithmic system at Centrelink, the Australian welfare agency, issued automated debt notices to welfare recipients based on an income-averaging method that produced systematically wrong debt calculations. Roughly 470,000 incorrect debts were issued totaling about AUD 1.8 billion. Several recipients died by suicide after receiving the notices; the political fallout brought down a government and ended in a Royal Commission whose 2023 report is essential reading. The system was not an LLM but the institutional lessons apply directly: an opaque automated decision in a high-consequence benefits domain, deployed without adequate human-in-the-loop review, with limited appeal rights, produced large-scale and irreversible harm. The remediation pattern that the public-sector AI community internalized from Robodebt is unambiguous: high-consequence benefits decisions must remain with identifiable human decision-makers, with documented appeal paths, and the AI may inform but never decide. Every major civilian-agency AI policy framework from OMB M-24-10 forward reflects this lesson.

Real-World Scenario
The NYC MyCity Launch and Post-Launch Remediation

Who. The New York City Office of Technology and Innovation (OTI) and the Adams Administration, with Microsoft Azure OpenAI as the underlying model provider. Situation. NYC launched the MyCity Business Chatbot in October 2023 as part of a broader municipal-AI initiative, with the goal of helping small-business owners navigate city regulations. Problem. Within months of launch (March-April 2024), The Markup's investigation documented the chatbot giving incorrect answers about employment law (telling business owners they could fire workers for complaining about sexual harassment), housing policy (incorrectly describing tenant-rights protections), and other consequential regulatory questions. The NYC Comptroller's office initiated an audit; the City Council held hearings. Decision. NYC kept the chatbot live but added prominent disclaimer language ("This chatbot may give incorrect or incomplete information"), expanded the grounded-retrieval corpus, and added explicit refusal patterns for high-stakes regulatory questions. The Comptroller's audit recommended a stricter grounded-retrieval architecture and demographic-disparate-impact evaluation. How. The remediation centered on three architectural changes: (1) strict-scope retrieval limited to the curated City corpus with refusal-to-answer outside it, (2) hard-coded routing rules for consequential regulatory questions ("for employment-law questions, here is the NYC Commission on Human Rights"), and (3) audit-log review and ongoing eval against a public-question benchmark. Result. Post-remediation evaluations show meaningfully lower wrong-answer rates on the original failure categories, though the underlying tension between "be helpful" and "refuse outside scope" remains. Lesson. The MyCity launch is now a standard reference in U.S. municipal-AI procurement: "helpful by default" is a poor default when the wrong answer carries legal consequence, and the architectural fix is strict grounded-retrieval with refusal-by-default outside scope.

Numeric Example
The Robodebt scale and the due-process cost of automated benefits

Australia's Robodebt scandal is the most-documented automated-decision-making failure in public benefits, and the numbers are instructive. Scale of harm: roughly 470,000 incorrect debts were issued by Centrelink between 2016 and 2020, totaling ~AUD 1.8 billion in wrongly-claimed debt. Settlement and remediation: the 2020 class-action settlement was AUD 1.2 billion (refunds plus interest plus compensation); the 2023 Royal Commission report documented systemic governance failures. Human cost: several recipients died by suicide after receiving the debt notices, a fact extensively documented in the Royal Commission proceedings. The political fallout brought down a government.

By comparison, Michigan MiDAS (the unemployment-fraud-detection system, 2013-2017) wrongly accused 40,000+ people of fraud with an estimated 93 percent error rate at peak; the eventual settlement was approximately $20 million plus subsequent state-level reforms. Dutch SyRI (the welfare-fraud-detection system that the Dutch courts ruled illegal in 2020) affected primarily immigrant communities and was implicated in the broader childcare-benefits scandal that brought down the Rutte cabinet in 2021.

The consistent pattern across all three: opaque automated decisions in high-consequence benefits domains produce large-scale and irreversible harm. The civil-rights and administrative-law remediation costs typically exceed the original program savings by 5-10x, before counting the political and reputational costs. This is the structural argument for the human-in-the-loop and accountability invariants of Section 72.1.

The three named pre-LLM automated-decision failures whose lessons now structure every public-sector AI policy framework.
Figure 72.2.1: The three named pre-LLM automated-decision failures whose lessons now structure every public-sector AI policy framework. Australia's Robodebt (~470,000 wrong debts, AUD 1.8B claimed, several suicides, government brought down, AUD 1.2B settlement) established that human-in-the-loop and accessible appeals are non-negotiable. Michigan MiDAS (40,000+ wrong fraud accusations, 93% error rate at peak) showed that opacity plus automated accusation produces mass harm in benefits programs. The Dutch SyRI / toeslagen scandal (Dutch courts ruled SyRI illegal in 2020 under ECHR Article 8; the childcare-benefits scandal brought down the Rutte cabinet in January 2021) established disparate-impact testing as a first-class requirement. The remediation cost across all three: 5-10x the original program savings. OMB M-24-10 and the EU AI Act both encode these lessons.
See Also
Key Takeaways

What Comes Next

Section 72.3 walks through the regulatory and policy framework that has consolidated for U.S. federal AI use: OMB M-24-10, FedRAMP authorization, Section 508 accessibility, EU AI Act for public-sector AI, and the state and local AI inventory laws.

Self-Check
1. The NYC MyCity launch is now a standard reference in U.S. municipal-AI procurement. What was the architectural failure, and what is the post-incident standard?
Show Answer
The architectural failure was prompting the chatbot as a helpful general assistant rather than as a strict grounded-retrieval system. When a constituent asked a question outside the curated City corpus (employment law, housing policy), the chatbot improvised an answer rather than refusing or routing to the relevant office. The post-incident standard, adopted across U.S. municipal-AI deployments through 2024-2025, is strict grounded-retrieval limited to the curated corpus with refusal-to-answer outside scope, hard-coded routing rules for consequential questions, and audit-log review. "Helpful by default" was retired as a default; "refuse outside scope" replaced it.
2. The Robodebt, Michigan MiDAS, and Dutch SyRI scandals all predate LLMs. Why are they treated as foundational case studies in public-sector LLM training and policy?
Show Answer
All three systems were opaque automated decisions in high-consequence benefits domains, deployed without adequate human-in-the-loop review, with limited appeal rights, that produced large-scale and irreversible harm. The technology (rule-based algorithms, ML classifiers) is not the point; the institutional pattern is. The lesson generalizes directly to LLMs: high-consequence benefits decisions must remain with identifiable human decision-makers, with documented appeal paths, and the AI may inform but never decide. Every major civilian-agency AI policy framework from OMB M-24-10 forward reflects this lesson. Treating these as foundational case studies inoculates the field against repeating the institutional failures with a more capable technology.
3. Agency chatbot conversations may themselves be public records under FOIA. What is the failure mode, and what is the procurement pattern that handles it?
Show Answer
The failure mode is that an agency deploys a chatbot, the chatbot generates conversations, a journalist or advocacy group files a FOIA request for all chatbot logs, and the agency discovers it has no records-retention policy for the new artifact. The procurement pattern that handles this: contracts specify retention (how long, by whom, under what classification), agency records officers classify chatbot transcripts under the records schedule, FOIA-response capability is engineered into the logging architecture from the start. Several agencies now treat chatbot transcripts as routine public records and publish dashboards or proactive data exports, which both satisfies FOIA and surfaces useful analytics.

What's Next?

In the next section, Section 72.3: Regulatory and Policy Framework for Government LLMs, we build on the material covered here.

Further Reading

Documented Failures

Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin's Press. Reference book on algorithmic-system failures in government services.
AlgorithmWatch (2024). "Automating Society Report." automatingsociety.algorithmwatch.org. Annual catalog of public-sector algorithmic failures in Europe.
Citron, D. K. (2007). "Technological Due Process." Washington University Law Review 85. openscholarship.wustl.edu vol85 iss6/2. The foundational law-review article on algorithmic due process; the legal framework underpinning the due-process failure-mode discussion.
Richardson, R., Schultz, J. M., & Crawford, K. (2019). "Dirty Data, Bad Predictions: How Civil Rights Violations Impact Police Data, Predictive Policing Systems, and Justice." NYU Law Review 94. nyulawreview.org Richardson-Schultz-Crawford. Empirical study showing how biased training data produces predictably harmful government AI; the canonical reference for the data-provenance failure mode.

Public Trust

Brookings Institution (2024). "Public Trust in Government Use of AI." brookings.edu/research. Survey research on citizen trust in government LLMs; informs deployment-failure analysis.
Veale, M., & Brass, I. (2019). "Administration by Algorithm? Public Management Meets Public Sector Machine Learning." Algorithmic Regulation, Oxford University Press. papers.ssrn.com 3375391. Public-administration analysis of how algorithmic government must remain accountable to citizens; informs the procurement and accountability failure-mode discussion.