Public-Sector Grounded Assistant Architecture

Section 72.4

"Strict-scope retrieval, citations always, refusal by default, audit log, accessibility-first. The five rules of a public-sector LLM that does not end up on the front page."

RagRag, Public-Sector-RAG-Architect AI Agent
Big Picture

The dominant pattern in successful public-sector LLM deployments has consolidated around seven layers: strict-scope retrieval, citations always, refusal-by-default outside scope, audit logging, disclaimer-and-informed-use UX, accessibility-first interface, and continuous evaluation against a public benchmark. This section walks through each layer and then maps the architecture onto the FedRAMP-authorized cloud LLM services landscape that procurement teams routinely consult.

Prerequisites

This section assumes the government regulatory framework from Section 72.3, the RAG fundamentals from Section 32.1, and the LLM-audit-log discipline from Section 54.9.

The Seven-Layer Pattern

Fun Fact

FedRAMP-authorized cloud services for federal LLM workloads are a tiny subset of the commercial market: of roughly 300 frontier-class LLM offerings worldwide in 2026, fewer than 15 hold a FedRAMP Moderate authorization, and only a handful hold FedRAMP High. The lead times to authorization explain why federal agencies still routinely deploy 18-month-old model versions; the procurement curve is much slower than the model release curve.

The dominant pattern in successful public-sector LLM deployments:

  1. Strict-scope retrieval: only retrieve from a vetted corpus of agency-approved documents. Refuse to answer questions outside the corpus.
  2. Citations always: every substantive answer includes a link to the underlying source document. No source = no answer.
  3. Refusal-by-default outside scope: when in doubt, hand off to a human agent or direct to the relevant office. Avoid the "helpful generalist" failure mode.
  4. Audit log of every interaction: prompts, retrieved passages, responses, and timestamps. Stored per records-retention policy. Available to oversight bodies.
  5. Disclaimers and informed use: clear language stating the system is a chatbot, not a substitute for an authoritative determination, and explaining how to reach a human.
  6. Accessibility-first UI: screen-reader compatible, keyboard navigable, plain-language responses by default, multilingual support where the served population requires it.
  7. Continuous evaluation against a public benchmark: an eval set of representative constituent questions with expected behavior. Regressions block deployments.
The seven non-negotiable layers of a successful 2026 public-sector grounded-assistant deployment.
Figure 72.4.1: The seven non-negotiable layers of a successful 2026 public-sector grounded-assistant deployment. Blue layers (1-3) enforce content discipline: strict-scope retrieval over the vetted agency corpus, mandatory citation with source URL and section ID, and refusal-by-default outside scope (the architectural fix for the NYC MyCity helpful-by-default pattern). Yellow layers (4-5) enforce accountability: tamper-evident audit logs and informed-use disclaimers. Green layers (6-7) enforce usability and quality: WCAG 2.1 AA accessibility from procurement onward and continuous eval gating deployments. Under the seven layers sits the FedRAMP-authorized substrate (Azure OpenAI in Azure Gov, AWS Bedrock GovCloud, Vertex AI Assured Workloads, Claude via Bedrock or Azure), with an air-gap fallback using open-weight models inheriting the agency's own ATO.

Procurement teams routinely ask "which LLM is FedRAMP-authorized and at what level?" Table 72.4.1a summarizes the dominant offerings as of mid-2026; the canonical source of truth is always the FedRAMP Marketplace, not vendor marketing.

Table 72.4.1b: FedRAMP-authorized cloud LLM services, mid-2026. Authorization levels and availability change frequently; verify in the FedRAMP Marketplace before any procurement decision. DoD impact levels (IL4/IL5/IL6) are tracked separately via DISA.
Service Cloud tier FedRAMP level Notes
Azure OpenAI Service Azure Government (separately also Azure Commercial) High in Azure Gov; DoD IL4/IL5 via separate authorizations Most-cited path for federal GenAI workloads in 2026
AWS Bedrock AWS GovCloud (US) High; specific models vary in availability Available foundation models in GovCloud lag commercial by weeks-months
Google Vertex AI Google Cloud Assured Workloads High for select services; verify per model in the Marketplace Gemini availability in government tiers expanded through 2025-26
Anthropic Claude (via AWS Bedrock GovCloud and Azure) AWS GovCloud / Azure Government Inherits hosting platform's authorization Frequently selected when Anthropic-specific safety properties matter
On-premises open-weight (Llama, Mistral, Qwen via vLLM/NIM) Agency-controlled infrastructure N/A (no cloud service to authorize) Inherits the agency's own ATO; required for air-gapped and classified workloads

Layer Notes

Layer 1 (strict-scope retrieval) is the architectural defense against the NYC MyCity pattern. The retrieval index is curated, the LLM is prompted to retrieve from the index, and the response template includes the citations. Any question that does not retrieve relevant passages produces a "this is outside my scope; here is the office to contact" response.

Layer 2 (citations always) is operationally simple but pedagogically critical. The user must be able to verify the response against the source. The pattern shared across successful federal deployments is to include the citation as a clickable link in the response, not as a footnote that users skip.

Layer 3 (refusal-by-default) is the policy choice that distinguishes the public-sector pattern from the consumer pattern. The default behavior is "I do not have information on this; here is who to contact"; the exception is "here is the answer with citation."

Layer 4 (audit log) supports both FOIA-response and after-action review. Every prompt and response is logged with timestamps, and the retention is governed by the agency's records schedule. Several agencies publish the audit logs proactively to satisfy transparency expectations.

Layer 5 (disclaimers) is a UX requirement that matches the legal requirement. The system clearly identifies itself as a chatbot, frames its responses as informational rather than authoritative, and provides clear paths to human agents for authoritative determinations.

Layer 6 (accessibility-first UI) is non-negotiable under Section 508 and equivalent state rules. The interface supports screen readers, keyboard navigation, and plain-language responses. Multilingual support is required where the served population needs it.

Layer 7 (continuous evaluation) is the operational discipline that catches regressions. The evaluation set is curated against actual constituent questions, the expected behavior is documented, and any regression on the eval set blocks deployment.

Key Insight

The seven layers above are conservative by design, and the conservative architecture is what allows the system to ship inside the constraints of federal procurement, accessibility, and accountability law. A more permissive architecture would require longer authorization, would carry higher compliance burden, and would not produce meaningfully more value to constituents. Public-sector AI is not a domain where pushing the architectural frontier produces faster shipping; the opposite is true. Conservative architecture is faster to ship in the public sector.

Real-World Scenario
A Federal Benefits Agency's Internal Knowledge-Search Pilot

Who. A federal benefits administration (composite drawn from VA, SSA, and HHS internal-pilot reports) with ~50,000 caseworker employees who routinely answer policy questions on benefits eligibility, documentation requirements, and procedural matters. Situation. The agency identified ~30 years of accumulated guidance documents, standard operating procedures, regulatory interpretations, and case-disposition memoranda totaling roughly 2 million documents. Caseworkers spent measurable time looking up policy answers, often duplicating work across the agency. Problem. The pilot needed FedRAMP authorization, OMB M-24-10 compliance, audit-log integration, and a deployment timeline measured in months rather than years. Decision. The agency built an internal-only LLM-augmented knowledge-search assistant using the seven-layer pattern: Azure OpenAI Service in Azure Government (FedRAMP High), retrieval index over the curated policy corpus, system prompt enforcing refusal-to-answer outside scope, every response with citations, audit logging into the agency's Splunk in GovCloud, accessibility audit, and continuous evaluation against a 500-question benchmark. The deployment was scoped explicitly as non-rights-impacting (the assistant helps caseworkers find answers; the caseworker decides). How. Implementation took 9 months end-to-end, dominated by FedRAMP-equivalent agency ATO paperwork rather than engineering. Result. Caseworker time on policy lookups dropped roughly 45 percent in the pilot cohort; the assistant handles ~70 percent of routine questions; the audit-log capability supported two OIG inquiries that required reproducing past policy-interpretation contexts. Lesson. The seven-layer pattern is repeatable: scope explicitly as non-rights-impacting, use FedRAMP-authorized cloud services, build the eval set up-front, and the deployment timeline is dominated by paperwork rather than engineering.

Numeric Example
The seven-layer architecture sized for a federal agency

For a federal benefits agency deploying the seven-layer pattern at 10,000-employee scale, the cost stack decomposes as follows. Layer 1 (strict-scope retrieval): a 2-million-document corpus, ingested and chunked with provenance metadata, indexed in OpenSearch in GovCloud, costing ~$50-100K one-time corpus preparation + ~$40K/year index operations. Layer 2 (citations always): negligible incremental cost beyond standard retrieval. Layer 3 (refusal-by-default): system prompt of ~3,000 tokens, no incremental architecture cost. Layer 4 (audit log): at 10,000 employees doing ~10 queries/day each (~100K queries/day) and ~10KB per logged interaction, ~1GB/day or ~365GB/year of logs. In Splunk in GovCloud at ~$2,000/GB-year for high-availability retention, that is ~$730K/year; tiered storage (hot 90 days, warm 2 years, cold 7 years) reduces this to ~$200-300K/year. Layer 5 (disclaimers): UI work, ~$30-60K one-time. Layer 6 (accessibility-first UI): additional Section 508 / WCAG 2.1 AA development effort, ~$100-200K one-time + ~$30K/year for ongoing accessibility audits. Layer 7 (continuous evaluation): a 500-question evaluation set + monthly regression testing, ~$50K one-time + ~$30K/year ongoing.

Inference cost: 100K queries/day at ~5,000 input + 800 output tokens averaged is ~$2-3M/year in Azure OpenAI Government Service consumption at standard SKU rates. Total annual cost: roughly $3-4M/year at 10,000-employee scale, against perhaps $200-400M/year in caseworker labor at the same scale. The 45 percent productivity gain on policy lookup translates to ~$25-50M/year in recovered staff time; the ROI is positive by an order of magnitude.

See Also
Self-Check
1. Why is strict-scope retrieval (Layer 1) described as the architectural defense against the NYC MyCity pattern, and what behavioral change does it produce?
Show Answer
The NYC MyCity failure was a chatbot that improvised answers outside its curated scope because the system was prompted as a helpful general assistant. Strict-scope retrieval defends against this by ensuring the retrieval layer only returns curated content, the LLM is prompted to answer only from retrieved passages, and any question that fails to retrieve relevant content produces a refusal-with-handoff ("I do not have information on this; here is the office to contact") rather than an improvised answer. The behavioral change is that the system's default is to refuse rather than to attempt; "helpful by default" is replaced with "refuse outside scope by default."
2. The Table 72.4.1 entry for "On-premises open-weight" specifies the capability lag relative to frontier-cloud LLMs is 9-18 months. Why does this lag exist, and what classes of workload accept it?
Show Answer
The lag exists because open-weight model releases follow proprietary frontier releases by 6-18 months at the capability level, and the on-premises deployment cannot incorporate the proprietary-frontier improvements (the model weights are unavailable). Workloads that accept the lag are those where the lag cost is lower than the alternative cost of data egress: classified networks, certain intelligence-community use cases, defense-industrial-base contracts, and highly-sensitive state-level workloads (criminal justice, tax administration). These workloads cannot use commercial cloud even with FedRAMP High or IL5/IL6 authorization, so the lag is the cost of doing business rather than a tradeoff against alternatives.
3. The accessibility-first UI requirement (Layer 6) under Section 508 covers four specific properties. Name them.
Show Answer
Section 508 (and WCAG 2.1 AA, which most state and local governments require equivalent of) covers (1) screen-reader compatibility (the UI must work with JAWS, NVDA, and similar assistive technologies), (2) keyboard navigation (no required mouse interactions), (3) visible focus indicators (the user must see which UI element is active), and (4) accessible streaming-text behavior (the LLM's streaming-response output must not break screen-reader parsing or trap focus). The U.S. Access Board has published specific chatbot-accessibility guidance through 2024-2025. Several public-sector pilot deployments failed audits because the streaming-text behavior was not tested with assistive technologies; the pattern that works is to treat accessibility as a first-class procurement requirement and verify before award.

What Comes Next

Section 72.5 closes the chapter with the vendor and tool landscape (Palantir AIP, Anduril, the FedRAMP-authorized cloud providers), the in-book cross-references, and the canonical external sources.

What's Next?

In the next section, Section 72.5: Government LLM Vendors and Postmortems, we build on the material covered here.

Further Reading

Grounded Assistant Architecture

Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401. RAG architecture is the basis for grounded public-sector assistants.
Gao, Y., et al. (2023). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997. RAG survey covering verification patterns for grounded LLM responses.

Public Sector Implementations

GSA (2024). "USAi: Government AI Pilot." gsa.gov/about-us/newsroom/news-releases. Reference U.S. federal LLM pilot.
UK Government (2024). "GOV.UK Chat." gov.uk/government/news/gov-uk-chatbot. Reference UK public-sector LLM deployment with citation-anchored answers.