"Strict-scope retrieval, citations always, refusal by default, audit log, accessibility-first. The five rules of a public-sector LLM that does not end up on the front page."
Rag, Public-Sector-RAG-Architect AI Agent
The dominant pattern in successful public-sector LLM deployments has consolidated around seven layers: strict-scope retrieval, citations always, refusal-by-default outside scope, audit logging, disclaimer-and-informed-use UX, accessibility-first interface, and continuous evaluation against a public benchmark. This section walks through each layer and then maps the architecture onto the FedRAMP-authorized cloud LLM services landscape that procurement teams routinely consult.
Prerequisites
This section assumes the government regulatory framework from Section 72.3, the RAG fundamentals from Section 32.1, and the LLM-audit-log discipline from Section 54.9.
The Seven-Layer Pattern
FedRAMP-authorized cloud services for federal LLM workloads are a tiny subset of the commercial market: of roughly 300 frontier-class LLM offerings worldwide in 2026, fewer than 15 hold a FedRAMP Moderate authorization, and only a handful hold FedRAMP High. The lead times to authorization explain why federal agencies still routinely deploy 18-month-old model versions; the procurement curve is much slower than the model release curve.
The dominant pattern in successful public-sector LLM deployments:
- Strict-scope retrieval: only retrieve from a vetted corpus of agency-approved documents. Refuse to answer questions outside the corpus.
- Citations always: every substantive answer includes a link to the underlying source document. No source = no answer.
- Refusal-by-default outside scope: when in doubt, hand off to a human agent or direct to the relevant office. Avoid the "helpful generalist" failure mode.
- Audit log of every interaction: prompts, retrieved passages, responses, and timestamps. Stored per records-retention policy. Available to oversight bodies.
- Disclaimers and informed use: clear language stating the system is a chatbot, not a substitute for an authoritative determination, and explaining how to reach a human.
- Accessibility-first UI: screen-reader compatible, keyboard navigable, plain-language responses by default, multilingual support where the served population requires it.
- Continuous evaluation against a public benchmark: an eval set of representative constituent questions with expected behavior. Regressions block deployments.
Procurement teams routinely ask "which LLM is FedRAMP-authorized and at what level?" Table 72.4.1a summarizes the dominant offerings as of mid-2026; the canonical source of truth is always the FedRAMP Marketplace, not vendor marketing.
| Service | Cloud tier | FedRAMP level | Notes |
|---|---|---|---|
| Azure OpenAI Service | Azure Government (separately also Azure Commercial) | High in Azure Gov; DoD IL4/IL5 via separate authorizations | Most-cited path for federal GenAI workloads in 2026 |
| AWS Bedrock | AWS GovCloud (US) | High; specific models vary in availability | Available foundation models in GovCloud lag commercial by weeks-months |
| Google Vertex AI | Google Cloud Assured Workloads | High for select services; verify per model in the Marketplace | Gemini availability in government tiers expanded through 2025-26 |
| Anthropic Claude (via AWS Bedrock GovCloud and Azure) | AWS GovCloud / Azure Government | Inherits hosting platform's authorization | Frequently selected when Anthropic-specific safety properties matter |
| On-premises open-weight (Llama, Mistral, Qwen via vLLM/NIM) | Agency-controlled infrastructure | N/A (no cloud service to authorize) | Inherits the agency's own ATO; required for air-gapped and classified workloads |
Layer Notes
Layer 1 (strict-scope retrieval) is the architectural defense against the NYC MyCity pattern. The retrieval index is curated, the LLM is prompted to retrieve from the index, and the response template includes the citations. Any question that does not retrieve relevant passages produces a "this is outside my scope; here is the office to contact" response.
Layer 2 (citations always) is operationally simple but pedagogically critical. The user must be able to verify the response against the source. The pattern shared across successful federal deployments is to include the citation as a clickable link in the response, not as a footnote that users skip.
Layer 3 (refusal-by-default) is the policy choice that distinguishes the public-sector pattern from the consumer pattern. The default behavior is "I do not have information on this; here is who to contact"; the exception is "here is the answer with citation."
Layer 4 (audit log) supports both FOIA-response and after-action review. Every prompt and response is logged with timestamps, and the retention is governed by the agency's records schedule. Several agencies publish the audit logs proactively to satisfy transparency expectations.
Layer 5 (disclaimers) is a UX requirement that matches the legal requirement. The system clearly identifies itself as a chatbot, frames its responses as informational rather than authoritative, and provides clear paths to human agents for authoritative determinations.
Layer 6 (accessibility-first UI) is non-negotiable under Section 508 and equivalent state rules. The interface supports screen readers, keyboard navigation, and plain-language responses. Multilingual support is required where the served population needs it.
Layer 7 (continuous evaluation) is the operational discipline that catches regressions. The evaluation set is curated against actual constituent questions, the expected behavior is documented, and any regression on the eval set blocks deployment.
The seven layers above are conservative by design, and the conservative architecture is what allows the system to ship inside the constraints of federal procurement, accessibility, and accountability law. A more permissive architecture would require longer authorization, would carry higher compliance burden, and would not produce meaningfully more value to constituents. Public-sector AI is not a domain where pushing the architectural frontier produces faster shipping; the opposite is true. Conservative architecture is faster to ship in the public sector.
Who. A federal benefits administration (composite drawn from VA, SSA, and HHS internal-pilot reports) with ~50,000 caseworker employees who routinely answer policy questions on benefits eligibility, documentation requirements, and procedural matters. Situation. The agency identified ~30 years of accumulated guidance documents, standard operating procedures, regulatory interpretations, and case-disposition memoranda totaling roughly 2 million documents. Caseworkers spent measurable time looking up policy answers, often duplicating work across the agency. Problem. The pilot needed FedRAMP authorization, OMB M-24-10 compliance, audit-log integration, and a deployment timeline measured in months rather than years. Decision. The agency built an internal-only LLM-augmented knowledge-search assistant using the seven-layer pattern: Azure OpenAI Service in Azure Government (FedRAMP High), retrieval index over the curated policy corpus, system prompt enforcing refusal-to-answer outside scope, every response with citations, audit logging into the agency's Splunk in GovCloud, accessibility audit, and continuous evaluation against a 500-question benchmark. The deployment was scoped explicitly as non-rights-impacting (the assistant helps caseworkers find answers; the caseworker decides). How. Implementation took 9 months end-to-end, dominated by FedRAMP-equivalent agency ATO paperwork rather than engineering. Result. Caseworker time on policy lookups dropped roughly 45 percent in the pilot cohort; the assistant handles ~70 percent of routine questions; the audit-log capability supported two OIG inquiries that required reproducing past policy-interpretation contexts. Lesson. The seven-layer pattern is repeatable: scope explicitly as non-rights-impacting, use FedRAMP-authorized cloud services, build the eval set up-front, and the deployment timeline is dominated by paperwork rather than engineering.
For a federal benefits agency deploying the seven-layer pattern at 10,000-employee scale, the cost stack decomposes as follows. Layer 1 (strict-scope retrieval): a 2-million-document corpus, ingested and chunked with provenance metadata, indexed in OpenSearch in GovCloud, costing ~$50-100K one-time corpus preparation + ~$40K/year index operations. Layer 2 (citations always): negligible incremental cost beyond standard retrieval. Layer 3 (refusal-by-default): system prompt of ~3,000 tokens, no incremental architecture cost. Layer 4 (audit log): at 10,000 employees doing ~10 queries/day each (~100K queries/day) and ~10KB per logged interaction, ~1GB/day or ~365GB/year of logs. In Splunk in GovCloud at ~$2,000/GB-year for high-availability retention, that is ~$730K/year; tiered storage (hot 90 days, warm 2 years, cold 7 years) reduces this to ~$200-300K/year. Layer 5 (disclaimers): UI work, ~$30-60K one-time. Layer 6 (accessibility-first UI): additional Section 508 / WCAG 2.1 AA development effort, ~$100-200K one-time + ~$30K/year for ongoing accessibility audits. Layer 7 (continuous evaluation): a 500-question evaluation set + monthly regression testing, ~$50K one-time + ~$30K/year ongoing.
Inference cost: 100K queries/day at ~5,000 input + 800 output tokens averaged is ~$2-3M/year in Azure OpenAI Government Service consumption at standard SKU rates. Total annual cost: roughly $3-4M/year at 10,000-employee scale, against perhaps $200-400M/year in caseworker labor at the same scale. The 45 percent productivity gain on policy lookup translates to ~$25-50M/year in recovered staff time; the ROI is positive by an order of magnitude.
- Chapter 32 (Retrieval-Augmented Generation) for the strict-scope retrieval architecture (Layer 1).
- Chapter 13 (Prompt Design) for the system-prompt engineering underlying Layer 3.
- Chapter 42 (Evaluation Foundations) for the continuous-evaluation methodology of Layer 7.
- Section 69.4 (HIPAA Deployment Patterns) for the structurally similar five-layer defensive pattern in healthcare.
- Section 71.4 (Trust Boundaries) for the structurally similar five-layer trust-boundary pattern in cybersecurity.
Show Answer
Show Answer
Show Answer
What Comes Next
Section 72.5 closes the chapter with the vendor and tool landscape (Palantir AIP, Anduril, the FedRAMP-authorized cloud providers), the in-book cross-references, and the canonical external sources.
What's Next?
In the next section, Section 72.5: Government LLM Vendors and Postmortems, we build on the material covered here.