Section 54.8: System Cards and Frontier System Disclosures

"A system card extends the model card with the parts of the LLM stack that ship to humans: the safety filters, the prompts, the refusal logic."
Sentinel, System-Card-Author AI Agent

Big Picture

A model card documents a model. A datasheet documents a dataset. A system card documents a deployed AI system: the model plus its safety mitigations, deployment surface, evaluation suite, red-team findings, and remaining known risks. System cards became the dominant frontier-lab disclosure format in 2023-2026: OpenAI's GPT-4 (March 2023), GPT-4o (May 2024), and o1 (September 2024) system cards; Anthropic's Claude 3.x and 3.7 system cards; Google's Gemini 2.5 system card and the Frontier Safety Framework disclosures. This section dissects the format, walks through what each major lab includes (and what they don't), and explains why system cards have become the operational artifact that EU AI Act Article 56 codes of practice and the UK AI Safety Institute audits actually consume.

Prerequisites

This section assumes the model-card and datasheet patterns from Section 54.6 and Section 54.7, the LLM-safety framing from Section 49.1, and the frontier-API release pattern from Section 11.1.

54.8.1 Model Card vs System Card: A Critical Distinction

Fun Fact

The GPT-4 system card published in March 2023 was technically authored by Sebastien Bubeck's team at Microsoft Research, who had spent four months with pre-release weights testing it for "sparks of AGI". The 98-page system card and the 155-page "Sparks" paper were drafted in the same hotel suite during NeurIPS 2022, separated mostly by which document the lawyers would later let see daylight.

The terms are easy to confuse and the difference matters legally. A model card documents the trained weights: what was trained, on what data, evaluated how. A system card documents the deployed product, including:

Which model version is currently in production (often different from the model originally documented)
What safety mitigations sit around the model (system prompt, content filters, refusal training, tool restrictions)
What the deployment surface is (API rate limits, allowed use cases, geographic restrictions)
What red-teaming was done against the deployed system, not just against the bare model
What ongoing monitoring is in place

The gap is consequential. A model trained on a known toxic-content benchmark with a 5% toxicity rate is a different risk than the same model deployed behind a Llama Guard filter that reduces production toxicity to 0.1%. Procurement and regulatory questions are about the system, not the bare model.

54.8.2 The OpenAI System Card Format

OpenAI's GPT-4 system card (Bubeck-led technical document, March 2023) established the format. The 2024-2025 evolution through GPT-4o, o1, and o3 system cards added more rigorous safety-evaluation methodology. The structure as of 2025:

Introduction: model lineage, release context, headline capabilities.
Risks and Mitigations: structured around enumerated risk categories (Disallowed Content, Hallucinations, Harmful Bias, Disinformation/Influence Operations, Privacy, Cybersecurity, CBRN Uplift, Self-Replication). For each category: the threat model, the evaluation methodology, headline metrics with comparison to prior models, mitigations in place, residual risk.
Preparedness Framework Evaluations: OpenAI's internal safety-rating framework. Each "risk category" gets a Low/Medium/High/Critical rating based on capability evaluations and threat-model considerations. Models exceeding Medium on certain categories require additional safeguards before release.
External Red-Teaming: summary of pre-deployment red-team campaigns, typically run with the UK AISI, US AISI, and a roster of external contractors. Aggregate findings; specific exploits are typically not published.
Apollo and METR Evaluations: third-party agentic evaluations for autonomy, deception, and self-improvement risks. Released alongside o1 onward.
Monitoring and Mitigation: production monitoring approach; how new abuse patterns get added to the mitigation pipeline.

The o1 system card (September 2024) was particularly significant because o1's chain-of-thought reasoning created new evaluation surface area. The card included previously-unpublished detail about deception evaluations and tested behavior under "model-thinks-it's-being-tested" conditions.

54.8.3 The Anthropic System Card Format

Anthropic's Claude system cards, refined across the 3.x family and the 3.7 Sonnet release in early 2025, take a different organizational stance:

RSP-anchored. Every system card is explicitly cross-referenced to Anthropic's Responsible Scaling Policy. The model's AI Safety Level (ASL) classification is published with the rationale.
Constitutional AI methodology disclosed at high level. The current constitution is linked; the card describes the RLHF and CAI training stages without disclosing the exact dataset composition.
Capability evaluations are quantitative. Specific benchmark numbers on the safety-relevant evals (cyber-uplift on InterCode, CBRN on internal evals, agentic on SWE-bench Verified, deception on Anthropic's internal MASK suite).
Red-team summary with anonymized findings. The 3.7 system card included a summary of red-team findings ranked by severity, with the disposition (mitigated, accepted, residual) noted for each. Specific exploits were not published, but the categorical breakdown was.

54.8.4 Google's Frontier Safety Framework

Google DeepMind's Frontier Safety Framework (FSF), released 2024 with a v2 update in late 2024, takes a more structured approach to capability thresholds. The framework defines:

Critical Capability Levels (CCLs) for each of cybersecurity, biological, autonomy, and machine-learning R&D.
Mitigation thresholds: which safety mitigations are required at each CCL.
Evaluation cadence: when capability evaluations must be re-run during training.

Each Gemini release ships a system card that maps the model to the FSF's CCLs, lists the safety mitigations in place, and gives evaluation numbers on standardized capability tests. The 2.5 Pro system card included Gemini's reaching of an "early warning indicator" for one of the biological-capability evaluations and the corresponding mitigation deployment (added refusal training, restricted API tier).

Comparison matrix of four frontier-lab disclosure frameworks. Rows: OpenAI Preparedness Framework, Anthropic Responsible Scaling Policy, Google DeepMind Frontier Safety Framework, Meta Llama Acceptable Use Policy. Columns: Risk categories covered (cyber, CBRN, autonomy, persuasion, deception), Capability levels defined (Low/Med/High/Critical for OpenAI; ASL-1 through ASL-5 for Anthropic; CCL-1 through CCL-4 for Google; binary 'restricted/not' for Meta), Required mitigations per level (deployment safeguards, training-time interventions, governance review), External red-team disclosure (full vs aggregate vs minimal), Pre-deployment audit (UK AISI / US AISI partnership for first three; not applicable for Meta). Bottom note: 'All four publish system cards at every major release; format and depth differ substantially.' — **Figure 54.8.1**: Comparison of frontier-lab disclosure frameworks. OpenAI's Preparedness Framework, Anthropic's RSP, and Google DeepMind's FSF are the dominant three. They differ in risk category enumeration, capability-level granularity, and mitigation prescription. As of 2026, all three reference each other and the UK/US AISI evaluations as standard practice.

OpenAI Preparedness Framework: risk-by-level grid with release-gating thresholds — **Figure 54.8.2**: OpenAI's Preparedness Framework as it appears in the o1 system card (September 2024) and later: four risk axes (cybersecurity, CBRN, persuasion, model autonomy) each scored Low / Medium / High / Critical. A model classified Medium on CBRN, which is where o1 was placed, can be released only with Anthropic-RSP-3-equivalent mitigations and a red-team report attached to the system card. Two cells at the Critical column require Anthropic-style "do not deploy" classification, the property the system card has to publicly affirm. The 16-cell grid is the structural skeleton system cards from OpenAI, Anthropic (mapped via ASL), and DeepMind (mapped via CCL) all converge on; Section 54.8.1 catalogs the differences.

54.8.5 What System Cards Leave Out (and Why)

Frontier-lab system cards are thorough but never complete. Three categories get omitted nearly every time:

Key Insight

Aha Moment: The GPT-4 System Card's Most Famous Omission

The 60-page GPT-4 system card (March 2023) listed 47 risk categories, 36 quantified evaluations, and a detailed CBRN red-team section. It also contained exactly zero pages on training-data composition. By contrast, the 5-page model card for BLOOM (May 2022, BigScience) included a 35-table appendix listing every dataset, language proportion, and license. The asymmetry is the entire field: the more capable the model, the less you can publish about how it was built. OpenAI's omission was rational (Common Crawl filtering rules are competitive IP, and the New York Times v. OpenAI lawsuit had not yet been filed), but it makes the difference between a "system card" and a "datasheet" a function of risk appetite, not vocabulary. When a vendor's system card is silent on training data, the silence is the data point.

Training data composition. No major lab publishes a full inventory. Reasons cited: trade secrets, legal exposure on copyright-disputed corpora, hesitancy to leak filtering methodology. Datasheets (Section 57.2) for the public components and high-level prose for the proprietary components are the partial substitute.
Specific red-team exploits. Aggregate categories and severity buckets are published; the exact prompts that work are not. Reason: publishing the exploits would simultaneously enable defenders and attackers, and the lab is betting that more defenders read the cards in detail than attackers do.
Capability-evaluation methodology details. The benchmark names are public; the specific prompts and grading rubrics often aren't. UK AISI and US AISI evaluations are an exception: their reports are increasingly published with more methodology detail than the lab-internal evaluations.

Key Insight

System cards have created a "race to the top" in disclosure. When OpenAI published the GPT-4 system card with detailed risk-category breakdowns, Anthropic and Google had to match or exceed the disclosure depth or appear less responsible. This dynamic is the closest thing to self-regulation that has emerged in AI safety, and it operates on social and reputational incentives rather than legal ones. EU AI Act Article 56 (codes of practice for general-purpose AI) and the UK AI Safety Institute's evaluation agreements are now codifying minimum disclosure expectations, which strengthens the dynamic.

54.8.6 Procurement and Regulatory Use of System Cards

Enterprise procurement and regulators consume system cards differently than model cards. Three patterns are common:

Mapping system cards to risk frameworks. A procurement team takes the system card's risk-category breakdown and maps it to the organization's own risk register (NIST AI RMF, ISO 42001). The mapping is rarely one-to-one; the procurement team augments the vendor's evaluation with the organization's own use-case-specific evaluations.

Triggered re-procurement. Many large enterprises tie procurement renewal to system-card updates: when a vendor releases a new system card showing a material capability change or a new safety incident, the contract terms are re-reviewed. This works only if vendors publish system cards on a predictable cadence, which most frontier labs now do.

Regulatory disclosure. The EU AI Act Article 53 (transparency obligations for general-purpose AI model providers) and the Article 55 (additional obligations for GPAI with systemic risk) make a lot of system-card content effectively mandatory for the EU market. The published system cards from OpenAI, Anthropic, and Google already meet most Article 53 requirements; full Article 55 compliance is being phased in through 2026.

Warning: System Cards Document a Moment, Not a Steady State

A system card describes the system at the time of publication. The model can be retrained, the system prompt can be updated, the content filters can be re-tuned, and any of these changes can invalidate parts of the card without anyone updating the document. Consumers should treat the card as version-stamped (it is) and consider whether material changes since publication have occurred. The post-deployment monitoring section, when populated, is what addresses this; in practice it is the section most often left vague.

Real-World Scenario: A Hospital Procurement

A hospital is evaluating Claude 3.7 Sonnet as a clinical-documentation assistant. The procurement team reads the system card and identifies: (1) ASL-3 classification with appropriate deployment safeguards (acceptable); (2) Healthcare-specific evaluation numbers are in the broad safety-eval suite but not in a domain-specific slice (request supplementary attestation); (3) The red-team summary mentions probes of refusal-evasion in medical-advice contexts with the disposition "mitigated" but not the specific exploits (acceptable; the hospital does its own targeted red-team during pilot); (4) Post-market monitoring section commits to quarterly updates and includes a contact for incident reporting (acceptable). Procurement is approved with the supplementary attestation. Six months in, an Anthropic system-card update flags a new capability that affects the use case; the procurement contract triggers a re-review.

Key Insight

System cards document deployed AI systems, not just trained models. The four major frontier labs (OpenAI, Anthropic, Google DeepMind, Meta) all publish system cards with each release, anchored to their internal safety frameworks (Preparedness, RSP, FSF, AUP). They cover risk categories, mitigations, capability evaluations, red-team summaries, and monitoring commitments. What system cards omit (training data composition, specific exploits, evaluation methodology details) is as informative as what they include. EU AI Act Article 53 and 55 are codifying minimum disclosure expectations through 2026. Procurement and regulatory pipelines consume system cards as the primary artifact for "is this system safe for this use case?" decisions.

Self-Check

Q1: A vendor publishes only a model card and no system card. What can they not tell you, that a system card would, and why does that matter for procurement?

Show Answer

A model card documents the trained weights: training data, evaluations, biases, intended use. A system card documents the deployed system: which specific model version is in production, what safety mitigations wrap it (input/output classifiers, retrieval components, refusal training), what monitoring is live, and what incident-response commitments the vendor has made. For procurement, the gap matters because the customer is buying a deployed service, not a set of weights. Two products can use the same underlying model and differ wildly in safety profile because of the mitigations layer; without a system card, you cannot tell whether you are buying the well-mitigated version or the bare model. EU AI Act Article 53/55 disclosures expect both artifacts; a vendor with only a model card is one step short of regulatory readiness.

Q2: Why do system cards typically include aggregate red-team findings but not the specific exploits? What is the implicit threat model?

Show Answer

The implicit threat model is that publishing specific exploits hands them to attackers; the aggregate "red team found N classes of jailbreaks with M mitigated" gives regulators and customers enough information to assess the system without making the next attacker's job easier. The trade-off is contested: some safety researchers argue that detailed exploit disclosures accelerate community-wide defenses (the responsible-disclosure pattern from infosec), while others argue that LLM mitigations are easier to copy than defend, so the asymmetry favors withholding. Frontier labs in 2026 have settled near "aggregate in the public card, specific exploits to regulators under NDA," which is a defensible middle ground but not universally satisfying.

Q3: Compare the OpenAI Preparedness Framework and the Anthropic RSP at the level of "what triggers additional safeguards?". Which is more granular?

Show Answer

OpenAI Preparedness Framework defines four risk categories (cybersecurity, CBRN, persuasion, model autonomy) and four risk levels (low, medium, high, critical) per category. Anthropic RSP defines AI Safety Levels (ASL-1 through ASL-4) as system-wide thresholds, each binding the whole model to a corresponding safety profile and operational commitments. The Preparedness Framework is more granular along the capability axis: a model can be "high" on cybersecurity but "medium" on persuasion, triggering different mitigations for each axis. The RSP is more granular along the deployment-commitment axis: each ASL specifies concrete operational requirements (interpretability research investment, weight security, etc.) that the framework demands as the model crosses the threshold. The two frameworks are partial duals; some safety-oriented procurement reviews require alignment with both.

Q4: You're reading a system card from January 2026; today is May 2026. What questions should you ask before relying on the card's content?

Show Answer

Three questions. First, has the model been updated since January? Frontier labs ship silent updates to deployed models monthly; the card may describe a version no longer in production. Ask for the version-pinned API endpoint and re-evaluate against the card's claims. Second, have new capabilities been added (tool use, vision, longer context) that the card did not evaluate? A January card cannot describe an April-shipped tool-use capability. Third, have any incidents been reported in the intervening four months that the card does not mention? Vendor incident reports and AI safety incident databases (the MIT AI Incident Database, AIAAIC) are the right places to check. If any of the three answers is "yes," request an updated system card before final procurement approval.

What's Next

Continue to Section 54.9: Audit Trails and Logging for Compliance.

Section 57.4 zooms into the operational logging and audit-trail mechanics that turn system-card commitments into verifiable claims. What gets logged, how is it retained, who can read it, and what is the cross-link to Part VIII observability infrastructure?

Further Reading

OpenAI (2024). OpenAI o1 System Card. https://openai.com/index/openai-o1-system-card/.

OpenAI (2024). GPT-4o System Card. https://openai.com/index/gpt-4o-system-card/.

Anthropic (2025). Claude 3.7 Sonnet System Card. https://www.anthropic.com/news/claude-3-7-sonnet.

Anthropic (2024). Responsible Scaling Policy v2.0. https://www.anthropic.com/rsp.

Google DeepMind (2024). Frontier Safety Framework v2. https://deepmind.google/discover/blog/updating-the-frontier-safety-framework/.

UK AI Safety Institute (2025). Pre-Deployment Evaluations of Frontier AI Models. AISI Research Report.

European Parliament and Council (2024). Regulation (EU) 2024/1689 (AI Act), Articles 53 and 55: Obligations for Providers of General-Purpose AI Models.