Section 54.6: Model Cards: Anatomy, Examples, Use in Procurement

"A model card is the model's passport. Without it, an LLM crosses every border, but no procurement officer signs off."
Compass, Procurement-Liaison AI Agent

Big Picture

A model card is the structured documentation that travels with a trained model: what it does, what it was trained on, what it was evaluated against, what failure modes are known, and what the intended use cases are. Mitchell et al.'s 2019 "Model Cards for Model Reporting" introduced the format; six years later it has become a regulatory artifact (EU AI Act Annex IV, NIST AI RMF), a procurement gate (federal acquisitions, large-enterprise vendor onboarding), and a community-norm requirement (Hugging Face won't let you publish a "Featured" model without one). This section walks through the canonical anatomy, two real-world examples (Llama-3 and Claude 3.5 Sonnet), and the procurement workflow that consumes model cards as one of several inputs to a "can we use this model?" decision. For LLM and agent product teams, the model card is the contract that decides whether the chosen LLM is fit for the deployment: a frontier-LLM evaluation section that omits jailbreak rates or hallucination metrics is the procurement signal that this LLM is unsafe for high-stakes agentic use.

Prerequisites

This section assumes the LLM-customization lifecycle from Section 13.1, the evaluation-set methodology from Section 42.1, and the responsible-AI framing from Section 50.1.

54.6.1 Why Model Cards Exist

Fun Fact

Model cards were proposed by Margaret Mitchell and colleagues in 2018, gained traction by 2020, became industry standard by 2023, and are now legally required for several EU and US procurement frameworks. The arc from "nice idea in a fairness paper" to "compliance checkbox in government contracts" took about five years, an unusually fast trip for any documentation idea.

Before model cards, "training data" was a sentence in a paper if anyone bothered to mention it at all. "Intended use" was inferred from a model's name. "Known failure modes" were learned by users running into them. Mitchell et al. argued that this lack of structured documentation was a major cause of model misuse: a model trained for English-language sentiment analysis would be deployed against Spanish customer reviews; a model trained for general-purpose toxicity classification would be deployed against medical-jargon-heavy text. The remedy was a one-to-two-page structured document that travels with the model.

The original Model Card schema included eleven sections. The 2024 NIST update added two more (intended-use-vs-out-of-scope-use, environmental impact); the 2026 EU AI Act Annex IV requirement adds another three for GPAI models (safety mitigations, conformity assessment, post-market monitoring). The schema is now both a research artifact and a regulatory checklist.

54.6.2 The Canonical Anatomy

A 2026-current model card carries eleven sections in roughly this order. The headers are stable across the major templates (Hugging Face, Google Model Card Toolkit, NIST AI RMF Annex):

Key Insight

Worked Example: A Real Model Card's Section Lengths Tell a Story

Compare two model cards side by side. Meta's Llama-3-70B-Instruct card on Hugging Face (April 2024) runs about 2,400 words: 380 on training data, 510 on benchmarks, 90 on intended use, 0 on out-of-scope uses. Anthropic's Claude 3.5 Sonnet model card (June 2024) runs about 7,100 words: 50 on training data (legally constrained), 1,200 on safety evaluations, 800 on intended use, 600 explicitly on out-of-scope and prohibited uses. Same template (the Mitchell 2019 eleven-section schema), opposite emphasis. Llama tells you what the model can do; Claude tells you what you should not use it for. The section that grows when a vendor faces real legal exposure is "out-of-scope uses"; the section that grows when a vendor wants to advertise capability is "benchmarks." Reading a model card is partly a literary exercise: the relative section lengths are themselves a disclosure about which risks the vendor's lawyers worry about most.

Model Details: name, version, type (decoder-only, encoder-decoder, classifier), parameter count, license, citation. The bookkeeping section.
Intended Use: primary intended users, primary intended uses, out-of-scope uses. The single most consequential section for procurement; this is what consumers reference to decide whether the model is fit-for-purpose.
Training Data: source corpora, time span, size, languages, known biases or gaps. Datasheets (Section 57.2) drill down into this.
Training Procedure: optimizer, hyperparameters, hardware, compute cost. The bare minimum for reproducibility.
Factors: relevant subpopulations and conditions (demographics, geographic distribution, instrument settings). These are the slices used for fairness analysis.
Metrics: which metrics, why those metrics, on which datasets, with what variation across factors. The substance of the safety story.
Evaluation Data: details of the eval sets, including known limitations. If the eval data has selection bias, the metrics inherit it.
Quantitative Analyses: tables, plots, confidence intervals. Performance broken out by the factors enumerated in section 5.
Ethical Considerations: risks, sensitive uses, mitigations.
Caveats and Recommendations: prose for downstream users; the "if I were you, I would think about" section.
Citation: how to cite the model in academic and engineering contexts.

EU AI Act Annex IV adds (for general-purpose AI models with systemic risk):

Safety Mitigations: red-teaming results, alignment evaluations, capability evaluations on hazardous-task benchmarks.
Conformity Assessment: how the model meets each of the Annex IV requirements; auditor sign-off.
Post-Market Monitoring: incident reporting, ongoing evaluation cadence, contact for serious-incident notifications.

Model card schema evolution from Mitchell 2019 to EU AI Act Annex IV — **Figure 54.6.1**: The model-card schema has grown from the 11-section Mitchell et al. (2019) template to the 13-section NIST AI 600-1 Generative AI Profile (2024) to the 16-section EU AI Act Annex IV requirement (in force for GPAI with systemic risk from August 2026). The 2024 additions (intended-vs-out-of-scope, environmental impact) translate research norms into US compliance language; the 2026 additions (safety mitigations, conformity assessment, post-market monitoring) turn the card into a procurement gate. The Llama-3.1 and Claude 3.7 Sonnet cards already populate most of the 16; cards that stop at the original 11 fail the August 2026 EU floor on first reading.

54.6.3 A Real Example: Llama-3.1 Model Card

Meta's Llama-3.1 model card (released July 2024 with updates through 2025) is one of the most-imitated templates in production. Notable structural choices:

Three model sizes documented together (8B, 70B, 405B), with per-size sub-sections where the relevant metrics differ but a shared training-data and methodology section.
Use-policy is normative. The "Intended Use" section is paired with a separate "Llama-3 Acceptable Use Policy" linked from the card; violation of the AUP is treated as a license violation.
Benchmark suite is wide. MMLU, HumanEval, GSM8K, MATH, BIG-Bench Hard, plus the responsibility benchmarks (TruthfulQA, ToxiGen, AdvBench, JailbreakBench). Per-benchmark numbers are reported with whatever sampling variability data Meta has.
Carbon impact reported. Estimated tCO₂e for each training run, in line with the NIST AI RMF recommendation. This is one of the first major commercial models to report carbon at the training-run level.
Known risks enumerated. Not just "harmful content" as an abstract category, but explicit pointers to the categories the model still fails on after RLHF (multilingual harms, low-resource-language jailbreaks).

54.6.4 A Real Example: Claude 3.5 / 3.7 Sonnet

Anthropic's Claude model cards (3.5 Sonnet in June 2024, 3.7 Sonnet in early 2025) take a different stance: shorter on benchmark numbers, longer on safety story. The relevant subsections:

Capabilities are described in product terms, not benchmark terms. "Strong at agentic coding, complex reasoning, vision tasks," rather than a wall of MMLU numbers (which appear separately).
Safety evaluations are paired with the Responsible Scaling Policy (RSP) framework. The card lists which RSP AI Safety Level the model is classified at and which automated safety evals were run.
Constitutional AI methodology is documented at high level. Pointer to the published constitution; no raw training data list (Anthropic does not publish that), with explicit acknowledgment of the trade-off.
Out-of-scope uses are normative and specific. Examples: "do not deploy without human oversight in scenarios involving life-or-death decisions"; "do not use to produce content that could plausibly be mistaken for a real person without disclosure."

Key Insight

A model card is a contract, not a marketing document. The "intended use" section binds the model provider in two directions: it tells consumers what is officially supported (and therefore what bugs are bugs the provider should fix), and it tells regulators what the provider claims about the system. A model card that says "general-purpose conversational AI" is a much broader contract than "customer-support chatbot for English-speaking retail customers." When in doubt, write the narrower intended use; you can always expand it later. The opposite expansion, after a misuse incident, is far harder.

Key Insight

The Math Behind "Per-Cohort Disclosure" in Model Cards

The substantive content of a model card's Quantitative Analyses section is the per-cohort performance table. Let the eval set be $\mathcal{D}_{\mathrm{eval}} = \{(x_i, y_i, c_i)\}_{i=1}^n$, where $c_i \in \mathcal{C}$ is the cohort label (language, gender, age band, geographic region). The Mitchell-et-al. format requires reporting, for each metric $\mu$ (accuracy, F1, calibration error, toxicity rate):

\hat{\mu}_c \;=\; \frac{1}{n_c}\sum_{i:\,c_i=c}\mu(x_i, y_i), \quad \mathrm{CI}_{95\%}(\hat\mu_c) \;=\; \hat\mu_c \;\pm\; 1.96\sqrt{\tfrac{\hat\mu_c(1-\hat\mu_c)}{n_c}}, \quad \text{for each } c \in \mathcal{C}.

The disclosure-quality metric used by NIST AI RMF and EU AI Act Annex IV reviewers is the worst-cohort and worst-pair gap:

\mu_{\mathrm{worst}} = \min_{c \in \mathcal{C}} \hat{\mu}_c,\qquad \Delta_{\mathrm{cohort}} = \max_{c, c' \in \mathcal{C}} \bigl|\hat{\mu}_c - \hat{\mu}_{c'}\bigr|.

A reviewer-grade card publishes (i) $n_c$ per cohort (so reviewers can spot under-powered slices, typically flagging $n_c < 100$), (ii) $\mathrm{CI}_{95\%}$ per cohort (so a "95% accuracy" claim on 30 samples can be distinguished from one on 30,000), and (iii) $\Delta_{\mathrm{cohort}}$ with a bootstrap CI of its own. Cards that report only the macro-average $\tfrac{1}{|\mathcal{C}|}\sum_c \hat{\mu}_c$ hide the disparity that procurement specifically asks about. See Mitchell et al., 2019 and the NIST AI RMF Generative AI Profile (NIST AI 600-1, 2024) for the formal disclosure spec.

54.6.5 The Procurement Workflow: How Model Cards Get Consumed

Large-organization procurement (federal agencies, financial institutions, healthcare systems, regulated industries) increasingly runs a five-stage check that pivots on the model card. The pattern stabilized around 2023-2024 as both the EU AI Act Article 11 (technical documentation requirement) and the NIST AI RMF (Govern-Map-Measure-Manage) gained regulatory weight; in 2026 a vendor that ships a frontier model without a procurement-grade card will typically lose Fortune 500 RFPs before the demo.

Fit-for-purpose review. Does the "intended use" section match the proposed application? If not, the procurement either escalates to the provider for a written attestation, or the procurement is rejected and a different model is sought. The Air Canada chatbot case (February 2024 BC Civil Resolution Tribunal) is the cautionary tale: the bot's "intended use" was customer information, not refund-policy commitments, but no one in procurement caught the gap before deployment.
Bias and fairness check. The "factors" and "quantitative analyses" sections are inspected for performance gaps across protected characteristics. A gap of more than a procurement-defined threshold (typically 5 percentage points on key metrics) triggers a mitigation requirement. NYC Local Law 144 (in force July 2023) codifies the four-fifths rule for hiring AI; a model card that does not break out impact ratios by race and gender fails LL 144 audits on first reading.
Data-source review. The training-data section is cross-referenced against any data-residency, copyright-sensitivity, or licensing constraints. Models trained on data that the procurement's customers have not consented to may be barred. The 2023-2024 wave of lawsuits (New York Times v. OpenAI, Andersen v. Stability AI, Getty v. Stability AI) has made the "what was the training corpus" question a contractual must-answer in enterprise deals, even when the model card itself is silent.
Security review. Cross-referenced against the safety evaluations: which red-team probes were run, with what results? The OWASP LLM Top 10 (Chapter 49) is now standard. The OWASP Top 10 for LLM Applications v1.1 (October 2023) and the 2024 v2.0 update define the 10 reference attack categories (LLM01 prompt injection, LLM02 insecure output handling, etc.); a model card that fails to report results on at least the top 5 of these will be challenged in any security-mature procurement.
Post-market monitoring. Are there ongoing reporting commitments? Is there a defined process for the procurement organization to report an incident back to the provider? The EU AI Act Article 72 imposes a post-market monitoring obligation on high-risk system providers starting August 2026, including a 15-day reporting deadline for serious incidents; procurement contracts now routinely incorporate this requirement by reference, even outside the EU.

# Excerpt from a real model card stored as YAML metadata
# in a Hugging Face model repository:
license: llama3.1
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
tags:
- facebook
- meta
- llama
- llama-3
intended_use:
  primary_intended_uses:
    - Commercial and research use in supported languages
    - Assistant-like chat applications
    - Tuning and inference for downstream tasks
  out_of_scope_uses:
    - Any use that violates applicable laws or regulations
    - Any use prohibited by the Llama 3 Acceptable Use Policy
training_data:
  sources:
    - Publicly available online text
    - Curated instruction-following datasets
  cutoff: 2023-12-01
  approximate_size_tokens: 1.5e13
evaluation:
  benchmarks:
    - {name: MMLU, score: 0.732, std: 0.004}
    - {name: HumanEval, score: 0.808, std: 0.011}
    - {name: GSM8K, score: 0.842, std: 0.007}
  responsibility:
    - {name: TruthfulQA-MC2, score: 0.526}
    - {name: ToxiGen, score: 0.044, lower_is_better: true}
carbon_footprint:
  training_tCO2e: 1900
  inference_per_million_tokens_kg: 0.0009

Code Fragment 54.6.1a: Machine-readable model card metadata in the YAML front matter of a Hugging Face repository. This format is consumed by HF's UI, by the EU AI Act's model registry (when fully online), and by enterprise procurement tools that crawl model hubs. The YAML structure is the same across providers; what varies is which fields are populated.

54.6.6 Failure Modes and Anti-Patterns

The 2024-2026 literature on model card adoption (Boyd 2024, Sloane et al. 2024) has documented several recurring failure modes:

The marketing-document card. Lists benchmark wins, omits failure modes. Common from smaller vendors competing on leaderboards. Useless for procurement.
The wall-of-numbers card. Twenty benchmark scores with no context. Hard to consume; defenders against bias and fairness analysis.
The stale card. Card was accurate at v1.0; the model has shipped v1.4 with new training data and the card was never updated. Versioning rules should be enforced by the publishing pipeline.
The boilerplate card. Generic language copied from a template; "intended uses: chatbot applications" with no specificity. Procurement reviewers can spot these instantly.
The procurement-mismatch card. The card is technically accurate but written in research language that procurement reviewers can't evaluate. Pair with an executive summary aimed at non-technical reviewers.

Warning: Cards Without Audits Are Vendor Self-Reports

A model card is whatever the vendor writes. Third-party audit is what turns a self-report into a verifiable claim. For high-stakes deployments (medical devices, credit decisions, employment screening), procurement should require an independent audit that re-runs the card's quantitative analyses on the procurement organization's own data. The audit pipeline is in Section 54.9; the cross-link to NIST's AI RMF audit framework is documented there.

Real-World Scenario: A Bank's LLM Procurement

A regional bank is evaluating an LLM for an internal customer-research assistant. The procurement checklist references the model card on five points: (1) Intended use matches "research and analysis support in English" (pass); (2) Training data cutoff is recent enough for the bank's purposes (pass); (3) Bias evaluation includes financial-services-relevant slices (fail; vendor agrees to add); (4) Safety evaluations cover the OWASP LLM Top 10 (pass with caveats; vendor's red-team report is shared under NDA); (5) Carbon footprint is reported (pass). The procurement is approved conditional on the vendor adding the financial-services slice to the next card revision; the bank's quarterly review revisits the card to verify ongoing accuracy.

Key Insight

A model card is the structured documentation that travels with a model: intended use, training data, evaluation results, failure modes, and ethical considerations. The format was introduced by Mitchell et al. in 2019, has been updated through NIST AI RMF and EU AI Act Annex IV, and is now a procurement requirement in regulated industries. The "intended use" section is the most consequential: it is the contract between vendor and consumer. A model card is necessary but not sufficient; third-party audit and ongoing post-market monitoring complete the transparency story.

Self-Check

Q1: A vendor's model card lists 22 benchmark scores but no out-of-scope uses. What does this tell you about the vendor's procurement readiness?

Show Answer

The vendor is in marketing-mode, not procurement-mode. Out-of-scope uses are the contractual boundary between safe deployment and vendor liability; omitting them is a sign that the vendor either has not done the risk analysis or is keeping their options open at the customer's expense. In regulated industries (finance, healthcare, government), procurement teams will require an addendum that lists explicit out-of-scope uses before approval. The Mitchell et al. (2019) template puts intended-use and out-of-scope use in the same section for exactly this reason: the contract between vendor and customer is what each side has agreed the model will and will not be used for. A 22-benchmark card with no out-of-scope section signals "not ready for production procurement."

Q2: You're a procurement reviewer and the card's intended use is "general-purpose AI assistant." You want to deploy it for medical-question triage. What is your next step?

Show Answer

"General-purpose AI assistant" is intentionally vague; medical-question triage is a high-stakes domain that the card neither lists as intended nor explicitly excludes. The procurement-correct next step is to request a domain-specific addendum from the vendor: vendor-conducted evaluation on a medical-question benchmark (MedQA, PubMedQA, or a custom internal set), a bias evaluation across patient demographics, and a vendor-signed statement that the model is suitable for the named clinical workflow. If the vendor declines, the answer is to use a model that has been explicitly validated for medical use (Med-PaLM 2, an FDA-cleared specialty model, or a commercial product with HIPAA BAA and clinical claims). Deploying a "general-purpose" model in a clinical workflow without that addendum exposes the institution to malpractice and FDA-enforcement risk.

Q3: EU AI Act Annex IV adds three sections beyond the Mitchell et al. template. Name them and explain why each was added.

Show Answer

First, energy and computational resources used during training, which was added in response to the carbon-footprint transparency movement and lets regulators compare the operational cost of competing models. Second, identification of legal and natural persons involved in development, which closes the accountability gap created when a model goes wrong but no individual or entity can be held responsible. Third, conformity assessment results, which is the EU-specific compliance documentation that maps the model to the AI Act's risk categories and required mitigations. The three together turn the model card from a research-community artifact into a regulatory-grade compliance document, which is exactly what Annex IV was designed for.

Q4: Why is "model card without audit" weaker than "model card with audit"? Give a scenario where the difference matters.

Show Answer

An unaudited model card is the vendor's self-report; the customer has no independent verification that the benchmark scores were measured on the stated test set or that the bias evaluations cover the populations the card claims. Third-party audit (an external evaluator with read access to the model) verifies the card's claims against actual model behavior. The difference matters concretely in lending: a vendor whose model card claims 95% fairness across protected-class slices may have measured on a biased internal test set; a procurement-required external audit on the customer's own data may reveal a 12-point disparity that triggers ECOA exposure. Audits cost the vendor extra effort, which is why they are a procurement lever, not a default.

What's Next

Continue to Section 54.7: Datasheets for Datasets.

Section 57.2 zooms into the training-data side: datasheets for datasets, the Gebru et al. format that complements model cards. Where a model card documents the model, a datasheet documents the data, and the gaps in datasheet coverage are the dominant cause of post-deployment surprises about what the model actually learned.

Further Reading

Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). Model Cards for Model Reporting. FAccT '19.

NIST (2024). AI Risk Management Framework Generative AI Profile (NIST AI 600-1). Appendix on Model Documentation.

European Parliament and Council (2024). Regulation (EU) 2024/1689 (AI Act), Annex IV: Technical Documentation Requirements.

Meta AI (2024). Llama 3.1 Model Card. Hugging Face, meta-llama/Meta-Llama-3.1-405B-Instruct.

Anthropic (2025). Claude 3.7 Sonnet Model Card and System Card. https://www.anthropic.com/news/claude-3-7-sonnet.

Boyd, K. (2024). Documenting Computer Vision Datasets: A Critical Analysis of Datasheets and Model Cards. FAccT 2024.

Sloane, M., Solano-Kamaiko, I., Yuan, J., et al. (2024). Introducing Contextual Transparency for Automated Decision Systems. Nature Machine Intelligence 6, 245-256.