The table below provides a side-by-side overview for rapid reference when choosing a model for a project.
| Model | Params (Active) | Context | Open? | Vision | Reasoning |
|---|---|---|---|---|---|
| GPT-4o | ~200B (est.) | 128K | No | Yes | Good |
| o3 | Undisclosed | 200K | No | Yes | Excellent |
| o4-mini | Undisclosed | 200K | No | Yes | Excellent |
| Claude 4 Sonnet | Undisclosed | 200K | No | Yes | Very Good |
| Claude 4 Opus | Undisclosed | 200K | No | Yes | Excellent |
| Gemini 2.5 Pro | Undisclosed | 1M | No | Yes | Excellent |
| Gemini 2.0 Flash | Undisclosed | 1M | No | Yes | Good |
| Llama 3.1 405B | 405B | 128K | Yes | No | Good |
| Llama 4 Maverick | 17B active | 1M | Yes | Yes | Good |
| Mixtral 8x22B | 39B active | 64K | Yes | No | Fair |
| DeepSeek-V3 | 37B active | 128K | Yes | No | Good |
| DeepSeek-R1 | 37B active | 128K | Yes | No | Excellent |
| Qwen 2.5 72B | 72B | 128K | Yes | No | Good |
| QwQ-32B | 32B | 128K | Yes | No | Very Good |
| Phi-4 | 14B | 16K | Yes | No | Very Good |
| Gemma 3 27B | 27B | 128K | Yes | Yes | Fair |
This appendix reflects the model landscape as of early 2026. New model releases occur frequently, and specifications, pricing, and capabilities shift with each release. Always verify details against the official model documentation and release announcements. Benchmark scores are intentionally omitted because they become outdated within weeks and can be misleading due to data contamination.
Start with your constraints: (1) Can you send data to a third-party API, or do you need self-hosting? (2) What is your latency budget? (3) What is your cost ceiling per request? (4) Do you need vision, long context, or strong reasoning? These four questions will narrow the field to 2-3 candidates. Then prototype with your actual data and measure what matters for your specific use case.
Documentation Frameworks: Model Cards, Datasheets, and Data Cards
The model cards in this appendix follow a tradition of structured documentation for ML artifacts. Three complementary frameworks have emerged as standards, each addressing a different artifact and audience. Understanding their differences helps teams choose the right documentation strategy for their projects.
Model Cards (Mitchell et al., 2019)
Model cards document the model itself: intended use cases, performance metrics disaggregated by demographic group, known limitations, and ethical considerations. Originally proposed at FAT* 2019, model cards are now standard on Hugging Face, where every model repository includes a card rendered from a structured README.md. Model cards answer the question: "Should I use this model for my task, and what should I watch out for?"
Datasheets for Datasets (Gebru et al., 2021)
Datasheets document the training or evaluation data behind a model. The framework organizes documentation into seven sections: motivation (why the dataset was created), composition (what the data contains), collection process (how it was gathered and by whom), preprocessing (cleaning, filtering, labeling steps), uses (intended and prohibited applications), distribution (how the dataset is shared), and maintenance (who maintains it and how to report issues). Datasheets answer the question: "Can I trust this data, and is it appropriate for my use case?"
Data Cards (Google, 2022)
Google's Data Cards Playbook extends the datasheet concept with a more structured, template-driven approach designed for enterprise adoption. Data cards include quantitative summaries (dataset size, label distributions, demographic breakdowns) alongside qualitative descriptions, making them easier to generate semi-automatically from metadata. The playbook provides fillable templates and review checklists that integrate into MLOps workflows.
Comparison: Documentation Frameworks
| Framework | Artifact Type | Primary Audience | Key Sections | Adoption |
|---|---|---|---|---|
| Model Cards | Trained model | Downstream developers, auditors | Intended use, metrics by group, limitations, ethical considerations | Widespread (Hugging Face, major providers) |
| Datasheets | Dataset | Researchers, data curators | Motivation, composition, collection, preprocessing, distribution, maintenance | Growing (academic standard, NeurIPS requirement) |
| Data Cards | Dataset | Enterprise ML teams, compliance | Quantitative summaries, schema, provenance, sensitivity labels | Moderate (Google ecosystem, enterprise adoption) |
Operationalizing Documentation in Training Pipelines
Documentation should not be a manual afterthought. Modern MLOps pipelines can generate documentation artifacts automatically. Hugging Face's huggingface_hub library provides ModelCard and DatasetCard classes that populate templates from training metadata (metrics, hyperparameters, dataset statistics). Google's Data Cards Playbook includes scripts that extract schema information and compute summary statistics directly from data files. The goal is to make documentation a build artifact: generated during training, versioned alongside model weights, and reviewed during the deployment approval process.
Tools for Documentation
Hugging Face Dataset Cards: Every dataset on the Hub includes a structured card with YAML metadata (task type, languages, license) and freeform sections. The datasets library can auto-generate skeleton cards from dataset metadata. Google Data Cards Playbook: Provides PDF and digital templates, a facilitator guide for team workshops, and example cards for reference datasets. Both tools lower the barrier to producing useful documentation, though human review remains essential for nuanced content like limitation descriptions and ethical considerations.
Documentation is a living artifact. Automate what you can (statistics, schema, performance metrics), but reserve human judgment for what you must (limitations, ethical considerations, known biases). Schedule quarterly reviews of model and dataset cards, especially after retraining or data pipeline changes. Stale documentation is worse than no documentation because it creates false confidence.