Appendices
Appendix H: Model Cards and Selection Guide

Open-Weight Model Families

Open-weight models can be downloaded, self-hosted, fine-tuned, and (with varying license restrictions) used commercially. They represent the best option for teams needing full control over their model deployment.

Meta Llama 3 / Llama 3.1 / Llama 4

Open Weights Llama License HuggingFace
ParametersLlama 3: 8B, 70B; Llama 3.1: 8B, 70B, 405B; Llama 4: Scout (17B active / 109B total MoE), Maverick (17B active / 400B total MoE)
Context LengthLlama 3.1: 128K; Llama 4: 10M tokens (Scout), 1M (Maverick)
ArchitectureDecoder-only transformer; Llama 4 uses Mixture of Experts
Key StrengthsStrong baseline for fine-tuning, large ecosystem, excellent community support, competitive with proprietary models at scale
LicenseLlama Community License (commercial use allowed with <700M MAU restriction)
Best ForCustom fine-tuning, self-hosted inference, research, production deployments needing full control

Mistral AI: Mistral / Mixtral / Mistral Large

Open Weights (select models) Apache 2.0 (7B, Mixtral) API + HuggingFace
ParametersMistral 7B; Mixtral 8x7B (47B total, 13B active), Mixtral 8x22B (141B total, 39B active); Mistral Large (123B); Mistral Small (24B)
Context Length32K (7B, Mixtral); 128K (Mistral Large)
ArchitectureDecoder-only; Mixtral uses Sparse MoE with top-2 routing
Key StrengthsExcellent quality/size ratio, efficient MoE inference, strong multilingual support, function calling
LicenseApache 2.0 (Mistral 7B, Mixtral); proprietary (Mistral Large)
Best ForCost-effective self-hosting, multilingual applications, MoE experimentation

DeepSeek-V3 / DeepSeek-R1

Open Weights MIT License API + HuggingFace
ParametersDeepSeek-V3: 671B total (37B active MoE); DeepSeek-R1: 671B total (37B active MoE); R1 distilled variants: 1.5B, 7B, 8B, 14B, 32B, 70B
Context Length128K tokens
ArchitectureMoE with Multi-Head Latent Attention (MLA) and auxiliary-loss-free load balancing
Key StrengthsV3: frontier-class general ability at low inference cost; R1: state-of-the-art open reasoning with visible chain-of-thought
LicenseMIT (fully permissive)
Best ForCost-efficient self-hosting of frontier-quality models, reasoning tasks (R1), distilled reasoning (R1-distill variants)

Qwen 2.5 / QwQ

Open Weights Apache 2.0 API + HuggingFace
ParametersQwen 2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B; QwQ-32B (reasoning variant)
Context Length32K (base); 128K (72B, QwQ)
ArchitectureDecoder-only transformer with GQA, SwiGLU, RoPE
Key StrengthsExcellent multilingual ability (especially CJK languages), strong coding (Qwen2.5-Coder), competitive reasoning (QwQ)
LicenseApache 2.0 (most sizes); Qwen License for 72B
Best ForMultilingual applications, coding assistants, fine-tuning base models, reasoning (QwQ)

Microsoft Phi-4 / Phi-4-mini

Open Weights MIT License HuggingFace
ParametersPhi-4: 14B; Phi-4-mini: 3.8B
Context Length16K tokens
ArchitectureDecoder-only transformer; trained heavily on synthetic data and curated textbook-quality sources
Key StrengthsExceptional performance at small sizes (especially math and reasoning), demonstrates the power of data quality over quantity
LicenseMIT
Best ForOn-device deployment, edge inference, resource-constrained environments, research into data-efficient training

Google Gemma 3

Open Weights Gemma License HuggingFace + Kaggle
Parameters1B, 4B, 12B, 27B
Context Length32K tokens (text); 128K (27B)
ModalitiesText and image input (4B+); text output
Key StrengthsStrong multimodal understanding at small sizes, native vision capability, good fine-tuning base
LicenseGemma Terms of Use (commercial use allowed with restrictions)
Best ForMultimodal applications on modest hardware, fine-tuning for specialized domains, on-device vision+language
License Differences Matter

Open weights does not mean open source. Check each model's license carefully: some (like Llama) restrict commercial use above certain revenue thresholds, while others (like Mistral's Apache-licensed models) have no such restrictions.