Section H.2: Open-Weight Model Families

Open-weight models can be downloaded, self-hosted, fine-tuned, and (with varying license restrictions) used commercially. They represent the best option for teams needing full control over their model deployment.

Meta Llama 3 / Llama 3.1 / Llama 4

Open Weights Llama License HuggingFace

Parameters	Llama 3: 8B, 70B; Llama 3.1: 8B, 70B, 405B; Llama 4: Scout (17B active / 109B total MoE), Maverick (17B active / 400B total MoE)
Context Length	Llama 3.1: 128K; Llama 4: 10M tokens (Scout), 1M (Maverick)
Architecture	Decoder-only transformer; Llama 4 uses Mixture of Experts
Key Strengths	Strong baseline for fine-tuning, large ecosystem, excellent community support, competitive with proprietary models at scale
License	Llama Community License (commercial use allowed with <700M MAU restriction)
Best For	Custom fine-tuning, self-hosted inference, research, production deployments needing full control

Mistral AI: Mistral / Mixtral / Mistral Large

Open Weights (select models) Apache 2.0 (7B, Mixtral) API + HuggingFace

Parameters	Mistral 7B; Mixtral 8x7B (47B total, 13B active), Mixtral 8x22B (141B total, 39B active); Mistral Large (123B); Mistral Small (24B)
Context Length	32K (7B, Mixtral); 128K (Mistral Large)
Architecture	Decoder-only; Mixtral uses Sparse MoE with top-2 routing
Key Strengths	Excellent quality/size ratio, efficient MoE inference, strong multilingual support, function calling
License	Apache 2.0 (Mistral 7B, Mixtral); proprietary (Mistral Large)
Best For	Cost-effective self-hosting, multilingual applications, MoE experimentation

DeepSeek-V3 / DeepSeek-R1

Open Weights MIT License API + HuggingFace

Parameters	DeepSeek-V3: 671B total (37B active MoE); DeepSeek-R1: 671B total (37B active MoE); R1 distilled variants: 1.5B, 7B, 8B, 14B, 32B, 70B
Context Length	128K tokens
Architecture	MoE with Multi-Head Latent Attention (MLA) and auxiliary-loss-free load balancing
Key Strengths	V3: frontier-class general ability at low inference cost; R1: state-of-the-art open reasoning with visible chain-of-thought
License	MIT (fully permissive)
Best For	Cost-efficient self-hosting of frontier-quality models, reasoning tasks (R1), distilled reasoning (R1-distill variants)

Qwen 2.5 / QwQ

Open Weights Apache 2.0 API + HuggingFace

Parameters	Qwen 2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B; QwQ-32B (reasoning variant)
Context Length	32K (base); 128K (72B, QwQ)
Architecture	Decoder-only transformer with GQA, SwiGLU, RoPE
Key Strengths	Excellent multilingual ability (especially CJK languages), strong coding (Qwen2.5-Coder), competitive reasoning (QwQ)
License	Apache 2.0 (most sizes); Qwen License for 72B
Best For	Multilingual applications, coding assistants, fine-tuning base models, reasoning (QwQ)

Microsoft Phi-4 / Phi-4-mini

Open Weights MIT License HuggingFace

Parameters	Phi-4: 14B; Phi-4-mini: 3.8B
Context Length	16K tokens
Architecture	Decoder-only transformer; trained heavily on synthetic data and curated textbook-quality sources
Key Strengths	Exceptional performance at small sizes (especially math and reasoning), demonstrates the power of data quality over quantity
License	MIT
Best For	On-device deployment, edge inference, resource-constrained environments, research into data-efficient training

Google Gemma 3

Open Weights Gemma License HuggingFace + Kaggle

Parameters	1B, 4B, 12B, 27B
Context Length	32K tokens (text); 128K (27B)
Modalities	Text and image input (4B+); text output
Key Strengths	Strong multimodal understanding at small sizes, native vision capability, good fine-tuning base
License	Gemma Terms of Use (commercial use allowed with restrictions)
Best For	Multimodal applications on modest hardware, fine-tuning for specialized domains, on-device vision+language

License Differences Matter

Open weights does not mean open source. Check each model's license carefully: some (like Llama) restrict commercial use above certain revenue thresholds, while others (like Mistral's Apache-licensed models) have no such restrictions.