Building Conversational AI with LLMs and Agents
Appendix H

Model Cards and Selection Guide

Quick-reference cards for the major model families powering today's LLM applications

A friendly AI model character at a podium presenting its own ID card with stats and limitations to an audience of developer characters taking notes
Big Picture

This appendix provides quick-reference cards for the major model families powering today's LLM applications. It covers both proprietary models (GPT-4 series, Claude series, Gemini series) and open-weight models (Llama, Mistral, Falcon, Qwen, and others), with structured information on context windows, capabilities, pricing, licensing, and known limitations. A comparative table in Section H.3 allows side-by-side evaluation across dimensions that matter for deployment decisions.

Model selection is often the first architectural decision in any LLM project and one of the most consequential. A model that performs well on benchmarks may be unsuitable for a given task due to context length, licensing restrictions, latency requirements, or cost at scale. These cards provide the information needed to make an informed first cut before running evaluations.

This appendix is essential for engineers and product builders selecting models for new projects, and for researchers who need to quickly understand what a given model family was designed for and where it has known weaknesses. It should be treated as a starting point: model capabilities and pricing change frequently, so always verify against official documentation for production decisions.

Model selection decisions are informed by the model landscape covered in Chapter 7 (Modern LLM Landscape), which provides deeper context on how these families evolved. API access patterns for the proprietary models are covered in Chapter 10 (LLM APIs). For benchmarking methodology behind the performance claims in these cards, see Appendix J (Datasets and Benchmarks).

Prerequisites

No specific prerequisites are required to read this appendix. Understanding what the benchmarks mean (MMLU, HumanEval, MT-Bench) will help you interpret the capability entries; those are covered in Appendix J. For a deeper understanding of how these models work architecturally, see Chapter 4 (Transformer Architecture) and Chapter 6 (Pretraining and Scaling Laws).

When to Use This Appendix

Use this appendix at the start of a project when selecting a model family, or when comparing options before running evaluations. Return to it when evaluating whether to switch from a proprietary model to an open-weight alternative (or vice versa) based on cost, latency, or licensing constraints. For sizing hardware to run specific open-weight models, pair this appendix with Appendix G (Hardware and Compute). Do not rely solely on the cards for production decisions; benchmark the candidate models on your specific task and data.

Sections