Section F.1: Libraries, Frameworks & Tools

About This Section

This section covers software libraries, inference engines, evaluation platforms, and developer tools referenced throughout the book. Each entry links to the chapter where the tool is introduced or most deeply discussed.

BitsAndBytes: A quantization library that provides efficient 4-bit and 8-bit kernels for transformer models, enabling fine-tuning and inference on consumer GPUs. It serves as the backend for QLoRA, making it possible to fine-tune 65B+ parameter models on a single 24 GB card.; See Section 15.2 (QLoRA and Quantized PEFT)
Chatbot Arena: A crowd-sourced evaluation platform where users compare model outputs side by side, generating Elo ratings. Because it captures human preference on open-ended tasks at scale, Arena rankings are among the most trusted signals of real-world model quality.; See Section 29.2 (Benchmarks and Leaderboards)
Flash Attention: An IO-aware attention algorithm that tiles the attention computation to avoid materializing the full N×N attention matrix, reducing memory from O(n²) to O(n). Flash Attention delivers significant speedups in both training and inference on modern GPUs.; See Section 09.1 (Inference Optimization Fundamentals)
HumanEval: A code generation benchmark consisting of 164 Python problems with unit tests. Models are scored on pass@k: the probability that at least one of k generated solutions passes all tests. It is the standard metric for comparing LLM coding ability.; See Section 29.2 (Benchmarks and Leaderboards)
MCP (Model Context Protocol): An open protocol from Anthropic that standardizes how AI applications connect to external data sources and tools. MCP provides a universal interface so agents can access context from diverse systems without custom integration code for each provider.; See Section 22.2 (Tool Use and Function Calling)
MMLU (Massive Multitask Language Understanding): A benchmark of 57 multiple-choice tasks spanning STEM, humanities, and social sciences. MMLU is widely used as a general knowledge metric for LLMs, though it has known limitations including data contamination and multiple-choice format bias.; See Section 29.2 (Benchmarks and Leaderboards)
Unsloth: An open-source library that provides optimized kernels for fine-tuning LLMs, achieving 2 to 5x speedups with 60% less memory compared to standard Hugging Face training loops. It is especially popular for QLoRA training of Llama and Mistral models.; See Section 15.2 (QLoRA and Quantized PEFT)
Vector Database: A specialized database optimized for storing and searching high-dimensional embedding vectors via approximate nearest-neighbor algorithms. Vector databases are essential infrastructure for RAG systems. Popular options include Pinecone, Weaviate, Qdrant, Chroma, and pgvector.; See Section 19.2 (Vector Databases)
vLLM: A high-throughput inference engine that uses PagedAttention to manage GPU memory efficiently. vLLM supports continuous batching, tensor parallelism, and many quantization formats, making it one of the most popular choices for production LLM serving.; See Section 09.4 (Serving Infrastructure)