
"The smartest model is the one you can run on the device your user already owns."
Quant, Edge-Squeezing AI Agent
Chapter 59 trained at the largest scale; this chapter deploys at the smallest. Quantization, distillation, MLX, GGUF, Apple Neural Engine, mobile NPUs, and the engineering tradeoffs that let a useful LLM fit in a phone or a laptop.
Not every LLM workload belongs in the cloud. This chapter covers running models on consumer hardware (laptops, phones), quantization for edge, framework support (llama.cpp, MLC, MediaPipe, Core ML), privacy-preserving on-device inference, and the operational patterns that differ from server-side serving.
Chapter Overview
Not every LLM inference should travel to the cloud. This chapter teaches when edge and on-device LLMs make sense (latency, privacy, cost, offline operation), the hardware envelope that makes them viable (Apple Silicon unified memory, mobile NPUs, embedded accelerators), the quantization recipes that close the fp16-to-4-bit quality gap, the runtime choices (MLX, llama.cpp, ONNX Runtime, mobile frameworks), and the deployment patterns that survive in real apps.
On-device LLMs crossed from "interesting demo" to "shipping product" in 2024 and 2025. This chapter is the practitioner's map of when to use them and how.
- Identify the four reasons (latency, privacy, cost, offline) that justify on-device LLM deployment.
- Size an on-device deployment for Apple Silicon, mobile NPU, or embedded hardware.
- Apply 4-bit and 2-bit quantization recipes that preserve quality for a target task.
- Compare MLX, llama.cpp, ONNX Runtime, and mobile frameworks for on-device serving.
- Architect an app that combines on-device inference with selective cloud escalation.
Sections in This Chapter
Prerequisites
- Inference optimization from Chapter 9
- PEFT and quantization from Chapter 17
- Frontier hardware from Chapter 58
- 60.1 Why Edge Deployment The four drivers (privacy, latency, cost, sovereignty) and the use-case matrix that decide when edge wins over cloud. Intermediate
- 60.2 The Edge Framework Landscape Catalog of llama.cpp, Ollama, MLX, ExecuTorch, WebLLM, and Qualcomm AI Hub, plus the iMatrix quantization workflow. Intermediate
- 60.3 Hardware Constraints Battery, thermal throttling, memory and quantization tradeoffs, and the Phi-3.5 / Gemma 3 / Apple Foundation reference models that fit on a phone. Advanced
What's Next?
This chapter begins with Section 60.1: Why Edge Deployment, continues with Section 60.2: The Edge Framework Landscape, and closes with Section 60.3: Hardware Constraints. Each section builds on the previous one, so we recommend reading them in order.