Edge & On-Device LLMs

Chapter opener illustration: Edge & On-Device LLMs.

"The smartest model is the one you can run on the device your user already owns."

QuantQuant, Edge-Squeezing AI Agent
Looking Back

Chapter 59 trained at the largest scale; this chapter deploys at the smallest. Quantization, distillation, MLX, GGUF, Apple Neural Engine, mobile NPUs, and the engineering tradeoffs that let a useful LLM fit in a phone or a laptop.

Big Picture

Not every LLM workload belongs in the cloud. This chapter covers running models on consumer hardware (laptops, phones), quantization for edge, framework support (llama.cpp, MLC, MediaPipe, Core ML), privacy-preserving on-device inference, and the operational patterns that differ from server-side serving.

Chapter Overview

Not every LLM inference should travel to the cloud. This chapter teaches when edge and on-device LLMs make sense (latency, privacy, cost, offline operation), the hardware envelope that makes them viable (Apple Silicon unified memory, mobile NPUs, embedded accelerators), the quantization recipes that close the fp16-to-4-bit quality gap, the runtime choices (MLX, llama.cpp, ONNX Runtime, mobile frameworks), and the deployment patterns that survive in real apps.

On-device LLMs crossed from "interesting demo" to "shipping product" in 2024 and 2025. This chapter is the practitioner's map of when to use them and how.

Note: Learning Objectives

Sections in This Chapter

Prerequisites

What's Next?

This chapter begins with Section 60.1: Why Edge Deployment, continues with Section 60.2: The Edge Framework Landscape, and closes with Section 60.3: Hardware Constraints. Each section builds on the previous one, so we recommend reading them in order.

Further Reading

Small & Mobile-Class Models

Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., et al. (2024). "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." arXiv preprint. arXiv:2404.14219. The reference small-model paper showing that 3.8B-parameter models can hit GPT-3.5-class quality on commodity phones; central to the on-device argument of 60.1.
Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., et al. (2024). "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases." ICML. arXiv:2402.14905. Meta's design study for sub-1B mobile LLMs; covers the architectural choices (deep-thin layouts, weight sharing) for memory-constrained deployment.

On-Device Quantization and Runtimes

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., & Han, S. (2024). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys. arXiv:2306.00978. The activation-aware 4-bit quantization scheme behind llama.cpp and most on-device runtimes; the standard reference for edge-class compression.
Gerganov, G., et al. (2023-2026). llama.cpp: Inference of Llama models in pure C/C++. github.com/ggerganov/llama.cpp. The de facto reference CPU/Metal/Vulkan runtime for on-device LLMs; defines the GGUF format used across mobile deployments.

On-Device Privacy

Apple Machine Learning Research. (2024). "Introducing Apple's On-Device and Server Foundation Models." Apple ML Research. Apple Intelligence architecture report describing private on-device inference and Private Cloud Compute; the production benchmark for privacy-preserving edge LLMs.