Edge LLMs: MLX, Apple Intelligence, Llama-Mobile

Section 58.3

"On-device LLMs solved one problem (latency) and created another (battery). The phone heats up, but it talks back without WiFi."

QuantQuant, Edge-Watt-Counter AI Agent
Note: Learning Objectives
Big Picture

Three independent forces aligned in 2025 to make on-device LLMs genuinely useful: Apple Silicon's unified memory normalized 32+ GB of fast shared RAM, quantization closed the 4-bit-to-fp16 quality gap to two perplexity points, and small models (Qwen3-0.6B, SmolLM2-360M, Liquid LFM2.5-350M) reached "useful at chat" sub-billion. The combined effect is that pocket-device LLMs in 2026 are roughly as capable as cloud LLMs were in mid-2023, with zero network latency and zero per-token cost.

Prerequisites

This section assumes the quantization formats from Section 10.1, the open-weights model zoo from Section 10.10, and the LLM-inference cost mechanics from Section 9.1.

The edge-LLM story changed in 2025 because three things lined up: Apple Silicon's unified-memory architecture made 32 GB of fast memory ubiquitous on laptops and increasingly on phones; quantization research closed the gap between 4-bit and fp16 quality to roughly two perplexity points on most benchmarks; and small models (Qwen3-0.6B, SmolLM2-360M, Liquid LFM2.5-350M) reached "useful at chat" capability under one billion parameters. The result is that 2026's pocket-device LLMs are roughly as capable as the cloud LLMs of mid-2023, with zero network latency and zero per-token cost.

The runtime story has consolidated around three engines: MLX on Apple Silicon, llama.cpp everywhere else, and vendor-specific stacks like Qualcomm's AI Engine on Android. The most important 2026 development was Apple's WWDC 2025 disclosure that iOS Foundation Models ship on MLX and Ollama's migration to MLX as its Apple Silicon backend.

Device-vs-model-size matrix. Columns are device RAM tiers (4-6 GB iPhone 17
Figure 58.3.1: What runs on the device in your pocket and on your desk. The 2026 edge frontier is set by unified memory, not by discrete-GPU VRAM: a 32 GB MacBook Air comfortably runs 4-bit Qwen3-32B or Llama-3.3 70B, while a 16 GB phone still only handles 3B-class models. 8-12 GB Pixel 11, 16 GB MacBook Air, 32 GB Mac Studio). Rows are model size tiers (under 1B, 1-3B, 3-8B, 8-15B, 30B+). Cells are color-coded: green fits comfortably, yellow tight, gray does not fit. Each cell lists representative models. A dashed arrow runs from 8 GB to 32 GB labeled 'Apple Silicon unified memory moved the practical ceiling from ~7B (2023) to ~30B (2026).'

58.3.1 MLX: Apple's tensor framework

Fun Fact

Apple's unified memory architecture turned out to be accidentally ideal for LLMs running on phones. The team that designed it in the early 2010s was optimizing for video encoding and Final Cut Pro; nobody at Apple was thinking about Llama-class models, because there were not yet Llama-class models to think about.

MLX is Apple's PyTorch-shaped tensor library, designed specifically for the unified-memory architecture of Apple Silicon (M1 through M4). On unified memory, CPU and GPU share the same physical RAM, eliminating the GPU-VRAM copy that dominates inference latency on discrete-GPU systems. MLX exposes lazy evaluation, function transforms (grad, vmap, jit), and a Python API that translates from PyTorch almost mechanically. The mlx-lm companion package implements the LLM-specific loading and generation paths.

58.3.2 Apple Intelligence Foundation Models

Apple shipped its Apple Intelligence Foundation Models in iOS 18 (late 2024) and expanded the family in iOS 19 (late 2025). The on-device model is ~3B parameters, runs entirely on the Neural Engine + GPU, and is exposed through a structured-generation API rather than chat. Apple's January 2026 ICLR paper details native LLM and MLLM inference at scale on Apple Silicon.

58.3.3 Llama-Mobile and the small-open frontier

The "small open-weight model that runs on a phone" tier exploded in 2025. The strongest entries are:

58.3.4 Comparing the edge runtimes

Table 58.3.1a: 63.3.1 Edge LLM runtimes, mid-2026.
RuntimeTarget hardwareFormatBest for
MLXApple Silicon (M1-M4)safetensors + customMac / iPad / iPhone inference
llama.cppAny CPU, NVIDIA, AMD, VulkanGGUFLowest-common-denominator everywhere
OllamaWraps llama.cpp + MLXGGUF + Apple SiliconEasiest developer experience
Qualcomm AI EngineSnapdragon NPUVendor formatAndroid phones with NPU
ONNX Runtime GenAICross-platformONNXWindows on Snapdragon, embedded
Apple Intelligence three-tier routing: on-device, Private Cloud Compute, partner cloud
Figure 58.3.2: Apple Intelligence's three-tier routing as deployed in iOS 19 (2026). The client-side classifier from Apple's ICLR 2026 paper holds 70-85% of routine queries (summarize this notification, rewrite this email, transcribe this voice memo) entirely on-device against the ~3B Foundation Model, escalates 10-25% of complex private queries to Private Cloud Compute (an attested MLX enclave inside Apple datacenters), and routes the remaining 3-7% to a partner frontier model (Apple-Anthropic Claude, Apple-OpenAI GPT) only after explicit per-query consent. The three tiers map to first-token latencies of ~200 ms, ~500 ms, and 1-2 s respectively, and to per-query costs of $0, internal-only, and per-million-token partner fees.
Key Insight: Unified memory rewrote the edge-inference math

On a 24 GB consumer GPU you can load a 70B 4-bit model but you cannot do anything else with that GPU at the same time. On a 32 GB Apple Silicon laptop with unified memory, a 4-bit 30B model shares the same memory pool with the rest of your applications and the GPU side never has to copy in from a separate pool. The practical consequence is that "what model fits comfortably on a developer laptop" jumped from ~7B to ~30B between 2023 and 2026, mostly because Apple Silicon's unified memory became normal. The ACM Computing Surveys edge-LLM review documents this shift across architectures.

Real-World Scenario: Apple Intelligence on-device routing

Apple's iOS 19 ships a routing layer that classifies user requests in three tiers: on-device (the ~3B Apple Foundation Model handles it locally on the Neural Engine), Private Cloud Compute (a hardened Apple-operated MLX inference fabric), or Apple-Anthropic / Apple-OpenAI partner cloud (with explicit user consent). The routing happens client-side based on prompt complexity, privacy classification, and current battery level. The headline behavior: roughly 70-85% of routine queries (summarize this notification, rewrite this email, transcribe this voice memo) never leave the device, which means zero network latency and zero per-query cost to Apple. Apple's ICLR 2026 paper describes the routing classifier; the Foundation Models 2025 update covers the on-device model's training and distillation pipeline. This is the largest-scale deployment of edge LLMs in 2026 and a template for every consumer-device vendor that follows.

Library Shortcut
a 4-bit Qwen3-7B on an M4 MacBook in three lines
from mlx_lm import load, generate
model, tokenizer = load("mlx-community/Qwen3-7B-Instruct-4bit")
print(generate(model, tokenizer, "Explain MLX in one sentence.", max_tokens=128))
Code Fragment 58.3.1b: Apple's iOS 19 ships a routing layer that classifies user requests in three tiers: on-device (the ~3B Apple Foundation Model handles it.

First-token latency on an M4 is roughly 200 ms; sustained tokens-per-second is 30 to 60 depending on the model. Compared to a cloud API call (typically 300-1500 ms first-token over the public internet), the local round trip wins almost every time for short interactions.

Tip: pick edge for latency, cloud for capability

The 2026 line for "is this edge-suitable" is roughly: chat that fits in one screen of text, structured extraction up to ~50 fields, simple coding completions, on-device summarization, dictation grammar. Anything that benefits from frontier reasoning (math proofs, multi-hop research, complex code synthesis) still wants the cloud. Treat edge models as the cheap, instant tier of the same hierarchy that puts Gemini Flash and Claude Haiku above the flagships.

Tip: quantization-aware training matters at the edge

Post-training quantization (PTQ) is fine at 8-bit; at 4-bit it loses 1-3 perplexity points on serious tasks and at 2-bit or 1.58-bit it loses meaningfully more. Quantization-aware training (QAT), where the model trains with simulated quantization noise, recovers most of that gap. SmolLM2 and Phi-4-mini are QAT-trained from the start; BitNet b1.58 2B4T extends this all the way to 1.58-bit ternary weights. If you are deploying to an edge device with tight memory, prefer a model that was QAT-trained over one that was post-quantized after a fp16 pretrain. The BitNet inference framework is the production-grade reference for 1.58-bit serving.

Research Frontier: Looking ahead

The unresolved edge-LLM question is whether MoE routing can be made energy-efficient on heterogeneous edge hardware. Pure-dense small models (SmolLM2, Qwen3-0.6B) work well today; sparse-MoE small models would in principle deliver more capability per active parameter but the routing overhead on NPUs is currently prohibitive. Section 58.4 turns to the kernels that drive inference latency on the cloud side, especially FlashAttention-4's adaptation to Blackwell's asymmetric SMs.

Key Takeaways
Self-Check
Q1: A 7B model in fp16 is 14 GB. Why can a 32 GB MacBook Air run a 4-bit 30B model when a 24 GB discrete GPU sometimes cannot?
Show Answer
A 4-bit 30B model is roughly 15 GB of weights, plus several gigabytes of KV cache and activations during decoding, so total working set lands in the 17 to 22 GB range. The MacBook Air uses unified memory: the same 32 GB pool serves the OS, applications, GPU, and Neural Engine, and the model is mapped directly into that pool without copy overhead. A discrete GPU's 24 GB of VRAM is a dedicated, separate address space; once weights plus KV cache exceed that, the runtime must offload tensors to system RAM over a PCIe bus that is one to two orders of magnitude slower than HBM, killing token throughput. Unified memory wins not because the total RAM is larger but because there is no fast/slow boundary to cross.
Q2: When should you push a query from on-device to a cloud API?
Show Answer
Push to cloud when the task exceeds what the local model can deliver at acceptable quality, in three concrete cases. First, frontier reasoning: multi-step math, long-horizon planning, novel-domain synthesis, anything where a 3B to 8B on-device model would predictably hallucinate. Second, calibration-based routing: the local model returns a confidence or self-rated quality score below a threshold, so the orchestrator escalates to GPT-4o or Gemini 2.5 Pro rather than ship a low-quality answer. Third, content that needs fresh data the local model cannot have (current events, time-sensitive corpora). Apple's iOS 19 Private Cloud Compute is the production reference: most queries stay on-device for latency and privacy, only the queries the on-device model flags as too hard get encrypted and forwarded.

What's Next?

In the next section, Section 58.4: FlashAttention-4 and Inference Kernels for Blackwell, we build on the material covered here.

Further Reading
Apple Machine Learning Research, "Apple Foundation Models 2025 Updates" (Apple).
Apple Machine Learning Research, "ICLR 2026 Apple paper": on-device LLM/MLLM inference at scale.
ml-explore, "MLX" and "mlx-lm" (Apple, open-source).
Allal et al., "SmolLM2" (Hugging Face, 2024-2025).
Microsoft Research, "BitNet b1.58 inference framework" (1.58-bit ternary models).
ACM Computing Surveys, "Edge-LLM Survey" (2025).