Section 58.2: Decentralized Training: Nous Psyche, DeMo, DisTrO

"The bandwidth wall is real, but it is also negotiable if you are willing to send less data."
Bowen Peng et al., DeMo: Decoupled Momentum, 2024

Note: Learning Objectives

Explain why synchronous data-parallel training requires hyperscaler-class interconnect, and what specifically DeMo compresses to break that requirement.
Walk through one DeMo optimizer step end-to-end (local momentum, top-k sparsification, async sync, global merge).
Compare the bandwidth profiles of DDP, FSDP, DiLoCo, and DeMo on a single common workload.
Assess the cryptographic-attestation threat model for Nous Psyche and identify which adversaries it does and does not defend against.

Big Picture

For ten years, "frontier model training" implied a co-located GPU datacenter. DeMo and DisTrO attack the underlying bandwidth math directly. If decentralized training crosses the GPT-4 quality threshold by 2027, the economics of who can train frontier models changes from "five hyperscalers" to "anyone willing to coordinate hundreds of consumer-grade GPUs". This section walks the algorithm, the framework, and the live network.

Prerequisites

This section assumes familiarity with frontier accelerators from Section 58.1 and with cross-hardware benchmarking from Section 57.4. Familiarity with pretraining at scale from Section 6.1 helps when reading the gradient-compression math.

Frontier-model training has historically required co-located GPUs with InfiniBand or NVLink fabrics, because synchronous data-parallel training pushes gradients across the network at every step. The bandwidth cost is enormous: a 70B model in fp16 emits 140 GB of gradients per optimizer step, and any inter-node bandwidth below tens of GB/s caps training throughput. That mathematical reality, more than anything else, is why "you can only train a frontier model in a hyperscaler datacenter" was conventional wisdom through 2024.

The decentralized-training thread of 2025-26 attacks exactly that bandwidth cost. DeMo (Decoupled Momentum Optimization) and the DisTrO toolkit it grew out of demonstrated that gradient communication could be compressed by 1000 to 10,000x with measurable but small quality cost; Nous Research Psyche wrapped that algorithm in a coordination layer on Solana and started training models over the public internet. Whether this scales to GPT-4-class checkpoints is one of the open questions of 2027.

Vertical log-scale bar chart of gradient bytes per optimizer step for a 70B fp16 — **Figure 58.2.1**: Per-step gradient bytes for a 70B model. Classical data-parallel pushes 70-280 GB per step (NVLink/InfiniBand only); DeMo's top-1% sparsification collapses that to ~140 MB, which fits inside an ordinary residential uplink. The 500-1000x gap is the entire reason decentralized training became viable in 2025. model. DDP all-reduce ~280 GB; FSDP ~140 GB; ZeRO-3 ~70 GB; DeMo v2 top-1% sparse ~0.14 GB; DisTrO async ~0.04 GB. A dashed red arrow spans the ZeRO-3-to-DeMo gap with a callout reading 500-1000x compression.

58.2.1 DeMo: the algorithmic core

The DeMo paper (Peng, Kingma et al., 2024) introduces an optimizer that decouples the fast-moving "momentum" of standard SGD from the part that must be synchronized across workers. By keeping a local momentum buffer at each worker and only synchronizing a sparse, top-k subset of gradient components per step, DeMo cuts inter-worker bandwidth by roughly three orders of magnitude. The October 2025 v2 of the paper closed the quality gap further and added formal convergence guarantees. The reference implementation lives in the DisTrO repository.

58.2.2 DisTrO: the practical toolkit

DisTrO (Distributed Training Over) is the open-source framework that operationalizes DeMo. It handles fault tolerance, asynchronous synchronization, dynamic worker join/leave, and the gradient-compression bookkeeping needed for DeMo to work at scale. As of mid-2026 it has been used to train models up to ~10B parameters across heterogeneous hardware fleets spanning consumer GPUs and rented cloud H100s.

58.2.3 Nous Psyche: training over the internet

Nous Psyche Network (launched January 2025) is the most ambitious public demonstration of this stack. Built on Solana for coordination, Psyche allows anyone with a GPU to contribute to a public training run, with cryptographic attestation of work and on-chain checkpoint commits. The first Psyche runs trained 1B-class models across hundreds of heterogeneous participants; the network is now moving on to larger checkpoints. Chakra Research's overview is the best entry point for the architecture.

58.2.4 Comparing the decentralized-training stack

Table 58.2.1a: Decentralized-training components, mid-2026.

Component	Layer	Role	Status
DeMo v2	Algorithm	Sparse gradient compression	Published, reference impl
DisTrO	Framework	Async coordination + fault tolerance	Open-source, used in production
Nous Psyche	Network	Solana coordination, public participation	Live since Jan 2025
Hivemind	Framework	Predecessor decentralized library	Older, less optimized
Bittensor	Network	Incentive-based ML market	Active, broader than just training

Sequence diagram of one DeMo optimizer step across a heterogeneous worker fleet — **Figure 58.2.2**: One DeMo optimizer step across heterogeneous workers. Top-k sparsification of the local momentum buffer reduces per-step bandwidth from 140 GB to roughly 140 MB; Solana attestation replaces the InfiniBand trust model. (cloud H100, university A100, hobbyist RTX 4090, rented L40S). Three boxes show step 1 local momentum and top-k extract producing roughly 140 MB sparse vectors, step 2 sparse sync with Solana attestation, and step 3 global merge into local parameters. A second panel compares bandwidth per step at 70B parameters: DDP at 140 GB, FSDP at 70 GB, DiLoCo at 14 GB, DeMo at 140 MB, bars to scale.

Key Insight

Mental Model: DeMo is gradient compression as steganography

Think of a 70B model's gradient as a 140 GB high-dimensional vector and DeMo's top-1% extraction as embedding that vector's signal into a 140 MB shadow. The bulk of the bytes are noise (or near-zero entries that decay fast); the signal lives in the largest-magnitude components, almost the way a steganographic encoder hides a message in the high-frequency components of an image. Local momentum at each worker keeps the discarded-but-not-zero pieces alive, so over many steps the system still "sees" their contribution. The bounded-information claim holds empirically for transformers up to mid-scale; trillion-parameter behavior is still open. Until a fully decentralized run crosses the GPT-4 quality threshold, this remains the question that gates the whole thread.

Real-World Scenario: Nous Psyche's first public run (2025)

The first end-to-end public Psyche run trained a 1B-parameter model across roughly 300 heterogeneous participants from January through April 2025. Participating hardware ranged from a cloud-rented H100 cluster (one institutional contributor) to consumer RTX 4090s in apartments. The run committed checkpoints every 1000 steps to Solana; total contributed compute reached approximately 1.2e21 FLOPs (about an H100-month). The resulting model trailed centrally-trained 1B baselines by roughly 1.5 perplexity points on FineWeb-Edu, the closest controlled comparison the field has. The takeaway was twofold: decentralized training works at 1B scale; the quality gap is real but small, and the bandwidth gap is no longer the binding constraint. The 7B Psyche run that started in Q4 2025 is the next data point. Folding@home remains the closest cultural analogue.

Warning

decentralized training does not yet match centralized at frontier scale

The 1B Psyche run showed feasibility; it did not show parity with GPT-4-class capabilities. Two compounding gaps: (1) at trillion-parameter scale, the fraction of gradient mass living in top-k sparse components has not been empirically validated, and (2) post-training (RLHF, DPO, GRPO) is gradient-noise-sensitive in ways pretraining is not, so the same compression ratio may not transfer. Cryptographic attestation is the other gap: Psyche's current proof-of-work is robust against accidental errors but not against coordinated adversaries. The 2026 papers on cryptographic SNARKs for transformer inference are the natural next building block; until they integrate, treat the security model as "trust most participants, audit periodically".

Fun Note

DeMo's "decoupled momentum" idea has a precedent in Lin et al.'s 2017 Deep Gradient Compression, which proposed identical top-k sparsification with momentum correction for mobile training, eight years before frontier LLMs needed it. The 2025 contribution was scale and engineering rigor; the algorithmic kernel is older than most of the engineers using it.

Nous Psyche timeline: 1B run to 7B run with key milestones — **Figure 58.2.3**: The decentralized-training arc from the Folding@home cultural ancestor (2000) through Lin et al.'s 2017 Deep Gradient Compression (the top-k sparsification idea), the DeepMind DiLoCo line (2023), and the Peng-Kingma DeMo paper (November 2024, arXiv:2411.19870) that scaled the kernel to LLMs at 500-1000x compression. The Nous Psyche network's January 2025 launch trained a 1B model with ~300 nodes and landed within 1.5 perplexity of a centralized baseline. The Q4 2025 Psyche 7B run is in flight; the 2027 open question is whether the next run crosses the GPT-4 quality threshold and whether SNARK-based attestation closes the adversarial gap the warning in this section calls out.

58.2.5 What 2027 has to settle

The hard open question is whether a fully decentralized run can produce a checkpoint competitive with a centralized hyperscaler run. DeMo v2 closes the bandwidth gap; what is left is whether the quality penalty from sparsification and async noise stays small at GPT-4+ scale. The DeepSeek and Qwen open-weight models gave the field a clear quality benchmark; Psyche v2 (announced for late 2026) is the test case. Section 58.3 turns to the opposite end of the deployment spectrum: what runs on the device in your pocket.

Research Frontier

Three frontier questions remain: (1) does top-1% gradient sparsification preserve quality at 100B+ parameters, where the loss landscape has different curvature; (2) can RLHF and GRPO post-training tolerate the same compression ratio (early signals suggest no); (3) can cryptographic attestation become non-interactive so adversarial gradient injection is provably bounded? DeMo v2, the DiLoCo line, and the Psyche v2 release are the next data points.

Key Takeaways

Synchronous data-parallel training requires 70-280 GB of gradient bandwidth per step at frontier scale; that is the hyperscaler-fabric requirement.
DeMo's top-k sparsification compresses that 500-1000x to roughly 140 MB per step, fitting inside residential uplinks.
Nous Psyche's first 1B run trailed centrally-trained baselines by ~1.5 perplexity points, demonstrating viability but not yet parity.
The remaining open questions are quality at trillion-parameter scale, post-training (RLHF/GRPO) compression tolerance, and adversarially-robust cryptographic attestation.

Self-Check

Q1: Why does ZeRO-3 still emit ~70 GB of gradient traffic per step, while DeMo emits ~140 MB on the same 70B model?

Show Answer

ZeRO-3 partitions optimizer state, gradients, and parameters across workers to fit large models in aggregate memory, but the backward pass still all-reduces full fp16 or bf16 gradients across the cluster so every worker has the matching slice. For a 70B model the per-step gradient traffic is on the order of the parameter count times the dtype size, which lands near 70 GB. DeMo (Decoupled Momentum Optimization, Peng & Kingma 2024) replaces the full all-reduce with a top-1 percent sparse exchange of momentum components, exploiting the empirical observation that most gradient mass concentrates in a small fraction of coordinates. The result is ~140 MB per step rather than ~70 GB, a 500x compression that makes training over public internet links practical for distributed runs like Nous Psyche.

Q2: What threat does Psyche's computational-attestation scheme NOT defend against?

Show Answer

Psyche's attestation relies on random re-execution: a fraction of submitted gradient updates is recomputed on independent verifier nodes, and inconsistent results disqualify the submitter. This catches accidental hardware errors and unilateral malicious submissions because re-execution will diverge from a tampered result. It does not defend against a coordinated Sybil adversary who controls many identities and submits consistently-wrong gradients (or who controls a sizeable share of the verifier pool itself); since every replica produces the same poisoned output, re-execution agrees and the attack passes attestation. The standard defenses against that threat are stake-weighting, identity-binding (KYC for compute providers), or a small trusted-validator set, none of which random re-execution provides on its own.

What's Next

Decentralized training pushes the frontier to commodity hardware; the next section pushes it onto the device in your pocket. Continue to Section 58.3: Edge LLMs: MLX, Apple Intelligence, Llama-Mobile.

Further Reading

Peng & Kingma et al., "DeMo: Decoupled Momentum Optimization" (Nous Research, Nov 2024; v2 Oct 2025).

Douillard et al., "DiLoCo: Distributed Low-Communication Training" (DeepMind, 2023; relevant baseline).

Nous Research, "Nous Psyche Network" (launch, Jan 2025).

Chakra Research, "The Third Epoch of AI" (decentralized training overview).

Lin et al., "Deep Gradient Compression" (2017): the algorithmic ancestor.

"Cryptographic SNARKs for Transformer Inference" (2025): the verified-ML primitive Psyche needs.