🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm | How I Study AI

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Beginner
Jinrui Zhang, Chaodong Xiao, Aoqi Wu et al.2/12/2026
arXiv

Key Summary

  • •Training big language models usually needs super-expensive, tightly connected GPU clusters, which most people do not have.
  • •This paper introduces SPES, a new way to train Mixture-of-Experts (MoE) models across many separate, ordinary GPUs on the internet.
  • •Each computer trains only a few 'experts' instead of the whole model, which saves a lot of memory and upload bandwidth.
  • •SPES synchronizes only the pieces that changed, not the entire model, so slow networks are no longer a deal-breaker.
  • •A warm-up trick called expert merging lets experts share knowledge early so they learn faster and don’t get stuck.
  • •With 16 standalone 48GB GPUs over the internet, SPES trained a 2B-parameter MoE model that performs like centrally trained models with similar compute.
  • •SPES also scales to a 7B model from scratch and a 9B model built by upcycling a dense checkpoint, matching prior centralized baselines.
  • •Compared to other decentralized training (like DiLiCo), SPES cuts per-round communication by up to one-third and keeps per-GPU memory under control.
  • •This approach makes large-model pretraining more affordable and reachable for labs, schools, and small companies.
  • •Code and models are released to help the community build on decentralized and memory-efficient training.

Why This Research Matters

This work lowers the hardware and network barriers to training strong language models, so more schools, labs, and startups can contribute. By cutting memory and bandwidth needs, it lets people use ordinary 48GB GPUs connected over typical Ethernet instead of elite superclusters. That broader access means faster, fairer progress as more groups can test ideas and share results. It can help regions with limited infrastructure build local models that fit their languages and needs. It may also reduce costs and energy use compared to always relying on huge centralized clusters. Altogether, SPES makes serious AI training more like a team sport that anyone with decent gear can join.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a school play needs a big stage, bright lights, and lots of helpers to run smoothly? Training a huge language model is like that—it usually needs fancy, expensive gear working together in one place.

🥬 Filling (The Actual Concept):

  • What it is: Before this paper, training large language models (LLMs) mostly happened on centralized super-clusters with powerful, high-memory GPUs and ultra-fast network cables.
  • How it works: 1) Split data or the model across many nearby GPUs, 2) move lots of information back and forth between them very quickly every step, 3) update every parameter with full optimizer states and gradients, and 4) repeat for trillions of tokens.
  • Why it matters: Without the fancy hardware and fast links, training slows down or runs out of memory.

🍞 Bottom Bread (Anchor): Big models like LLaMA-3 were trained on up to 16,000 top-tier GPUs connected with super-fast fabric—way out of reach for most schools or small labs.

🍞 Top Bread (Hook): Imagine doing a group project where everyone must copy the entire textbook and all notes, even if they only need one chapter. That’s wasteful!

🥬 Filling (The Actual Concept):

  • What it is: Gradient updates are the tiny nudges that adjust a model to make better predictions each step.
  • How it works: 1) Compare the model’s guess to the correct answer, 2) measure the error, 3) compute the gradient (which direction to change), 4) take a small step to reduce that error.
  • Why it matters: These steps must be stored and applied carefully to learn well.

🍞 Bottom Bread (Anchor): Like practicing free throws: miss left, aim a bit more right next time; the gradient is that “aiming nudge.”

🍞 Top Bread (Hook): A good coach doesn’t just remember the last play; they track trends to improve the team.

🥬 Filling (The Actual Concept):

  • What it is: Optimizer states are the coach’s notes—extra memory the optimizer keeps (like momentum) to make smarter updates.
  • How it works: 1) Store running averages of past gradients, 2) adjust learning steps based on those averages, 3) update weights more stably.
  • Why it matters: These states can use more memory than the model itself during training.

🍞 Bottom Bread (Anchor): AdamW’s notes can take up to three times the model’s parameter size, like bringing not just your backpack but three extra bags to class.

🍞 Top Bread (Hook): What if classmates work at home and only share their improvements sometimes, instead of meeting every minute in the library?

🥬 Filling (The Actual Concept):

  • What it is: Federated optimization is a way for many devices to train locally on their own data and occasionally share updates (not raw data) to build a shared model.
  • How it works: 1) Everyone downloads the current model, 2) does several local steps, 3) sends the change (not the data) to a server, 4) server averages these changes, 5) repeat.
  • Why it matters: It saves bandwidth and can protect privacy, but each device still needs to handle the whole model’s memory.

🍞 Bottom Bread (Anchor): It’s like classmates studying at home, then emailing the teacher what they learned, and the teacher averaging it into a class study guide.

🍞 Top Bread (Hook): Imagine a sports team where only a few players are on the field for any given play, chosen because they’re best for that move.

🥬 Filling (The Actual Concept):

  • What it is: Mixture-of-Experts (MoE) models have many specialist sub-networks (experts), but only a small number are used for each token.
  • How it works: 1) A gating function scores which experts are most relevant, 2) pick the top-k experts, 3) combine their outputs, 4) move on to the next token.
  • Why it matters: You get the capacity of many experts without paying full compute for all of them at once.

🍞 Bottom Bread (Anchor): Asking a medical question? The “medicine experts” wake up; a history question summons the “history experts.”

The problem: Even decentralized methods like DiLiCo and Photon save bandwidth by syncing less often, but each node still trains the full model. That means every GPU must hold every parameter’s gradients and optimizer states—still heavy for 48GB cards. What was missing was a way to lower per-node memory and communication without losing the benefits of collaboration.

The paper’s gap-filling idea: Use the MoE structure to split experts across nodes. If a node only trains its assigned experts (plus shared layers), it only needs to store optimizer states for those parts. That slashes per-device memory and lets slow networks cope because only changed pieces get synchronized. But there’s a catch: each expert sees fewer tokens, so learning can slow down. The authors solve that with early expert merging—experts share what they’ve learned at the beginning to build a strong base, then specialize later.

Real stakes: This matters because it opens large-model pretraining to schools, small labs, startups, and citizen-science groups who lack elite clusters. It means more people can experiment, check results, and innovate—leading to fairer, faster progress in AI.

02Core Idea

🍞 Top Bread (Hook): Think of a giant jigsaw puzzle. Instead of every friend trying to finish the whole picture alone, each friend assembles only their own chunk and occasionally brings it to the table so the full picture can grow.

🥬 Filling (The Actual Concept):

  • What it is: The key insight is to train a Mixture-of-Experts model in a decentralized way where each node updates only its subset of experts (plus shared parts), then synchronizes just those changed pieces—SParse Expert Synchronization (SPES).
  • How it works: 1) Split experts across nodes, 2) each node freezes unassigned experts and trains only its own, 3) after several local steps, upload the updated shared layers and the local experts, 4) server averages shared parts and slots in the freshest expert copies, 5) early in training, merge similar experts to speed up learning.
  • Why it matters: This cuts memory per node and shrinks upload volume, making internet-speed training feasible without fancy clusters.

🍞 Bottom Bread (Anchor): With 16 separate 48GB GPUs on the internet, SPES trained a 2B-parameter MoE model to competitive quality—something that used to be out of reach.

Multiple Analogies (same idea, 3 ways):

  • Library analogy: Instead of every branch storing and updating all books, each branch curates only certain sections (experts). Periodically, they exchange just those sections and the shared catalog (shared layers), keeping shelves light and mail costs low.
  • Sports analogy: Each coach trains their own position group (experts) using local drills. After a few sessions, they meet to update team playbooks (shared layers) and swap the best-trained players in each position.
  • Cooking analogy: Each cook perfects a few dishes (experts). At checkpoints, they share those recipes and a common spice mix (shared layers). Early on, they blend similar recipes to find a strong base, then later fine-tune their specialties.

🍞 Top Bread (Hook): Imagine a talent show where only a few best-suited performers go on stage for each act.

🥬 Filling (The Actual Concept):

  • What it is: Mixture-of-Experts (MoE) routes each token to just a handful of experts, so you get big capacity without paying full compute every time.
  • How it works: 1) A router scores experts, 2) picks top-k, 3) combines their outputs.
  • Why it matters: This modular design makes it natural to split experts across nodes.

🍞 Bottom Bread (Anchor): Asking a math question lights up math experts; asking about art lights up art experts.

🍞 Top Bread (Hook): Early team scrimmages help players quickly learn from each other before they specialize.

🥬 Filling (The Actual Concept):

  • What it is: Expert-merging warm-up blends each expert with its most similar peers at the start, so everyone learns faster from more tokens.
  • How it works: 1) Measure similarity (cosine) between experts’ input layers, 2) pick top-K similar ones, 3) gently average (weighted) their parameters, 4) gradually stop merging so experts can specialize.
  • Why it matters: Without this, each expert only sees a slice of data, so learning could drag.

🍞 Bottom Bread (Anchor): It’s like guitarists sharing riffs early in band practice to find a common groove, then later each focuses on their own sound.

Before vs After:

  • Before: Decentralized training saved bandwidth but still forced every node to carry full-model memory.
  • After: SPES lowers memory and communication by training only assigned experts and syncing only changed parts, yet achieves competitive quality.

Why It Works (intuition):

  • Memory shrinks because optimizer states are kept only for a subset of parameters per node.
  • Communication shrinks because only updated experts plus shared layers are exchanged.
  • Quality holds because shared layers are averaged (global knowledge) and experts either rotate through enough data or receive early merged knowledge to build strong foundations.

Building Blocks:

  • MoE backbone and router (chooses experts per token)
  • Expert partitioning (who trains what)
  • Local training with frozen unassigned experts
  • Sparse synchronization (send only what changed)
  • Expert-merging warm-up (fast shared learning, then specialization)

🍞 Bottom Bread (Anchor): Picture 16 friends each perfecting one puzzle section at home. Every so often, they bring just their section and the shared border pieces to the table. Early on, similar sections are gently averaged to get the colors right. The puzzle finishes faster without anyone carrying the whole box.

03Methodology

At a high level: Input tokens → local training of assigned experts and shared layers → sparse synchronization (only send changed parts) → early expert merging (warm-up) → updated global model.

Step 1: Set up the MoE model and split experts 🍞 Top Bread (Hook): Imagine a big toolbox where each tool is great at a specific job.

🥬 Filling (The Actual Concept):

  • What it is: The model is a decoder-only Transformer with MoE feed-forward blocks, shared attention/normalization, and a router.
  • How it works: 1) Build an MoE LLM (e.g., drop-less experts, SwiGLU, RoPE, RMSNorm, QK-Norm), 2) list all experts, 3) divide experts into disjoint sets, one set per node, 4) each node also holds shared modules.
  • Why it matters: Dividing experts prepares us to cut memory and bandwidth later.

🍞 Bottom Bread (Anchor): With 16 experts and 16 nodes, each node gets 1 expert plus the shared layers.

Step 2: Local training on each node (freeze unassigned experts) 🍞 Top Bread (Hook): You don’t need to practice every instrument to improve the band—just your own parts and the common rhythm.

🥬 Filling (The Actual Concept):

  • What it is: Each node updates shared layers and its assigned experts, while keeping other experts frozen.
  • How it works: 1) Receive the latest global model, 2) for H local steps, train on local data, 3) compute gradients and keep optimizer states only for shared layers and local experts, 4) save memory by not storing states for frozen experts.
  • Why it matters: This is the main memory win—optimizer states often dominate training memory.

🍞 Bottom Bread (Anchor): For AdamW, a node now stores states for maybe 0.7B trainable params (its expert + shared), instead of the whole 2B model.

Step 3: Sparse synchronization (send only what changed) 🍞 Top Bread (Hook): If you edited just two pages of a report, you’d email those two pages, not the entire 200-page document.

🥬 Filling (The Actual Concept):

  • What it is: At the end of local training, nodes upload only updated shared parameters and their assigned experts.
  • How it works: 1) Server averages the shared parts (FedAvg-style), 2) for each expert, the server simply takes the owner node’s updated copy (direct assignment), 3) broadcast the new global model back to all nodes.
  • Why it matters: Communication cost drops a lot because you skip sending frozen experts.

🍞 Bottom Bread (Anchor): In a 7B setup on 4 nodes, SPES needed about 9.8GB uplink per round per node vs. 28.6GB for full-model methods—a huge savings.

Step 4: Expert-merging warm-up (early training only) 🍞 Top Bread (Hook): Early huddles help teammates share tips before they split up to master their roles.

🥬 Filling (The Actual Concept):

  • What it is: During the first T_merge steps, each expert is softly merged with its most similar peers to speed up learning.
  • How it works: 1) Compare experts by cosine similarity on their input projection weights, 2) pick top-K similar ones, 3) add a small weighted average of their differences (controlled by α), 4) gradually decay α to zero and stop merging after warm-up.
  • Why it matters: It boosts token coverage per expert indirectly, helping faster convergence and better early generalization.

🍞 Bottom Bread (Anchor): The paper used K=4, α=0.1, merging every 500 steps for 12,500 steps, improving average scores like BoolQ and SciQ.

Step 5: Losses and stability tricks 🍞 Top Bread (Hook): A recipe needs the right ingredients and just the right amount of each one.

🥬 Filling (The Actual Concept):

  • What it is: The training uses: next-token cross-entropy, z-loss for stability, and load-balancing loss to encourage fair expert use.
  • How it works: 1) Cross-entropy teaches prediction, 2) z-loss keeps logits well-behaved, 3) load balancing prevents one expert from hogging all tokens.
  • Why it matters: Without balancing, some experts might rarely get trained; without stability, training can wobble.

🍞 Bottom Bread (Anchor): Think of a classroom where one student always answers; the teacher’s “everyone participates” rule is the load-balancing loss.

Concrete data example (2B MoE on 16 nodes):

  • Model: About 2.1B total, with 0.8B active; 16 experts (top-2 active per token)
  • Nodes: 16 single 48GB GPUs over ~17 Gbps Ethernet
  • Per node: Trains 1 expert (~0.7B trainable params with shared layers); memory stays under ~40GB
  • Sync: H=50 local steps before each round; only updated expert + shared parts are uploaded
  • Result: Competitive scores vs centralized/decentralized baselines with significantly lower memory and communication costs

What breaks without each step:

  • No expert partitioning: Each node must store optimizer states for the whole model—memory overload on 48GB cards.
  • No sparse sync: Upload volume balloons; internet links become the bottleneck.
  • No warm-up merging: Each expert learns too slowly from limited tokens; convergence lags and final quality drops.
  • No load balancing: Some experts starve of data; specialization gets lopsided.

The secret sauce:

  • MoE modularity makes it natural to shard experts across nodes.
  • Sparse synchronization transfers just the minimum.
  • Early expert merging counters the token-scarcity per expert and accelerates convergence.
  • Together, these design choices make weakly connected, geographically spread GPUs practical for pretraining.

🍞 Bottom Bread (Anchor): It’s like a relay team: each runner (expert) trains their leg locally, shares only their split times and improved form (updated params), and early in the season they run together to share pacing tips (merging). The team gets fast without needing a fancy indoor track.

04Experiments & Results

The test: The authors trained MoE LLMs at 1B, 2B, 7B, and a 9B upcycled model, measuring three things: (1) per-GPU memory, (2) upload bandwidth per round, and (3) downstream benchmark accuracy (e.g., ARC, PIQA, SciQ, BoolQ, WinoGrande). The key question: Can SPES match centralized or prior decentralized performance while using much less memory and communication?

The competition: They compared against centralized training (tight clusters with FSDP/DP) and DiLiCo, a leading decentralized baseline that still trains the full model per node using FedAvg.

The scoreboard (with context):

  • Memory: For a 2B model, centralized and DiLiCo need over ~50GB per GPU—too big for many 48GB cards. SPES keeps memory under ~40GB without extra sharding. That’s the difference between “won’t run” and “runs comfortably.”
  • Communication: In a 7B setup on 4 nodes, SPES uploads ~9.8GB per node per round versus ~28.6GB for full-model methods (about a 65% cut in that case). In another 2B/16-node analysis, SPES cuts communication by about 33.3% per round.
  • Throughput: Despite lacking fast RDMA interconnects, SPES hits ~3.67k tokens/s per GPU with H=50, close to centralized’s ~3.79k on far stronger fabric—showing speed stays competitive.
  • Accuracy: On a suite of commonsense tasks, SPES-2B (trained across 16 internet-connected 48GB GPUs) is competitive with centralized and DiLiCo baselines. For example, the average score nudges ahead of DiLiCo in their 1B comparison and remains solid across ARC(e/c), PIQA, SciQ, BoolQ, SIQA, WinoGrande. SPES-7B matches or beats MoE++ on several metrics using fewer tokens. SPES-9B (upcycled from a strong dense model) hits competitive, even standout numbers like 81.5 on ARC-E and 77.3 on BoolQ with under 500B tokens before stopping early.

Make the numbers meaningful:

  • Think of 87% vs 80% as A+ vs B-. Here, SPES doesn’t always leap ahead, but it consistently lands in the A/B range while needing only consumer-grade GPUs and basic Ethernet. That’s the magic: similar report card, simpler classroom.

Surprising findings:

  • Warm-up merging helps more than you might guess. Turning it on raised the average from about 50.5 to 51.3 in one study, especially improving BoolQ and SciQ—evidence that early cross-expert sharing fixes the “few tokens per expert” problem.
  • SPES stays robust as you add nodes. Performance dipped only slightly when moving from 2 to 8 nodes with a fixed batch, showing the design handles more participants without crumbling.
  • Synchronization frequency matters. Bigger H (e.g., 200–400) reduced communication but also hurt performance, implying a sweet spot around H=50 where you get both quality and efficiency.

Data and implementation notes:

  • Data was drawn from open corpora like Ultra-FineWeb, SlimPajama, and subsets from OLMo-Mix-1124 (arXiv, OpenWebMath, peS2o, Algebraic Stack, StarCoder). Upcycling used the Nemotron Pretraining Dataset.
  • Infrastructure varied from 16 separate L40S GPUs over ~17 Gbps Ethernet (2B) to 4 nodes Ă— 8 A800s (7B). A gRPC-based parameter server handled aggregation.

Bottom line: SPES shows that with smart architecture (MoE), selective local training (experts + shared), sparse syncing, and a short early merging phase, you can hit centrally competitive accuracy while cutting the usual roadblocks—memory and bandwidth. That unlocks serious pretraining on everyday hardware setups.

05Discussion & Limitations

Limitations (be specific):

  • Scale ceiling not yet fully tested: Experiments went up to 9B parameters and <500B tokens; much larger models or trillion-token runs need more validation.
  • Router/expert design not exhaustively explored: While MoE choices were sensible (e.g., drop-less), more advanced expert architectures could shift outcomes.
  • Synchronization trade-offs: If H is too large, models drift and performance dips; too small, and communication costs rise. Tuning is needed per setup.
  • Early merging is a knob: Wrong α, K, or warm-up length can over-mix experts, hurting late specialization.

Required resources:

  • One GPU per node (48GB worked for 2B MoE), an Ethernet link (~10–20 Gbps practical), and a modest CPU/RAM parameter server.
  • Open datasets, an MoE-capable training stack (e.g., OLMoE codebase), and a gRPC-like communication layer.

When NOT to use SPES:

  • If you already have a tightly coupled RDMA cluster: Traditional FSDP/tensor/pipeline parallelism might be simpler and faster end-to-end.
  • If your model is dense-only and cannot be re-architected or upcycled to MoE: SPES relies on expert modularity.
  • If nodes are extremely unstable or networks are highly unreliable: Even sparse sync needs occasional reliable rounds.

Open questions:

  • Scaling laws under SPES: How do accuracy and convergence change with more experts, more nodes, and bigger token budgets?
  • Smarter merging: Can learned or adaptive merging policies beat fixed cosine/Top-K schemes?
  • Dynamic expert assignment: Could we reassign experts across nodes mid-training to balance load or handle stragglers?
  • Beyond text: How well does SPES extend to multimodal tasks (vision+language, code+execution), or to long-context training?
  • Privacy and robustness: What happens when data distributions are very different across nodes, and how to protect against malicious updates?

Takeaway: SPES is not a silver bullet for all training, but it’s a powerful new tool for memory- and bandwidth-limited settings. It makes serious pretraining possible on ordinary, decentralized hardware while staying competitive in quality. With further work on scheduling, routing, merging, and robustness, its sweet spot could grow even larger.

06Conclusion & Future Work

Three-sentence summary: SPES is a decentralized, memory-efficient way to pretrain Mixture-of-Experts language models by letting each node train only its assigned experts (plus shared parts) and synchronizing just those changed pieces. A short expert-merging warm-up helps experts learn faster from broader token exposure before they specialize. This approach keeps per-GPU memory and per-round communication low while delivering performance comparable to centralized training on strong clusters.

Main achievement: Proving that large-scale pretraining can be done across weakly connected, everyday GPUs—without sacrificing competitiveness—by combining MoE modularity, sparse synchronization, and early expert merging.

Future directions: Push to larger models and longer runs; explore adaptive merging and dynamic expert reassignment; integrate with advanced optimizers; extend to multimodal and long-context settings; harden against heterogeneous and adversarial environments.

Why remember this: SPES turns “you need a supercomputer” into “you need a few friends with decent GPUs and the internet.” It democratizes pretraining, trims memory and bandwidth bills, and shows how clever architecture plus smart synchronization can widen who gets to build the next generation of AI.

Practical Applications

  • •Run pretraining experiments across a campus lab’s scattered 48GB GPUs without buying a high-speed cluster.
  • •Upcycle an existing dense checkpoint into an MoE and continue training with SPES to boost capacity at low cost.
  • •Create university–industry collaborations where each partner trains a few experts and synchronizes over standard Ethernet.
  • •Launch community compute collectives that contribute experts from home or office GPUs to build open models.
  • •Prototype regional or domain-specific models (e.g., medical, legal, education) on modest hardware geographically spread out.
  • •Teach a hands-on course where students each train one expert and see the full model improve as they synchronize.
  • •Scale research ablations (router choices, merging schedules, H steps) without needing RDMA infrastructure.
  • •Operate in edge data centers or cloud spots with limited bandwidth by syncing only updated experts and shared layers.
  • •Combine SPES with intra-node sharding (FSDP) when a node has multiple GPUs to push to larger model sizes.
  • •Experiment with privacy-aware setups where data stay local and only parameter updates travel.
#decentralized LLM pretraining#mixture-of-experts (MoE)#sparse expert synchronization#federated optimization#expert merging warm-up#memory-efficient training#distributed GPUs#low-bandwidth communication#optimizer states (AdamW)#parameter server (gRPC)#MoE routing#token utilization#upcycling dense to MoE#RMSNorm RoPE#communication-efficient learning
Version: 1

Notes

0/2000
Press Cmd+Enter to submit