SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning
Key Summary
- ā¢This paper shows how to safely make a neural network wider in the middle of training without it freaking out.
- ā¢The authors say two things must happen together: keep the signal level steady (signal preservation) and make the new parts learn different things (symmetry breaking).
- ā¢They keep signals steady by matching the RMS scale of activations before and after width growth, so the modelās layers still talk in the same āvolume.ā
- ā¢They break symmetry by resetting optimizer states only for the new channels and by giving those channels a short, special learning-rate re-warmup.
- ā¢Copying weights during expansion looks smooth at first but secretly locks gradients so the new channels donāt learn; the proposed tricks unlock them.
- ā¢Across Mixture-of-Experts models and different optimizers (AdamW and Muon), the method improves final performance versus naive expansion.
- ā¢Under 2Ć width growth, training compute drops by up to 35% while matching or beating from-scratch training on many downstream tasks.
- ā¢RMS-preserved scaling may cause a tiny immediate loss bump but wins later with better convergence and lower final loss.
- ā¢The recipe works for several width axes, like hidden size and MoE expert inner dimension, and even when both grow together.
- ā¢Bottom line: SPARKLING makes mid-stage width growth practical, stable, and cost-effective.
Why This Research Matters
Safer mid-stage width growth means we can upgrade models while they train, saving significant compute and time. That directly lowers costs for building capable language models, making them more accessible to researchers and companies of different sizes. Because the method improves downstream accuracy even when final pre-training loss is slightly higher, it focuses on what matters in practice: real task performance. Its optimizer-agnostic design (works with AdamW and Muon) suggests wide applicability across modern LLM training stacks. Finally, this framework gives a clear, practical recipe teams can adopt today, turning a once-risky moveāgrowing width mid-trainingāinto a reliable optimization tool.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook) You know how you might start a puzzle with the edges first and then add more pieces later? Itās still the same puzzle, but you grow it step by step so it doesnāt get overwhelming.
š„¬ Filling (The Actual Concept)
- What it is: Progressive Learning (PL) is a way to train big AI models by starting smaller and carefully growing them during training.
- How it works:
- Begin with a smaller, cheaper model.
- Train it until it knows a lot.
- Expand the model (make it deeper or wider).
- Keep training so the bigger model can learn more without wasting earlier effort.
- Why it matters: Training huge models from scratch is very expensive; PL saves time and money by reusing what the smaller model already learned.
š Bottom Bread (Anchor) Imagine learning piano with easy songs first, then adding more keys and harder songs as you improve. You build up without starting over each time.
š Top Bread (Hook) Picture a classroom where you add more desks so more students can join without building a brand-new school.
š„¬ Filling (The Actual Concept)
- What it is: Width Expansion means adding more āchannelsā or neurons to parts of a neural network so it can represent more patterns at once.
- How it works:
- Pick where to add channels (like hidden size or expert inner size in MoE).
- Insert new rows/columns (fan-out/fan-in) in the right layers.
- Initialize the new parts in a stable way.
- Continue training so the bigger network learns.
- Why it matters: More width = more capacity to learn, but if done carelessly mid-training, the model can become unstable.
š Bottom Bread (Anchor) Itās like adding more crayons to your box mid-drawing. If theyāre the wrong size or too waxy, your picture gets smudgy; if they fit just right, you color better.
š Top Bread (Hook) Imagine turning up the volume on a speaker by accident; suddenly everything is too loud and hurts your ears.
š„¬ Filling (The Actual Concept)
- What it is: Signal Preservation means keeping the size (scale) of the layer activations consistent so the modelās layers keep āspeakingā at the same volume.
- How it works:
- Measure the RMS (root-mean-square) of activations before expansion.
- Expand width.
- Rescale new weights so the output RMS matches the old RMS.
- Proceed with training so deeper parts receive familiar-sized signals.
- Why it matters: If the scale jumps, later layers get out-of-range inputs and training wobbles or spikes.
š Bottom Bread (Anchor) If your recipe needs one cup of flour, doubling it to two cups without adjusting the rest ruins the cake. Keeping the same proportions makes it work.
š Top Bread (Hook) If two kids copy each other on every question, they wonāt learn different skills.
š„¬ Filling (The Actual Concept)
- What it is: Symmetry Breaking makes duplicated parts of a network learn different, useful features instead of staying identical.
- How it works:
- Notice that simple copying makes new and old channels get the same gradients.
- Change their āmomentumā history (optimizer state) for the new channels.
- Give new channels a small learning-rate boost briefly.
- This nudges them onto different paths.
- Why it matters: Without breaking symmetry, the extra width is wasted because duplicates act like clones.
š Bottom Bread (Anchor) On a soccer team, if everyone runs to the ball, no one defends or passes. Assigning different roles makes the team win.
Before this research, people mostly grew networks at the very beginning of training or focused on depth. Width growth in the middle of training was scary: naive setups caused loss spikes (like tripping while running), and copying weights seemed smooth but secretly froze the new parts (they learned the same thing). Attempts like uneven splitting or tiny symmetric noise helped a little but not enough, and they didnāt handle both the forward signal size and the backward learning dynamics together.
This paper identifies the missing piece: you must balance two forces at onceākeep the activation scale steady (signal preservation) and deliberately make new parts different (symmetry breaking). The stakes are big: if we can safely expand width mid-training, we can save up to 35% compute while matching or beating from-scratch models on many tasks. That means better models faster and cheaperālike upgrading your bike while riding it, without crashing.
02Core Idea
š Top Bread (Hook) Imagine youāre adding more lanes to a highway while cars are driving. You must keep traffic flowing at the same speed (no sudden jams) and make sure each lane gets different cars (not all piling into one).
š„¬ Filling (The Actual Concept)
- What it is: The key insight is to grow a networkās width mid-training by simultaneously preserving activation RMS scale and breaking gradient symmetry so the new capacity actually learns different features.
- How it works:
- Enforce RMS-scale consistency at expansion so forward signals stay in-range.
- Reset optimizer states only for the new channels to remove coupled momentum.
- Re-warm learning rates only for the new channels to help them branch out.
- Keep training under the same schedule for old parameters to maintain stability.
- Why it matters: Doing only one side (just preserve signals or just break symmetry) isnāt enough; the magic is in balancing both.
š Bottom Bread (Anchor) Think of adding new musicians to a band: keep the volume matched to the current song (RMS preservation), and give the newcomers a short solo (re-warmup with fresh momentum) so they find their own sound.
Three analogies for the same idea:
- Classroom analogy: Keep the speaking voice level the same (RMS) so everyone hears, then give new students different projects (asymmetric reset + re-warmup) so they donāt copy old work.
- Cooking analogy: Preserve the recipeās ratios (RMS) when you double it, and stir the newly added ingredients more at first (re-warmup) so they blend uniquely, not in clumps.
- Garden analogy: Keep the same sunlight and water (RMS), but loosen the soil around new plants (reset) and give them a short fertilizer boost (re-warmup) so they grow their own roots.
Before vs. After:
- Before: Width growth mid-training led to loss spikes (if signal scale changed) or useless duplicates (if you copied weights that stayed in lockstep).
- After: With SPARKLING, forward signals stay steady, and the backward dynamics are gently nudged so new channels diversify and learn.
Why it works (intuition, no equations):
- Transformers with pre-normalization rely on a balanced mix between the main path and branch outputs. If you suddenly change activation size, the layerās āmixing ratioā breaks, confusing downstream layers. Preserving RMS keeps that ratio stable.
- Copying creates identical gradients; identical gradients plus identical momentum means identical updatesāso nothing new is learned. Resetting states removes the identical momentum, and a small, temporary LR bump gives the new channels room to branch away.
Building blocks (introduced with quick sandwiches):
š Top Bread (Hook) Think of adding sockets to a power strip: you can add outputs (more devices) or inputs (more power lines).
š„¬ Filling (The Actual Concept)
- What it is: Fan-out grows output channels; fan-in grows input channels.
- How it works:
- Fan-out: add rows to a weight matrix; keep new ones statistically like old ones.
- Fan-in: add columns; rescale so the output variance stays the same.
- Why it matters: Each case needs the right scaling to keep RMS steady.
š Bottom Bread (Anchor) Adding more speaker outputs? Make sure each new speaker plays at the same loudness as the others.
š Top Bread (Hook) Imagine a ruler that always tells you the average height of a group.
š„¬ Filling (The Actual Concept)
- What it is: RMS-scale consistency means keeping the typical size of activations unchanged through expansion.
- How it works:
- Measure pre-expansion RMS.
- Apply derived rescaling factors for new weights (different for fan-in/fan-out and copy patterns).
- Verify the post-expansion RMS matches.
- Why it matters: Keeps residual streams in their comfortable operating zone.
š Bottom Bread (Anchor) Itās like keeping cruise control at 60 mph while adding a new lane; traffic flow stays smooth.
š Top Bread (Hook) If you give two runners the exact same push and same shoes, they run the same path.
š„¬ Filling (The Actual Concept)
- What it is: Copy initialization duplicates parameters, which duplicates gradients; without intervention, new channels remain clones.
- How it works:
- Copy creates identical forward outputs.
- Identical backward signals + identical optimizer momentum = identical updates.
- Why it matters: Extra width becomes fake capacity unless we break this lock.
š Bottom Bread (Anchor) Two identical keys open the same lock; to open a different door, you need to file one key differently (state reset + re-warmup).
03Methodology
At a high level: Pretrained model ā Choose width axis to grow ā RMS-preserved expansion (signal kept steady) ā Asymmetric optimizer-state reset (break clone momentum) ā Asymmetric LR re-warmup for new params (help them branch) ā Continue training ā Better, wider model at lower total cost.
Step-by-step, like a recipe:
- Decide what to widen (hidden size, MoE expert inner size, or both)
- What happens: You pick fan-out/fan-in layers that must grow together (e.g., MLP up/gate then down; attention v/out).
- Why this step exists: Matching pairs ensures consistency; growing one side without the other breaks shapes and statistics.
- Example: In an MoE expert MLP, doubling the inner dimension means widening up and gate projections (fan-out) and the down projection (fan-in).
- Expand with RMS-preserving scaling š Top Bread (Hook) Imagine swapping a light bulb for a brighter one but adding a dimmer so the room brightness stays the same.
š„¬ Filling (The Actual Concept)
- What it is: Carefully initialize and rescale new rows/columns so the post-expansion activation RMS matches the pre-expansion RMS.
- How it works:
- Fan-out: Add rows drawn like the old ones (or copied), keeping distribution consistent so output magnitude stays steady.
- Fan-in: Add columns but rescale weights by a factor that keeps per-output variance unchanged; handle special cases (random, zero, copy both sides) with the right formula.
- RMSNorm scale: When widening, make new γ entries statistically match the old ones so RMSNorm output scale stays consistent.
- What breaks without it: Downstream residual blocks receive too-large or too-small signals, upsetting the delicate balance between the main path and residual branch; training stutters.
š Bottom Bread (Anchor) In a stereo system, if you add more speakers, you adjust their gains so the total loudness sounds the same to listeners.
Tip: For tied embeddings/output projection under hidden-size growth, embedding side acts like fan-out and output side like fan-in; compensate with the right factors after the final projection.
- Choose initialization flavor (copy, random, zero) under RMS preservation š Top Bread (Hook) Think of cloning a plant, sowing a new seed, or planting a tiny sprout.
š„¬ Filling (The Actual Concept)
- What it is: Three practical starts for new channelsācopy, random, or zeroāeach adjusted to preserve RMS.
- How it works:
- Copy: Forward matches perfectly at expansion, but beware gradient symmetry.
- Random: No forward match but good diversity; with RMS scaling, recovery is smooth.
- Zero: Starts silent; treat zero side as effectively random after first updates for RMS tracking.
- What breaks without it: Naive unscaled versions may look fine early (small loss gap) but hurt late-stage convergence.
š Bottom Bread (Anchor) Copy is like a twin; random is like a new student; zero is like a quiet student who starts speaking after warming up.
- Asymmetric optimizer-state reset (break the symmetry lock) š Top Bread (Hook) If two racers have the same head start and the same wind at their backs, theyāll stay neck-and-neck.
š„¬ Filling (The Actual Concept)
- What it is: Keep optimizer states (like momentum) for original channels, but reset them for new channels.
- How it works:
- On expansion, do not copy or zero-out all states.
- Instead, original channels keep their states; new channels start fresh.
- This removes identical momentum that would keep updates identical.
- What breaks without it: With symmetric states, gradients and updates mirror each other; new channels remain clones.
š Bottom Bread (Anchor) Itās like telling the new runner, āYou start from here with fresh legs,ā while the old runner keeps their pace notes; now their strides diverge.
- Asymmetric learning-rate re-warmup (let new parts find their lane) š Top Bread (Hook) When a new player joins mid-game, you might give them a few extra drills to catch up.
š„¬ Filling (The Actual Concept)
- What it is: Briefly warm up only the new channelsā learning rate from the current value to a slightly higher peak, then decay normally.
- How it works:
- Keep the old parameters on the same schedule for stability.
- For new parameters, start at the current LR and warm up to Ļ Ć current LR over Ļ_w steps (e.g., Ļā1.3, Ļ_wā0ā250).
- After the warmup, share the same cosine tail down to the minimum LR.
- What breaks without it: New channels donāt get enough āpushā to separate; with too-strong or long re-warmup, you risk instability.
š Bottom Bread (Anchor) Give the new kid a short, focused practice so they learn the team plays faster, then they rejoin the normal practice.
- Continue training and monitor
- What happens: Train to the same token budget as the from-scratch target-width model but with fewer total FLOPs.
- Why this step exists: The whole point is cost savings without losing performance.
- Example: Expand at 100B tokens and finish at 200B; measure final loss and downstream task scores.
Secret sauce (what makes it clever):
- It balances both sides of the problemāforward stability (RMS preservation) and backward diversity (state reset + re-warmup)āso expansion is smooth and useful.
- It works across different width axes and even different optimizers (AdamW and Muon) without optimizer-specific hacks.
- It embraces that tiny bumps right after expansion are okay if the late-stage convergence improves more.
04Experiments & Results
š Top Bread (Hook) Think of testing a bigger backpack mid-hike: Does it carry more stuff without throwing you off balance, and do you reach camp faster?
š„¬ Filling (The Actual Concept)
- What it is: The authors test mid-stage 2Ć width growth on Mixture-of-Experts language models to see if training stays stable, learns well, and saves compute.
- How it works:
- Train a smaller OLMoE-style model to the midpoint (e.g., 100B tokens).
- Expand width along different axes: Inner 2Ć (expert inner dim), Hidden 2Ć (hidden size), and both together.
- Compare multiple inits (copy, random, zero) with/without RMS-preserved scaling.
- Compare different optimizer-state treatments and learning-rate schedules.
- Evaluate with AdamW and Muon to show optimizer-agnostic behavior.
- Why it matters: Real wins must show up as lower final loss, better downstream accuracy, and less compute.
š Bottom Bread (Anchor) Itās like trying different ways of packing your bigger backpack, checking which way keeps you steady and gets you to camp with the least effort.
The tests (what they measured and why):
- Reference loss after expansion and final pre-training loss: to see stability and convergence.
- Downstream benchmarks (ARC, MMLU, HellaSwag, etc.): to measure real task performance.
- Compute saved (FLOPs) and wall-clock speed-up: to measure efficiency.
The competition (baselines):
- Train the big model from scratch under the same token budget.
- Naive expansions without RMS scaling.
- Copy-based expansions with symmetric treatments (drop/copy optimizer states).
- Heuristics like uneven splitting or symmetric perturbations.
Scoreboard (with context):
- RMS-preserved scaling vs naive: Naive often shows a smaller immediate loss gap at expansion but loses later; RMS-preserved wins the āfinal exam,ā converging to lower loss.
- Copy-only vs copy + asymmetric reset + re-warmup: Copy-only looks smooth but gets stuck; adding asymmetric reset and re-warmup turns the tie into a clear win with the best final loss among variants.
- Across width axes (Inner 2Ć, Hidden 2Ć, both): Asymmetric re-warmup consistently lowers final loss; copy-copy benefits the most and ends up on top when paired with the full SPARKLING recipe.
- Across optimizers (AdamW, Muon): The same pattern holdsāRMS-preserved scaling helps late-stage convergence, and re-warmup further improves it, showing optimizer generality.
- Downstream tasks: Even with a slightly higher final pre-training loss than from-scratch in some cases, SPARKLING often matches or beats the from-scratch expanded baseline on average accuracy over many tasks.
- Compute: Under 2Ć width growth, SPARKLING saves roughly 20%ā35% FLOPs and up to about 1.49Ć wall-clock speed-up compared to training the large model from scratchālike getting an A while doing a third less homework.
Surprising findings:
- Copying seems perfect for forward smoothness, yet it underperforms later unless you break the symmetry. The backward pass is the hidden culprit.
- Zero-initialized parts should be treated statistically like random after the first updates when doing RMS-preservation; strict loss-preservation exactly at the instant of expansion is less important than keeping the right RMS shape soon after.
- Fancy orthogonalization in spectral optimizers doesnāt automatically break the copy symmetry; you really need the asymmetric state reset and re-warmup.
Overall message: The combinationāRMS preservation + asymmetric reset + asymmetric re-warmupāwins consistently, across axes and optimizers, in both convergence quality and training efficiency.
05Discussion & Limitations
š Top Bread (Hook) If youāre upgrading your bike mid-ride, you should know when it helps, when it doesnāt, and what tools you need in your backpack.
š„¬ Filling (The Actual Concept)
-
Limitations (what this canāt do yet):
- There can be a small gap in final pre-training loss versus from-scratch training, even though downstream accuracy often matches or improves.
- The method is designed and tested mainly for pre-norm Transformers (common in modern LLMs); behavior in very different architectures may vary.
- Both-sides copy without the asymmetric tricks remains weakāso if you canāt do the state reset or re-warmup, you may not get the gains.
- Only width growth is systematized; fully unified width+depth progressive growth remains open.
- Multiple expansions and optimal scheduling across many stages arenāt fully charted.
-
Required resources:
- A mid-sized GPU cluster and stable training stack (the paper used 64ĆA100 80GB) to pretrain and run controlled expansions.
- Access to optimizer states (AdamW or Muon) to perform asymmetric reset.
- Proper logging/metrics to monitor RMS scale and loss trajectories.
-
When NOT to use:
- Very early expansion: early width growth is easier but saves little compute; you may as well train the big model from scratch.
- If you canāt safely touch optimizer states (e.g., stateless training setups), copy-based expansion may remain locked.
- If your model lacks residual/pre-norm behavior that benefits from RMS matching, the gains could be smaller.
- If you have strict constraints forbidding brief LR changes, youāll lose a key symmetry-breaking lever.
-
Open questions:
- Can we form a single theory to plan both depth and width growth together?
- How does RMS preservation relate to μP (hyperparameter transfer) and can it make expansions truly ātuning-freeā?
- What is the best multi-stage schedule (where/when/how much to grow) for different datasets and sizes?
- Can we adapt the approach for other norms/normalizations (LayerNorm variants, GroupNorm) or non-Transformer backbones?
- Could smarter, learned growth operators combine with SPARKLING to automate expansion decisions?
š Bottom Bread (Anchor) Itās like learning the best times to add more train cars, how to keep the ride smooth, and which stations are best for upgradesāthereās a map to draw next.
06Conclusion & Future Work
Three-sentence summary: Mid-stage width expansion is hard because changing activation scale upsets training and copying creates gradient twins that donāt diversify. SPARKLING fixes both by preserving activation RMS during expansion and by breaking symmetry with an asymmetric optimizer-state reset and a brief, targeted learning-rate re-warmup for new channels. The result is stable growth that often matches or beats from-scratch performance while saving up to 35% compute under 2Ć width growth.
Main achievement: A simple, optimizer-agnostic recipe that jointly controls forward signal scale and backward learning dynamics so new capacity becomes useful quickly instead of staying a clone.
Future directions: Unify width and depth growth under a principled schedule; connect RMS preservation to μP for zero-shot hyperparameter transfer; automate multi-stage expansion decisions; extend to other architectures and normalization schemes.
Why remember this: It turns the scary part of growing models mid-training into a practical playbookākeep the signal steady, nudge the new parts to be different, and you get bigger, better models faster and cheaper.
Practical Applications
- ā¢Upgrade a running LLMās width mid-training to hit a higher quality target without restarting from scratch.
- ā¢Reduce pre-training costs for new model families by planning mid-stage width expansions with RMS-preserved scaling.
- ā¢Retrofit existing MoE models with larger expert inner dimensions to boost capacity under a fixed token budget.
- ā¢Safely expand hidden size to improve downstream task performance late in training cycles.
- ā¢Adopt asymmetric optimizer-state resets in training pipelines to unlock capacity after copy-based expansions.
- ā¢Use short, targeted LR re-warmups on new parameters to accelerate post-expansion recovery.
- ā¢Design multi-stage training curricula (small ā medium ā large) with stable, compute-efficient width jumps.
- ā¢Benchmark optimizers (AdamW vs Muon) under the same SPARKLING recipe to choose the best fit for your stack.
- ā¢Automate RMS checks and scaling in model surgery tools to prevent activation scale shocks.
- ā¢Combine with deployment-time upgrades: expand a near-ready model briefly to close last-mile performance gaps.