ECO: Quantized Training without Full-Precision Master Weights
Key Summary
- ā¢Training big AI models uses lots of memory because most methods still keep a secret full-precision copy of the weights called master weights.
- ā¢ECO is a new optimizer trick that removes master weights by updating the low-precision (quantized) weights directly and recycling the rounding error into momentum.
- ā¢That recycled error creates an error-feedback loop so tiny updates donāt get lost, keeping training stable without extra memory.
- ā¢Theory shows ECO converges close to the best possible answer, and avoids the huge errors that happen if you naively remove master weights.
- ā¢With stochastic rounding, ECOās accuracy nearly matches systems that keep master weights, even for large models.
- ā¢Experiments on 30Mā800M Transformers, Gemma-3 1B, a 2.1B SMoE, and INT4 fine-tuning of DeepSeek-MoE-16B show near-lossless accuracy.
- ā¢ECO shifts the memory vs. accuracy trade-off, saving up to about 25% of static memory in some setups.
- ā¢It needs no new hyperparameters, adds almost no runtime, and is easy to implement.
- ā¢On hardware without stochastic rounding, ECO still helps, though the gains are slightly smaller.
Why This Research Matters
ECO lets teams train big models with less memory by removing the need for master weights, which is a major practical bottleneck. That means you can fit larger models or more experts on the same GPUs, or use cheaper hardware to get similar results. Lower memory also reduces energy use and costs, which is good for budgets and the environment. Because ECO is simple and hyperparameter-free, itās easy to adopt in existing training code. With stochastic rounding, ECO keeps accuracy near the best baselines, so youāre not trading quality for savings. This makes cutting-edge training more accessible to smaller labs, startups, and educational groups.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine you have a huge LEGO castle. You want to add tiny decorations (small LEGO pieces), but your storage bin is so full that just keeping a second copy of every piece takes up all the space on your floor.
š„¬ The Concept (Quantized Training World Before ECO): Many teams train giant language models using low-precision numbers (small LEGO pieces) during the forward and backward passes to save time and memory. But almost everyone still keeps a separate high-precision copy of the weights called master weights to store tiny update steps that would otherwise get lost when rounded. This high-precision copy silently eats most of the memory savings.
How it works (before ECO):
- Keep high-precision master weights in memory.
- Quantize them to low precision for the math of each step.
- Compute gradients and apply updates to the high-precision copy.
- Repeat: quantize again next step.
Why it matters: Without the high-precision copy, small updates can vanish after rounding, making training drift or stall. But with the copy, you lose a lot of memory.
š Anchor: Teams using FP8 for compute still keep FP32 master weights. It runs fast, but memory barely shrinks because that big FP32 buffer never goes away.
š Hook: You know how rounding money to the nearest dollar can make small coins disappear over time? If you always round down, your piggy bank never fills with change.
š„¬ The Concept (Quantization): Quantization is switching from big, precise numbers to smaller ones to save space and compute.
How it works:
- Pick a scale so numbers fit into a small range (like choosing how many cents each tick mark is worth).
- Round each number to the nearest tick on that grid.
- Store the rounded number instead of the exact one.
Why it matters: It saves memory and speeds up training, but tiny updates can get lost when theyāre smaller than a tick.
š Anchor: If your scaleās tick is 0.01 and you try to add 0.0003, rounding erases it. Do that a lot, and you miss many tiny but important changes.
š Hook: Imagine two ways to round your allowance: always to the nearest dollar, or sometimes up and sometimes down so, on average, itās fair.
š„¬ The Concept (Stochastic Rounding): Stochastic rounding sometimes rounds up and sometimes down with a probability that keeps things unbiased on average.
How it works:
- Find the two nearest grid points (down and up).
- Flip a weighted coin: the closer one wins more often.
- Over many steps, the average is correct, even if each step is a little noisy.
Why it matters: It prevents tiny updates from being erased in a biased way; over time, they still add up.
š Anchor: If 1.49 sometimes rounds to 1 and sometimes to 2, then over many tries you get about 1.49 on average, not stuck always at 1.
š Hook: Think of a sports team where only a few specialists enter the game depending on the play.
š„¬ The Concept (Sparse Mixture of Experts, SMoE): SMoE models activate only a few expert sub-networks per token, saving compute.
How it works:
- Many experts exist, each good at different patterns.
- A router picks a small set of experts for each input.
- Only chosen experts run, saving time and memory.
Why it matters: Even though only a few experts are active, you must still store all expertsā weightsāso master weights become a massive memory burden.
š Anchor: If a SMoE has 32 experts but uses only 4 each time, you still keep all 32 in memory; removing master weights would be a big win.
š Hook: You know how a to-do list helps you remember small tasks you couldnāt do right away?
š„¬ The Concept (Error Feedback): Error feedback keeps track of the difference lost by rounding and adds it back later.
How it works:
- After rounding, compute the rounding error (what got lost).
- Save that error.
- Next time, add the saved error to your new update so nothing is wasted over time.
Why it matters: Without error feedback, a lot of small but important tweaks vanish; models can drift or learn slower.
š Anchor: If you meant to save 1, you note the missing 49Ā¢ and add it next time. Over time, you save the full amount.
š Hook: Imagine trying to stop keeping that extra high-precision copy to save memory. Sounds greatāuntil the model starts wobbling.
š„¬ The Problem: If you remove master weights and just update the low-precision weights, many tiny updates vanish, or noise stacks up in a bad way.
How it showed up:
- Naive master-weight removal often diverges or gets much worse accuracy, especially as the learning rate gets small.
- Some prior tricks worked only for tiny models or special cases, not for big LLMs.
- Classic error feedback needs an extra error buffer, which ruins the memory savings we wanted.
The Gap: We need a way to train directly on quantized weights that (a) keeps tiny updates alive, (b) doesnāt add any new memory buffers, and (c) scales to large LLMs and SMoEs.
Real Stakes for Daily Life:
- Train bigger models on the same hardware, or the same models on cheaper hardware.
- Lower energy and costs for companies, research labs, and the planet.
- Make on-device or edge training more practical in the future.
- Faster iteration cycles because youāre not memory-bound.
- More accessible AI research and development for smaller teams.
02Core Idea
š Hook: You know how when you write with a marker on a dotted grid, tiny strokes can get snapped to the nearest dot and some details disappear? What if your pen had a memory that gently pushed the next stroke to include what got lost?
š„¬ The Concept (ECO ā Error-Compensating Optimizer): ECO is a way to train directly on low-precision weights while sneaking the rounding loss into the momentum, so your next step remembers what rounding threw away.
How it works:
- Do a normal optimizer step to get a tentative new weight (in high precision just for the step math).
- Quantize that tentative weight back to low precision.
- Measure the rounding error (the piece you lost when snapping to the grid).
- Add that error into the optimizerās momentum buffer.
- Next step, the momentum gently returns the lost piece, creating an error-feedback loop without any extra memory.
Why it matters: It removes the need for master weights, preserves tiny updates, and stays stableāso you save lots of memory with near-lossless accuracy.
š Anchor: If your rounded weight missed +0.003, ECO stores that +0.003 in momentum, so the next update includes it. Over time, you donāt lose anything.
Three Analogies for the Same Idea:
- Piggy-bank change: When rounding drops the coins, ECO sweeps the spilled coins into your savings jar (momentum) so next time they still count.
- Sticky notes: If a teacher erases part of the board (rounding), ECO writes a sticky note of the missing bit and uses it in the next class (next step).
- Moving walkway: Even if you take tiny steps that donāt register (below the grid), the walkway (momentum with injected error) carries you forward by the missed distance over time.
Before vs After:
- Before: Keep FP32 master weights to protect tiny updates, or risk losing them and diverging.
- After: Update quantized weights directly, but recycle the quantization error into momentum. No master weights, similar accuracy, less memory.
š Hook: You know how rolling a snowball gathers flakes you missed the first time? Momentum is like that gentle rolling.
š„¬ The Concept (Momentum, as used by ECO): Momentum stores a smoothed version of recent updates, helping carry information forward between steps.
How it works:
- Compute the current gradient-based update.
- Blend it with the previous momentum (a smoothing factor β decides how much past you keep).
- Use this blended value to update the weights.
Why it matters: With ECO, we add the rounding error into that blend, so nothing small gets lost.
š Anchor: If your update should be +0.0103 but you only apply +0.010 because of rounding, the missing +0.0003 goes into momentum and gets returned later.
Why It Works (intuitive logic, no equations):
- Consecutive rounding errors are very similar from one step to the next (the grid and sizes donāt change much), so pushing todayās lost bit into momentum is a reliable way to repay it tomorrow.
- With stochastic rounding, the rounding is unbiased on average, so the momentum averages out noise instead of locking in a bias.
- The model can only end up on grid points anyway, so your best possible answer is a small neighborhood near the true continuous optimum; ECO provably gets you there.
- Naive removal of master weights lets errors pile up the wrong way, especially when you shrink the learning rate; ECO prevents that blow-up by recycling errors constructively.
Building Blocks:
- Quantization: Smaller number format for weights.
- Momentum: Smooth running total of recent updates.
- Error Feedback: Track what rounding lost; put it back later.
- Stochastic Rounding (best case): Keeps rounding fair on average.
- ECOās Trick: Reuse the existing momentum buffer to store the lost bitāno extra memory.
š Anchor: In tests, ECOās accuracy sticks close to systems with master weights, but with up to roughly 25% less static memory, especially helpful for SMoE models where weights dominate.
03Methodology
High-Level Pipeline: Input (quantized weights and optimizer state) ā Optimizer step on quantized weights ā Quantize the tentative new weights ā Compute rounding error ā Inject that error into momentum ā Output (new quantized weights and updated optimizer state).
Step-by-step with the Sandwich pattern for each key step:
- Do the optimizer step directly on quantized weights. š Hook: Imagine cooking with measuring spoons that only measure whole teaspoons. You still follow the recipe, even if you canāt pour half a pinch exactly. š„¬ The Concept: We compute the next weight update using the usual optimizer (SGD with momentum or Adam) but starting from the quantized weights in memory. How it works:
- Read the current quantized weights.
- Compute gradients from the forward/backward pass.
- Form the tentative update using your optimizerās rules (e.g., blend with momentum). Why it matters: If we didnāt do the step using quantized weights, weād need master weights again, losing the memory savings. š Anchor: Suppose the weight is 0.123 in FP8 scale and the momentum suggests +0.010. We tentatively go to 0.133 before snapping back to the grid.
- Quantize the tentative new weights. š Hook: You know how snapping a photo to a pixel grid makes details line up to the nearest pixel? š„¬ The Concept: After computing the tentative high-precision value for the next weight, we round it back to the low-precision grid. How it works:
- Use your chosen quantization (e.g., FP8 or INT4) and rounding method (stochastic is best when available).
- Produce the new on-grid weight that will be stored and used next step. Why it matters: This keeps memory low and matches the format used for forward/backward. š Anchor: Tentative 0.133 becomes either 0.132 or 0.134 depending on rounding.
- Compute the rounding error. š Hook: If you wanted to save $1.33 but your jar can only hold whole cents, you note the leftover fraction as an IOU. š„¬ The Concept: The rounding error is the small difference between the tentative high-precision weight and the snapped low-precision weight. How it works:
- Subtract: error = tentative ā quantized.
- This error is tiny but important because itās what would be lost without compensation. Why it matters: If we ignore this error, small but crucial learning signals vanish. š Anchor: Tentative 0.133 became 0.132; the error is +0.001.
- Inject the error into momentum (the secret sauce). š Hook: You know how a rolling average remembers what happened recently? What if it also remembered what rounding erased? š„¬ The Concept: ECO adds a scaled version of the rounding error into the optimizerās momentum buffer. How it works:
- Take the tiny error and add it to momentum, scaled by the learning-rate and momentum settings (the code does this per-parameter and per-step).
- For Adam, use its element-wise effective step sizes to scale the injection. Why it matters: This creates a memory of lost updates, so the next step pays them back. No extra buffers are neededāmomentum already exists. š Anchor: That +0.001 missing piece gets tucked into momentum now. Next step, momentum gently adds it back to the weight.
- Repeat next step. š Hook: Think of a Ferris wheel where each bucket carries a small souvenir to the next station. š„¬ The Concept: Every step, ECO repeats the cycle so no small update gets permanently lost. How it works:
- New quantized weights become the starting point.
- Momentum already carries unspent change from before.
- Next update reflects both fresh gradients and the saved rounding bits. Why it matters: Over time, updates are preserved, training remains stable, and memory savings persist. š Anchor: After many steps, the model reaches nearly the same accuracy as if you had master weightsāwithout paying that memory cost.
Concrete Mini-Example:
- Suppose a weight is 1.00 and momentum suggests +0.003, but your grid has 0.002 steps.
- Tentative: 1.003 ā quantize to 1.002; error = +0.001.
- ECO puts +0.001 into momentum.
- Next step, even if the new pure update is only +0.001, the momentum returns the saved +0.001, totaling +0.002, which now lands exactly on the grid.
Why each step exists and what breaks without it:
- Optimizer step on quantized weights: avoids needing master weights; otherwise you lose memory savings.
- Quantize tentative weights: keeps storage and compute consistently low-precision; otherwise formats drift.
- Compute error: identifies what the grid threw away; otherwise tiny updates vanish.
- Inject error into momentum: returns lost value later; otherwise learning can stall or diverge.
SGD with Momentum vs. Adam:
- SGDM: ECO adds the error to the single momentum buffer with a scale tied to the learning rate and momentum factor.
- Adam: Same idea, but Adamās per-parameter adaptive step sizes are used to scale the error injection element-wise.
Implementation Tips:
- No new hyperparameters are needed beyond your usual optimizer settings.
- Use stochastic rounding if your hardware supports it; ECO still helps with round-to-nearest but SR is better.
- You can start by applying ECO to linear layers inside Transformer blocks, keeping embeddings/outputs high precision if desired.
The Secret Sauce:
- Error feedback without extra memory: momentum doubles as the error storage.
- Stability at small learning rates: avoids the 1/learning-rate blow-up that hurts naive no-master-weight training.
- Consecutive errors are similar: ECO exploits this to keep training smooth and accurate.
- Simplicity: A tiny code change; almost no runtime overhead.
04Experiments & Results
The Test: The authors checked whether ECO keeps training accurate and stable without master weights, across different model sizes and tasks, while reducing memory. They measured validation loss (how well the model generalizes) and static memory (how much is permanently tied up by weights and optimizer states).
The Competition (Baselines):
- High-precision anchors: BF16 compute with FP32 master weights; FP8 compute with master weights (round-to-nearest or stochastic rounding).
- No-master-weight baselines: Update quantized weights directly using FP8 or INT4, with either round-to-nearest or stochastic roundingāwithout ECO.
- ECO variants: Same quantization setups, but with ECOās error injection (RTN or SR).
Scoreboard with Context:
- Small Transformers (30Mā800M parameters):
- Naive removal of master weights often diverged (or got clearly worse) without ECO.
- ECO with stochastic rounding matched or came very close to the validation loss of the master-weight baselines. Think of it like getting an A- to A when others without ECO are scoring Cās or even failing to finish the test.
- With round-to-nearest (no SR), ECO still improved a lot over naive, though SR provided the best results.
- Gemma-3 1B pretraining:
- ECO + SR practically overlapped the curves of master-weight baselines, showing near-lossless accuracy.
- SMoE 2.1B pretraining:
- ECO outperformed naive removal by a wide margin and came close to master-weight systems, particularly with SR.
- DeepSeek-MoE-16B fine-tuning in INT4:
- Naive no-master-weight training diverged.
- ECO (both RTN and SR) stayed stable and matched master-weight baselines on training loss and zero-shot benchmarks like ARC, GSM8K, HellaSwag, PIQA, and MMLU. Thatās like running the race at the same pace while carrying a lighter backpack.
Memory Savings:
- In SMoE settings where weights dominate memory, removing FP32 master weights cut static memory from about 12 bytes/parameter to about 9āroughly a 25% reductionāwhile keeping accuracy near-lossless with ECO (especially with SR).
- The Pareto frontier (memory vs. validation loss) shifted in ECOās favor: for the same loss, you can use less memory; for the same memory, you can often get better loss.
Surprising and Informative Findings:
- Consecutive quantization errors were highly similar in both size and direction; this supports ECOās assumption that reusing the current error in momentum is a strong approximation.
- With RTN only (no SR), ECO still helps a lot but keeps a slightly higher noise floorāmatching the theory that unbiased rounding (SR) is best.
- Runtime overhead was negligible since ECOās injection is a simple element-wise operation.
Theory vs. Practice:
- Theory predicted: naive no-master-weight training can blow up as you reduce the learning rate; ECO stays stable and converges near the best grid-constrained solution.
- Practice confirmed: naive runs often diverged or degraded, while ECO runs were stable and accurate across sizes and tasks.
05Discussion & Limitations
Limitations:
- Works best with stochastic rounding (SR). On hardware that only supports round-to-nearest (RTN), ECO still helps but the final accuracy can sit on a slightly higher noise floor.
- When master weights are allowed and you prefer RTN (which can be slightly better in that setup), ECO may have a tiny ceiling compared to the absolute best RTN-with-master-weights runs.
- Some layers (like embeddings or output heads) may still need special care; many experiments focused ECO on Transformer internal linear layers first.
Required Resources:
- A training stack that supports low-precision weight storage (FP8 or INT4) and, ideally, stochastic rounding.
- Standard optimizer states (momentum for SGDM or Adam). ECO reuses theseāno extra buffers.
- Usual LLM training infrastructure (data pipelines, distributed training, checkpointing). ECO adds almost no extra compute.
When NOT to Use:
- If your hardware absolutely cannot support SR and youāre chasing the final decimal of accuracy at any cost, a master-weight RTN baseline may still be slightly stronger.
- If your optimizer is highly custom and doesnāt use momentum-like buffers, ECOās trick may not directly apply.
- If memory is abundant and simplicity is key, sticking with standard master weights might be more straightforward.
Open Questions:
- How far can ECO scale in trillion-parameter regimes and ultra-long training runs, especially with different learning rate schedules?
- What are the best practices for combining ECO with optimizer-state quantization (e.g., 8-bit or 4-bit moments) and activation checkpointing?
- Can we extend the same no-extra-memory error-feedback principle to other optimizers or second-moment buffers while keeping stability?
- How does ECO interact with advanced routing or load-balancing strategies in very large SMoE models?
- Can future hardware add efficient SR everywhere so ECOās strongest guarantees become the default?
06Conclusion & Future Work
Three-Sentence Summary:
- ECO trains directly on quantized weights and injects the lost rounding error into momentum, removing the need for full-precision master weights.
- This error-feedback loop preserves tiny updates, keeps training stable, and matches the accuracy of master-weight baselinesāespecially with stochastic roundingāwhile saving up to roughly 25% static memory in some settings.
- Theory explains why naive removal fails and why ECO converges to a near-optimal neighborhood; experiments across model sizes and tasks confirm near-lossless accuracy and negligible runtime cost.
Main Achievement:
- ECO establishes a practical, general, and memory-free path to master-weight-free quantized training by reusing the optimizerās momentum as an error-feedback buffer.
Future Directions:
- Combine ECO with optimizer-state quantization and more aggressive formats (FP4/INT4) for even larger savings.
- Broaden support and best practices for hardware stochastic rounding.
- Extend analysis and implementation guidance for more optimizers and more layers (including embeddings and output heads).
Why Remember This:
- ECO changes the default assumption that master weights are required for stable quantized training. Itās a simple, elegant trickāstore the rounding crumbs in momentumāthat unlocks big memory savings without sacrificing accuracy.
Practical Applications
- ā¢Train larger LLMs on the same number of GPUs by freeing memory from master weights.
- ā¢Increase the number of active experts in SMoE models without exceeding memory limits.
- ā¢Reduce training costs for startups and research labs by using smaller or fewer GPUs.
- ā¢Deploy quantization-aware training on hardware with limited memory, moving toward on-device or edge fine-tuning.
- ā¢Speed up experimentation cycles by avoiding out-of-memory errors and enabling larger batch or sequence lengths elsewhere.
- ā¢Combine ECO with optimizer-state quantization for deeper memory savings on massive runs.
- ā¢Stabilize low-precision fine-tuning (e.g., INT4) that would otherwise diverge without ECO.
- ā¢Run multi-tenant training clusters more efficiently by packing more jobs per node.
- ā¢Prototype quantized training algorithms rapidly, since ECO needs no new hyperparameters.
- ā¢Improve the memoryāaccuracy Pareto frontier in production training pipelines.