On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Taejong Joo; Wenhan Xia; Cheolmin Kim; Ming Zhang; Eugene Ie

On Surprising Effectiveness of Masking Updates in Adaptive Optimizers

Intermediate

Taejong Joo, Wenhan Xia, Cheolmin Kim et al.2/17/2026

arXiv

Key Summary

•The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.
•Skipping half the block updates (and scaling the ones you keep) smooths the path of learning, pushing the model toward flatter, more reliable solutions.
•This smoothing happens automatically because masking adds a curvature-aware regularization without ever computing curvature matrices.
•They introduce Magma, which keeps the skip trick but also checks if the current gradient agrees with momentum; aligned directions get boosted, noisy ones get damped.
•Magma is a tiny wrapper around your favorite optimizer (like Adam or RMSProp) and adds almost no extra compute or memory.
•On Llama pre-training (60M–1B), Magma consistently lowers perplexity and beats strong methods like Muon; at 1B, RMSProp+Magma reaches 13.19 vs Adam’s 16.35 and Muon’s 14.52.
•Magma also shines with mixture-of-experts models and under heavy-tailed gradient noise, where it stays in better-conditioned regions of the loss landscape.
•Dense momentum updates are key: keeping momentum updated everywhere (even when a block’s parameters are skipped) stabilizes training.
•Block-wise masking matches transformer geometry well and is efficient to implement.
•Overall, the work challenges the idea that dense updates are always best and shows structured randomness can improve both stability and generalization.

Why This Research Matters

Training giant language models is expensive and fragile—runs can fail or underperform, wasting time and compute. This paper shows that a tiny change—randomly skipping some block updates and aligning the rest with momentum—can make training both steadier and better. Because Magma is a drop-in wrapper with negligible cost, teams can adopt it quickly in existing pipelines. Lower perplexity means clearer, more reliable language understanding, which benefits search, assistants, coding tools, and more. The method also broadens the safe learning-rate range, reducing hyperparameter tuning pain. Finally, it challenges the “dense is best” assumption and opens a path to simpler, stochastic techniques that improve generalization without heavy machinery.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re building a huge Lego castle. Every brick you place matters, and if you rush and place too many at once, the wall can wobble or collapse. Careful, steady building gets you a stronger castle.

🥬 Filling (The Actual Concept):

What it is: Training large language models (LLMs) is like building that Lego castle—millions to billions of pieces (parameters) get adjusted step by step using an optimizer.
How it works (before this paper): The common way is to push on every piece at once using dense adaptive optimizers like Adam or RMSProp. These tools look at how the model is doing and automatically change each piece’s step size to try to learn faster and more safely.
Why it matters: Even though dense adaptive updates are fast and convenient, they can struggle with bumpy learning surfaces (loss landscapes) that have sharp edges, noise, and unstable spots—common in transformers and mixture-of-experts models.

🍞 Bottom Bread (Anchor): Think of your optimizer as a smart screwdriver that tightens every screw on each pass. That’s efficient—but if the wood is knotty (sharp curvature) and the drill pushes hard on every screw all the time, some screws strip. Sometimes doing a little less in the trickiest spots actually builds better furniture.

— New Concept: Adaptive Optimizers — 🍞 Hook: You know how a good chef tastes soup and then decides to add more salt or turn down the heat? 🥬 Concept: Adaptive optimizers are training tools that automatically adjust how big each parameter’s step should be, based on recent gradients.

How it works: (1) Track averages of recent gradients (momentum) and their sizes (second moments). (2) Use these to pick a safe step size per parameter. (3) Update all parameters at once (dense update).
Why it matters: Without adaptivity, some parameters overshoot and others move too slowly; learning becomes unstable and slow. 🍞 Anchor: Adam is like the chef who tastes as they cook, so the soup turns out right even if the ingredients vary a lot.

— The World Before —

Dense optimizers (Adam, RMSProp) became the default for LLMs because backpropagation gives gradients for all parameters in one pass, so updating everything together is efficient.
People tried sparse methods (like coordinate descent) that change only a few parameters at a time, but those didn’t fit well with how we compute gradients for LLMs and often underused the available information.
Transformers have quirky geometry: some directions are extremely sharp (tiny safe steps), others are gentle (bigger steps are okay). Also, gradients in LLMs can be heavy-tailed (sometimes giant spikes appear), making learning extra tricky.

— The Problem —

Dense updates keep pushing everywhere, even along sharp, noisy directions, which can cause training spikes, instability, and worse generalization.
Advanced, matrix-based preconditioners (like Muon) help but add complexity and may still struggle with the messiness of real LLM training.

— Failed Attempts —

Subspace or low-rank update methods (e.g., picking a subset of directions repeatedly) can miss important directions, get stuck optimizing the wrong slice, and don’t always save memory in today’s regimes where activation memory dominates.
Deterministic “cautious” rules that block updates when gradient signs disagree with momentum can help, but they don’t inject the right kind of randomness to nudge learning toward flatter regions.

— The Gap —

We lacked a drop-in, nearly free method that (a) respects transformer geometry, (b) tames sharp, noisy directions, (c) uses information we already compute (like momentum), and (d) works at LLM scale consistently.

— The Paper’s Idea —

Randomly skip some parameter block updates (“masking”) each step, but keep momentum updated for all blocks. Surprisingly, this improves results. Even better, use how well the gradient agrees with momentum (their cosine similarity) to decide how strongly each block should be updated after masking.

— Real Stakes —

Better training stability means fewer failed runs and less wasted compute.
Better generalization (lower perplexity) means models that understand text more reliably.
A tiny code change with near-zero overhead that boosts performance saves time, money, and energy across many training jobs.

— New Concept: Random Masking — 🍞 Hook: You know how in class the teacher sometimes calls on only a few students to answer, not everyone at once? 🥬 Concept: Random masking means we randomly choose which parameter blocks to update this step and skip the rest, so not everyone “speaks” every time.

How it works: (1) Split parameters into blocks. (2) Flip a coin per block (e.g., 50-50). (3) If heads, update that block with a slightly bigger step to balance the skipped ones; if tails, skip it this step. (4) Still update momentum for all blocks.
Why it matters: Skipping adds a gentle, geometry-aware dampening that avoids pushing hard in sharp, risky directions. 🍞 Anchor: It’s like watering half your garden today and the other half tomorrow. You still nourish everything over time, but you avoid overwatering delicate plants.

— New Concept: Blocks — 🍞 Hook: When you sort your Lego pieces into bins (by color or shape), you can work faster and smarter. 🥬 Concept: Blocks are grouped sets of parameters (like all weights in attention or MLP layers) treated as a unit for masking and scoring.

How it works: (1) Partition model parameters by layer or module. (2) Apply one mask decision per block. (3) Track alignment and momentum per block.
Why it matters: Transformers show strong within-block coupling in curvature; block-wise actions match this geometry. 🍞 Anchor: Updating an entire attention head together makes more sense than flipping thousands of tiny switches independently.

02Core Idea

🍞 Top Bread (Hook): Imagine riding a bike on a path with smooth stretches and surprise speed bumps. If you sometimes lift your feet off the pedals for a moment, you avoid putting power down on the bump and keep your ride smoother.

🥬 Filling (The Actual Concept):

What it is: The key insight is that randomly skipping some parameter block updates—and scaling the ones you keep—acts like an automatic, curvature-aware regularizer that smooths the training path. Then, align the size of each surviving update with how well the gradient agrees with momentum (Magma).
How it works:
1. Start with any adaptive optimizer’s update (Adam, RMSProp, etc.).
2. For each block, flip a coin: if it’s selected, apply the update; if not, skip the parameter change but still update momentum for that block.
3. Scale the applied update. In Magma, multiply by an alignment score from sigmoid(cosine_similarity(gradient, momentum)/temperature), smoothed over time.
4. Repeat every step. Over time, this gently avoids sharp, unstable directions and sticks with consistent, signal-rich directions.
Why it matters: Without this, dense updates keep pushing in sharp or noisy directions, causing spikes and worse generalization; masking plus alignment tamps down those risky moves.

🍞 Bottom Bread (Anchor): Think of a choir. If some singers keep going off-key, the conductor lowers their volume (damps) and occasionally lets them rest (skips), while the on-key sections carry the melody. The performance sounds clearer and more stable.

— New Concept: Geometric Regularization — 🍞 Hook: You know how skaters choose a smoother line on the ice to stay balanced and elegant? 🥬 Concept: Geometric regularization gently guides learning to take smoother, safer paths (flatter regions) in the loss landscape.

How it works: (1) Masking creates an expected penalty for pushing along sharp directions. (2) This penalty is stronger where curvature is high. (3) The optimizer naturally prefers flatter routes where progress is steadier.
Why it matters: Flatter solutions tend to generalize better, meaning the model performs well not just on training data but on new data, too. 🍞 Anchor: It’s like choosing a gently sloped hiking trail that lets you keep a steady pace, rather than a rocky scramble that slows you down and risks a fall.

— Three Analogies for the Idea —

Stepping stones: When the riverbed is slippery (sharp directions), sometimes you skip a step and use the next safer stone; you still cross, just with fewer slips.
Traffic lights: Random reds (masks) stop some lanes, and greens (scaled updates) let aligned lanes move; traffic flows smoother through busy intersections.
Team rowing: If a rower is out of rhythm (misaligned gradient), you damp their stroke. The boat stays straight and fast because the team keeps stroke with momentum.

— Before vs. After —

Before: Dense optimizers push everywhere, every step, even in risky directions; stabilizers are either complex (matrix preconditioners) or deterministic (missing useful randomness).
After: Skip-and-scale yields smoother progress with tiny overhead; Magma uses momentum–gradient agreement to further prefer trustworthy moves. The result is better stability, lower perplexity, and wider safe learning-rate ranges.

— Why It Works (Intuition) —

Random masking changes higher-order behavior: it keeps the average update the same but adds a curvature-sensitive penalty that discourages moving along sharp directions.
Alignment scoring rewards directions with consistent signal (gradient agrees with accumulated momentum) and suppresses jittery, noise-dominated ones.
Keeping momentum dense gives you a steadier compass, even when some blocks pause their parameter updates.

— Building Blocks —

Mask coin flip per block (e.g., 50% keep rate) and rescale survivors.
Alignment score = sigmoid(cosine_similarity(gradient, momentum)/temperature), then smoothed by an exponential moving average.
Dense momentum updates for all blocks, every step.
Apply as a simple multiplier to any base optimizer’s update.

— New Concept: Curvature (and Hessian) — 🍞 Hook: Picture a bowl. A wide, shallow bowl is easy to balance a marble in; a narrow, steep bowl makes the marble roll fast and unpredictably. 🥬 Concept: Curvature describes how sharp or flat the loss surface is near your point; the Hessian matrix measures this mathematically.

How it works: (1) High curvature = sharp, risky directions; low curvature = gentle directions. (2) Masking creates a stronger penalty in high-curvature blocks. (3) The optimizer naturally avoids those sharp paths.
Why it matters: Avoiding sharpness leads to more stable training and better generalization. 🍞 Anchor: In a steep bowl, tiny nudges send the marble flying; masking acts like a brake in those directions so you don’t overshoot.

— New Concept: Cosine Similarity — 🍞 Hook: When two friends walk in the same direction, they get where they’re going faster. 🥬 Concept: Cosine similarity measures how aligned two vectors are—in this case, the gradient (now) and momentum (long-term trend).

How it works: (1) Compute cosine similarity between gradient and momentum per block. (2) Pass it through a sigmoid to get a score in (0,1). (3) Smooth it over time; multiply the block’s update by this score.
Why it matters: Aligned signals are more trustworthy; misaligned ones are more likely noise, so they get damped. 🍞 Anchor: If your map (momentum) and the road sign (gradient) point the same way, you speed up a little; if they disagree, you slow down.

03Methodology

At a high level: Inputs → (Base optimizer proposes updates) → (Per-block coin flip mask) → (Per-block alignment score and smoothing) → (Multiply: mask × alignment × base update) → Output new parameters (while momentum stays dense)

Step-by-step recipe:

Prepare the ingredients

What happens: Choose a base optimizer (Adam, RMSProp, LaProp, etc.). Organize parameters into blocks (e.g., all weights in an attention head or MLP layer). Keep track of momentum (first-moment estimate) and, if applicable, second moments.
Why this exists: Blocks match transformer geometry; momentum stores the smoothed direction of descent.
Example: Suppose we have two blocks: Block A = attention weights, Block B = MLP weights. Momentum μ_A and μ_B are running averages of recent gradients.

Base update direction Δ

What happens: The base optimizer computes its proposed update Δ for each block using the current gradient and its preconditioning (like dividing by RMS of past gradients in RMSProp).
Why this exists: Δ is the best guess of how to move this step before masking or alignment.
Example: RMSProp suggests Δ_A = −0.01 and Δ_B = −0.005 (vector updates, simplified as scalars here for clarity).

Random masking (SkipUpdate)

What happens: For each block, flip a coin m_b ~ Bernoulli(p), often p=0.5. If m_b=1, keep the update; if 0, skip the parameter change for that block this step. For SkipUpdate (the simple version), scale survivors by s=1/p (e.g., 2 if p=0.5) to keep the average update unbiased.
Why this exists: Masking injects geometric regularization that pushes you away from sharp directions, smoothing the training path without explicit curvature computation.
Example: Coin says m_A=1 (keep), m_B=0 (skip). We apply 2×Δ_A and skip Δ_B for parameters, but we still update momentum μ_B using the current gradient so the compass stays accurate.

Momentum stays dense

What happens: Even if a block’s parameter update is skipped, you still update its momentum (and other running stats) using the fresh gradient.
Why this exists: Dense momentum reduces noise and prevents the optimizer from losing track of where to go next time the block is selected.
Example: Although Block B’s parameters didn’t change, μ_B incorporates today’s gradient so that when B is updated next, its direction is steadier.

Magma’s alignment score

What happens: Compute cosine similarity between the block’s gradient g_b and momentum μ_b. Turn that into a score via sigmoid(cossim/τ), with τ a temperature (e.g., 2). Smooth this score over time: s_b ← 0.9 s_b(prev) + 0.1 sigmoid(...).
Why this exists: Aligned gradients carry consistent signal; misaligned ones are often noise. The score gently boosts trustworthy updates and damps risky ones.
Example: If cossim for Block A = 0.8 and τ=2, sigmoid(0.4) ≈ 0.60; after smoothing, s_A might be around 0.62. If Block B’s cossim = −0.5, sigmoid(−0.25) ≈ 0.44; after smoothing, s_B ≈ 0.46.

Final parameter update per block

What happens: For each block, apply θ_{t+1}^{(b)} = θ_t^{(b)} − s_b × m_b × Δ_b.
Why this exists: This combines geometric smoothing (masking) with signal selection (alignment), while the base optimizer provides the core update idea.
Example: Using the numbers above: If m_A=1, update A by −(0.62)×2×0.01 = −0.0124. If m_B=0, skip B’s parameter change this step (but keep momentum update for B).

Learning-rate schedule and other basics unchanged

What happens: You can keep your usual warm-up and cosine decay; Magma is a wrapper, not a replacement for your whole training recipe.
Why this exists: Ease of adoption—minimal code changes, same training pipeline.
Example: The team keeps their 10% warm-up and cosine schedule and just adds the Magma multiplier.

What breaks without certain steps:

Without masking: You lose the curvature-aware regularization that avoids sharp, unstable directions; spikes and worse generalization may return.
Without dense momentum: Momentum becomes noisier and less reliable; experiments show stability suffers.
Without alignment damping: Unbiased variants (using alignment as survival probability with 1/p rescaling) were unstable; the gentle bias of damping improves robustness.

Concrete micro-example with two steps:

Step 1: p=0.5. Block A kept (m_A=1), Block B skipped (m_B=0). s_A≈0.62, s_B≈0.52. Apply A: −0.0124; skip B’s parameter change but update μ_B.
Step 2: Now B’s cossim improved (thanks to updated μ_B), giving s_B≈0.58; coin flips m_A=0, m_B=1. This time, A holds steady while B advances with a steadier direction.
Net effect: Both blocks move forward over time, but the most reliable direction gets emphasized, and sharp/noisy directions get fewer, smaller pushes.

The secret sauce:

Masking adds curvature-aware smoothing for free (no Hessian needed).
Alignment turns noisy directions down and consistent ones up.
Dense momentum keeps the compass stable even when a block skips a step.
Block-wise design matches transformer Hessian structure and is efficient to implement.

04Experiments & Results

The test: They trained Llama models of several sizes (60M, 130M, 350M, 1B) on the C4 dataset and measured validation perplexity (lower is better). They also tested a mixture-of-experts (MoE) model on OpenWebText and studied behavior under synthetic heavy-tailed noise and controlled quadratic problems.

— New Concept: Perplexity — 🍞 Hook: When you try to guess the next word in a sentence, sometimes it’s easy, sometimes it’s hard. 🥬 Concept: Perplexity measures how surprised a language model is when predicting text; lower means better predictions.

How it works: It’s based on how likely the model thinks the right next words are.
Why it matters: Lower perplexity usually means a smarter, more reliable model. 🍞 Anchor: A model with perplexity 13 is less “confused” than one with perplexity 16 on the same task.

Llama pre-training on C4 (60M–1B)

Competition: Adam, LaProp, Adafactor, APOLLO, Muon, SOAP, and enhancers like SGG and Cautious Optimizer.
Scoreboard highlights (validation perplexity): • Adam (1B): 16.35 • Muon (1B): 14.52 • RMSProp+Magma (1B): 13.19 (new best in this benchmark) • Adam+Magma (1B): 13.71, beating both Adam and SGG-enhanced Adam (14.30) and Cautious Adam (15.92)
Context: Dropping from 16.35 to 13.71 is like going from a solid B to an A—about a 19% relative improvement. Beating Muon from 14.52 to 13.71 is like pulling ahead in a close race by a clear margin (~6%). RMSProp+Magma’s 13.19 is a new top score among these methods.
Surprising finding: Even though masking discards half the updates, it consistently improved performance over dense baselines, including strong state-of-the-art optimizers.

Nano MoE pre-training on OpenWebText

MoE training is noisier and more irregular due to dynamic routing and sparse expert activations.
Results: Magma improves both Adam and Muon. With Muon, Magma delivers the best performance among tested methods, suggesting that masking-based modulation and structured preconditioning tackle different parts of the problem and combine well.
Note: Magma sometimes converges a bit slower mid-training but finishes with better final performance—consistent with the idea of smoother, more reliable paths.

Heavy-tailed gradient noise benchmark

Setup: A controlled synthetic task that mimics transformer quirks, with light-tailed vs heavy-tailed data.
Results: Under light-tailed data, Adam and Magma are similar. Under heavy-tailed noise, Magma significantly outperforms Adam.
Extra context: A lower “robust condition number” along Magma’s path shows it stays in better-conditioned regions—another sign of geometric smoothing.

Heterogeneous vs homogeneous quadratic problems

Setup: Two quadratic objectives with the same eigenvalues but arranged differently across blocks—one grouped by scale (homogeneous), one mixed (heterogeneous).
Findings: On homogeneous curvature, Magma is roughly on par with AdamW (slightly faster early). On heterogeneous curvature (more like transformers), Magma converges faster and to lower final loss than AdamW.
Counterexample: On ResNet-50/CIFAR-10, Magma doesn’t improve over AdamW, reinforcing that its benefits are specific to transformer-like geometry.

Other observations:

Masking granularity: Element/row/column/block masking all worked similarly; block-wise is preferred for efficiency and matches transformer structure.
Hyperparameters: p=0.5 (keep half the blocks) worked best across temperatures; Magma widened the stable learning-rate window dramatically, reducing tuning pain.
Dense momentum: Essential for stability; sparse momentum updates were much less stable even with damping.

Big picture: Across realistic LLM training and stress tests, Magma’s skip-and-align approach improves reliability and final performance with negligible overhead and minimal code change.

05Discussion & Limitations

Limitations and caveats:

Architecture specificity: Gains are strongest for transformer-like, heterogeneous curvature. On CNN-style tasks (e.g., ResNet-50/CIFAR-10), Magma doesn’t help and can slightly lag.
Biased damping: Using alignment as a multiplier introduces bias. Unbiased variants (using alignment as survival probability with inverse rescaling) were unstable in tests; designing a stable unbiased scheme remains open.
No backprop savings: You still compute full gradients; masking reduces parameter updates, not gradient cost. This isn’t a compute saver, it’s a stability and generalization booster.
Block choices: Performance depends on a reasonable block partition (attention and MLP blocks worked well). Extremely tiny or arbitrary blocks may add overhead without benefit.
Hyperparameters: While Magma is robust, choices like keep ratio p and temperature τ still matter; defaults (p=0.5, τ≈2) worked broadly in experiments.

Required resources:

Same training hardware/software as your base optimizer.
Minimal extra ops: per-block cosine similarity, a sigmoid, and an EMA per step.
No extra activations or optimizer state memory; momentum remains as usual.

When not to use:

Smooth, well-conditioned problems where dense updates already excel.
Architectures with largely homogeneous curvature (typical CNN settings).
Ultra memory-constrained settings that can’t keep momentum dense or can’t afford per-block cosine scores.

Open questions:

Unbiased but stable masking: Can we design a variant that preserves unbiasedness and keeps the same stability?
Adaptive masking ratio: Can p vary by block and over time, guided by measured instability or curvature proxies?
Better block partitioning: What block granularity best matches transformer Hessian structure across scales?
Theory: Tighter, more predictive convergence bounds that capture heavy-tailed noise and real-world schedules.
Composition: How best to combine Magma with sharpness-aware methods, trust regions, or newer matrix preconditioners?

06Conclusion & Future Work

Three-sentence summary: Randomly skipping some parameter block updates—and scaling the ones you keep—adds a free, curvature-aware smoothing that improves LLM training. Magma multiplies those masked updates by a momentum–gradient alignment score, further boosting stable, signal-following moves while damping noisy ones. This tiny wrapper consistently beats strong baselines (including Muon) across Llama scales and MoE models, with negligible overhead.

Main achievement: Showing that structured randomness (block-wise masking) plus alignment-based damping can outperform complex dense optimizers and widen the stable learning-rate region, delivering lower perplexity in large-scale LLM pre-training.

Future directions: Design unbiased yet stable masking schemes, adapt the keep ratio by block and over time, refine block definitions for different transformer variants, and explore principled combinations with trust-region and sharpness-aware methods.

Why remember this: It flips a common assumption—dense updates aren’t always best—and offers a simple, practical tool that makes training both steadier and better. In a world where small code changes at scale have huge impact, Magma is an elegant, low-cost upgrade for modern LLM pipelines.

Practical Applications

•Pre-train LLMs with Magma wrapping Adam or RMSProp to reduce perplexity and training spikes.
•Use Magma when scaling to larger transformer models to widen the stable learning-rate range and cut tuning time.
•Combine Magma with matrix preconditioners (e.g., Muon) to get complementary gains in MoE training.
•Apply masking primarily to attention and MLP blocks for strong performance with minimal code changes.
•Adopt p=0.5 and τ≈2 as robust defaults and then fine-tune only if needed.
•Deploy Magma in heavy-tailed data regimes (e.g., web-scale corpora) to maintain better conditioning during training.
•Use Magma to stabilize long runs where occasional gradient spikes previously caused divergence.
•Integrate Magma into low-resource experiments to get reliable results without exhaustive hyperparameter sweeps.
•Leverage Magma during curriculum or domain shifts to dampen noisy, misaligned updates as distributions change.

Version: 1