Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation
Key Summary
- •This paper shows how to train big language models faster and cheaper by using 4-bit numbers (NVFP4) without losing much accuracy.
- •It introduces MS-EDEN, a new way to keep gradients unbiased while reducing noise more than old methods like stochastic rounding.
- •MS-EDEN moves the randomness from each tiny 4-bit number to the shared scale of small blocks, which cuts error by over 2x.
- •Quartet II is a full recipe that uses MS-EDEN for the backward pass and a smarter forward pass with the Four-over-Six scale choice.
- •Compared to earlier NVFP4 training recipes, Quartet II consistently lowers validation loss gaps by around 20%.
- •On NVIDIA Blackwell GPUs, the provided kernels achieve up to 4.2x speedups versus BF16 on linear layers.
- •The method stays unbiased in expectation, which keeps training stable over billions of tokens.
- •Special GPU kernels avoid extra memory passes with a practical trick called post hoc range alignment for scales.
- •The approach works on real LLM pre-training runs up to 1.9B parameters and 38B tokens, not just tiny tests.
Why This Research Matters
Training large language models is extremely costly, so any safe way to use smaller numbers can save time, money, and energy. This work shows that 4-bit training (NVFP4) can stay accurate and stable by reducing gradient noise while keeping gradients unbiased. That means more researchers and companies can afford to train and iterate on strong models. It also paves the way for greener AI by cutting power use and memory traffic. With better kernels and practical tricks, the method is not just a theory—it runs fast on real hardware. In short, this helps unlock high-quality AI at lower cost and broader scale.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how writing with a big, smooth pencil makes your work neat, but a tiny golf pencil is lighter and faster for quick notes? Computers face a similar choice: use big, precise numbers (like FP16 or FP8) for safe accuracy, or tiny, speedy numbers (like FP4) for huge speed and memory savings. Before this paper, people had learned to train large language models (LLMs) using smaller and smaller numbers to save time and money. First FP16 and BF16 got popular, then FP8, and now NVFP4—an especially small format with clever scaling so it still fits a surprising range of values. The dream is to do almost everything in these tiny 4-bit numbers and get massive speedups.
The problem is that training is a long journey with lots of tiny steps called gradients. If those steps are consistently a little wrong (biased), the model wanders off in the wrong direction and never reaches its best performance. With 4-bit numbers, it’s very easy to introduce systematic mistakes. Past work tried to fix this with a trick called stochastic rounding (SR), which uses randomness to keep gradients unbiased on average. But SR adds a lot of noise, especially with only 4 bits, because it flips a coin for every single number. That keeps things fair but shaky: training doesn’t crash, yet it lands worse than higher precision like FP8 or FP16.
People also tried to help the forward pass (where the model makes predictions) by making quantization smarter. For example, the Four-over-Six (4/6) idea chooses between two grid sizes (like using a small or a slightly larger ruler) to reduce error. And others grouped weights into larger square blocks to reuse their quantized versions in the backward pass, reducing extra rounding. But these choices had trade-offs: square blocks hurt representation in the forward pass, and combining 4/6 with SR in certain ways quietly introduced bias into gradients.
What was missing was a way to keep gradients unbiased without adding so much noise—and to do it in a way that fits the special structure of NVFP4 (microscaling with FP8 per-16-element scales and one FP32 global scale). Past theory offered a clue: EDEN, a method from distributed optimization, uses a random rotation plus a gentle correction factor to keep estimates unbiased while keeping error low. But EDEN’s correction needs fine-grained scales that don’t fit cleanly into NVFP4’s coarse FP8 scales; the correction is too small to store exactly.
This paper fills that gap. It adapts EDEN to NVFP4 by moving the randomness from each 4-bit element to the shared FP8 scales of small groups. That’s the big idea behind MS-EDEN. It preserves unbiasedness in expectation and slashes error compared to SR. With this new tool, the authors build Quartet II, a full training recipe: a high-capacity forward pass using native per-16-element NVFP4 scales plus Four-over-Six, and an unbiased backward pass using MS-EDEN and inner-dimension randomized rotations. Together, they keep the forward pass sharp and the backward pass fair.
Why should anyone care? Because training giant models is expensive. If we can switch most math to 4-bit while staying accurate and stable for billions of tokens, we can train more models, try more ideas, and bring high-quality AI to more people and places. Faster pre-training means lower costs, less energy use, and wider access. It’s like making the highway wider and the cars more fuel-efficient at the same time.
02Core Idea
🍞 Hook: Imagine you’re hiking with a compass. If the compass is randomly shaky (noisy), but on average points north (unbiased), you’ll still reach your campsite—just more slowly. If it’s steadily off by a few degrees (biased), you’ll get lost. LLM training is the same: the backward pass compass (gradients) must be unbiased, and less shake (lower error) is even better.
🥬 The Concept in One Sentence: MS-EDEN keeps gradients unbiased in NVFP4 while cutting error by over 2x compared to stochastic rounding by moving randomness from each tiny 4-bit value to the shared FP8 group scales and using randomized rotations.
How it works (intuition, not equations):
- Randomly rotate small blocks (like shuffling numbers to spread out outliers).
- Quantize with round-to-nearest (lower error than SR) into NVFP4.
- Compute a gentle correction factor that says “scale me up or down a tiny bit.”
- Apply that correction not to each 4-bit element, but to the shared FP8 micro-scales using stochastic rounding; this keeps everything unbiased in expectation.
- Because rotations are matched across the two matrices in the GEMM, rotations cancel out in the product.
Why it matters: With SR, every element flips a coin, causing lots of noise. MS-EDEN instead flips a few coins on shared scales and gets the same fairness with much less shake, which makes training more accurate and stable at 4 bits.
🍞 Anchor: Think of painting a wall. SR is like shaking your hand for every brush stroke—fair on average, but messy. MS-EDEN steadies your hand and only wiggles the roller handle a little now and then, so the whole wall looks smoother.
Three analogies for the same idea:
- Camera analogy: SR is like adding grain to every pixel. MS-EDEN reduces grain by adjusting only the exposure setting for small tiles while keeping the overall scene balanced on average.
- Classroom grading: SR re-rolls points for every single question. MS-EDEN assigns a small random curve to the whole quiz section, keeping the class fair but with less chaos.
- Traffic control: SR adds random slowdowns at every intersection. MS-EDEN tweaks the timing of a few area-wide traffic lights, keeping average flow the same but making the ride smoother.
Before vs After:
- Before: Unbiased, but noisy gradients (SR) or biased tricks that slowly push training off course.
- After: Unbiased and quieter gradients (MS-EDEN), letting NVFP4 hit higher accuracy while keeping the big speed and memory wins.
Why it works:
- Random rotations spread big values, so round-to-nearest clips less often and errors distribute evenly.
- The small correction factor makes the rotated-and-quantized vector align with the original on average.
- Applying that correction at the FP8 scale level with stochastic rounding preserves unbiasedness without adding per-element noise.
Building blocks (each with a mini sandwich):
🍞 You know how mixing colored beads evenly in a bag makes any scoop look similar? 🥬 Randomized Hadamard Transform (RHT) is a fast shuffle-then-flip that spreads large values across a block so rounding is gentler and more uniform. Without it, a few outliers dominate and get badly rounded. 🍞 Example: One huge number becomes many medium ones after the rotation, so fewer get clipped.
🍞 Imagine measuring height with a ruler that has big marks. 🥬 NVFP4 microscaling stores 4-bit values plus a shared FP8 scale per 16 elements and a global FP32 scale; this keeps range while staying tiny. Without shared scales, 4-bit alone would lose too much detail. 🍞 Example: A classroom’s average height (scale) plus each kid’s small offset (4-bit) stores the class well.
🍞 Think of two rounding styles: flipping a coin every time vs rounding to the nearest line. 🥬 Round-to-nearest (RTN) has lower mean-square error than per-element SR, but alone it can be biased. Without a fix, repeated bias would accumulate. 🍞 Example: Always rounding 2.49 to 2 adds a tiny bias if you do it millions of times.
🍞 Picture a tiny dimmer knob for brightness. 🥬 EDEN’s correction factor is a small scale that realigns the average of many randomized-rotated, quantized chunks to the original. Without it, the average would drift. 🍞 Example: After several photos, you adjust exposure so the album’s average brightness matches the scene.
🍞 Choosing between two spoon sizes for sugar. 🥬 Four-over-Six tests two grid sizes (4 or 6) per block in the forward pass and picks the one with lower error. Without it, forward-pass rounding is worse. 🍞 Example: For mild sweetness, use the 4-sized spoon; for stronger flavors, use the 6-sized spoon.
Finally, Quartet II ties it all together: native NVFP4 + 4/6 in forward for representation, MS-EDEN in backward for unbiased, low-noise gradients, and inner-dimension rotations so the math lines up and cancels correctly.
03Methodology
High-level pipeline: Inputs (activations X, weights W) → Forward pass quantization (NVFP4 + Four-over-Six) → NVFP4 GEMM for Y = X·W^T → Save quantized X and W → Backward pass: generate rotation, dequantize-and-requantize E, X^T, W^T with MS-EDEN along inner dimension → NVFP4 GEMMs for gradients → Optimizer updates.
Step-by-step with what/why/examples:
- Forward quantization with native NVFP4 + Four-over-Six
- What happens: For both activations and weights, the system uses native NVFP4 grouping (1 FP8 scale per 16 FP4 values plus a tensor-wide FP32 scale). For each block, it tries two grid tops (4 and 6), does round-to-nearest for both, and picks the one with lower error. The chosen quantized values go to the NVFP4 tensor cores.
- Why this step exists: The forward pass must preserve representation quality so the model’s predictions are not degraded by 4-bit rounding. Four-over-Six reduces mean-square error without adding bias to the forward computation (bias is a backward-pass concern).
- Example: If a block’s values are mostly small, a “4” grid fits more snugly; if they’re wider, “6” prevents clipping. Picking the better one per block makes the forward activations and outputs sharper.
- Save quantized X and W for the backward pass
- What happens: The quantized versions (not the original BF16) are cached for later re-use during gradient computations.
- Why it matters: Re-using on-device quantized forms avoids extra conversions and allows the backward pass to operate in NVFP4.
- Example: Keeping the compact 4-bit plus scales version is like storing a neatly folded map instead of the whole paper—faster to grab later.
- Prepare randomized rotations along the inner GEMM dimension
- What happens: For each relevant matrix multiplication in the backward pass, generate a seeded Randomized Hadamard Transform (RHT) over chunks (e.g., 128 elements) along the inner dimension so both multiplicands use the same rotation. This is implemented efficiently on GPU.
- Why this step exists: Rotations evenly spread outliers, so RTN has lower error and fewer extreme clips, and the EDEN-style correction can realign the average. Using the same seed makes rotations cancel when the matrices are multiplied.
- Example: If E has a few giant entries, the rotation turns those into many medium entries spread across the 128-length block, so rounding is kinder.
- MS-EDEN requantization for E, X^T, and W^T
- What happens: For each rotated 128-length chunk, perform RTN-based NVFP4 quantization (can clip if necessary), then compute a small correction factor S as the ratio of dot-products between the original rotated chunk and its RTN-quantized reconstruction. Instead of multiplying every 4-bit value by S, MS-EDEN stochastically rounds the FP8 group scales by S, ensuring that, in expectation, the correction is applied exactly.
- Why this step exists: RTN by itself is lower-noise but can be biased. The EDEN-style correction restores unbiasedness. Applying it via the FP8 scales (not per 4-bit element) preserves unbiasedness in expectation but avoids huge per-element randomness.
- Example: Suppose S = 1.03 for a chunk. We nudge the block’s FP8 scale up via stochastic rounding so, on average, the block is 3% larger, matching the original chunk on average.
- NVFP4 GEMMs for gradient computations
- What happens: With E, X^T, and W^T now in unbiased NVFP4 form (MS-EDEN), the backward matrix multiplications are performed on NVFP4 tensor cores. Because both multiplicands used the same inner-dimension rotations, the rotations cancel in the product.
- Why this step exists: This delivers both speed (4-bit tensor cores) and correctness (unbiased gradients), keeping training stable and fast.
- Example: It’s like turning two opposite twists on a rope that untwist each other when tied together—only the intended pull (true gradient) remains.
- Optimizer and state updates
- What happens: The resulting gradients feed into the optimizer (e.g., Adam or Muon). Master weights and accumulators stay in higher precision (e.g., FP32), as in standard mixed-precision training.
- Why this step exists: High-precision optimizer state keeps long-term stability and avoids drift.
- Example: It’s like keeping your savings ledger in very neat handwriting, even if daily notes are in shorthand.
The secret sauce: where the cleverness hides
- Move randomness to the FP8 scales: Instead of flipping a coin for each FP4 element (noisy), MS-EDEN flips far fewer coins for each group scale, yet stays unbiased.
- Rotation-cancel trick: By rotating both multiplicands with the same seed along the inner dimension, the rotations cancel during GEMM, so we don’t pay extra cost to un-rotate outputs.
- Post hoc range alignment kernel: A GPU-friendly two-kernel design avoids double loading and rotating the full tensor. The first pass produces extended-range scales (a BF16 proxy), computes global maxima and correction factors, and writes FP4 values. The second, very light pass re-aligns those pseudo-scales into legal FP8 range and applies the stochastic rounding for unbiasedness. This saves memory bandwidth and latency in practice.
Concrete mini example (toy numbers):
- Suppose a 128-length chunk has values around 0.0 to 3.0 with one outlier at 6.5. After rotation, the 6.5 is spread into many 0.5-ish entries. RTN to FP4 with a good FP8 scale leads to small rounding errors. The dot-product ratio S ends up, say, 1.02. Instead of multiplying each 4-bit element, we stochastically bump the FP8 group scale by 2% (in expectation). Across many chunks and steps, the gradients remain unbiased but with lower variance than SR.
What breaks without each step:
- No 4/6 in forward: Larger forward error, worse predictions, higher loss.
- No rotations: More clipping and higher error in backward; correction less reliable.
- No MS-EDEN correction: RTN’s small bias accumulates; training drifts over time.
- Per-element SR instead: Unbiased, but much noisier; validation loss degrades relative to MS-EDEN.
- No post hoc alignment: Extra memory passes and rotations, slowing training kernels noticeably.
04Experiments & Results
What was tested and why: The authors examined how much extra validation loss appears when replacing BF16 training with various FP4 training schemes. They also measured speedups on NVIDIA Blackwell GPUs. The key questions were: (1) Does MS-EDEN really give more accurate, unbiased gradients than stochastic rounding (SR)? (2) Does using native NVFP4 scales with Four-over-Six (4/6) in the forward pass help representation? (3) When everything is combined (Quartet II), does it beat prior NVFP4 recipes end-to-end?
Competitors and baselines:
- BF16: reference “gold standard” for accuracy.
- NVIDIA NVFP4 recipe: prior practical fully-quantized baseline.
- TetraJet-v2: a refined NVFP4 approach without hard-to-implement extras in this paper’s comparison setting.
- Four-over-Six: used carefully (forward only) due to potential backward bias.
- SR-based backward quantization: the standard unbiased method to beat.
Scoreboard with context:
- Backward-pass ablations: Across multiple settings (quantizing different combinations of the two main backward-pass GEMMs), MS-EDEN consistently outperformed SR wherever both were applicable. Importantly, fully quantized MS-EDEN with weight re-quantization even beat fully quantized SR without weight re-quantization. Think of it like getting a cleaner signal despite doing the extra step: MS-EDEN’s lower noise paid off.
- Forward-pass ablations: Using Four-over-Six on native NVFP4 (1x16 group scales) cut forward error more than using it with square-block (16x16) scales. That’s like saying the 4/6 trick works best when each small group can have its own scale, improving the model’s prediction quality in the forward pass.
- Full system (Quartet II): When forward (native scales + 4/6) and backward (MS-EDEN + rotations) were combined, validation loss gaps versus BF16 dropped by around 20% compared to prior NVFP4 methods. Translating this: if others were scoring a B-, Quartet II nudged it up towards a solid B/B+ under the same training budget.
Unbiasedness checks:
- The paper verified unbiasedness by averaging many independent backward passes and showing the error shrinks about as 1/sqrt(B), just like it should if gradients are unbiased. Baselines using SR also showed this expected behavior. Applying 4/6 in the backward pass, however, introduced bias and did not show the right shrink pattern—so 4/6 stays forward-only in Quartet II.
Larger-scale validation (Nanochat pipeline):
- On 560M-parameter, 11B-token and 1.9B-parameter, 38B-token pre-training runs, Quartet II reduced the bits-per-byte (BPB) gap to BF16 by about 15–25% relative to other NVFP4 methods. After some brief instruction tuning (SFT) on small datasets, zero-shot benchmark differences among FP4 methods were small or statistically insignificant—which is expected since SFT was short and data-limited.
Performance (speedups):
- Custom kernels plus NVFP4 tensor cores achieved up to 4.2x speedups on linear layers relative to BF16, and around 2.4x end-to-end training throughput improvement for a ~1B-parameter model on a single RTX 5090 by exploiting memory savings (storing 4-bit preactivations) to increase micro-batch size. Note that at very small dimensions, overheads can dominate; larger models benefit more because GEMMs take a bigger share of time.
Surprising findings:
- Even with the extra step of weight re-quantization for the backward pass (required for MS-EDEN and inner-dimension rotations), the overall gradient quality was better than SR’s no-requant route. This underlines how much noise SR adds at 4 bits—and how effective MS-EDEN’s scale-level randomness is at keeping things both unbiased and quiet.
- Four-over-Six, which clearly helps forward quality, can’t be naively applied to the backward pass without bias. Keeping it forward-only while relying on MS-EDEN in backward turned out to be the winning split of duties.
05Discussion & Limitations
Limitations and where it can stumble:
- Inner-dimension rotations required: MS-EDEN needs randomized rotations along the inner GEMM dimension so rotations cancel during matrix multiplication. This shapes implementation choices and means you must re-quantize weights and activations in the backward pass.
- Hardware coupling: The scheme is tuned for NVFP4 on NVIDIA Blackwell tensor cores. Porting to other 4-bit microscaling formats (e.g., MXFP4) or other hardware may need careful retuning of group sizes, kernels, and range-alignment tricks.
- Small-scale inefficiency: For tiny matrix shapes, the overhead of rotations and requant kernels can eat the theoretical speedups, so end-to-end wins grow with model size.
- Correctness subtleties: Using Four-over-Six in the backward pass introduces bias. The paper keeps it strictly forward-only; using it elsewhere would need new theory or safeguards.
Resources needed:
- NVIDIA Blackwell-class GPUs with NVFP4 tensor-core support for best results.
- The provided custom CUDA kernels (including post hoc range alignment) and a quantized GEMM library (e.g., QuTLASS) to get the advertised speedups.
- Standard mixed-precision infrastructure (e.g., FP32 master weights/optimizer states) and data pipelines for large-scale pre-training.
When not to use it:
- Very small models or tiny-batch, tiny-sequence settings where GEMM isn’t the bottleneck; overheads may dominate.
- Non-transformer architectures or custom layers that don’t map cleanly to NVFP4 GEMMs and the inner-dimension rotation pattern.
- Scenarios requiring exact reproducibility at the bit level across runs, since randomized rotations and stochastic rounding seeds must be managed carefully (though they can be deterministically seeded).
Open questions:
- Can similar unbiased, low-noise ideas be adapted to other ultra-low-precision formats (INT4, other microscaling FP4s) with equal success?
- Are there smarter rotation schemes than Hadamard that further reduce error without increasing cost, or adaptive chunk sizes that track activation statistics over training?
- Can forward-pass grid selection be extended beyond 4/6 safely (without backward bias) to squeeze out more accuracy?
- How do these ideas interact with extremely long training schedules (trillion-token scale) and with advanced optimizers or normalization schemes in very large models?
- What is the best way to combine MS-EDEN with outlier-channel handling, activation sparsity, or structured pruning for even larger wins?
06Conclusion & Future Work
Three-sentence summary: This paper introduces MS-EDEN, a way to keep gradients unbiased in NVFP4 while cutting error by more than 2x compared to stochastic rounding by moving randomness from per-element values to shared scales and using randomized rotations. Building on this, the authors design Quartet II, a full NVFP4 training recipe that pairs a high-capacity forward pass (native NVFP4 + Four-over-Six) with an unbiased, low-noise backward pass (MS-EDEN), delivering consistently better accuracy than prior FP4 methods. Custom GPU kernels and a practical scale-alignment trick make the approach fast and feasible, achieving up to 4.2x speedups on linear layers and strong end-to-end gains.
Main achievement: Quartet II shows that accurate, stable, and fast end-to-end 4-bit (NVFP4) pre-training is possible by improving the unbiased gradient estimator itself—MS-EDEN—rather than sacrificing forward representation or accepting high per-element noise.
Future directions: Extend MS-EDEN-like ideas to other formats and vendors, explore richer forward quantization grids that stay unbiased in backward, refine rotation strategies, and combine with outlier control or sparsity for even greater wins. Scaling to even larger models and longer runs will test robustness and uncover further kernel optimizations.
Why remember this: It reframes 4-bit training from “unbiased but noisy” to “unbiased and quiet,” unlocking much of NVFP4’s promised speed and memory savings without paying a big accuracy tax. In short, it moves us meaningfully closer to training top-tier LLMs at a fraction of today’s cost.
Practical Applications
- •Pre-train midsize LLMs on smaller GPU budgets by switching linear layers to NVFP4 with Quartet II.
- •Reduce training time in large-scale language modeling by using MS-EDEN for unbiased, low-noise gradients.
- •Deploy larger batch sizes on the same GPU by storing 4-bit preactivations to save memory.
- •Speed up research iteration cycles (more experiments per week) due to faster end-to-end training.
- •Lower cloud compute bills for long pre-training runs by achieving 2–4x throughput on key layers.
- •Train foundation models in more locations (labs or startups) that previously could not afford it.
- •Combine Quartet II with data-centric recipes (like Nanochat) to get better accuracy-cost tradeoffs.
- •Use the provided CUDA kernels as drop-in building blocks for custom NVFP4 training stacks.
- •Integrate Four-over-Six in the forward pass to improve quantization quality of activations and weights.
- •Explore unbiased 4-bit training for domains beyond language (e.g., vision transformers) with similar GEMM patterns.