FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Zhaopeng Qiu; Shuang Yu; Jingqi Zhang; Shuai Zhang; Xue Huang; Jingyi Yang; Junjie Lai

FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

Intermediate

Zhaopeng Qiu, Shuang Yu, Jingqi Zhang et al.1/26/2026

arXiv PDF

Key Summary

•The paper shows how to speed up reinforcement learning (RL) for large language models (LLMs) by making numbers smaller (FP8) without breaking training.
•They build a practical rollout system that re-quantizes and reloads model weights every step so the generator always matches the newest policy.
•They shrink two big pieces: linear layers (W8A8) and the attention KV-cache, which together cut memory use and speed up decoding.
•To keep learning stable, they correct the small differences between the FP8 generator and the full-precision trainer using token-level importance sampling with clipping.
•On an 8B dense model, FP8 linear layers alone give about 10–20% rollout speedup; on a 30B MoE model they give 30–50% because the model is bigger and more memory-hungry.
•Quantizing just the KV-cache brings even larger gains (around 38% by itself), and combining both reaches up to 44% faster rollout.
•End-to-end FP8 (both training and rollout) keeps accuracy while reducing the mismatch between training and inference and cuts learner time by about 20%.
•The system works across common training backends (FSDP/Megatron-LM) and inference engines (vLLM/SGLang) and uses proven FP8 formats (E4M3) with blockwise scales.
•The approach preserves learning quality on long-context RL tasks (AIME24) when paired with token-level truncated importance sampling.

Why This Research Matters

This work makes the slowest part of RL for LLMs—generation—significantly faster without sacrificing learning quality. Faster rollouts mean researchers can try more ideas in less time, reducing costs and energy use. By shrinking the KV-cache and linear layers with FP8, teams can handle longer contexts and more concurrent users on the same hardware. The stability fix (token-level TIS) keeps training from wobbling even when precision is low, so speed doesn’t come at the price of collapse. End-to-end FP8 further trims training time and narrows the gap between what the trainer expects and what the generator produces. This unlocks practical, stable, low-cost RL for real-world systems that need long, careful reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to write a super long story with lots of clues, and your friend reads each new version you write to give you feedback. If reading takes forever, you won’t finish your story on time!

🥬 The Concept: Reinforcement Learning (RL) for Large Language Models (LLMs) is like that: the model writes long answers (rollouts), then learns from feedback. What it is: RL teaches an LLM to improve by trying, getting a score (reward), and updating itself. How it works (simple steps):

Generate answers to prompts (rollout).
Score those answers (reward model).
Update the policy (the model) using the scores. Why it matters: If rollouts are slow, the whole learning loop crawls, making experiments costly and progress slow. 🍞 Anchor: When an LLM practices math proofs, it must write long chains of thought. If writing is slow, it learns slowly.

🍞 Hook: You know how a class gets noisy when everyone talks at once? LLMs also juggle lots of stuff at once when they write long answers.

🥬 The Concept: Rollout (generation) is the step where the model produces tokens step-by-step. What it is: The model expands one token at a time into a long sequence. How it works:

Look at what’s been written so far.
Use attention to decide what’s important.
Pick the next token, repeat. Why it matters: Long outputs make attention expensive and fill a special memory called the KV-cache. 🍞 Anchor: Writing a 20,000-token answer is like reading a huge book while remembering every page—you need speed and space.

🍞 Hook: Think of attention like choosing which parts of your notes to reread before a test.

🥬 The Concept: Attention and the KV-cache help the model remember and use past tokens. What it is: The KV-cache stores key and value vectors so the model doesn’t recompute past work. How it works:

For each new token, look up stored keys/values.
Weigh them to decide what to focus on.
Produce the next token using that focus. Why it matters: If the KV-cache is too big or slow, generation stalls, especially with very long sequences. 🍞 Anchor: A crowded locker (KV-cache) slows you down between classes.

🍞 Hook: You know how shrinking photos makes them faster to send, but you still recognize the picture?

🥬 The Concept: FP8 quantization makes numbers use fewer bits so math and memory move faster. What it is: Store and compute with 8-bit floating-point numbers (E4M3) instead of BF16. How it works:

Scale values so they fit FP8’s range.
Convert to FP8 for compute/storage.
Convert back when needed. Why it matters: Smaller numbers mean faster compute and less memory traffic—but too much shrinkage can blur details. 🍞 Anchor: Like using a compressed photo that still looks good enough to share quickly.

🍞 Hook: Imagine your teacher updates the answer key every minute; you must use the newest one or you’ll practice the wrong answers.

🥬 The Concept: Training–inference mismatch happens when the model that generates text (inference) isn’t exactly the same as the model that learns (training). What it is: A small drift between rollout policy and training policy. How it works:

Quantization or different kernels change outputs a little.
Generated samples no longer match what training expects.
The learner updates off-policy and can get unstable. Why it matters: This mismatch can make RL wobble or collapse if uncorrected. 🍞 Anchor: Practicing with an outdated answer key leads you to learn the wrong thing.

🍞 Hook: People tried to run fast by wearing heavy boots—they still ran slow. They tried lighter boots, but they fell because the ground was slippery.

🥬 The Concept: Earlier attempts sped up parts (like int8 or partial FP8) or used static inference tricks. What it is: Separate speed-ups that didn’t account for RL’s always-changing policy. How it works:

Quantize once and serve fast.
But RL changes weights every step.
Without step-wise syncing and mismatch fixes, training can destabilize. Why it matters: Speed without stability doesn’t help if learning fails. 🍞 Anchor: A race car with a powerful engine but wobbly wheels won’t win.

🍞 Hook: So what was missing?

🥬 The Concept: A practical, step-by-step FP8 rollout stack that updates weights every RL step and corrects mismatch. What it is: A system that combines FP8 speed with importance-sampling stability and dynamic syncing. How it works:

Quantize linear layers (W8A8) blockwise for accuracy + speed.
Quantize KV-cache and recalibrate scales each step for long contexts.
Use token-level truncated importance sampling (TIS) to keep learning on track. Why it matters: You get big speedups while keeping training behavior close to BF16. 🍞 Anchor: It’s like using a fast bicycle with training wheels—you move faster without crashing.

02Core Idea

🍞 Hook: You know how packing smarter (folding clothes, using cubes) lets you take the same trip with a smaller suitcase and still look great?

🥬 The Concept: The “aha” is to do rollouts in FP8 with smart safeguards so it’s fast and still learns right. What it is: A low-precision (FP8) rollout stack that updates weights every step, shrinks linear layers and the KV-cache, and corrects drift with importance sampling. How it works:

Shrink math where it counts (W8A8 linear layers) using blockwise scales.
Shrink memory where it hurts (KV-cache) with per-step QKV scale recalibration.
Sync quantized weights into the inference engine every RL step.
Reweight tokens with TIS so the learner sees near on-policy data. Why it matters: Without these pieces together, you either go slow or go unstable; with them, you go fast and stay steady. 🍞 Anchor: Like racing on a smooth track with grippy tires and a tuned engine—you get speed and control.

Multiple analogies:

Suitcase: FP8 packs the same trip into a smaller bag (memory), blockwise folding prevents wrinkles (accuracy), and TIS keeps your itinerary aligned (stability).
Microphone check: Rollout is the speaker; training is the recorder. FP8 changes the mic settings (quantization). TIS is the sound engineer that balances levels so the recording matches the performance.
Team playbook: The coach (training) updates plays every step. Syncing weights delivers the new playbook to players (inference). FP8 makes them run faster; TIS ensures they execute the intended plays.

Before vs After:

Before: BF16 rollouts dominated time; KV-cache hogged memory; FP8 attempts risked instability.
After: FP8 W8A8 + FP8 KV-cache yield up to 44% faster rollout; training tracks BF16 when paired with TIS; end-to-end FP8 trims learner time too.

Why it works (intuition):

Compute: Smaller numbers accelerate tensor core GEMMs.
Memory: FP8 halves weight and KV traffic, easing bandwidth bottlenecks.
Accuracy: Blockwise scales localize rounding error; avoiding lm_head quantization protects logits.
Stability: TIS caps extreme likelihood ratios so off-policy noise doesn’t explode.
Consistency: Per-step QKV recalibration refreshes scales after every weight update, keeping attention well-calibrated.

Building blocks (each as a mini-sandwich):

🍞 Hook: Folding a shirt neatly makes it both smaller and still wearable. 🥬 W8A8 Quantization (what): Quantize Weights and Activations to FP8 for linear layers using blockwise scales. How:

Split weight matrices into 128×128 blocks.
Compute a scale per block from its max value.
Store weights in FP8 (E4M3); quantize activations dynamically at runtime. Why: Blockwise scaling preserves detail where it matters while unlocking FP8 tensor core speed. 🍞 Anchor: Like using small packing cubes for each clothing type so nothing gets squished wrong.

🍞 Hook: Labels on boxes help you find stuff fast. 🥬 KV-cache Quantization (what): Store keys/values in FP8 and recalibrate QKV scales each step. How:

After each policy update, trigger QKV scale recalculation.
Use engine-side or trainer-side calibration to set correct scales.
Decode using the compact FP8 cache. Why: The KV-cache dominates long-context memory—shrinking it boosts concurrency and throughput. 🍞 Anchor: A well-labeled, compact locker fits more books and speeds up class changes.

🍞 Hook: Everyone needs the newest map before a field trip. 🥬 Dynamic Weight Synchronization (what): Push freshly updated weights from training to inference every step and then quantize. How:

Pull BF16/FP16 weights from FSDP/Megatron.
Quantize blockwise to FP8.
Load into vLLM/SGLang for rollout. Why: Without this, rollout lags behind training and creates extra mismatch. 🍞 Anchor: Handing out the latest playbook before the game starts.

🍞 Hook: If one student whispers a word and another writes it down, tiny mis-hearings can spread. 🥬 Token-level Truncated Importance Sampling (what): Reweight each generated token by how likely the training policy would have chosen it, but clip extremes. How:

Compute weight = p_train(token)/p_rollout(token).
Clip at C=2 to limit variance.
Use these weights in updates. Why: Prevents small precision differences from derailing learning. 🍞 Anchor: Like adjusting a microphone’s volume so the recording matches the singer without screeching feedback.

03Methodology

At a high level: Prompts → (Step A) Sync and Quantize Weights → (Step B) FP8 Rollout with Dynamic Activations + FP8 KV-cache → (Step C) Token-level TIS Correction → (Step D) Policy Update → Repeat.

Step A: Dynamic Weight Synchronization

What happens: Each RL step, the trainer (FSDP/Megatron-LM) provides updated BF16/FP16 weights. The rollout engine (vLLM/SGLang) quantizes them blockwise (128×128) into FP8 (E4M3) for all supported linear layers (q/k/v/o projections, MLPs, MoE experts), excluding embeddings, norms, and lm_head.
Why this exists: RL changes the policy every step; serving must reflect the newest policy or rollouts become stale and off-policy.
Example: Suppose a 8B model updates gate_proj and up_proj. The pipeline extracts those tensors, computes per-block scales from max-abs values, converts to FP8, and hot-swaps them into the inference engine before generation starts.

Step B: FP8 Rollout with Dynamic Activations and KV-cache

What happens: During decoding, activations are quantized on-the-fly to FP8 (dynamic quantization) for linear layers. If KV-cache FP8 is enabled, the system recalibrates QKV scales at the start of the rollout for the new weights, then stores keys/values in FP8.
Why this exists: Dynamic activation quantization preserves accuracy while keeping compute fast. KV-cache FP8 slashes memory footprint, enabling more concurrent tokens and fewer preemptions.
Example: With 20K-token max responses, the FP8 KV-cache halves cache memory, allowing more simultaneous sequences without evicting old tokens.

Step C: Token-level Truncated Importance Sampling (TIS)

What happens: For each token generated by the FP8 rollout policy, the trainer evaluates its log-probability under the BF16 (or FP8-train) policy. It computes the importance weight w = p_train / p_rollout, then clips it to C=2.
Why this exists: FP8 introduces small drifts; TIS corrects the off-policy component and keeps gradient updates stable.
Example: If rollout chose a token with probability 0.05 but the trainer would choose it with 0.06, w=1.2 (no clip). If a rare token yields w=4.5, it gets clipped to 2 to avoid exploding variance.

Step D: Policy Update (e.g., DAPO/PPO-like loop)

What happens: Using rewards, advantages, and TIS weights, the learner updates the policy and value networks. Then the next iteration returns to Step A.
Why this exists: This closes the RL loop while keeping the serving side in lockstep with the fresh policy.
Example: After a batch of 32 prompts with 16 responses each, the learner computes advantages, applies clipped IS weights, and makes one update per iteration to isolate rollout effects.

Secret Sauce (why the whole recipe works together):

Blockwise FP8 scales keep accuracy high where values vary.
Avoiding quantization of lm_head protects final logits and text quality.
Per-step QKV recalibration ensures attention stays well-scaled after every policy change.
TIS converts the slightly off-policy FP8 samples into near on-policy training signals.
Engineering glue (weight hot-swapping, engine patches, DeepGEMM kernels, CUDA 12.9+) turns the math idea into reliable speed on real hardware.

Mini “Sandwich” refreshers inside the steps:

🍞 Hook: Like tailoring each shirt’s fold instead of smashing the whole wardrobe into one fold. 🥬 Blockwise Quantization (what): Split weights into 128×128 tiles, scale each tile, quantize to FP8. How: Per-block max-abs → scale → round to E4M3. Why: Finer scales = less rounding error. 🍞 Anchor: Every cube fits its clothes just right.

🍞 Hook: New rules? Update the team before the game. 🥬 Weight Sync (what): Reload quantized weights each step into vLLM/SGLang. How: Pull from trainer → quantize → hot-load. Why: Old weights = off-policy rollouts. 🍞 Anchor: Fresh playbook every match.

🍞 Hook: Your locker labels should match your new schedule. 🥬 QKV Scale Recalibration (what): Refresh FP8 scales for attention after policy updates. How: Engine-side (reset flags, calibrate on first forward) or trainer-side (calibrate on samples, sync scales). Why: Keeps attention stable across steps. 🍞 Anchor: Update binder tabs when classes change.

🍞 Hook: Volume knobs prevent screechy audio. 🥬 TIS (what): Token-wise IS with clipping C=2. How: Compute p_train/p_rollout; clip to 2. Why: Caps variance; stabilizes learning. 🍞 Anchor: Balanced soundboard = clean recording.

Concrete micro-example with numbers:

Suppose rollout picks tokens t1, t2, t3 with p_rollout = [0.20, 0.05, 0.01]. The trainer’s p_train = [0.22, 0.04, 0.03].
IS weights = [1.10, 0.80, 3.00]; with C=2 → [1.10, 0.80, 2.00].
These scale the loss/advantages so training reflects what the trainer policy prefers, without letting rare outliers dominate.

04Experiments & Results

🍞 Hook: If you race three bikes—a heavy one, a lighter one, and a lightweight with tuned tires—you’ll see who’s really fastest on the track.

🥬 The Concept: The authors test speed and learning quality across dense and MoE models with different precision setups. What it is: A head-to-head comparison of BF16 vs FP8 variants during RL on long-context tasks. How it works:

Use DAPO on AIME24 validation with H100 GPUs.
Measure learning (accuracy, reward, response length) and stability (mismatch KL), plus speed (ms/token).
Compare BF16, FP8-W8A8, FP8-KV-cache, and combined full FP8. Why it matters: We want to go faster without losing smarts. 🍞 Anchor: Like testing lap times and making sure the rider still follows the course perfectly.

Metrics “sandwich” callouts:

🍞 Hook: Report cards don’t just show grades; they show effort and behavior too. 🥬 Validation Accuracy (what): Percent of problems solved on AIME24. How: Periodically evaluate the current policy. Why: Direct signal of learning quality. 🍞 Anchor: Final score on a math quiz.

🍞 Hook: Longer essays often mean deeper thinking. 🥬 Response Length (what): Average tokens per answer. How: Track generated length during training. Why: In long-context RL, longer can indicate richer reasoning. 🍞 Anchor: A student’s multi-step proof vs a one-liner.

🍞 Hook: If two singers sing the same song slightly differently, you can measure their difference. 🥬 Mismatch KL (what): KL divergence between rollout and training policies. How: Compare token distributions. Why: High KL can destabilize learning; lower is safer. 🍞 Anchor: How off-key one performance is from the reference.

Key findings (with context):

Dense 8B (Qwen3-8B-Base):
- FP8 W8A8 + TIS tracks BF16 on accuracy, reward, and response length. Without TIS, FP8 degrades—showing correction is crucial.
- Speed: FP8 W8A8 gives about 10–20% faster rollout, especially at long sequences where memory bandwidth dominates.
MoE 30B (Qwen3-30B-A3B-Base):
- FP8 W8A8 + TIS matches BF16+TIS on learning metrics. MoE shows rising mismatch KL over time (even in BF16) due to routing differences, but TIS keeps training stable.
- Speed: FP8 W8A8 yields 30–50% faster rollout—bigger wins than the 8B dense model because compute and memory scaling amplify FP8 benefits.
KV-cache FP8 (8B tests):
- KV-only FP8 aligns closely with BF16 accuracy; mismatch KL is slightly higher than linear-only FP8 but remains stable with TIS.
- Speed: KV-only FP8 gives ~38% speedup; combining linear FP8 + KV FP8 reaches ~44%.
End-to-end FP8 (8B, NeMo-RL):
- Accuracy aligns with BF16.
- Mismatch drops versus FP8-rollout-only (precision alignment helps).
- Learner-side time falls by ~20%.

Surprises and insights:

KV-cache quantization can outperform linear-only speedups on small dense models with very long outputs, because KV memory is the dominant bottleneck.
MoE needs rollout correction even in BF16 due to router inconsistencies; FP8 doesn’t add instability when TIS is used.
Precision alignment (FP8 train + FP8 rollout) reduces mismatch vs FP8 rollout-only, but isn’t a silver bullet—some residual drift remains.

Scoreboard in plain words:

10–20% faster for 8B dense with FP8 linear.
30–50% faster for 30B MoE with FP8 linear.
~38% from KV-only FP8; up to ~44% when combined with linear FP8.
Learning curves (accuracy, reward, length) stay close to BF16 when TIS is applied.

Bottom line: With TIS, FP8 can make RL rollouts much faster without sacrificing learning quality.

05Discussion & Limitations

🍞 Hook: Fast shoes help you win races, but you still need good balance and a track that fits them.

🥬 The Concept: This method is powerful but not magic; it has limits, needs certain tools, and fits some races better than others. What it is: A clear-eyed look at when FP8-RL shines and where caution is needed. How it works:

Name limitations.
List resources required.
Explain when not to use.
Share open questions. Why it matters: Knowing the edges prevents stumbles. 🍞 Anchor: Reading the map before the hike avoids wrong turns.

Limitations:

Residual mismatch: Even with TIS, FP8 runs can show higher mismatch KL than BF16. In MoE, router differences can increase over time.
Calibration overhead: Trainer-side QKV calibration adds ~2–3% step time (small, but present).
Aggressive stacks: Full FP8 (linear + KV + attention) raises mismatch the most; TIS stabilizes it, but margins are thinner.
Coverage: lm_head stays unquantized to protect logits; quantizing it risks quality drops.
Future formats: More extreme formats (e.g., NVFP4) may be unstable due to accumulated error.

Required resources:

Modern GPUs (e.g., H100-class) with FP8 tensor core support.
CUDA ≥ 12.9 and DeepGEMM-enabled inference engines (vLLM ≥ 0.11, SGLang ≥ 0.55).
RL stack with backend/engine integration (FSDP/Megatron-LM + vLLM/SGLang) and rollout correction (TIS/MIS).

When NOT to use:

Very short outputs or tiny models where KV/compute aren’t bottlenecks; FP8 gains may be small.
Strict bitwise reproducibility across heterogeneous stacks without aligned kernels.
Older hardware without FP8 acceleration; overheads might outweigh benefits.
If you cannot apply rollout correction; FP8 without TIS can hurt stability.

Open questions:

Best practices for MoE routing consistency (e.g., MIS, routing replay) under FP8.
Automated tuning of TIS clipping (C) per task/phase.
Scaling to larger models and multi-turn/agentic RL with extreme contexts.
Safe quantization of more components (e.g., partial lm_head?) without hurting quality.
Full-stack bitwise consistency across training and inference to further reduce mismatch.

06Conclusion & Future Work

Three-sentence summary: The paper introduces FP8-RL, a practical FP8 rollout stack for LLM RL that keeps weights synchronized every step, shrinks linear layers and the KV-cache, and stabilizes learning with token-level importance sampling. It delivers sizable rollout speedups—10–20% for 8B dense, 30–50% for 30B MoE, and up to 44% when adding KV-cache FP8—while preserving accuracy close to BF16 on long-context tasks. End-to-end FP8 further reduces mismatch and cuts learner time by about 20%.

Main achievement: Showing that low-precision FP8 can be used safely and effectively in RL rollouts (and even in training) by combining blockwise quantization, per-step QKV recalibration, dynamic weight sync, and TIS—turning speed gains into real, stable progress.

Future directions: Explore even leaner formats (with care), improve router consistency for MoE (MIS/R3), automate mismatch control, scale to larger and multi-turn agent settings, and push for bitwise-consistent stacks.

Why remember this: It’s a blueprint for making RL fast where it hurts most—generation—without losing learning quality, unlocking quicker experiments, lower costs, and longer-context capabilities for the next wave of LLM training.

Practical Applications

•Speed up RLHF training cycles for long-reasoning tasks (math, code, science) while maintaining accuracy.
•Reduce GPU memory pressure to support longer contexts (e.g., 20K tokens) without frequent preemptions.
•Serve more concurrent rollouts per GPU by halving KV-cache footprint via FP8 quantization.
•Cut end-to-end experiment time with end-to-end FP8, enabling faster hyperparameter sweeps.
•Stabilize low-precision RL with token-level TIS so teams can safely adopt FP8 in production.
•Lower cloud costs by achieving similar learning quality with fewer GPU-hours.
•Enable larger MoE models to train with better throughput by shrinking compute and memory bottlenecks.
•Integrate seamlessly with popular stacks (FSDP/Megatron-LM + vLLM/SGLang) for easy adoption.
•Prototype new RL algorithms faster by reusing the same FP8 rollout correction tools (TIS/MIS).
•Scale to agentic, multi-turn RL setups by expanding KV capacity and keeping generation stable.

Version: 1