DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Ying Zhu; Jiaxin Wan; Xiaoran Liu; Siyang He; Qiqi Wang; Xu Guo; Tianyi Liang; Zengfeng Huang; Ziwei He; Xipeng Qiu

DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Intermediate

Ying Zhu, Jiaxin Wan, Xiaoran Liu et al.12/23/2025

arXiv PDF

Key Summary

•This paper builds DiRL, a fast and careful way to finish training diffusion language models so they reason better.
•It fixes a long-standing mismatch between how these models are trained and how they actually answer questions.
•DiRL uses FlexAttention to speed up training masks that diffusion models need, cutting training time by about 6×.
•It ties training directly to an always-on inference server (LMDeploy) so the model updates instantly without slow file saves.
•The new RL algorithm, DiPO, is the first unbiased GRPO for diffusion LLMs, making learning signals accurate and stable.
•Using two stages—Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL)—DiRL-8B-Instruct becomes the top math dLLM.
•On tough math tests like AIME24/25 and OlympiadBench, DiRL-8B-Instruct even beats larger, well-known AR models.
•Efficient rollouts and online updates mean researchers can iterate quickly and cheaply on reasoning tasks.
•The framework shows that diffusion LLMs can match or surpass AR models when post-training is done right.

Why This Research Matters

Better reasoning from smaller, efficient models means students and teachers can get high-quality math help without giant hardware. Faster training and live updates let research teams iterate quickly, making new ideas practical sooner. Matching training to inference reduces surprises at deployment, improving reliability in educational and professional tools. Accurate, unbiased RL signals help models learn the right lessons, cutting down on brittle or lucky wins. If diffusion LLMs can match or beat larger AR models after smart post-training, powerful reasoning becomes more accessible and affordable. This opens the door to improved tutoring systems, scientific assistance, and safer decision support across many domains.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a student can read lots of books (pre-training) but still needs practice tests and coaching (post-training) to do well on real exams? Reading and practicing are not the same thing.

🥬 The Concept (Language Model Basics): A language model is a program that guesses the next words or fills in missing parts to make helpful text. How it works:

It reads a prompt.
It uses patterns it learned to score possible next tokens.
It picks tokens and repeats until it forms an answer. Why it matters: Without a good way to use what it learned after reading (post-training), it might still fumble hard problems. 🍞 Anchor: Like a quiz coach who helps you turn what you read into test-ready strategies.

🍞 Hook: Imagine writing a story one word at a time, always in order. 🥬 The Concept (Autoregressive, AR): An AR model writes from left to right, choosing the next token based on all previous tokens. How it works: 1) Look at what’s been written. 2) Score each possible next token. 3) Pick one and append. 4) Repeat. Why it matters: It’s simple and stable, but can be slow and sometimes gets stuck. 🍞 Anchor: Like typing a sentence letter by letter.

🍞 Hook: Imagine repairing a fuzzy sentence by clarifying many spots at once. 🥬 The Concept (Diffusion Language Model, dLLM): A dLLM starts with a partly masked text and “denoises” it to reveal the correct tokens. How it works: 1) Add masks to parts of text. 2) The model predicts missing tokens. 3) Repeat steps until text becomes clear. Why it matters: It can fill many tokens in parallel, promising faster inference—but training and usage must match to work well. 🍞 Anchor: Like clearing fog from many window panes at once.

🍞 Hook: Think of writing a long essay in chunks: first paragraph 1, then paragraph 2, and so on. 🥬 The Concept (Blockwise dLLM): A blockwise dLLM splits text into blocks; it keeps order between blocks (like AR) but denoises tokens inside a block in parallel (like diffusion). How it works: 1) Break the sequence into equal blocks. 2) Use past blocks as context. 3) Denoise all tokens in the current block together. 4) Move to the next block. Why it matters: This design allows exact token scores (logits) per block and enables fast caches, solving big efficiency headaches. 🍞 Anchor: Like finishing one paragraph cleanly before moving to the next, while editing all sentences of that paragraph at the same time.

🍞 Hook: When you prepare for a contest, practicing in the exact same format as test day makes you improve much faster. 🥬 The Concept (Training–Inference Mismatch): This happens when the way a model is trained (random masks) doesn’t match how it actually answers (block-by-block decoding). How it works: 1) Train with one procedure. 2) Use a different one during inference. 3) The model gets confused and underperforms. Why it matters: If practice doesn’t match the game, performance drops, especially on tough math. 🍞 Anchor: Practicing multiple-choice but getting an essay on test day.

🍞 Hook: A scoreboard shows how confident you are about each choice. 🥬 The Concept (Logits): Logits are the raw scores before turning into probabilities for each token. How it works: 1) Model computes scores for tokens. 2) Scores become probabilities. 3) We use them to decide and to train. Why it matters: If you can’t compute correct logits, you can’t train RL properly. 🍞 Anchor: The taller the bar on a chart, the more likely the answer.

🍞 Hook: Keeping notes so you don’t have to reread the whole book every time. 🥬 The Concept (KV Cache): A memory that stores past attention results so the model doesn’t recompute them. How it works: 1) Save key/value summaries per step. 2) Reuse them for future steps. 3) Speed increases a lot. Why it matters: Without KV cache, inference and RL rollouts can be painfully slow in big models. 🍞 Anchor: Bookmarks that jump you to exactly where you left off.

🍞 Hook: After learning facts, you still need practice and coaching to shine. 🥬 The Concept (Post-Training, SFT + RL): Post-training means adding two steps after pre-training: Supervised Fine-Tuning (SFT) on good answers, then Reinforcement Learning (RL) from rewards. How it works: 1) SFT: copy high-quality solutions. 2) RL: try, get a score, adjust policy to get better scores. Why it matters: Without post-training, models often struggle with complex multi-step reasoning. 🍞 Anchor: First study solved examples, then drill with a coach who gives you points.

🍞 Hook: Ranking your answers by how good they are compared to your own other answers. 🥬 The Concept (GRPO): Group Relative Policy Optimization is an RL method that samples several answers per question and learns by comparing within the group. How it works: 1) Generate multiple outputs. 2) Score each. 3) Push up better ones, push down worse ones, with safety clipping. Why it matters: It removes the need for a separate value model and works well online. 🍞 Anchor: Like trying five strategies on a math problem and keeping what worked best.

The world before: AR models were strong but sometimes slow, and diffusion LLMs (dLLMs) promised fast, parallel token filling. Pre-training dLLMs was feasible, but post-training—especially RL—was clunky. Why? dLLMs once lacked exact per-token logits, used random masking unlike real inference, and didn’t enjoy smooth integration between training and inference. Failed attempts used biased approximations (like one-step random masks) or inefficient engineering loops (reloading models each step). The gap: a system that makes training-time computation match inference-time behavior and gives unbiased, efficient token scores—plus an RL algorithm tailored for diffusion. Real stakes: Better math reasoning can power tutoring, scientific help, and accurate code or planning. DiRL fills this gap with a fast, consistent, and scalable post-training recipe for dLLMs.

02Core Idea

🍞 Hook: Imagine upgrading a race car so practice laps and real races feel identical, and the pit crew can tweak the engine live, mid-race.

🥬 The Concept (DiRL Framework): DiRL is a post-training framework that makes training match inference for blockwise diffusion LLMs, while speeding everything up with FlexAttention and a live inference server. How it works:

Use blockwise dLLMs so we can compute exact per-block logits and keep a KV cache.
Train with FlexAttention-friendly masks that mirror real decoding.
Serve the model via LMDeploy and push parameter updates in-place after every step.
Run two-stage post-training: SFT on high-quality math solutions, then RL with DiPO (unbiased GRPO for dLLMs). Why it matters: Without this match and speed, diffusion models underperform on tough reasoning. 🍞 Anchor: A pit crew (LMDeploy) updates the car (model) while it’s running, and the practice track (training masks) is the same as race day (inference).

Three analogies:

Sports practice: Train with the same drills you’ll use in the game, and your coach updates your playbook instantly after each play.
Cooking: Use the same oven and timing during practice as in the real dinner rush; the head chef can tweak the recipe on the fly.
Orchestra: Rehearsals use the same hall and seating as the concert, and the conductor can adjust tempo while the music plays.

Before vs After:

Before: dLLM post-training used random masks unlike real decoding, lacked exact logits, and reloaded checkpoints repeatedly.
After: DiRL matches training to inference with blockwise masks, computes unbiased logits efficiently, pushes online updates, and runs RL (DiPO) stably.

Why it works (intuition, no equations):

Matching the path: Training follows the same block-by-block steps as inference, so learning signals align with how the model will really answer.
Unbiased scores: Blockwise structure lets us compute true token scores within each block, avoiding drift.
Efficient attention: FlexAttention handles complex masks fast, so we don’t pay a speed penalty.
Tight loop: Serving the model and updating it in place eliminates IO stalls and keeps the policy fresh.
GRPO without a critic: Comparing multiple answers per prompt stabilizes learning with light compute; DiPO adapts this cleanly to diffusion.

Building blocks (each piece in simple terms):

Blockwise dLLM: Keep order across blocks, denoise inside a block in parallel.
FlexAttention masks: Regular, fine-grained masks that match block decoding and run fast in PyTorch.
LMDeploy server: Always-on model that accepts immediate parameter updates.
Online rollouts: Generate answers at inference speed and feed them right back for training.
SFT then RL: First copy good math chains, then polish with rewards.
DiPO (unbiased GRPO): Compute per-step token ratios correctly and use group comparisons with clipping and a reference KL to stay stable.
Dynamic decoding: Confident tokens can be fixed early to speed up decoding without much risk.

🍞 Anchor: DiRL turns practice into game time: same rules, same field, and a coach who can whisper new instructions between plays—leading to smarter, faster math reasoning.

03Methodology

At a high level: Prompt → (SFT or RL) → Blockwise forward passes with FlexAttention masks → Exact per-block logits and outputs → Online update via LMDeploy → Improved model.

🍞 Hook: Imagine solving a math worksheet with two steps: first copy expert examples, then try on your own while getting points. 🥬 The Concept (Two-Stage Post-Training): SFT teaches the model clean solution patterns; RL teaches it to prefer higher-scoring answers. How it works:

SFT Stage:

Data: OpenR1-Math expert trajectories capped at long 8k tokens.
Training: Use LLaMA-Factory + FlexAttention masks that repeat blocks in a regular way so attention aligns with blockwise decoding.
Why it matters: Without SFT, RL has weak starting habits and can wobble. Example: The model learns to show steps like factoring or substitution consistently.

RL Stage (DiPO):

Data: Big-Math with verified, high-quality problems and rewards.
Rollouts: Serve the current model on LMDeploy; for each prompt, generate multiple candidate solutions (a group).
Logits and ratios: Use blockwise passes to get accurate token scores step-by-step; compute importance ratios per token without bias.
Policy update: Apply GRPO-style learning: boost better group members, suppress worse ones, with clipping and a KL to a reference model to avoid drifting too far.
Online updates: Push the new parameters straight into the running server—no reloading, no saving overhead. Why it matters: Without unbiased ratios and online updates, RL is either noisy or slow. Example: For one geometry question, the model tries 32 solutions; the ones that land the correct final number—and have coherent steps—get reinforced immediately.

🍞 Anchor: First, the student copies clear worked examples; then they try many answers, keep the best tricks, and instantly update their notes.

Detailed steps like a recipe:

Input: A math prompt up to 8k tokens.
Step A (Blockwise Encoding): Use past blocks as context, keep a KV cache, and prepare FlexAttention masks that match exactly how blocks will be decoded.
Step B (Denoising per Block): Predict all tokens in the current block in parallel, using historical blocks as context.
Step C (Logit Extraction): Collect per-token logits from the denoising pass for unbiased scoring.
Step D (SFT Loss or GRPO Update): • SFT: Compare predicted tokens to reference answers; compute cross-entropy only where tokens are masked for the current block. • RL (DiPO): Generate multiple outputs; compute group-relative advantages; apply clipped policy gradients with a KL-to-reference regularizer.
Step E (Online Update): Call LMDeploy’s in-place parameter update API; the served model is refreshed instantly.
Output: A refined answer and a slightly smarter model.

What breaks without each step:

No blockwise structure: You lose exact logits and KV caching; training slows and becomes biased.
No FlexAttention masks: Complex block masks run slowly or can’t run, tanking throughput.
No LMDeploy updates: IO waits dominate; the policy grows stale between steps.
No SFT: RL starts from shaky reasoning and can collapse.
No GRPO/DiPO: You need a heavy value model or accept noisy signals.

🍞 Hook: Think of a teacher who groups your attempts, scores them together, and nudges you toward the best ones. 🥬 The Concept (DiPO, unbiased GRPO for dLLMs): DiPO adapts GRPO to diffusion by computing accurate token ratios per block along the real decoding path. How it works:

For each prompt, sample G candidates from the current model.
Score them (reward): correctness, format, etc.
Compute group-relative advantages (how much better than the group average).
Update the policy with clipping and a KL to a reference, using unbiased per-token ratios. Why it matters: Without unbiased ratios, the model can learn the wrong lessons. 🍞 Anchor: It’s like trying several proofs, keeping the best, and not overreacting thanks to guardrails (clipping + KL).

Secret sauce:

Training–inference consistency: Masks and block sequencing match exactly.
Unbiased logit computation: Blockwise passes give accurate token scores.
FlexAttention speed: Efficient, general masks keep GPUs busy.
Online integration: Inference engine and trainer talk live—minimal waiting, maximal learning.
Dynamic decoding: Confident tokens can be fixed early, improving throughput without harming accuracy.

04Experiments & Results

🍞 Hook: If two runners finish the same distance, the one who runs faster and with better form is clearly ahead.

🥬 The Concept (What they tested): They measured both brains and brawn—reasoning accuracy on math benchmarks and system speed (training and rollout times). How it works:

Benchmarks: GSM8K, MATH500, AIME 2024, AIME 2025, OlympiadBench.
Models: DiRL-8B-Instruct vs SDAR-8B-Chat and TraDo-8B-Instruct (other blockwise dLLMs), plus AR baselines Qwen2.5-7B/32B.
Decoding: Both static and dynamic (fix top-1 tokens above 0.9 probability early).
System speed: Time per RL step, and breakdown of rollout, IO, and training. Why it matters: Strong scores and fast training together mean practical, deployable reasoning. 🍞 Anchor: Like winning both the accuracy test and the speed drill.

Scoreboard with context:

DiRL-8B-Instruct reaches state-of-the-art among dLLMs and even surpasses larger AR models on AIME24/25 and OlympiadBench.
Average performance is best overall; on AIME tasks (famously tricky), DiRL-8B-Instruct leads by a clear margin—like getting an A when others hover around B.
Outputs are longer on average, which suggests deeper chain-of-thought and more robust math derivations.

System efficiency:

FlexAttention reduces training latency by about 6× compared to TraceRL-like setups.
Replacing repeated load/save with in-place LMDeploy updates removes IO bottlenecks; total RL step time improves ~2.5×.
These speedups hold across model sizes; DiRL’s 8B trains per step faster than TraceRL’s 1.7B.

Surprising findings:

An 8B diffusion model with good post-training can beat a 32B AR model on tough math—so training quality can trump sheer size.
Dynamic decoding at a 0.9 threshold balances speed and accuracy well; results stay robust across thresholds.
Long 8k reasoning sequences are feasible and helpful in diffusion LLMs when the post-training path matches inference.

🍞 Anchor: The upgraded team not only solved more Olympiad-style puzzles but trained faster between matches, improving quickly week after week.

05Discussion & Limitations

🍞 Hook: Even great shoes have limits—you still can’t fly.

🥬 The Concept (Limitations and trade-offs): DiRL is strong, but there are boundaries.

Limitations: Focused mostly on math; context length tops at 8k (shorter than top AR models); blockwise structure still constrains how decoding works; relies on high-quality data and careful rewards; dynamic decoding requires tuning.
Required resources: Multi-GPU clusters (e.g., 8–128× H200), FlexAttention-enabled PyTorch, LMDeploy server integration, curated SFT/RL datasets, and alignment tooling (reward models or verifiers).
When not to use: Ultra-long contexts (beyond 8k) today; tasks lacking measurable rewards; tiny-hardware settings without an inference server; cases where pure streaming AR latency is strictly needed without block batching.
Open questions: Scaling to larger dLLMs; extending to coding and agent tasks; pushing beyond 8k with long-context tricks; combining with other variance-reduction RL for stability; safety alignment and reward shaping for non-math domains. 🍞 Anchor: The car is fast and efficient on city roads (math), but we’ve yet to test it on deserts (agents) or highways (very long contexts).

06Conclusion & Future Work

Three-sentence summary: DiRL makes diffusion language models practice the same way they play by matching training to inference and speeding everything up with FlexAttention and an always-on LMDeploy server. Its RL core, DiPO, is the first unbiased GRPO for dLLMs, providing accurate learning signals. Together with SFT+RL on high-quality math data, DiRL-8B-Instruct sets a new bar for math reasoning in dLLMs and even outpaces larger AR baselines on key tests. Main achievement: Turning dLLM post-training into a consistent, efficient, and unbiased loop—unlocking strong reasoning without needing a massive model. Future directions: Scale to larger models; go beyond 8k context; test-time scaling for longer chains; expand to coding and agentic tasks; integrate advanced packing and memory tricks for more speed. Why remember this: It shows that the right post-training framework can flip the script—diffusion models aren’t just feasible; with DiRL, they can excel in reasoning and train fast enough for real-world use.

Practical Applications

•Build a math tutor that explains multi-step solutions reliably on modest GPUs.
•Run fast RL alignment loops for reasoning tasks using LMDeploy’s online updates.
•Port AR RL pipelines (like GRPO) to diffusion LMs with unbiased token ratios via DiPO.
•Speed up SFT on diffusion LMs by swapping in FlexAttention-compatible masks.
•Deploy a continuous training service where the model gets smarter after every batch without restarts.
•Apply dynamic decoding for throughput gains in production with minimal accuracy loss.
•Create verified-reward workflows (e.g., math-verify) to supervise RL at scale.
•Extend the framework to coding or agent tasks that need accurate step-by-step reasoning.
•Explore long-context reasoning up to 8k tokens and plan upgrades for longer contexts.
•Integrate DiRL as a baseline toolkit for future diffusion LLM post-training research.

Version: 1