Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Yonggan Fu; Lexington Whalen; Zhifan Ye; Xin Dong; Shizhe Diao; Jingyu Liu; Chengyue Wu; Hao Zhang; Enze Xie; Song Han; Maksim Khadkevich; Jan Kautz; Yingyan Celine Lin; Pavlo Molchanov

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

Intermediate

Yonggan Fu, Lexington Whalen, Zhifan Ye et al.12/16/2025

arXiv PDF

Key Summary

•Autoregressive (AR) models write one word at a time, which is accurate but slow, especially when your computer or GPU can’t keep many tasks in memory at once.
•Diffusion language models (dLMs) can write many words in parallel, but training them well used to be hard and often broke the accuracy of strong AR models.
•This paper shows how to convert a good AR model into a fast dLM by using block-wise attention with clean context and a smarter, position-dependent masking rule.
•Block-wise attention keeps the model largely causal across blocks (so KV cache works) while letting tokens inside each block see each other to denoise in parallel.
•Position-dependent masking mimics the real decoding behavior (which tends to go left-to-right), so the model practices the same way it will play at test time.
•With these ideas plus careful choices of block size and training time, Efficient-DLM reaches AR-level (or better) accuracy while generating 2.7×–4.5× faster.
•Removing the old 'token shift' trick simplifies training and actually improves accuracy in this setup.
•A sweet-spot block size preserves AR knowledge while enabling strong parallel denoising; too small loses context, too large causes harmful weight drift.
•Longer training steadily improves likelihood estimates, which allows more aggressive parallel decoding without hurting quality.
•Efficient-DLM models also produce stronger text embeddings than similar-size AR models, thanks to in-block bidirectional reasoning.

Why This Research Matters

This work shows we don’t have to pick between smart (accurate) and speedy (fast) language models—we can have both by converting proven AR models into efficient dLMs. That means chatbots can respond faster on phones and small GPUs without losing quality. Customer support and code assistants can serve more users at once, especially when memory is tight and batch sizes are small. Developers can also tune a single model for either top speed or top accuracy by just moving a confidence threshold. The approach improves text embeddings too, which strengthens search, retrieval, and recommendation systems. In short, Efficient-DLM can cut costs, boost user experience, and open doors for on-device AI that feels responsive and reliable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine writing a story on a chalkboard, adding one word at a time while your classmates wait. You write neatly, but the line behind you grows long because you can only write one word per turn.

🥬 The Concept (Autoregressive Models): AR language models write text left-to-right, predicting the next word using only the words before it. How it works (like a recipe):

Read the words already written.
Guess the next word.
Add it and repeat. Why it matters: This is accurate but slow because it must move one word at a time; your GPU can’t speed it up much when batch sizes are small. 🍞 Anchor: When you ask, “What is the capital of France?”, an AR model focuses on 'capital' and 'France' and then writes 'Paris' word-by-word.

🍞 Hook: Now imagine sculpting—a statue appears as you chip away marble in several places at once.

🥬 The Concept (Diffusion Language Models): dLMs generate text by starting from a noisy, partially hidden version and cleaning it up step by step, often in parallel. How it works:

Hide or corrupt some tokens.
Predict and fill them in (denoise).
Repeat for multiple steps until the text becomes clear. Why it matters: This lets many tokens be fixed together, boosting speed; but training dLMs well (and scaling them) used to be hard and expensive. 🍞 Anchor: Instead of writing one word at a time, the model refines chunks of a sentence together, like clearing fog from several windows at once.

🍞 Hook: You know how you keep a sticky note of what you already did so you don’t redo work?

🥬 The Concept (Key-Value Caching): KV cache is a memory that stores past attention computations so the model doesn’t recompute them. How it works:

Save key/value summaries from earlier tokens.
Reuse them when processing new tokens.
Skip repeated work to speed up inference. Why it matters: Without KV cache, even fast models waste time rethinking old context; many prior dLMs struggled to use this cache effectively. 🍞 Anchor: If your math homework saved each solved step, you could quickly build the next step instead of starting from scratch each time.

The world before: AR models set the accuracy standard but were slow due to strictly one-token-at-a-time decoding. dLMs promised parallel generation, but real-world speedups often fell short because (1) fully bidirectional attention didn’t work nicely with KV caching; (2) training from scratch was costly since dLMs must learn many possible token orders, not just left-to-right; and (3) the way dLMs were trained (uniformly hiding tokens) didn’t match how they actually decoded (often resolving earlier tokens first, then later ones).

The problem: How can we start from a strong AR model and convert it into a dLM that’s truly faster while keeping its hard-earned accuracy?

What people tried (and why it fell short):

Fully bidirectional training (like Dream): Gave the model a global view but drifted far from AR’s causal habits, hurting accuracy and making KV caching awkward.
Uniform masking: Easy to implement, but it didn’t match the test-time left-to-right tendency; the model practiced differently from how it played.
Token shift (predicting “next token” at masked spots): Helped in early attempts, but here it turns out unnecessary and even harmful.

The gap this paper fills: It designs a conversion recipe that preserves what the AR model already knows, lets KV cache work, matches training to test-time behavior, and delivers fast, scalable parallel decoding.

Real stakes: Faster generation matters for chatbots on phones, customer support with small batches, coding assistants that must reply instantly, and servers that need to handle more users without buying tons of extra GPUs. This paper shows how to get both speed and smarts.

02Core Idea

🍞 Hook: Picture an assembly line where each table handles a small part of the job, but the tables only start once the earlier tables are fully done and tidy.

🥬 The Concept (Aha!): Convert a pretrained AR model into a dLM by training it with block-wise attention that sees clean, finished context from earlier blocks and a smarter, position-dependent masking rule that mirrors real decoding. How it works:

Split text into blocks (small chunks).
Within a block, allow bidirectional attention to clean up masked tokens in parallel.
Across blocks, keep causality and feed each block clean context from the finished blocks so KV cache works.
Mask tokens with a position bias (more masking near block ends late in denoising) to mimic the left-to-right resolution seen at test time. Why it matters: It preserves the AR model’s weight distribution (so accuracy stays high), enables true parallel decoding, and aligns training with inference for better quality. 🍞 Anchor: Like solving a puzzle row by row: you complete one row (clean context), then the next row is easier because the finished row guides it. Inside the row, pieces can be placed in any order, together.

Three analogies for the same idea:

School play rehearsal: Earlier scenes are fully polished (clean context), so the next scene’s actors can perform together (in-block parallel) using cues from the finished scene.
Kitchen stations: The appetizer station must be finished and plated (clean context) before the main-course station starts; inside the main station, multiple cooks work in parallel.
Group texting: The previous messages are finalized; now, within a new short thread, participants can respond to each other freely (bidirectional within block), but they all rely on the already settled earlier thread.

Before vs. after:

Before: Fully bidirectional dLM training led to larger weight drift, weaker KV caching, and a mismatch between training and inference.
After: Block-wise with clean context keeps the AR “muscle memory,” supports KV caching, and the position-dependent masking matches test-time behavior, producing both speed and accuracy.

Why it works (intuition):

Preserve what’s precious: Keeping across-block causality keeps the AR model’s learned structure stable, avoiding big, harmful weight changes.
Practice how you play: Biasing masks toward later tokens late in denoising imitates how the model actually finalizes tokens, improving confidence estimates and parallel decoding quality.
Efficient memory reuse: Clean context and block causality fit KV caching naturally, so less recomputation and higher throughput.

Building blocks: 🍞 Hook: You know how upgrading a bike with an electric kit makes the old ride faster without buying a new bike? 🥬 The Concept (AR-to-dLM Conversion): Start from a strong AR model and continuously pretrain it into a dLM with new attention and masking. How it works: Adjust attention to block-wise, feed clean context, change the masking to be position-aware, and train for tens to hundreds of billions of tokens. Why it matters: Keeps accuracy while unlocking parallel decoding. 🍞 Anchor: It’s your same bike frame (weights), but the new kit (training recipe) makes it zip.

🍞 Hook: Think of studying a bit more each day to adapt to a new subject. 🥬 The Concept (Continuous Pretraining): Keep training the AR model with the new dLM loss and patterns instead of restarting from scratch. How it works: Use modest learning rates, many tokens, and the new objectives to smoothly adapt. Why it matters: Saves time and preserves skills. 🍞 Anchor: You don’t relearn the alphabet to write essays—you just keep practicing smarter.

🍞 Hook: Imagine reading a book chapter-by-chapter; you don’t start chapter 2 until chapter 1 is clean and done. 🥬 The Concept (Block-Wise Attention with Clean Context): Inside a block, tokens can see each other; across blocks, causality is preserved and earlier blocks are clean. How it works: Concatenate noisy current-block tokens with finished clean-context tokens, then apply a special attention mask. Why it matters: Enables KV caching, reduces corruption, and preserves AR causality. 🍞 Anchor: You finish chapter 1, then chapter 2 uses it as a solid reference.

🍞 Hook: You know how teachers often grade the end of an essay more strictly to see if the argument holds? 🥬 The Concept (Position-Dependent Token Masking): Later tokens in a block are more likely to be masked toward the end of denoising. How it works: A position-weighted probability picks which tokens to hide, stronger near the block end when the noise is low. Why it matters: Matches the left-to-right tendency at test time, improving quality during aggressive parallel decoding. 🍞 Anchor: Practice the tough last sentences more so you ace them when it counts.

Taken together, these parts form Efficient-DLM: fast, accurate, and flexible.

03Methodology

At a high level: Input text → Split into blocks → Add position-aware masks (noise) per step → Denoise each block using block-wise attention with clean context (and KV cache) → Use confidence to decode multiple tokens in parallel → Repeat until all masks are resolved → Output text.

Step-by-step recipe:

Prepare the starting model and data.

What: Begin with a strong pretrained AR model (e.g., Qwen variants) and a large, mixed dataset.
Why: AR models already know language well; starting here preserves accuracy and saves training cost.
Example: Initialize from Qwen2.5-1.5B or Qwen3-4B/8B.

Split the input into blocks.

What: Cut each input sequence into contiguous blocks (e.g., 16, 32, 64 tokens each).
Why: Blocks enable in-block parallel denoising while keeping across-block causality for KV cache and AR familiarity.
Example: A 512-token sequence with block size 32 becomes 16 blocks.

Apply position-dependent masking per step.

What: For a noise level t, choose which tokens to mask with higher probability toward the block’s end (especially when t is small), using a half-life-style positional prior.
Why: At test time, decoding confidence tends to rise left-to-right; masking later tokens more during late denoising mimics that behavior, closing the training–test gap.
Example: In a 32-token block late in denoising, tokens 25–32 are more likely to be masked than tokens 1–8.

Construct inputs with clean context.

What: Feed each noisy block together with the fully finished tokens from earlier blocks (clean context) and apply a special attention mask that permits: (a) bidirectional attention within the current block; (b) attention from current block to clean context; (c) standard attention inside clean context.
Why: This mirrors test-time decoding (earlier blocks are already done), lowers corruption, and enables KV caching.
Example: For block 3, we concatenate [noisy block 3] + [clean blocks 1–2] and use the block-wise mask.

Denoise with block-wise attention and KV cache.

What: The model predicts the masked tokens for the current block using information from both directions within the block and from clean earlier blocks. KV cache stores previous computations, boosting speed.
Why: In-block bidirectionality helps fill gaps in parallel; clean context and KV cache avoid recomputation and protect accuracy.
Example: Tokens 5, 12, 27 inside the block are predicted together in a single step, guided by blocks 1–2.

Confidence-based parallel decoding.

What: After each denoising forward pass, compute confidence for each masked token and finalize those above a threshold; this increases tokens-per-forward (TPF). Lower thresholds yield more TPF (more parallelism) but can risk errors.
Why: It adaptively trades accuracy for speed, giving one-for-all flexibility.
Example: With a moderate threshold, the model decodes 3–5 tokens per step; with a high threshold, perhaps only 1–2.

Iterate across noise levels and blocks.

What: Repeat masking/denoising across timesteps until masks are gone, then move to the next block.
Why: Multistep denoising steadily refines predictions while keeping earlier context stable and reusable via KV cache.
Example: Block 1 finalizes first, then block 2 leverages it as clean context, and so on.

Training dynamics and hyperparameters.

What: Use continuous pretraining for tens to hundreds of billions of tokens with a small initial learning rate (e.g., 1e-5 cosine decay) to adapt gently without losing AR knowledge.
Why: Too large a learning rate causes harmful weight drift; too small fails to adapt to new attention patterns.
Example: 25B–50B tokens recover most accuracy; 200B–500B enables more aggressive parallel decoding with stable quality.

Block-size selection.

What: Choose a training block size that balances context richness with limited corruption; evaluate with possibly larger block sizes to allow more parallel opportunities.
Why: Too small gives weak context; too large causes large weight changes and accuracy drops; a sweet spot exists.
Example: For 1.5B models, block size ~16 worked best in tests; for 4B–8B, ~64 worked well.

Remove token shift.

What: Predict the masked token itself instead of the “next” token at masked positions.
Why: This simplification improved accuracy in this framework; predicting two things (the masked token and the next one) was harder and unnecessary.
Example: Rows (f)→(g) in ablations show consistent gains when token shift is removed.

What breaks without each step:

No blocks: You lose KV cache friendliness and drift from AR, hurting accuracy and speed.
No clean context: Training no longer matches test-time decoding; accuracy drops notably, even with more tokens.
Uniform masking only: The model practices a different pattern than it uses at test time; parallel decoding quality suffers, especially at low NFEs/high TPF.
Oversized blocks or LR: Weights drift too far from AR, losing the original skills.

The secret sauce:

Preserve AR strengths with block-wise causality and clean context.
Practice like you play via position-dependent masking.
Use confidence thresholds to dial speed vs. quality on the fly. Together, these deliver Efficient-DLM’s high throughput with AR-level accuracy.

04Experiments & Results

The tests: The authors evaluated accuracy on 12 benchmarks covering coding (HumanEval, HumanEval Plus, MBPP, MBPP Plus), math (GSM8K, Minerva Math), factual knowledge (MMLU), and commonsense reasoning (ARCC, ARCE, HellaSwag, PIQA, Winogrande). They also measured throughput (tokens per second) on an NVIDIA H100 with batch size 1, and varied tokens-per-forward (TPF) via confidence thresholds to plot accuracy–speed trade-offs.

The competition: Strong AR baselines (Qwen3 family, Qwen2.5, Llama3.2, SmolLM2) and strong dLMs (LLaDA, Dream). Some dLMs were also tested with external acceleration (Fast-dLLM) for a fairer speed comparison.

The scoreboard (with context):

Efficient-DLM 8B vs Dream 7B: +5.35% average accuracy with about 4.50× higher throughput. That’s like winning by a full letter grade while also finishing the test four times faster.
Efficient-DLM 8B vs Qwen3 8B (AR): Comparable accuracy–throughput frontiers, but Efficient-DLM can trade more speed for a small accuracy loss when needed.
Efficient-DLM 4B vs Qwen3 4B (AR): +7.79% accuracy with 2.68× throughput. That’s like scoring several more points on a quiz while answering questions in less than half the time.
Efficient-DLM 1.5B vs Qwen2.5 1.5B: Maintains strong accuracy with higher throughput options via TPF>1.

Surprising findings:

Token shift isn’t needed and can hurt this conversion; predicting the masked token directly is simpler and better here.
Training with clean context beats training longer on corrupted context; practice must match game time.
There’s a sweet-spot block size: too small (not enough context) and too large (excessive corruption) both reduce accuracy.
Larger evaluation block sizes help when you want ultra-parallel decoding (more tokens finalized per step) at similar NFEs.
dLMs shine at text embeddings: Efficient-DLM 1.5B/4B beat same-size AR models by +7.71%/+9.91% on MTEB subsets, thanks to in-block bidirectional reasoning.

Training dynamics:

With around 10B–50B tokens of continuous pretraining, converted dLMs recover much of the AR model’s accuracy.
Longer training steadily improves likelihood estimates; this makes confidence scores more trustworthy, enabling lower NFEs (more parallelism) without big accuracy hits.
Learning rate matters: ~1e-5 was a sweet spot; too high drifts and hurts, too low adapts too slowly.

Visual trade-offs:

Accuracy–throughput curves show Efficient-DLM 8B achieving better frontiers than Dream/LLaDA (even with Fast-dLLM) and competitive with AR Qwen3 models; its advantage is strongest at small batch sizes (common in memory-limited deployments).

Bottom line: Efficient-DLM turns AR accuracy into dLM speed, consistently beating prior dLMs on both quality and tokens-per-second, and rivaling or exceeding AR models when you factor in real throughput.

05Discussion & Limitations

Limitations:

Block-size tuning is important; the best size balances enough context with not too much corruption. Wrong choices can hurt accuracy.
Benefits shrink at very large batch sizes; AR models can catch up or surpass in throughput when batches are huge.
Requires strong AR initializations and notable training tokens (e.g., 300B–500B for largest models) plus substantial compute (e.g., 128 H100s), which may be out of reach for small labs.
Sensitivity to hyperparameters like learning rate; too aggressive can cause harmful weight drift, too gentle can stall adaptation.
Confidence threshold tuning is needed in deployment to pick the right speed–quality balance.

Required resources:

Pretrained AR models (e.g., Qwen2.5/Qwen3).
Large training corpora (tens to hundreds of billions of tokens).
Significant GPU time (multi-node H100-class hardware for large models).

When not to use:

Extremely large-batch, high-throughput server settings where AR models’ mature KV pipelines may already be optimal.
Ultra-long-context tasks if block windowing isn’t adapted carefully.
Scenarios where strict left-to-right reasoning with no in-block bidirectionality is mandated (rare, but possible in specialized pipelines).

Open questions:

Can we automate block-size selection and position-dependent masking (e.g., learned schedules) per domain/task?
How well does the recipe generalize to very different architectures (Mixture-of-Experts, linear attention, Mamba hybrids)?
Can we further improve large-batch efficiency for dLMs (e.g., adaptive block sizes, better KV layouts, hybrid AR–diffusion decoding)?
How do multilingual and domain-specific corpora affect the sweet spots and masking priors?
Can parameter-efficient tuning (e.g., higher-rank LoRA) close more of the gap to full-model training for resource-limited teams?

06Conclusion & Future Work

Three-sentence summary: Efficient-DLM shows how to convert strong AR models into faster diffusion language models by using block-wise attention with clean context and position-dependent token masking that mirrors test-time behavior. This preserves AR knowledge, enables KV caching, and unlocks reliable parallel decoding, yielding better accuracy–throughput trade-offs than prior dLMs and even surpassing AR baselines in many settings. Longer training and careful block-size/learning-rate choices further improve confidence estimation and allow more aggressive speedups with minimal accuracy loss.

Main achievement: A practical, scalable recipe that bridges AR accuracy and dLM speed—demonstrating state-of-the-art accuracy–throughput frontiers and strong text-embedding performance.

Future directions: Automate masking schedules and block-size policies, combine with linear/attention variants for large-batch gains, refine KV strategies for even better caching, and extend the approach across languages, modalities, and specialized reasoning tasks.

Why remember this: It turns a long-standing trade-off—accuracy vs. speed—into a tunable dial. With Efficient-DLM, you can keep the brains of your favorite AR model and give it the speed of parallel diffusion, choosing exactly how fast to go without letting quality crash.

Practical Applications

•Speed up existing AR chatbots by converting them to Efficient-DLM for faster replies on small GPUs or CPUs.
•Deploy on mobile or edge devices where memory is tight, using KV cache–friendly block-wise decoding.
•Tune a single model’s speed–quality trade-off per request by adjusting the confidence threshold (one-for-all deployment).
•Accelerate coding assistants in IDEs to provide near-instant code completions and fixes.
•Serve more simultaneous users on the same hardware by increasing tokens-per-forward during peak times.
•Generate stronger text embeddings for retrieval, reranking, and semantic search thanks to in-block bidirectionality.
•Support math and reasoning tutors that remain accurate while answering faster.
•Convert domain-specialized AR models (e.g., legal, medical, finance) into faster dLMs without losing subject expertise.
•Run batch-limited workflows (agent planning, chain-of-thought sampling) more quickly by leveraging parallel denoising.
•Prototype low-latency voice assistants by pairing Efficient-DLM with streaming front-ends for snappy responses.

Version: 1