LLaDA2.0: Scaling Up Diffusion Language Models to 100B
Key Summary
- •Before this work, most big language models talked one word at a time (autoregressive), which made them slow and hard to parallelize.
- •This paper turns those one-word-at-a-time models into diffusion models that can fill in many blanks at once, making them faster and often smarter on structured tasks.
- •The key is a smooth three-phase training plan called Warmup–Stable–Decay that safely switches the model’s habits from left-to-right to fill-in-the-blanks.
- •A document-level attention mask keeps the model from mixing up sentences from different documents when learning with packed data.
- •Supervised fine-tuning and an adapted preference-learning method (DPO) teach the model to follow instructions and match human choices.
- •A special confidence loss makes the model more sure about correct answers, which unlocks speed through parallel decoding.
- •Top-k checkpoint merging blends the best versions of the model for stronger, more stable performance.
- •The 16B (mini) and 100B (flash) models match or beat similarly sized autoregressive models on many tests, especially coding and tools/agent tasks.
- •The 100B model reaches higher tokens-per-second than strong AR baselines when using the confidence-aware training.
- •These models are open-sourced, offering a practical recipe to scale diffusion language models to frontier sizes.
Why This Research Matters
Faster, cheaper language models let apps respond quickly, even when many users are online. Parallel decoding lowers latency for tools like code assistants, customer support, and educational tutors. Better performance on structured tasks (coding, math, tool use) helps professionals and students get more reliable, step-by-step help. The conversion recipe makes it practical to reuse strong AR models, reducing the cost and time of building new diffusion systems. Safer, more aligned post-training (SFT + DPO) helps the models follow instructions and match human preferences. The open-source release enables the community to build, test, and deploy improvements rapidly.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine writing a story by placing one word after another, carefully choosing each next word. Now imagine instead you see the whole page with some words blanked out and you fill them in all at once. Which way sounds faster when you’re good at it?
🥬 The Concept (Auto-Regressive Models, AR): AR models write from left to right, one token at a time. They’re great learners because the rule is simple: predict the next word.
- How it works: (1) Read all words so far, (2) Guess the next one, (3) Add it, (4) Repeat.
- Why it matters: Without this rule, training would be messy. But at inference, AR must go slow, one token after another, which is hard to parallelize.
🍞 Anchor: When you ask, “What is the capital of France?”, an AR model reads your words and says “Paris,” but it commits each word step-by-step.
🍞 Hook: You know how you can solve a crossword by filling in many squares at once because you see clues across and down? That’s a different superpower.
🥬 The Concept (Discrete Masked Diffusion Language Models, MDLM): MDLMs learn to fix noisy or masked text by filling in blanks using context from both sides (bidirectional).
- How it works: (1) Randomly mask some tokens, (2) Look at all remaining context (left and right), (3) Predict the masked tokens, (4) Repeat across steps with different masks.
- Why it matters: MDLMs can generate in parallel because they can propose many tokens at once, and they use more complete context.
🍞 Anchor: Given “The capital of [MASK] is Paris,” MDLMs can infer “[MASK] = France” using both sides of the sentence.
The world before: Big AR models dominated because training them was stable and the recipes were mature. But they had two big limits: (1) Inference is sequential—slow and costly at scale. (2) Some tasks benefit from seeing both left and right context while generating—AR can’t do that naturally.
Failed attempts: People tried training diffusion models from scratch, but that was expensive and usually smaller (≤8B parameters). Others tried to switch AR models to diffusion quickly, but the jump was bumpy—knowledge got forgotten and training got unstable. Also, with packed datasets (many documents in one batch), diffusion models sometimes attended across document boundaries, mixing unrelated content.
🍞 Hook: Think of changing a violinist into a pianist overnight—it’s the same musician, but the habits are different. You need careful lessons, not just a new instrument.
🥬 The Concept (Continual Pre-training, CPT): CPT lets a model keep learning new skills while keeping its old knowledge.
- How it works: (1) Start from a strong AR model, (2) Gradually adjust tasks and attention style, (3) Train longer to master the new skill without forgetting.
- Why it matters: Without CPT, a quick switch to diffusion can erase prior language knowledge (catastrophic forgetting).
🍞 Anchor: A math student who already knows arithmetic slowly learns algebra; they don’t toss away arithmetic—they build on it.
🍞 Hook: You know how you warm up before a race, run steadily, then cool down? Models benefit from a similar rhythm when learning new tricks.
🥬 The Concept (Warmup–Stable–Decay, WSD): WSD is a three-phase schedule to smoothly change an AR model into a diffusion model.
- How it works: (1) Warmup: grow block size so the model denoises larger chunks; (2) Stable: train on full-sequence diffusion to master bidirectional denoising; (3) Decay: shrink block size to regain fast, blockwise decoding.
- Why it matters: A sudden switch causes instability; WSD keeps learning smooth and data-efficient.
🍞 Anchor: It’s like practicing piano hands-separately (small blocks), then hands-together (full), then polishing tricky bars (small blocks again for speed and control).
🍞 Hook: Imagine reading several short stories glued together. If you’re not careful, your eyes might skip between stories and mix them up.
🥬 The Concept (Document-level Attention Mask): A training-time rule that makes the model attend only within each document segment, not across them.
- How it works: (1) Pack multiple docs in one sequence, (2) Mask attention so tokens only look within their own doc, (3) Train safely without cross-doc leakage.
- Why it matters: Without it, the model forms fake connections across unrelated texts, which hurts learning and stability.
🍞 Anchor: It’s like using folders so your homework for math doesn’t get jammed into your literature essay.
The gap this paper fills: It shows a reliable and scalable way to convert powerful AR checkpoints into diffusion models up to 100B parameters, keeping knowledge while gaining parallel decoding. It also adds post-training alignment (SFT + DPO) and a confidence trick to speed up inference, proving competitiveness against strong AR baselines.
Real stakes: Faster, stronger language models mean cheaper servers, quicker answers, and better performance on structured tasks like coding, math, and tool use—things students, developers, and businesses care about every day.
02Core Idea
🍞 Hook: Think of learning to ride a bike without training wheels. You don’t jump straight to a steep hill—you start small, go steady on the flat, then practice turns.
🥬 The Concept (Key Insight): The paper’s “aha!” is: don’t train diffusion models from scratch—convert an already-smart AR model into a diffusion model gradually, using a three-phase WSD plan plus smart masking and alignment so you keep knowledge and gain speed.
- How it works: (1) Start with AR, (2) Warmup with growing block sizes, (3) Train stably on full-sequence diffusion, (4) Decay to small blocks for fast inference, (5) Align with SFT and DPO, (6) Add confidence loss to unlock aggressive parallel decoding.
- Why it matters: This preserves the AR model’s brain while teaching it new diffusion tricks that enable parallelism and bidirectional reasoning.
🍞 Anchor: It’s like upgrading a skillful violinist to also play piano by adding careful stages, not replacing their music brain.
Three analogies:
- Sports: Warm up (small drills), main set (full run), cool down (targeted drills). That’s WSD for models.
- Puzzles: AR solves one piece after another; diffusion fills many blanks at once using the whole picture.
- Cooking: AR is adding ingredients in a strict order; diffusion is tasting the whole soup and adjusting many spices together.
Before vs After:
- Before: AR models are stable and strong but decode one token at a time; diffusion models trained from scratch are smaller and less mature.
- After: Convert strong AR models into diffusion models that keep their knowledge, generate in parallel, and shine in structured tasks (coding, tool use), even at 100B scale.
🍞 Hook: Ever sorted Lego bricks into small trays so you can build faster? Block diffusion does that for text.
🥬 The Concept (Block Diffusion): Instead of predicting one token, the model predicts a chunk (block) at a time, with a recipe that mixes diffusion inside the block and ordered blocks outside.
- How it works: (1) Split sequence into blocks, (2) Denoise masked tokens within a block using context, (3) Move block by block with reuse of earlier context, (4) Adjust block size over training.
- Why it matters: It balances coherence (inside block) and speed (fewer steps, cache reuse), especially important for long sequences.
🍞 Anchor: Like building a Lego castle room-by-room; each room is polished internally, and rooms are added in order for a full castle.
Why it works (intuition):
- Gentle shift: WSD avoids shocking the model. Growing block size teaches it to use more context, then full diffusion cements the skill, and small blocks reintroduce efficiency for deployment.
- Clean context: Document-level masks remove noisy, cross-document attention that can derail bidirectional training.
- Full utilization: Complementary masking ensures every token gets trained each step, speeding convergence.
- Confident outputs: The confidence loss reduces uncertainty on already-correct tokens, enabling the decoder to accept more tokens per step.
- Human alignment: SFT and DPO tune the model to follow instructions and prefer human-liked answers, using a diffusion-friendly formulation of DPO.
Building blocks (in simple pieces):
- Start Strong: Use an AR base to inherit knowledge.
- Learn Blocks: Warm up with small-to-large blocks.
- Master Diffusion: Train on whole sequences bidirectionally.
- Deploy Fast: Shrink back to small blocks for runtime speed.
- Stay Organized: Use document-level masks.
- Use Data Well: Complementary masking and mask ratio bandwidth.
- Be Confident: Confidence-aware training for faster decoding.
- Be Helpful: SFT for instructions, DPO for human preferences.
- Be Stable: Top-k checkpoint merging to average the best model states.
🍞 Anchor: After training, the 16B “mini” and 100B “flash” models can answer coding and math questions quickly, often outperforming similar AR models while using parallel decoding.
03Methodology
High-level pipeline: Input (AR checkpoint + text data) → Stage 1: Continual Pre-training with WSD → Stage 2: Block Diffusion Pre-training & masking tricks → Stage 3: Post-training (SFT, Confidence-Aware Parallel, DPO) → Output (LLaDA2.0-mini/flash).
Step 0. Start from a strong AR model
- What happens: Load a well-trained AR checkpoint (e.g., Ling-mini-2.0-base or Ling-flash-2.0-base).
- Why it exists: We inherit language knowledge and stability rather than starting from scratch.
- Example: The base already knows grammar, facts, and coding syntax.
🍞 Hook: Like stretching before a sprint—start small, then go big, then return to a relaxed pace.
🥬 The Concept (WSD: Warmup–Stable–Decay): A 3-phase block-size schedule to cross the AR→diffusion gap.
- How it works:
- Warmup (BDLM with small→huge blocks): Start with block size 1 (AR-like), then 4, 32, 64, … up to full sequence (e.g., 4096) so diffusion sees whole context.
- Stable (MDLM full-sequence): Train long on full bidirectional diffusion to master denoising across the entire sequence.
- Decay (BDLM back to small blocks): Reduce block size stepwise (4096→2048→…→32) to regain blockwise decoding efficiency and KV-cache reuse.
- What breaks without it: A sudden switch causes optimization instability and forgetting; skipping decay hurts runtime speed.
- Example: A 4096-token input becomes a single block at Stable; later we shrink to 32-token blocks for fast decoding.
🍞 Anchor: It’s like mastering a song hands-together, then practicing in smaller sections to perform swiftly.
Document-level attention mask (applies throughout training)
- What happens: During packed training, enforce attention only within each document’s boundaries, and structure block-wise attention patterns for noisy/clean halves.
- Why it exists: Prevents spurious cross-document links that confuse the model in bidirectional denoising.
- Example: If a batch packs News A + Wiki B, tokens from A cannot attend to B.
Top-k checkpoint merging
- What happens: After pre-training, pick the top k checkpoints by validation score and average their parameters.
- Why it exists: Smooths the parameter landscape and ensembles good states for better generalization.
- Example: Averaging the best 5 checkpoints yields a steadier final model than just the last one.
Stage 2. Block Diffusion fine-tuning details
- Padding & block alignment: Round sequence lengths up to the nearest multiple of block size so attention masks align perfectly.
- Mask ratio bandwidth: Clip the diffusion mask rate to a helpful interval (avoid almost-no-mask and almost-all-mask regions that give weak or noisy signals).
- Complementary masking: For each sample, create two masks that are opposites so that across the pair every token is observed exactly once in clean form.
- Why these exist: They stabilize optimization and increase data efficiency—every token contributes signal.
- Example: If positions 1–10 are masked in sample A, they’re unmasked in sample B, and vice versa.
Stage 3. Post-training for assistants
🍞 Hook: A coach first shows you how to do it, then asks which attempts you like best, then helps you become more decisive under pressure.
🥬 The Concept (Supervised Fine-Tuning, SFT): Teach the model to follow instructions reliably with labeled examples.
- How it works: (1) Provide prompt–response pairs, (2) Apply block diffusion loss conditioned on the prompt, (3) Use complementary masking and bandwidth.
- Why it matters: Without SFT, the model knows language but doesn’t behave like a helpful assistant.
- Example: Prompt: “Write a function to reverse a list.” Response: code. The model learns to produce that response under diffusion training.
🍞 Anchor: Like practicing correct free-throws with guidance to build good habits.
🍞 Hook: Suppose two answers exist; which feels better to you? Teach the model to prefer that one next time.
🥬 The Concept (Direct Preference Optimization, DPO for diffusion): Adapt preference learning to diffusion by comparing ELBO-based scores of preferred vs. dispreferred responses against a reference model.
- How it works: (1) Freeze a reference (post-SFT) model, (2) For each pair (win/lose), compute diffusion-compatible score (an ELBO estimate) under policy and reference, (3) Train policy to increase margin for the preferred response.
- Why it matters: Standard log-likelihood isn’t directly available in diffusion; this formulation lets us still learn human preferences.
- Example: If “Answer A” is preferred over “Answer B,” the model increases its relative ELBO-based score for A.
🍞 Anchor: It’s like using a scorecard to learn to choose the better essay every time.
🍞 Hook: Imagine you already got the answer right. Now, speak it with confidence so others can trust you quickly.
🥬 The Concept (Confidence-Aware Parallel, CAP): Add an auxiliary confidence loss that reduces output uncertainty for tokens the model already gets right.
- How it works: (1) Compute standard SFT loss, (2) On correctly predicted tokens, also lower entropy to sharpen probabilities, (3) Weighted sum total loss.
- Why it matters: Sharper distributions let the threshold-based decoder accept more tokens per step, boosting speed.
🍞 Anchor: A student who’s sure of correct steps can write more steps between pauses, finishing faster without losing accuracy.
Inference (blockwise threshold decoding)
- What happens: Generate one block per diffusion step. Within a step, accept tokens whose probabilities exceed a confidence threshold; if too few pass, accept top-k by score to ensure progress.
- Why it exists: Balances speed (accept more tokens early) and quality (fallback when uncertain).
- Example: With threshold 0.95 and block size 32, most confident tokens are locked in; the rest wait for the next refinement.
Secret sauce (why this recipe is clever):
- WSD tames the AR→diffusion jump; document masking prevents cross-doc confusion; complementary masking squeezes more learning from each sample; CAP unlocks true parallel decoding speed; DPO aligns the model to human preferences in a diffusion-friendly way. The combination is what scales diffusion models to 100B while staying practical.
04Experiments & Results
The test: The authors evaluated two instruction-tuned models—LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B)—across 47 benchmarks spanning knowledge (e.g., MMLU), reasoning (e.g., HellaSwag), coding (e.g., HumanEval, MBPP), math (e.g., AIME 2025), and agents/alignment (e.g., BFCL, IFEval). They also analyzed decoding hyperparameters and long-context performance (RULER), and measured real inference throughput (tokens per second).
The competition: Strong open-source AR models were used as baselines, notably Ling-mini-2.0 and Qwen3-30B-A3B-Instruct-2507. The question: can diffusion models match or exceed similarly sized AR peers at scale?
Scoreboard with context:
- LLaDA2.0-mini averaged 64.34—very close to its AR peer Ling-mini-2.0 (65.77). That’s like scoring an A- when the class leader gets a low A.
- LLaDA2.0-flash averaged 73.18, essentially tying Qwen3-30B-A3B-Instruct-2507 (73.60)—a neck-and-neck result across many tasks.
- Where diffusion shines: coding and agent/tool use. LLaDA2.0-flash scored 94.51 on HumanEval and strong results on MBPP and MultiPL-E; it also led in BFCL agent benchmarks. That’s like acing the hardest logic puzzles when others get high but not top marks.
Speed matters: Using the CAP objective, LLaDA2.0-flash-CAP reached about 535 tokens/second on a multi-benchmark average—up to 2.1× faster than comparable AR baselines and much faster than the same model without CAP (383 TPS). This validates the idea that sharper confidence enables more aggressive parallel decoding.
Hyperparameter takeaways (mini model proxy study):
- Denoising threshold: 0.95 delivered the best quality; lower thresholds sped things up but cost too much accuracy.
- Block size: 32 offered the best speed–quality balance; 16 sometimes scored slightly higher but was slower; 64 underperformed on both fronts compared to 32.
Long-context (RULER):
- Up to 32k tokens (native window), both models remained strong; the 100B model stayed above ~93 throughout. Extending to 64k with YaRN scaling worked but with expected accuracy drop—useful when extra context is needed and some loss is acceptable.
Surprising findings:
- Diffusion at 100B not only holds its own vs. AR on general tasks but regularly leads on structured generation (coding) and agentic tool use—areas that benefit from coherent, parallel blockwise planning.
- CAP did not sacrifice benchmark quality while meaningfully increasing throughput—rarely do we get both speed and quality improvements together.
Bottom line: The evidence supports that diffusion models, when carefully converted and aligned, can compete at frontier scale, with special strengths where structure, planning, and tool use dominate.
05Discussion & Limitations
Limitations:
- Iterative nature: Diffusion still takes multiple refinement steps; while CAP boosts speed, ultra-low-latency single-step generation remains an AR edge in some cases.
- Training complexity: The WSD schedule, document-level masks, and complementary masking make the recipe more complex to implement and tune than vanilla AR pretraining.
- Dependence on AR bases: The approach assumes access to high-quality AR checkpoints; fully from-scratch diffusion at 100B remains expensive and less explored.
- Long-context beyond 32k: Performance drops when extrapolating to 64k via scaling techniques; further research on native longer windows is needed.
- RL at scale: Interactions between diffusion training, DPO, and more advanced RL (e.g., tool-augmented reasoning) at 100B+ are promising but not fully mapped.
Required resources:
- Significant compute for 100B-scale training (Megatron-style parallelism across DP/PP/TP/CP/EP).
- Carefully prepared instruction and preference datasets, especially for code/maths and agent tasks.
- An inference stack that supports block diffusion, caching, and threshold decoding (e.g., dInfer/SGLang adaptations).
When not to use:
- Ultra-streaming tasks where token-by-token immediacy is critical and minimal latency trumps total throughput.
- Extremely small models or tiny datasets where the WSD complexity may not pay off.
- Settings without stable AR bases to initialize from.
Open questions:
- Can we push block sizes or adaptive block schedules at inference to further raise tokens-per-forward without harming coherence?
- How best to blend chain-of-thought (CoT) or tool-augmented reasoning with diffusion’s parallel denoising for stepwise reasoning transparency?
- What are the principled scaling laws for diffusion at 100B–1T, and how do mask schedules and complementary masking interact at that scale?
- Can we achieve native 64k–128k windows with robust accuracy under diffusion pretraining, not just inference-time scaling?
- How does DPO for diffusion compare with full RL pipelines for complex agent behavior under safety constraints?
06Conclusion & Future Work
Three-sentence summary: This paper presents a practical path to turn powerful AR language models into large diffusion models using a careful Warmup–Stable–Decay plan, document-level masking, and post-training alignment. The resulting 16B and 100B models keep AR knowledge, gain parallel decoding, and match or beat AR peers on many tasks—especially coding and agentic tool use—while achieving faster inference with confidence-aware training. The approach is open-sourced and ready for real deployments.
Main achievement: A scalable, stable conversion recipe (not training from scratch) that brings diffusion language models to the 100B frontier with strong quality and real speed advantages through parallel decoding.
Future directions:
- Push diffusion scaling beyond 100B and explore longer native context windows.
- Combine diffusion with richer RL/thinking frameworks to enhance multi-step reasoning and tool use.
- Further optimize inference (dynamic thresholds/blocks) and safety alignment for agentic applications.
Why remember this: It reframes the “AR vs. diffusion” debate—showing you can inherit AR strengths and still unlock diffusion’s parallelism and bidirectional context. It offers a concrete, reproducible recipe that moves diffusion LLMs from lab curiosity to practical, high-performance systems at frontier scale.
Practical Applications
- •Build faster code assistants that can suggest multiple lines or functions in parallel with high accuracy.
- •Deploy customer support chatbots that handle many requests simultaneously with lower latency.
- •Create math tutors that reason bidirectionally about problem statements and solutions.
- •Power tool-using agents (e.g., function calling) that plan and fill structured outputs more coherently.
- •Speed up long-form content generation (drafting documents or reports) through blockwise parallel decoding.
- •Improve database text-to-SQL systems with stronger structured generation for complex queries.
- •Use the WSD conversion pipeline to upgrade existing AR checkpoints into diffusion models without retraining from scratch.
- •Leverage complementary masking to increase sample efficiency in your diffusion fine-tuning jobs.
- •Adopt confidence-aware training to push tokens-per-second higher in real deployments.
- •Apply document-level attention masks to stabilize training on packed, mixed-domain datasets.