Residual Context Diffusion Language Models

Yuezhou Hu; Harman Singh; Monishwaran Maheswaran; Haocheng Xi; Coleman Hooper; Jintao Zhang; Aditya Tomar; Michael W. Mahoney; Sewon Min; Mehrdad Farajtabar; Kurt Keutzer; Amir Gholami; Chenfeng Xu

Residual Context Diffusion Language Models

Intermediate

Yuezhou Hu, Harman Singh, Monishwaran Maheswaran et al.1/30/2026

arXiv PDF

Key Summary

•Diffusion language models (dLLMs) generate several tokens at once but usually throw away lots of helpful clues each step—RCD keeps and reuses those clues.
•RCD turns the low-confidence token guesses into soft “residual” context and feeds it back so the model improves on the next step instead of starting from scratch.
•An entropy weight acts like a trust dial: when the model is uncertain, it leans more on these residual clues; when it’s sure, it leans less.
•A two-stage training plan makes this reuse possible without huge memory costs: a small frozen helper model creates clean residual signals, and the main model learns to use them.
•With almost no extra compute, RCD boosts accuracy by 5–10 points on many benchmarks and nearly doubles accuracy on tough AIME math tests.
•RCD also needs 4–5 times fewer denoising steps to reach the same accuracy, so it can be both smarter and faster.
•It works in both block-wise (SDAR) and bidirectional (LLaDA) diffusion models, and scales from smaller to larger models.
•Compared to other latent methods, RCD stays stable because it builds residuals from the model’s own embedding space instead of raw hidden states.
•Even with limited training data (~1B tokens across settings), standard dLLMs can be upgraded to RCD and gain strong improvements.

Why This Research Matters

RCD makes AI helpers smarter without demanding a lot more computing power, which means better results on your laptop or phone. By reusing what models already computed, it turns waste into wisdom—especially helpful for hard tasks like step-by-step math, code, and long documents. Faster, fewer-step decoding can reduce costs for companies while cutting wait times for users. The approach is stable at larger scales, so it can be deployed in real systems, not just labs. Its ideas also transfer to multimodal AI (text+images) and long-context applications like legal or scientific documents. Finally, RCD is data-efficient enough to retrofit existing diffusion models, so improvements can arrive quickly and broadly.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a big jigsaw puzzle with friends. Everyone tries different pieces. Even if a friend picks a piece that doesn’t fit yet, their try still tells you something useful—like which colors or shapes are nearby. Throwing those tries away would slow the whole puzzle.

🥬 The Concept (Masked Language Modeling):

What it is: A training game where the model must fill in missing (masked) words in a sentence.
How it works:
1. Hide some words with a special [MASK] token.
2. Ask the model to guess the hidden words.
3. Reward it for correct guesses, repeat on many sentences.
Why it matters: Without this skill, models struggle to use context clues and can’t learn how words fit together.

🍞 Anchor: “The cat sat on the [MASK].” The model learns that “mat” fits better than “moon.”

🍞 Hook: You know how cleaning a foggy window takes a few wipes? Each wipe removes a bit of blur until the view becomes clear.

🥬 The Concept (Diffusion Large Language Models—dLLMs):

What it is: A way for models to generate text by unmasking many tokens over several “denoising” steps, getting clearer each time.
How it works:
1. Start with everything masked.
2. At each step, guess tokens and commit only the most confident ones.
3. Keep repeating until all tokens are decided.
Why it matters: Unlike one-at-a-time writing, dLLMs can update many spots at once, which can be much faster on modern hardware.

🍞 Anchor: Writing a paragraph by first sketching the whole outline lightly, then darkening the correct words step by step.

🍞 Hook: Think of a teacher who only keeps your best quiz answers and shreds the rest, even though your crossed-out work shows your thinking.

🥬 The Concept (Remasking):

What it is: In each dLLM step, only the most confident guesses are kept; the rest are reset to [MASK].
How it works:
1. Score every position by confidence.
2. Keep the top few; remask the rest.
3. Move to the next step.
Why it matters: This wastes the computation already spent on the low-confidence guesses, which still contain useful context.

🍞 Anchor: If you guessed “seas—” before writing “season,” throwing away “seas—” each time slows you down.

🍞 Hook: When you’re choosing a snack, you weigh many options in your head before picking one. Those “almost picks” tell you what you’re craving.

🥬 The Concept (Soft Tokens):

What it is: A blended vector that represents a weighted mix of many possible tokens instead of a single hard choice.
How it works:
1. Take the probabilities over the vocabulary.
2. Multiply and sum with the token embeddings.
3. Get one “soft” vector that carries information about all likely options.
Why it matters: It keeps fine-grained clues alive, instead of collapsing everything to just one guess.

🍞 Anchor: If your top 3 snack choices are apples (50%), bananas (30%), grapes (20%), a soft token captures all three tastes at once.

🍞 Hook: Suppose you could recycle every near-miss puzzle piece into a hint for the next move. You’d finish faster!

🥬 The Concept (The Problem Before This Paper):

What it is: dLLMs’ remasking throws away low-confidence tokens and loses helpful hints.
How it works:
1. Compute predictions at each step.
2. Keep only the top confident ones.
3. Discard the rest, even though they were expensive to compute.
Why it matters: This creates an accuracy gap vs. autoregressive models and wastes compute that could guide later steps.

🍞 Anchor: A class that only grades your final answer but ignores your scratch work will miss patterns showing you’re on the right track.

🍞 Hook: Imagine a compost bin for ideas—leftovers become fertilizer for the next crop of thoughts.

🥬 The Concept (What This Paper Adds—Residual Context Diffusion, RCD):

What it is: A way to recycle discarded token information as “residual” context and feed it into the next step.
How it works:
1. Convert the probability distribution of the uncommitted tokens into a soft token (a residual vector).
2. Decide how strongly to mix it in using uncertainty (entropy).
3. Add this residual to masked positions for the next step.
Why it matters: You keep and use signals that were previously thrown away—so you get better accuracy with little extra cost.

🍞 Anchor: Instead of erasing your draft notes, you summarize them into a helpful margin hint for your next revision.

🍞 Hook: If you’re unsure, you probably look closer; if you’re sure, you move on.

🥬 The Concept (Entropy as a Trust Dial):

What it is: A measure of uncertainty used to decide how much residual context to inject.
How it works:
1. Compute entropy from the token probabilities.
2. Normalize it so it’s between 0 and 1.
3. Use it as the weight for mixing residuals into inputs.
Why it matters: High-uncertainty spots get more help; low-uncertainty spots don’t get overpowered.

🍞 Anchor: If a weather forecast is 50/50, you pack an umbrella; if it’s 99% sunny, you don’t.

Real Stakes: This matters for everyday AI—faster assistants, better math and coding help, and smoother long-document understanding—because recycling makes the model both smarter and more efficient without needing huge extra compute.

02Core Idea

🍞 Hook: Think of a sports team reviewing missed shots to plan a better play. Those “almost” moments are gold!

🥬 The Concept (The Aha!):

What it is: Don’t throw away low-confidence guesses—turn them into soft residual context and reuse them next step, weighted by how uncertain they were.
How it works:
1. For every remasked token, turn its whole probability distribution into a soft vector (the residual).
2. Compute an entropy weight that says how much to trust this residual.
3. Mix the residual into the next step’s masked inputs.
Why it matters: Every step gets smarter by standing on the shoulders of the last step’s partial knowledge.

🍞 Anchor: Like a chef using yesterday’s veggie scraps to make a tasty broth that boosts today’s soup.

Multiple Analogies:

Puzzle analogy: Don’t toss wrong pieces; keep notes about shapes and colors to narrow tomorrow’s choices.
Classroom analogy: Use draft work to improve the final answer instead of starting fresh each time.
GPS analogy: If several routes are close in time, keep them in mind; uncertainty means you should keep more options on the map.

🍞 Hook: Remember texting one letter at a time vs. drafting the whole message and refining it?

🥬 The Concept (Before vs. After RCD):

Before: dLLMs remask low-confidence tokens, losing rich probability info and needing more steps to converge.
After: dLLMs recycle that info as residual context, guiding future steps and cutting both errors and steps.
Why it matters: Same hardware, smarter use of the same compute, better accuracy–latency trade-offs.

🍞 Anchor: It’s like having spell-check suggestions carry over to the next revision instead of disappearing after each pass.

🍞 Hook: Why does mixing a soft vector help at all?

🥬 The Concept (Why It Works—Intuition):

What it is: Soft residuals capture the whole “shape” of possibilities, not just the top pick.
How it works:
1. A distribution over words holds semantics about near-misses (like “season” vs. “reason”).
2. Turning it into a soft embedding carries those semantics forward.
3. Entropy-weighted mixing prevents swamping certain tokens while helping uncertain ones.
Why it matters: The model keeps global context consistent and refines tricky spots faster.

🍞 Anchor: When guessing a password, keeping a shortlist of top candidates beats guessing one and forgetting the rest.

🍞 Hook: Let’s break the idea into bite-size pieces.

🥬 The Concept (Building Blocks):

Soft Residual Vector: Convert probabilities into a blended embedding.
Entropy Weight: Use uncertainty to decide mixing strength.
Residual Injection: Add residuals only to masked positions so decided tokens stay stable.
Two-Stage Training: A frozen helper model generates steady residual targets; the main model learns to use them.
Temperature Alignment at Inference: Gently calibrate confidence so runtime behavior matches training.
Why it matters: Each piece ensures help goes where it’s needed, stays numerically stable, and avoids feedback-loop explosions.

🍞 Anchor: It’s like a study plan: summaries (soft residuals), focus more on confusing topics (entropy), don’t rewrite mastered parts (only masked), use a coach’s notes (helper model), and keep the test conditions similar to practice (temperature alignment).

03Methodology

🍞 Hook: Imagine baking in rounds: you mix, taste, adjust, and repeat—each round uses what you learned last time.

🥬 The Concept (High-Level Recipe):

What it is: RCD is a decoding recipe that turns leftover guesses into helpful seasoning for the next round.
How it works (Input → Steps → Output): Input: A masked sequence → Step A: Predict probabilities at each masked position → Step B: Keep confident tokens; form residuals from the rest → Step C: Weight residuals by entropy and mix into next-step inputs → Output: A more accurate, denoised sequence after a few rounds
Why it matters: Without recycling, each round forgets what it just learned.

🍞 Anchor: Like tasting soup, adding what’s missing, and tasting again—faster to delicious.

Step-by-Step Details:

Vanilla dLLM Step (Baseline)

What happens:
- Predict a distribution over the vocabulary for each masked spot.
- Keep the top-confident positions (commit them as real tokens).
- Remask the rest for the next round.
Why this step exists: It gives a coarse-to-fine path to the final text.
What breaks without it: You’d have no structure for gradual unmasking.
Example: At position 7, probs: “season” 0.45, “reason” 0.35, “lesson” 0.20. If not confident enough, it’s remasked.

Residual Formation (Soft Tokens)

What happens:
- For each remasked position, convert its full probability distribution into a single blended vector (residual).
Why this step exists: It preserves nuanced clues across many almost-correct options.
What breaks without it: You’d lose helpful semantic hints and need more steps.
Example: Mix 0.45×E(season) + 0.35×E(reason) + 0.20×E(lesson) into one soft vector.

Entropy Weighting (Trust Dial)

What happens:
- Compute normalized entropy (0 to 1) from the distribution to decide how strongly to inject the residual.
Why this step exists: Directly controls help where the model is unsure, and avoids over-helping where it’s certain.
What breaks without it: Either underuse residuals (no gains) or flood the model (instability, worse results).
Example: If entropy is high (uncertain), weight might be 0.8; if low (confident), weight might be 0.1.

Residual Injection (Masked-Only Mix)

What happens:
- Only for still-masked positions next step: New input embedding = (1 − weight)×mask_embedding + weight×residual.
Why this step exists: Keeps decided tokens stable and aims help exactly where needed.
What breaks without it: Already-solved words could get perturbed; context would wobble.
Example: Position 7 remains masked, so it gets the soft hint mixed in; position 3 was committed, so it stays untouched.

Two-Stage Training (Decoupled Learning)

What happens:
- Stage 1: Train a smaller Reference Model to produce stable probability targets and entropy weights.
- Stage 2: Freeze the Reference Model; train the Target Model to use those residuals built with its own embeddings.
Why this step exists: Prevents a huge, memory-heavy feedback loop from unrolling many steps.
What breaks without it: Training becomes unstable or too expensive (like backpropagating through time across many steps).
Example: A 1.7B ref model guides a 4B or 8B target; target learns to leverage residuals without chasing a moving target.

Inference Tricks (Warm Start + Temperature-Scaled Entropy)

What happens:
- Warm Start: Use the Reference Model once to seed the first residuals.
- Temperature Scaling: Calibrate the Target Model’s probability sharpness so runtime entropy weights match training habits.
Why this step exists: Bridges the gap between teacher-guided training and self-guided inference.
What breaks without it: Early steps wobble; entropy weights miscalibrate, hurting accuracy.
Example: If the Target is overconfident, a higher residual temperature softens probabilities so entropy-based mixing helps more.

Secret Sauce

What makes it clever:
- Recycles compute you already did (green and cheap).
- Sends help only to unsure spots (precision-guided).
- Uses the model’s own embedding space (numerically stable at larger scales).
- Trains without unrolling long loops (practical memory footprint).

Concrete Toy Walkthrough:

Step 1: Position 5 probs: “north” 0.4, “worth” 0.3, “northwest” 0.3 → entropy high → strong residual saved.
Step 2: Inject residual at pos 5; nearby words clarify direction; probs shift to “north” 0.6, “worth” 0.2, “northwest” 0.2 → entropy drops.
Step 3: Confidence crosses threshold; pos 5 gets committed to “north.”
Outcome: Fewer steps, steadier reasoning, same or better throughput.

04Experiments & Results

🍞 Hook: Picture a timed math contest. You want to answer more questions correctly without taking longer.

🥬 The Concept (What They Tested):

What it is: Check if reusing residual context boosts accuracy and speed across different diffusion models and math benchmarks.
How it works:
1. Train RCD on two families: SDAR (block-wise) and LLaDA (bidirectional).
2. Measure accuracy on GSM8K, MATH500, MinervaMath, and hard AIME24/25.
3. Also measure throughput (tokens/second) and how many tokens are decoded per step.
Why it matters: We need proof that recycling helps in both quality and efficiency, not just one.

🍞 Anchor: It’s like seeing whether better study notes improve both your test score and how fast you finish.

🍞 Hook: Who were the opponents?

🥬 The Concept (Baselines and Competitors):

What it is: Compare RCD against Sequential Denoising (standard dLLM decoding) and a latent competitor (Loopholing).
How it works:
1. Same models and data budgets for fairness.
2. Tune thresholds so both methods have similar tokens-per-second for apples-to-apples comparisons.
Why it matters: Beating a strong baseline at matched speed shows real progress, not just more compute.

🍞 Anchor: Racing on the same track, same distance, same shoes—who finishes with a better time?

Results with Context:

Wide Gains: RCD adds 5–10 accuracy points across many settings. That’s like jumping from a B- to an A- on average.
AIME Breakthrough: On the toughest AIME24/25 math tasks, RCD nearly doubles accuracy (e.g., ~9.8% → ~19.8%), showing deeper multi-step reasoning.
Fewer Steps: To reach the same accuracy, RCD needs about 4–5 times fewer denoising steps—like solving a maze in a handful of bold moves rather than many tiny ones.
Throughput-Matched Wins: When locked to similar tokens/second, RCD still scores 2–9% higher. That’s a fair-speed win.
Data Efficiency: With just ~300M tokens in one epoch on a 4B model, RCD reaches strong reasoning accuracy; a competing latent method failed to produce coherent outputs under the same budget.

Surprises and Notes:

Early-Step Recall: Many correct final tokens already show up in top-5 guesses at early steps—evidence that intermediate distributions are rich and worth recycling.
Bigger Blocks, Bigger Benefits: With larger block sizes, RCD’s edge grows, because more positions share and stabilize context.
Stable at Scale: Methods that inject raw hidden states can destabilize at 8B+ models; RCD stays stable by using input-embedding space for residuals.
Saturation Check: Even after training baselines longer (more epochs), RCD still pulls ahead—so the bottleneck was wasted information, not undertraining.

Bottom Line: RCD consistently outperforms standard diffusion decoding, especially on the hardest reasoning tests, and it does so with minimal extra compute and good real-world throughput.

05Discussion & Limitations

🍞 Hook: Even great backpacks have zippers and seams—you should know where they’re strong and where they might snag.

🥬 The Concept (Limitations):

What it is: Where RCD may struggle or need care.
How it works:
1. Needs a decent Reference Model to create clean residual signals.
2. Adds a simple but extra step (residual mixing) at inference.
3. Requires careful calibration (residual temperature) to match training behavior.
Why it matters: Knowing edges keeps deployments smooth.

🍞 Anchor: Like tuning a bike’s brakes—not hard, but important for a safe ride.

🍞 Hook: What do you need in your toolbox?

🥬 The Concept (Required Resources):

What it is: Practical needs to train/use RCD.
How it works:
1. A smaller yet capable Reference Model (e.g., ~1.7B) and a Target Model (e.g., 4B–8B).
2. Some extra training tokens (hundreds of millions to ~1B) beyond plain SFT.
3. Inference engine support for parallel decoding (already common in dLLM stacks).
Why it matters: Planning capacity avoids surprises.

🍞 Anchor: Think of it as needing a good assistant coach, some practice time, and a gym that supports your drills.

🍞 Hook: When might RCD not be your best pick?

🥬 The Concept (When Not to Use):

What it is: Situations where benefits are small.
How it works:
1. Ultra-short generations where diffusion already ends in 1–2 steps.
2. Tasks that don’t benefit from soft context (e.g., trivial lookups).
3. Very tiny models where hidden-state tricks might already suffice and overhead matters a lot.
Why it matters: Use the right tool for the job.

🍞 Anchor: You don’t bring a moving van to carry a lunchbox.

🍞 Hook: What mysteries remain?

🥬 The Concept (Open Questions):

What it is: Future paths to explore.
How it works:
1. Can we learn the entropy schedule automatically per task or per sample?
2. How does RCD combine with reinforcement learning for even deeper reasoning?
3. Can residuals carry structure beyond token embeddings (e.g., syntax or math states)?
4. How to push RCD into multimodal and ultra-long contexts (books, codebases)?
Why it matters: Answering these can unlock even bigger gains.

🍞 Anchor: It’s like discovering a new shortcut—now we map it, pave it, and see where else it leads.

06Conclusion & Future Work

🍞 Hook: If your drafts could whisper tips to your next draft, you’d write better, faster.

🥬 The Concept (3-Sentence Summary):

What it is: Residual Context Diffusion (RCD) recycles low-confidence guesses in diffusion LLMs into soft residual hints for the next step, weighted by uncertainty (entropy).
How it works: A two-stage training pipeline teaches a target model to use stable, teacher-provided residuals; at inference, a warm start and temperature scaling keep behavior aligned.
Why it matters: RCD consistently boosts accuracy (often 5–10 points; nearly 2× on AIME) with little extra compute and fewer steps, improving both quality and efficiency.

🍞 Anchor: Main Achievement: Turning discarded intermediate signals into a guiding context stream—practical, stable, and scalable across model sizes and block settings.

Future Directions:

Learn adaptive residual schedules automatically.
Combine with RL to deepen reasoning.
Extend to multimodal and very long sequences.
Explore structured residuals that encode tasks like algebra or code flow.

Why Remember This: RCD’s big idea is simple and sticky: don’t waste your near-misses—recycle them smartly. That mental model applies beyond language models to any iterative system that refines guesses over time.

Practical Applications

•Upgrade existing diffusion LLM chatbots to answer math and logic questions more accurately with similar latency.
•Use RCD in code-generation diffusion models to reduce errors and speed up multi-line completions.
•Enable long-document assistants (summarization, Q&A) to reason better across chapters with fewer passes.
•Deploy in education tools to provide clearer step-by-step solutions on math benchmarks like GSM8K and AIME.
•Adopt in on-device or edge models to save compute by needing fewer denoising steps for the same quality.
•Pair with RL finetuning to further boost complex reasoning while keeping inference efficient.
•Apply to multimodal diffusion models (e.g., LLaDA-V) to refine text reasoning grounded in images more effectively.
•Improve batch serving economics in inference engines (Fastdllm, D2F) by keeping throughput high and accuracy higher.
•Retrofit small-to-mid models (1–8B) in resource-constrained settings using the two-stage training to gain sizable accuracy boosts.
•Use as a research scaffold to explore structured residuals (syntax, program state) for domain-specialized assistants.

Version: 1