ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Key Summary
- âąReFusion is a new way for AI to write text faster by planning in chunks (called slots) and then filling each chunk carefully.
- âąIt mixes two skills in one model: diffusion-style planning (great for deciding where to write next) and autoregressive writing (great for writing sentences that flow).
- âąBy grouping nearby words into a slot, the model avoids mixing up word pairs like 'right now' and 'at once' into nonsense like 'right once'.
- âąReFusion reuses its memory (KV cache) every step, so it doesnât have to recompute everything, making it much faster than older diffusion models.
- âąOn seven tests in math, coding, and reasoning, ReFusion beats past diffusion models by big margins and often rivals strong autoregressive models while being about 2.3Ă faster.
- âąIts decoding runs in two steps each round: plan which slots to write next, then fill those slots in parallel with careful verification.
- âąTraining mirrors inference: the model practices both orderly writing on clean slots and fixing masked slots, so every token helps learning.
- âąA simple idea powers it all: parallelize at the slot level (where words are less dependent across slots) and serialize inside each slot (where words depend a lot).
- âąEven when it skips a costly re-computation step, performance stays strong, suggesting the design is both efficient and robust.
Why This Research Matters
ReFusion makes AI responses arrive faster without scrambling nearby words, which feels better in everyday apps like chat, search, and coding assistants. It uses compute more efficiently by reusing memory every step, which lowers latency and can reduce energy costs. For learning and tutoring, it writes multi-step reasoning more smoothly, so students see clearer explanations in less time. Developers benefit because code suggestions are both quicker and more likely to work on the first try. On devices with limited power, ReFusionâs speed-ups can make advanced AI more accessible. The design shows a general patternâplan globally, write locallyâthat other AI systems can adopt. This helps push the whole field toward models that are both smart and snappy.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine building a Lego castle. If you have to place each brick one by one in a strict left-to-right order, it takes forever. But if you can plan sections and build several towers at onceâwhile making sure each towerâs bricks stack correctlyâyou finish much faster.
đ„Ź Concept 1 â Large Language Model (LLM)
- What it is: An LLM is a computer program that predicts the next words to write based on what came before.
- How it works: 1) Read the prompt, 2) Guess likely next words, 3) Repeat. It learns from lots of text to make good guesses.
- Why it matters: Without it, computers wouldnât understand or generate useful language. đ Anchor: When you ask a chatbot to write a story, thatâs an LLM deciding which words to place next.
đ„Ź Concept 2 â Autoregressive Model (ARM)
- What it is: An ARM writes strictly one token after another, always left-to-right.
- How it works: 1) Look at all previous words, 2) Predict the next word, 3) Append it, 4) Repeat until done.
- Why it matters: This keeps sentences coherent, but itâs slow because it canât write multiple new words at the same time. đ Anchor: Itâs like copying a sentence letter-by-letter with no skipping allowed.
đ„Ź Concept 3 â KV Cache (the modelâs short-term memory)
- What it is: A way to remember the past so the model doesnât re-think everything every step.
- How it works: 1) Store key/value summaries of whatâs been read, 2) Reuse them to speed up future predictions.
- Why it matters: Without it, every step would be as slow as the first, making generation crawl. đ Anchor: Like saving your place and notes in a book so you donât reread from page one.
đ„Ź Concept 4 â Diffusion (for text)
- What it is: A method that starts with a messy or masked version of text and gradually cleans it up.
- How it works: 1) Mask parts, 2) Predict missing pieces, 3) Repeat to reduce uncertainty.
- Why it matters: It allows more flexible writing order, potentially filling multiple parts at once. đ Anchor: Like restoring a faded painting by carefully filling in missing colors layer by layer.
đ„Ź Concept 5 â Masked Diffusion Model (MDM)
- What it is: A diffusion-style text model that repeatedly fills in masked tokens.
- How it works: 1) Start with many [MASK] tokens, 2) Unmask likely words, 3) Iterate until complete.
- Why it matters: It can, in theory, write several places in parallel. But it often needs bidirectional reading and canât use KV cache easily, so it becomes slow per step. đ Anchor: Imagine fixing holes in a story page by pageâbut each time you must scan the whole book again.
đ„Ź Concept 6 â Conditional Independence (when itâs safe to fill in several spots at once)
- What it is: The idea that some positions donât strongly affect each other, so you can guess them independently.
- How it works: If A and B barely influence each other given the context, you can fill both at once.
- Why it matters: If this assumption is wrong, you get jumbled phrases. đ Anchor: If two puzzle pieces are from different corners, you can place both without them clashing; but if they touch, you must be careful.
The World Before: ARMs make beautiful, coherent sentences but are slow because they write strictly in order. MDMs promise speed by filling multiple tokens at once, but in practice they were often slower (since they reprocess everything) and could jumble nearby words due to wrong independence assumptions.
The Problem: How can we keep the coherence of ARMs, the parallelism promise of MDMs, and the speed from reusing memoryâall at once?
Failed Attempts: 1) Plain MDMs used bidirectional attention, preventing KV cache reuse, so each step was expensive. 2) Heuristics tried to pick many tokens to decode in parallel, but nearby words (like âat onceâ or âright nowâ) got mixed, causing incoherence. 3) Blocked hybrids fixed some caching but lost true any-order flexibility or still scrambled tokens within blocks.
The Gap: We needed a way to parallelize generation without breaking near-word dependencies and to enable full KV cache reuseâwhile keeping flexible, non-linear generation orders.
đ„Ź Concept 7 â Slot (a short, fixed-length chunk of consecutive tokens)
- What it is: A small stretch of neighboring tokens treated as a unit.
- How it works: 1) Split the target text into equal-length slots, 2) Decide which slots to write now, 3) Inside each slot, write tokens in order.
- Why it matters: Nearby words depend on each other a lot; serializing inside a slot preserves coherence, while parallelizing across slots gives speed. đ Anchor: Build several Lego towers at once (parallel), but stack the bricks in each tower from bottom to top (serial).
Real Stakes: Faster, coherent AI writing helps everyday toolsâsearch answers that appear quicker, coding assistants that test more ideas in the same time, tutors that solve math step-by-step without lag, and apps that run smoothly on less powerful devices. Speed and quality together mean friendlier, more helpful AI.
02Core Idea
đ Hook: You know how a chef preps ingredients (plan) for several dishes, then cooks each dishâs steps in the right order? Thatâs much faster than cooking every dish from start to finish one at a time.
Aha in one sentence: Move parallel decoding from single tokens to short, consecutive slotsâplan which slots to write next using diffusion, then fill those slots autoregressively in parallel while reusing memory.
đ„Ź Concept 8 â Plan-and-Infill (two-step decoding)
- What it is: A loop of (1) planning which slots to write next and (2) infilling those slots.
- How it works: 1) Diffusion planning scores masked slots, 2) Pick confident slots, 3) Draft them, 4) Verify and complete them autoregressively.
- Why it matters: Planning avoids clashes; infilling preserves fluency. đ Anchor: Decide which blanks in a crossword can be safely solved now (plan), then fill each chosen word letter-by-letter (infill).
Three analogies:
- City building: Approve permits for several neighborhoods (plan), then construct each building floor-by-floor (infill). Parallel neighborhoods, serial floors.
- School play: Assign different scenes to different groups (plan), each group rehearses its scene line-by-line (infill). Scenes in parallel, lines in order.
- Jigsaw puzzle: Pick separate puzzle regions to work on next (plan), then place each regionâs pieces one after another (infill). Regions in parallel, pieces in order.
Before vs After:
- Before: MDMs tried to fill many scattered tokens at onceâfast in theory but slow in practice and often incoherent nearby.
- After: ReFusion fills several slot-sized regions at once while writing inside each region in orderâpreserving local coherence and enabling full KV cache reuse.
Why it works (intuition):
- Nearby tokens are tightly linked; far tokens are looser. By serializing inside slots (nearby words) and parallelizing across slots (farther parts), the model respects true dependencies. Reordering generated slots before masked ones keeps a clean causal view so the KV cache is fully reusable. Planning shrinks the search from a wild âall token combinationsâ space to a manageable âwhich slots nextâ space.
Building blocks:
- Slots: Fixed-length, consecutive chunks; inside them, write left-to-right.
- Diffusion planning: Score each masked slotâs readiness; pick the confident ones.
- Drafting + Verification: Make a draft for chosen slots; accept the longest trustworthy prefix.
- Parallel completion: If a slot isnât fully accepted, refine it independently in parallel.
- KV cache reuse: Move finished slots ahead of masked ones so the model always reads past first; this keeps memory valid and fast.
- Hybrid training: Practice both skillsâARM loss on clean slots and diffusion denoising on masked slotsâso every token teaches the model something.
đ„Ź Concept 9 â Parallel Decoding (at the slot level)
- What it is: Filling multiple slots at the same time.
- How it works: 1) Choose weakly dependent slots, 2) Draft them all, 3) Verify and finish them together.
- Why it matters: Big speed gains without tangling words that should flow together. đ Anchor: Several builders erect different houses at once, but each houseâs floors are built in order.
The result: Strong coherence where it matters (inside slots), true parallelism where itâs safe (across slots), and full memory reuse every stepâmaking ReFusion both fast and high-quality.
03Methodology
High-level recipe: Prompt â split target into slots â Step A: diffusion planning (pick which slots to write) â Step B: autoregressive infilling (verify, accept prefixes, and complete) â reorder finished slots before masked ones â repeat until done â restore original order.
Inputs and setup:
- Start with a fully masked response divided into fixed-length slots (for example, each slot has 32 tokens). All tokens keep their true position IDs so the model always knows where each word belongs in the final answer.
đ„Ź Concept 10 â Position IDs (keeping everyoneâs seat)
- What it is: Each token carries its true index from the final sequence.
- How it works: Even if we reorder buffers for speed, the IDs donât change, so the model still measures distances correctly.
- Why it matters: Without stable positions, reordering would confuse the modelâs sense of order. đ Anchor: Like moving students to a different row for a photo but keeping their name tags on.
Step A: Diffusion-based slot planning
- What happens: The model looks at all masked slots and gives each one a certainty score (e.g., how confident it is about the first token of that slot). It then selects a batch of high-confidence slots to work on in parallel and drafts them (makes a quick guess for all tokens in each slot).
- Why this step exists: If you skip planning, you might pick slots that depend on each other and clash; that causes incoherence. Drafts help the next step go faster.
- Tiny example: Suppose youâre writing âThe early bird catches the worm.â With slot size 4, you might have slots like [The early bird], [catches the worm], etc. Planning will likely pick the âcatches the wormâ slot only after âThe early birdâ gives enough context.
Step B: Autoregressive slot infilling (verify and complete)
- What happens: 1) Concatenate the drafted slots in original left-to-right order and do a single pass to see how many drafted tokens are trustworthy in a row (a long, good prefix). 2) If a whole slot (or more) is accepted, greatâsave them immediately. 3) If not, for each slot, accept its longest good prefix, remask the rest, and predict againâeach slot independently and in parallel.
- Why this step exists: Verification prevents bad drafts from slipping in; autoregressive completion maintains fluent, local grammar inside each slot.
- Tiny example: If the draft for a slot is [cat, chase, mouse, quickly] but the model is only confident up to [cat, chase], it accepts those two, remasks [mouse, quickly], and tries again, guided by the prefix.
Secret sauce #1: Serial inside, parallel outside
- Nearby tokens are handled carefully (serial writing) while distant regions run together (parallel). This matches how dependencies actually behave in language.
Secret sauce #2: Full KV cache reuse
- After finishing a batch of slots, we physically place them before the still-masked ones in the next iteration. Because attention is causal and position IDs stay true, every new pass can reuse all the memory from whatâs done. That massively speeds things up.
Secret sauce #3: Hybrid training mirrors inference
- Training simulates inference states: randomly mask some slots, permute the clean ones, then place clean-before-masked. The model learns: 1) ARM next-token prediction on the clean slots (to be a strong writer), and 2) diffusion denoising on the masked slots (to be a strong planner and fixer). This uses every token for learning, boosting data efficiency.
Concrete walkthrough:
- Suppose the answer has 96 tokens and slot size 32 â 3 slots.
- Iteration 1 (planning): Model scores all three masked slots. It selects slots 1 and 3 to draft because they look confident enough.
- Iteration 1 (infilling): It verifies the combined draft and fully accepts slot 1, but for slot 3 it only trusts the first 10 tokens. It keeps those 10, remasks the rest, and tries again for slot 3 until done. Finished slots move up front; their caches are kept.
- Iteration 2: With more context from slot 1 and 3, slot 2 now scores high, is drafted, verified, and completed quickly.
- Final: Restore original slot order to present the answer exactly as intended.
What breaks without each step:
- No planning â parallel conflicts between slots, causing garbled phrases.
- No verification â bad drafts slip in, errors snowball.
- No autoregressive inside slots â nearby words mismatch (e.g., âright onceâ instead of âright nowâ).
- No KV cache reuse â each iteration reprocesses everything, crushing speed.
- No stable position IDs â reordering breaks the modelâs sense of order.
đ„Ź Concept 11 â Speculative Decoding (speed boost)
- What it is: Using a quick draft to accelerate careful verification.
- How it works: Draft multiple tokens, then only accept the parts that pass a confidence check.
- Why it matters: You get the speed of big leaps with the safety of checkpoints. đ Anchor: Like sketching a whole paragraph quickly, then keeping only the sentences that read well.
Putting it all together: Plan â Draft â Verify â Complete (in parallel) â Reuse memory â Repeat. Thatâs ReFusionâs efficient, coherent writing loop.
04Experiments & Results
đ Hook: Think of a track meet where one runner (old MDMs) tries to sprint in zigzags and keeps stopping to tie shoes, another runner (ARMs) runs smoothly but only one step at a time, and a new runner (ReFusion) runs multiple lanes at once but keeps each laneâs steps in perfect order.
The tests: Seven benchmarks covering general knowledge (MMLU-Pro, ARC-C), math (GSM8K, MATH, GPQA), and code (HumanEval, MBPP). We measured accuracy or pass@1 (did it get the exact answer?), and throughput in tokens per second (TPS), which is like how fast the runner moves.
The competition: Strong ARMs (Llama-3-8B, Qwen3-8B), MDMs (LLaDA-8B, Dream-7B), and MDM accelerators (Fast-dLLM, D2F).
Scoreboard with context:
- ReFusion vs prior MDMs: ReFusion sets a new state of the art. On HumanEval (code), it hits roughly 78.7% pass@1âabout a 22-point jump over a strong MDM baseline. On MBPP, it reaches around 92 TPS, faster than other MDMs (up to about 1.4Ă the next-fastest) while also being more accurate. Thatâs like getting an A while running the race the quickest.
- ReFusion vs strong ARMs: ReFusion often matches or beats a robust ARM (Qwen3-8B) on tasks like GSM8K and MBPP while being about 2.3Ă faster on average. Thatâs like earning slightly higher scores than a careful note-taker while writing more pages per minute.
đ„Ź Concept 12 â Throughput (Tokens Per Second)
- What it is: How many tokens the model outputs each second.
- How it works: Faster TPS means the model writes more quickly.
- Why it matters: Low TPS feels laggy to users. đ Anchor: Itâs like words-per-minute in typing.
đ„Ź Concept 13 â Pass@1 (for coding)
- What it is: The chance the first try passes the test cases.
- How it works: The model writes code once; we run tests; pass@1 counts how often it works immediately.
- Why it matters: In coding, first-try correctness saves huge time. đ Anchor: Like solving a math problem right the first time without erasing.
Surprising findings:
- KV cache reuse without re-computation was not just faster (about 1.16â1.33Ă speedup) but kept accuracy stable or slightly better. Itâs as if not overthinking each draft prevents spreading mistakes.
- Hyperparameters had broad âsweet spots.â For example, with slot sizes like 8â32 and reasonable acceptance thresholds, ReFusion stayed both faster and more accurate than baselines. This suggests the design is robust, not finicky.
Controlled comparisons:
- When retrained under the same data and settings (e.g., on a 120K-sample subset), ReFusion still outperformed comparable MDMs and even a retrained strong ARM on some coding tasks, often while being ~2Ă faster. This shows the advantage comes from the architecture, not secret data.
Takeaway: ReFusion shifts the speedâquality frontier up and to the right: itâs both quicker and better than previous diffusion models and often challenges strong autoregressive models too.
05Discussion & Limitations
Limits:
- Immutable slots: Once a slot is accepted, ReFusion doesnât revisit it to fix small errors. This speeds things up but can lock in a mistake.
- Planning heuristics: The choice of which slots to fill is based on confidence scores (like the first tokenâs probability). While effective, smarter planners could further reduce rare missteps.
- Knowledge-heavy tasks: Without large-scale pretraining beyond its fine-tuning data, ReFusion can trail models that memorize more world facts.
Resources needed:
- Training: Multi-GPU setup and data covering math, code, and reasoning. The method benefits from enough compute to mirror the plan-and-infill dynamics during training.
- Inference: A single modern GPU benefits a lot from full KV caching and parallel slot infilling; standard transformer tooling works.
When not to use:
- Very short outputs: The planning overhead may not pay off versus a simple left-to-right ARM.
- Ultra-tight coupling across the whole answer: If nearly every sentence depends tightly on the previous one, parallelism opportunities shrink.
- Strictly deterministic formatting: If every character depends on the immediate previous one (like certain cryptic formats), ARM-style decoding may be simpler.
Open questions:
- Can we safely re-mask and refine part of a finished slot (sub-slot editing) without big overhead?
- Could reinforcement learning improve the planning policy to pick even better slot orders for multi-step reasoning?
- Can slot size be dynamic (short for tricky areas, long for easy runs) to balance coherence and speed automatically?
- What new certainty scores or verifiers best predict safe parallelism across domains like legal or scientific writing?
06Conclusion & Future Work
Three-sentence summary: ReFusion speeds up text generation by planning which chunks (slots) to write in parallel and then filling each chunk carefully in order. By serializing inside slots and parallelizing across them, it preserves coherence, enables full KV cache reuse, and avoids the messiness that plagued earlier diffusion models. Across math, coding, and reasoning tasks, it sets a new bar for diffusion-based models and often rivals strong autoregressive models at about 2.3Ă faster speeds.
Main achievement: A simple, powerful slot-based plan-and-infill framework that unites diffusionâs flexible planning with autoregressionâs local fluencyâwhile unlocking exact KV cache reuse.
Future directions: Add sub-slot refinement to fix late-discovered errors, learn smarter planning with reinforcement learning, and adapt slot sizes on the fly. Scale training data and models to boost knowledge-heavy performance without sacrificing speed.
Why remember this: ReFusion shows that structure mattersâby matching the real shape of language dependency (tight locally, looser globally), we can have both speed and quality. Itâs a blueprint for future systems that plan globally, write locally, and waste no compute.
Practical Applications
- âąSpeed up coding assistants so they propose multiple complete function drafts in one go while preserving correctness.
- âąDeliver faster math tutoring with clear, step-by-step solutions that donât lag between steps.
- âąImprove chatbotsâ response times in customer support without sacrificing clarity and tone.
- âąPower search summaries that build different sections (facts, pros/cons, sources) in parallel and read fluently.
- âąGenerate long documents (reports, blog posts) by planning sections in parallel and writing each section coherently.
- âąAccelerate data-to-text dashboards (e.g., weekly sales summaries) by parallelizing independent sections.
- âąEnable smoother on-device AI typing suggestions by reusing memory for lower latency.
- âąBoost code review tools that propose multi-line edits across separate files while keeping each edit locally consistent.
- âąImprove reasoning agents that plan multiple sub-steps ahead and fill them in reliably.
- âąSupport educational content creation (quizzes, explanations) where multiple items can be drafted at once and polished locally.