🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
DEER: Draft with Diffusion, Verify with Autoregressive Models | How I Study AI

DEER: Draft with Diffusion, Verify with Autoregressive Models

Intermediate
Zicong Cheng, Guo-Wei Yang, Jia Li et al.12/17/2025
arXivPDF

Key Summary

  • •DEER is a new way to speed up big language models by letting a diffusion model draft many tokens at once and an autoregressive model double-check them.
  • •Traditional drafters write left-to-right and their mistakes snowball, which limits how many tokens can be accepted at a time.
  • •DEER uses a diffusion large language model (dLLM) that drafts whole blocks in one step, so early errors don’t ripple through later tokens.
  • •A two-stage training process aligns the diffusion drafter to behave like an autoregressive model for continuations, fixing a key mismatch.
  • •Stage I teaches the drafter to continue from a given prefix; Stage II sharpens accuracy right after the prefix using exponentially weighted losses.
  • •DEER achieves much longer accepted drafts (up to 32 tokens) and higher speedups (up to 5.54× on HumanEval with Qwen3-30B-A3B) than strong baselines like EAGLE-3.
  • •The decoding remains lossless: after verification, the output distribution exactly matches the original autoregressive model.
  • •DEER scales well in batches and even helps when the diffusion drafter is only partially trained, such as in math reasoning tasks.
  • •An emergent skill called reliable block regeneration lets the drafter neatly rewrite masked suffixes as coherent blocks.
  • •Overall, DEER shows diffusion drafting is a practical path to faster, high-quality LLM inference without changing the main model.

Why This Research Matters

Faster LLMs mean less waiting for help with coding, homework, or creative writing. For companies, higher throughput lowers serving costs and energy use, making AI both cheaper and greener. In real-time tools—like copilots, chat assistants, and coding IDEs—snappier responses keep users engaged and productive. DEER’s lossless guarantee preserves trust: you get the speed without changing what the model would have said. Its batch-friendly design and blockwise independence also scale well to many users at once. As diffusion-based infrastructures mature (like KV cache support), the gains should grow, pushing practical AI toward smoother, more responsive experiences.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you text a friend, your phone guesses the next few words to help you type faster? Big language models (LLMs) do something similar when they answer questions or write code: they predict one token at a time. That one-token-at-a-time habit is called autoregressive decoding, and it’s safe but slow, especially when answers are long or when many users are waiting at once.

🍞 Hook: You know how building a LEGO tower one brick at a time is steady but slow? 🥬 The Concept: Autoregressive Model (AR) decoding is when an AI writes text one token after another, each new token depending on everything it has already written.

  • How it works:
    1. Look at the existing text (the prefix).
    2. Predict the next token.
    3. Add it to the text and repeat.
  • Why it matters: Without this careful, step-by-step process, the model might forget context or contradict itself; AR keeps things consistent but can be sluggish. 🍞 Anchor: When ChatGPT writes a sentence, it decides the next word like “The,” then “capital,” then “of,” and so on, one by one.

To make answering faster without changing the model’s brain, researchers use speculative decoding: a tiny helper guesses several upcoming tokens, and the big model checks them. If they look right, we keep them; if not, we fix them. This can save time when the helper often guesses correctly.

🍞 Hook: Imagine a friend blurts out three likely next words for your sentence, and you just nod if they’re right. 🥬 The Concept: Speculative Decoding is a draft–verify trick where a small drafter proposes multiple next tokens and the main model verifies them.

  • How it works:
    1. The drafter proposes a short block of tokens in advance.
    2. The big model verifies each token in order.
    3. Accept correct ones; resample any mistakes.
  • Why it matters: Without speculative decoding, the big model must write every token itself, wasting time when many tokens are easy to predict. 🍞 Anchor: When asking “What is the capital of France?”, the drafter suggests “Paris.” The big model checks and accepts it instantly.

Here’s the snag: most drafters today are also autoregressive. They guess token 1, then guess token 2 based on token 1, and so on. If token 1 is slightly off, token 2 drifts further, and by token 5 the chain is off-track. We call this the gradual collapse of trust. Plus, AR drafters are still sequential, so their guesses aren’t fully parallel.

That’s why diffusion language models (dLLMs) are exciting: they generate many tokens together, more like solving a whole puzzle at once.

🍞 Hook: Imagine sketching a whole scene in broad strokes instead of drawing one leaf at a time. 🥬 The Concept: A Diffusion Large Language Model (dLLM) is a text generator that starts from noisy tokens and denoises them into a coherent block, often in parallel.

  • How it works:
    1. Begin with a noisy version of a token block.
    2. Gradually clean up the noise using the model’s knowledge.
    3. Output a full, polished block of tokens.
  • Why it matters: Without blockwise denoising, we’re stuck with left-to-right guessing and error buildup. 🍞 Anchor: For code like “def add(a, b): return a + b,” a dLLM can suggest the whole function body in one go.

Before this paper, people tried faster drafters: AR heads with trees (Medusa, Hydra, EAGLE-3) and n-gram or retrieval tricks (Lookahead). These helped but still struggled with uncertainty accumulation or shallow context. Some used diffusion ideas, but in continuous space with many steps, which made temperature control and alignment to AR verification tricky.

The missing piece was a drafter that (1) generates blocks in one step to avoid compounding mistakes, and (2) is trained to continue from the current prefix exactly like an AR model expects. DEER fills that gap with a diffusion drafter plus a two-stage alignment that teaches it to behave like a continuation expert. Why care? Faster, cheaper, greener AI: less waiting for code, tutoring, search, and multi-step reasoning; lower cloud bills; and happier users in real-time apps.

02Core Idea

The “aha!” in one sentence: Use a diffusion model to draft whole blocks at once, then let the regular AR model verify them token-by-token—so you skip error snowballing and get big speedups without changing the final answer.

Let’s layer on simple analogies:

  1. Painting vs. placing tiles: Instead of laying tiles one by one (AR), paint the whole section (dLLM) and let a foreman (AR verifier) inspect each part.
  2. Group notes vs. single notes: Rather than composing a melody one note at a time, the dLLM proposes a bar of music, and the AR model checks each note for harmony.
  3. Packing lunches: The dLLM packs a multi-item lunchbox in one go; the AR model checks each item is correct. Fewer trips, same lunch quality.

🍞 Hook: Imagine taking a long jump instead of many tiny steps. 🥬 The Concept: One-Step Block Drafting is proposing a chunk of future tokens in a single move, with no left-to-right dependence inside the chunk.

  • How it works:
    1. Read the prefix.
    2. Draft k tokens in parallel via diffusion denoising.
    3. Send them to the AR verifier for token-wise acceptance.
  • Why it matters: Without one-step blocks, early token hiccups ripple forward, and the verifier rejects more and more. 🍞 Anchor: For “The capital of France is …”, the drafter proposes “Paris, a beautiful city.” The verifier accepts several tokens in a row at once.

But there’s a catch: naive dLLMs are trained to complete whole sequences, not to continue from a prefix like AR models do. DEER fixes this with a two-stage lesson plan.

🍞 Hook: Think of learning to ride a bike: first balance, then speed and turns. 🥬 The Concept: A Two-Stage Training Strategy teaches the dLLM to be a precise continuation drafter.

  • How it works:
    1. Stage I (AR-style Distillation): Train on truncated answers with a [SEP] marker so the dLLM learns to continue from the prefix.
    2. Stage II (Scribe Refinement): Re-mask only the last R tokens and weight losses so tokens closest to the prefix get extra focus.
  • Why it matters: Without Stage I, drafts don’t act like continuations; without Stage II, accuracy near the verification boundary is shaky and acceptance drops. 🍞 Anchor: Give the model “The capital of France [SEP]” and train it to write “is Paris …”, with extra attention to “is” and “Paris”.

🍞 Hook: You know how you continue a story from the last page you read, not from the whole book? 🥬 The Concept: Prefix-Conditioned Continuation means the model must extend text based only on the visible prefix and a boundary token.

  • How it works:
    1. Mark where the known text ends (using [SEP]).
    2. Mask the future and denoise only the continuation.
    3. Ensure predictions depend on the prefix, not hidden future.
  • Why it matters: Without this, the drafter might hallucinate based on missing future tokens and clash with the AR verifier. 🍞 Anchor: Given “Once upon a time [SEP]”, the model writes the next sentence that fits that start.

🍞 Hook: Imagine a coach who cares most about the first steps after you start running. 🥬 The Concept: Exponential Decay Loss Weighting makes mistakes closer to the prefix count more during training.

  • How it works:
    1. Re-mask only a suffix of length R.
    2. Assign higher weights to earlier tokens in that suffix (closest to the prefix).
    3. Train so the model is extra accurate where verification begins.
  • Why it matters: Without heavier focus near the boundary, the verifier rejects early tokens and shortens accepted blocks. 🍞 Anchor: When completing “The capital of France [SEP] …”, getting “is” and “Paris” right matters more than the flowery words that follow.

A neat bonus pops out:

🍞 Hook: Think of a neat eraser that lets you rewrite the end of a paragraph cleanly, keeping the start intact. 🥬 The Concept: Reliable Block Regeneration is the drafter’s ability to accept partially masked suffixes and regenerate them coherently over and over.

  • How it works:
    1. Mask part of the end.
    2. Denoise to fill it in as a block.
    3. Repeat with different masks and keep coherent structure.
  • Why it matters: Without stable regeneration, long blocks crumble when you need to rewrite or extend them. 🍞 Anchor: You can blank out the last two lines of a code function and the drafter refills them correctly on multiple tries.

Before vs. After: Before DEER, AR drafters’ errors compounded, capping accepted block length (often ~7–10 tokens). After DEER, one-step diffusion drafting keeps proposals stable with depth, boosting acceptance up to 32 tokens and delivering 2–5.5× speedups depending on model and task. Why it works: independence inside the block (no conditioning on earlier draft tokens) prevents error snowballing; careful alignment turns a global dLLM into a sharp continuation expert; and the AR verifier keeps the output exactly correct.

03Methodology

At a high level: Prefix → Stage I (AR-style distillation) → Stage II (refinement with weighted suffix masking) → One-step diffusion draft of k tokens → AR verification token-by-token → Output.

Inputs and outputs: The input is the current prefix plus a special [SEP] marker signaling where the known text ends. The output is a block of k drafted tokens that the AR verifier either accepts or corrects, then appends to the prefix for the next round.

Stage I: AR-Style Continuation Distillation

  • What happens: We take teacher answers from an AR model, randomly truncate them, append [SEP], and mask the rest. The dLLM sees a noised version and learns to denoise only the continuation.
  • Why this exists: Off-the-shelf dLLMs are trained to complete whole sequences, so they may “peek” at future patterns. This stage re-teaches them: continue from the prefix, nothing else.
  • Example: Teacher answer: “The capital of France is Paris.” Truncate to “The capital of France [SEP] …” and mask the rest. The dLLM learns to reconstruct “is Paris …” from noise.
  • What breaks without it: The drafter drafts globally, not causally; the AR verifier rejects many tokens since the distribution mismatches prefix-only conditioning.

Stage II: Prefix-Conditioned Accuracy Refinement (Scribe Refinement)

  • What happens: Instead of masking the whole suffix, we mask only the last R tokens (R is small) and weight losses exponentially so tokens nearest the prefix count more.
  • Why this exists: The verifier begins at the first masked token after the prefix. If those early tokens are off, acceptance plunges. This stage concentrates accuracy where it matters most.
  • Example: For “… France [SEP] is Paris, a beautiful city.” we re-mask “is Paris” (R small), assigning higher weight to “is”, slightly lower to “Paris”.
  • What breaks without it: The drafter might be fine overall but wobbly right at the verification boundary, capping accepted length.

🍞 Hook: Like continuing a story from where you left off. 🥬 The Concept: Prefix-Conditioned Continuation (introduced here operationally) ensures the dLLM only uses the prefix plus [SEP] when drafting.

  • How it works:
    1. Add [SEP] at the truncation point.
    2. Mask and denoise only the continuation tokens.
    3. Keep training so the model’s probability matches the AR teacher near the boundary.
  • Why it matters: Without strict prefix conditioning, the drafter proposes tokens the AR model won’t agree with, slashing acceptance. 🍞 Anchor: From “def add(a, b): [SEP]”, the model writes “return a + b” using just the function head.

🍞 Hook: A coach focusing practice on the first steps out of the blocks. 🥬 The Concept: Exponential Decay Loss Weighting (operational detail) gives higher training weight to earlier tokens in the masked suffix.

  • How it works:
    1. Choose R ∈ [1, 96].
    2. Set weights w_i = α^(R−i), with α ≈ 1.01.
    3. Train so near-prefix tokens are ultra-reliable.
  • Why it matters: Without this, small near-prefix errors snowball during verification. 🍞 Anchor: In “The capital of France [SEP] is Paris …”, the loss emphasizes “is” most, then “Paris,” then the rest.

Inference: One-Step Draft, Then Verify

  • Drafting (One-Step Block Drafting):

    • What happens: Given prefix x1:j, the dLLM proposes k tokens in parallel via single-step denoising.
    • Why this exists: To avoid left-to-right uncertainty accumulation and exploit GPU parallelism.
    • Example: Propose 8 tokens for a code completion in one go.
    • What breaks without it: A sequential drafter conditions on its own unverified guesses; errors compound.
  • Verification (AR, token-by-token):

    • What happens: For each token i in the block, compute an acceptance ratio by comparing the AR model’s probability to the drafter’s. Accept with that probability; otherwise, resample from the AR model’s residual distribution. Append the chosen token to the prefix and proceed to i+1.
    • Why this exists: Guarantees lossless decoding—your final output has exactly the same distribution as sampling directly from the AR model.
    • Example: If the drafter proposes “Paris,” and the AR model agrees strongly, it is accepted immediately. If not, the AR model replaces it with its own choice.
    • What breaks without it: Speedups might come at the cost of correctness, changing the model’s behavior. DEER avoids that.

Secret Sauce

  • Independence inside the block: The dLLM’s proposal for position i does not depend on the drafted tokens at positions < i. This stops mistakes from cascading.
  • Alignment where it counts: Two-stage training makes the drafter act like a continuation specialist, especially at the verification boundary.
  • Exactness by design: The AR verifier keeps everything lossless, so you accelerate without altering what the model would have said.

Concrete mini example

  • Prefix: “The capital of France [SEP]”
  • Draft block (k=6): “is Paris , a”
  • Verify:
    1. “is” → AR agrees → accept.
    2. “Paris” → AR agrees → accept.
    3. “,” → AR might prefer “.” → reject and replace with “.”
    4. Next tokens verified similarly.
  • Result: Multiple tokens accepted in one round, with any mismatches safely corrected.

04Experiments & Results

The Tests: The authors measured two things across code and math tasks: (1) Speedup—how many times faster end-to-end decoding gets; (2) Average acceptance length (τ)—how many drafted tokens get accepted per verification cycle. Larger τ usually means bigger speedups.

The Competition: DEER was compared to strong speculative baselines, including Medusa, Hydra, and EAGLE-3, across Qwen backbones from 4B up to 30B parameters.

The Scoreboard with Context:

  • Code generation (HumanEval, MBPP, CodeAlpacaPy, LiveCodeBench):

    • On HumanEval with Qwen3-30B-A3B at temperature 0, DEER reached an average acceptance length of 5.03 (versus EAGLE-3’s 3.05) and a speedup up to 5.54× in some settings (versus 2.41× for EAGLE-3). Think of that as finishing your homework before dinner when others finish at bedtime.
    • Across models, DEER consistently improved τ by roughly 50–120% over EAGLE-3 and achieved maximum accepted blocks up to 32 tokens (EAGLE-3 topped out around 7–8).
    • Acceptance distribution: DEER frequently got ≥8-token acceptances (about 16% of the time), a sign that long, stable blocks are common rather than rare flukes.
  • Batch scaling:

    • On HumanEval without KV cache, DEER’s throughput improved strongly with batch size, e.g., about 160 tokens/s at batch 8 vs. ~38 tokens/s for pure AR decoding—like opening more checkout lanes and actually using them efficiently.
  • Math reasoning (GSM8K, Math500, Minerva Math):

    • Even with a partially trained math drafter, DEER beat EAGLE-3 on speed and τ (e.g., Math500 speedup 2.12× vs. 1.89×; τ 2.45 vs. 2.04). That’s like a new runner setting a better time even before full training.

Surprising Findings:

  • Long-block resurgence: Acceptance lengths showed an interesting pattern—after an exponential-like decay up to ~30 tokens, the chance of very long acceptances rose again. This hints that independence within the drafted block stabilizes far-ahead tokens, enabling unusually long accepted runs.
  • Stage II matters most on harder tasks: Refinement boosted acceptance more on HumanEval and LiveCodeBench than on simpler sets, showing that near-prefix precision unlocks longer blocks in complex code.
  • Reliable block regeneration: The trained dLLM could repeatedly rewrite masked suffixes coherently, a capability that emerges from the alignment pipeline and is handy for iterative editing.

Overall Take: DEER’s one-step diffusion drafting plus careful alignment consistently turned into longer accepted chunks and higher speedups, with exact AR-level correctness preserved by verification.

05Discussion & Limitations

Limitations:

  • Draft-quality dependence: If the diffusion drafter is very weak or misaligned, acceptance lengths shrink. DEER still helps, but you get smaller gains.
  • Training sensitivity: Stage II’s exponential weight α needs to be modest (around 1.01). Too large and training can destabilize.
  • Infrastructure maturity: Ecosystem support for dLLM inference with KV cache is still emerging; once standardized, batch gains should be even stronger.

Required Resources:

  • A modest (~0.5B) diffusion drafter and access to teacher outputs for Stage I.
  • GPU time for two short fine-tuning stages (reported wall-clock ranges are comparable to or lower than some AR-based baselines at similar scales).

When NOT to Use:

  • Very short outputs (one or two tokens) where drafting overhead outweighs benefits.
  • Highly stochastic, creative sampling with high temperature where verifier agreement is low and acceptance collapses.
  • Scenarios with severe memory limits that can’t accommodate a small additional drafter.

Open Questions:

  • Can we design adaptive block sizes that expand and shrink on the fly for even better speedups?
  • How far can reliable block regeneration go—can it power structured editing and tool use directly?
  • What are the best architectures and training curricula for domain-specific dLLMs (math, code, dialogue)?
  • How will widespread KV-cache support for dLLMs change the speedup landscape in production systems?

06Conclusion & Future Work

In three sentences: DEER drafts with a diffusion model and verifies with a standard autoregressive model, eliminating left-to-right error snowballing while preserving exact correctness. A two-stage alignment (AR-style distillation and near-prefix refinement) turns a general dLLM into a sharp continuation drafter, enabling one-step block proposals. The result is much longer accepted blocks and strong end-to-end speedups across models and tasks, even when the drafter is only partially trained.

Main Achievement: Showing that discrete diffusion drafting, properly aligned, is a practical and superior alternative to AR drafting for speculative decoding—achieving up to 32-token accepts and 5.54× speedups without changing the main model’s answers.

Future Directions: Mature KV-cache support for dLLMs, adaptive block sizing, domain-specialized drafters, and integrating reliable block regeneration into editing and tool-using agents.

Why Remember This: DEER reframes the drafting problem—don’t walk token-by-token; jump in blocks and just check your landing. It’s a blueprint for faster, accurate LLMs that feel more responsive in the real world.

Practical Applications

  • •Speed up code completion in IDEs so developers see multi-token suggestions instantly.
  • •Accelerate data-analysis notebooks that auto-complete queries, plots, and summaries.
  • •Serve more chatbot users per GPU by increasing accepted tokens per step and reducing latency.
  • •Boost agentic workflows (plan–execute–reflect) by drafting long reasoning steps quickly and verifying them safely.
  • •Improve live tutoring systems that solve multi-step math problems with faster yet reliable intermediate steps.
  • •Enhance document drafting tools that propose whole sentences or paragraphs while keeping factual accuracy via verification.
  • •Enable real-time product search assistants to generate precise, longer answers without lag.
  • •Power batch inference for customer support summaries where many conversations are processed in parallel.
  • •Support iterative code editing by reliably regenerating masked blocks during refactor and repair loops.
  • •Reduce cloud costs and energy per token generated by shifting work to parallel drafting and light verification.
#DEER#speculative decoding#diffusion LLM#autoregressive verification#blockwise drafting#prefix-conditioned continuation#exponential decay loss weighting#reliable block regeneration#acceptance length#HumanEval#EAGLE-3#Qwen3-30B-A3B#lossless decoding#speedup#single-step denoising
Version: 1