DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Lunbin Zeng; Jingfeng Yao; Bencheng Liao; Hongyuan Tao; Wenyu Liu; Xinggang Wang

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Intermediate

Lunbin Zeng, Jingfeng Yao, Bencheng Liao et al.12/17/2025

arXiv PDF

Key Summary

•This paper shows a simple way to turn any strong autoregressive (step-by-step) model into a diffusion vision-language model (parallel, block-by-block) without changing the architecture.
•They call the family of models DiffusionVL, and it reaches or beats the best previous diffusion VLMs while using less than 5% of the training data those models needed.
•The key trick is diffusion finetuning: keep the same transformer, but switch the training and decoding rules from next-token prediction to block-wise diffusion.
•A special block decoding design lets the model generate any length, reuse the KV-cache, and decode many tokens at once for big speedups.
•On tough multimodal benchmarks, DiffusionVL improves results by 34.4% on MMMU-Pro (vision) and 37.5% on MME (Cognition), closing the gap with top AR-VLMs.
•DiffusionVL is about 2× faster at detailed image captioning than prior diffusion VLMs and shows a tunable speed–quality tradeoff via denoising steps and remasking thresholds.
•You can build DiffusionVL either from an AR-VLM (just switch paradigms) or from an AR-LM (first align vision to text with a small connector, then switch paradigms).
•Even when starting from an AR-LM, the converted DiffusionVL matches or rivals AR-style finetuning and clearly beats conversions starting from diffusion LLMs.
•The method keeps the same transformer blocks but uses different attention masks during training vs inference to enable block-parallel denoising with causal context.
•This work suggests AR and diffusion are closer than we thought, and that many existing AR models can be upgraded into fast, parallel multimodal generators with minimal extra data.

Why This Research Matters

DiffusionVL turns the huge world of existing autoregressive models into fast, practical diffusion vision-language models with minimal extra data. That means chat assistants can describe images and diagrams quicker, on cheaper hardware, and with strong accuracy. Features like KV-cache reuse and variable-length decoding make diffusion competitive not just in theory but in real deployments. This helps accessibility tools, education apps, and workplace software respond in near real time. Companies can cut inference costs by decoding in parallel and reusing compute. Researchers gain a simple recipe to upgrade proven AR models rather than starting from scratch. Overall, it lowers the barrier to building speedy, smart, multimodal AI.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine building a LEGO city one brick at a time. It works, but it’s slow. Now imagine placing small chunks of bricks at once—much faster, right?

🥬 The Concept (Autoregressive Models, AR):

What it is: An autoregressive model writes answers one token at a time, always looking at what it already wrote.
How it works (recipe):
1. Read the prompt (and maybe an image).
2. Predict the next token based on all previous tokens.
3. Append that token and repeat until done.
Why it matters: This is very accurate, but slow, because it can’t easily guess many tokens in parallel. 🍞 Anchor: When you ask a chatbot, it replies word by word, like a careful storyteller.

🍞 Hook: You know how you can sketch the whole picture lightly first, then refine it everywhere at once? That’s like diffusion.

🥬 The Concept (Diffusion Models):

What it is: Diffusion models start from noisy guesses and clean them up in steps to get the final answer.
How it works:
1. Add noise to a clean thing (text or image) during training and learn how to undo the noise.
2. At test time, start from noise.
3. Denoise step by step until the output is clear.
Why it matters: Diffusion can improve many parts in parallel, which can be faster at decoding. 🍞 Anchor: Image generators like “paint from noise” are diffusion; here we apply a similar idea to text tokens.

🍞 Hook: Think of reading a picture book and explaining it. You use both your eyes (vision) and words (language).

🥬 The Concept (Vision-Language Models, VLMs):

What it is: Models that understand images and text together to answer questions, describe pictures, and more.
How it works:
1. A vision encoder turns the image into vectors.
2. A language model reads the text prompt and those image vectors.
3. The model generates an answer.
Why it matters: Many real tasks (captions, charts, diagrams) need both sight and words. 🍞 Anchor: Given a math diagram, a VLM explains the steps, not just the picture.

🍞 Hook: Reusing notes is faster than rereading the whole book every time you answer a question.

🥬 The Concept (KV-cache):

What it is: A memory of key/value attention states so the model doesn’t recompute old context.
How it works:
1. Store attention summaries as you go.
2. Reuse them for future steps.
3. Only compute what’s new.
Why it matters: Huge speedups for long answers. 🍞 Anchor: Like bookmarking pages you’ve already summarized.

🍞 Hook: Sometimes you need a short answer; sometimes you need an essay.

🥬 The Concept (Variable-length Generation):

What it is: The model can produce outputs of any needed length and stop when it’s done.
How it works:
1. Keep generating until an end token appears.
2. Don’t force a fixed size.
3. Handle short and long tasks smoothly.
Why it matters: Real tasks vary a lot; fixed-length outputs can cut off answers or waste time. 🍞 Anchor: A caption for a simple photo may be short; explaining a busy diagram needs many tokens.

The world before: Most powerful VLMs were autoregressive—great at accuracy but slow at decoding one token at a time. Some researchers tried to speed AR up with speculative decoding (guessing ahead), but it’s still limited. Diffusion for language promised parallel decoding, but early diffusion VLMs (like LaViDa, LLaDA-V, Dimple) lagged in accuracy and missed practical features: they couldn’t easily reuse KV-cache and struggled with variable-length outputs, making real-world inference slower and clunkier.

The problem: There was a performance and practicality gap. Diffusion VLMs were not only weaker than top AR-VLMs on benchmarks, they were also slower in practice because they couldn’t reuse KV-cache and couldn’t flexibly stop at the right length.

Failed attempts: Full-sequence diffusion helped parallelism but ignored KV reuse and variable lengths. Block-diffusion ideas existed in small text-only settings, but they didn’t convincingly scale to large multimodal models.

The gap: No one had shown a simple, reliable path to take the many powerful AR models we already have and convert them into strong, practical diffusion VLMs—without changing architecture or needing massive new data.

Real stakes (why you should care):

Faster multimodal assistants on your phone or laptop.
Better real-time captioning for accessibility.
Quicker chart and diagram help for students and workers.
Lower cloud bills by decoding in parallel and reusing compute.
Using less data and training time to reach strong performance.

02Core Idea

🍞 Hook: You know how you can learn a new game by just changing the rules you play by—without buying new pieces?

🥬 The Concept (Aha!):

What it is: Keep the same transformer architecture from any strong autoregressive model, but switch the training and decoding rules to block-wise diffusion using simple finetuning.
How it works:
1. Start with a good AR model (language-only or vision-language).
2. If it’s language-only, add a small connector to align image and text spaces.
3. Finetune with block diffusion: add noise to answer blocks and train the model to denoise them using earlier clean context.
4. At inference, decode by blocks in parallel (intra-block), moving forward block by block (inter-block), reusing KV-cache.
Why it matters: You get diffusion’s parallel speed with AR-level skills, plus practical features like variable-length generation and KV reuse. 🍞 Anchor: It’s like turning a solo pianist into a conductor who leads sections of the orchestra at once—without changing the concert hall.

Multiple analogies:

Assembly line: AR is one worker doing all steps in order; diffusion blocks are many stations working on chunks in parallel, passing along partial results.
Sketch to paint: AR paints one brushstroke at a time; diffusion roughs in whole areas, then sharpens them together.
Puzzles: AR fits pieces strictly one by one; diffusion tries small regions at once, refining until they all fit.

Before vs After:

Before: To get diffusion VLMs, people trained special diffusion models that still lagged in performance and features.
After: You can “translate” any strong AR model into a diffusion VLM with simple finetuning, reaching SOTA diffusion results with tiny data and gaining 2× faster decoding on detailed captioning.

Why it works (intuition):

The architecture wasn’t the blocker—paradigms were. Transformers can support both AR and diffusion with different attention masks and training signals.
Strong AR models already know language (and often vision-language alignment). Switching the rulebook (the objective and attention pattern) teaches them to denoise blocks instead of predict only the next token.
Block diffusion keeps long-range order (causal across blocks) while allowing parallel cleanup inside each block.

Building blocks (with quick sandwiches):

🍞 Hook: Sorting laundry in baskets is faster than one sock at a time. 🥬 Block Diffusion/Decoding: Break the sequence into equal-sized blocks, denoise inside each block in parallel while moving block by block. Why it matters: Parallelism inside blocks, order across blocks. 🍞 Anchor: Captioning a long image description: the model cleans up 8-token chunks at once.
🍞 Hook: Switching from running to biking changes speed without changing your legs. 🥬 Paradigm Shift (AR → Diffusion): Keep the same network, change objective/attention to learn denoising. Why it matters: No architecture surgery needed; cheap and reliable. 🍞 Anchor: Qwen2.5-VL becomes DiffusionVL via finetuning.
🍞 Hook: Adapters let a camera lens fit a different body. 🥬 Modality Shift (LM → VLM): Train a small connector to align image embeddings to the text space. Why it matters: Lets language-only AR-LMs become multimodal. 🍞 Anchor: Add a 2-layer MLP projector between SigLIP2 image features and the LLM.
🍞 Hook: Use yesterday’s notes to answer faster today. 🥬 KV-cache Reuse in Diffusion: Cache previous blocks’ keys/values and reuse them when denoising the next block. Why it matters: Big speedups and real-time feel. 🍞 Anchor: While writing paragraph 2, reuse all context from paragraph 1.
🍞 Hook: Fix the messiest parts first. 🥬 Low-confidence Remasking: At each denoising step, lock in confident tokens, re-try uncertain ones; static = fixed count, dynamic = threshold-based. Why it matters: Balances speed and quality. 🍞 Anchor: If the model is sure about “cat” but unsure about “sleeping,” it keeps “cat” and revisits “sleeping.”

03Methodology

High-level flow: Input (image + prompt) → Encode (vision + text) → Blockify answer (+EOS padding) → Add block-wise noise to answer blocks → Concatenate noisy + clean sequences with a special attention mask → Train to denoise masked tokens (cross-entropy) → Inference with block decoding (KV reuse, variable length).

Step-by-step (training):

Prepare multimodal inputs.

What happens: Use a vision encoder (SigLIP2-400M) to turn image I into embeddings; a tokenizer/embedding layer for text prompt P and answer A.
Why it exists: The model needs both picture and words in the same space to reason.
Example: Image of a yellow cat; prompt: “What is in the image?”; answer: “There is a yellow cat.”

Align modalities (only if starting from an AR-LM).

What happens: Train a small MLP projector (connector) so image embeddings fit in the LLM’s text space.
Why it exists: Without alignment, the LLM treats image vectors like nonsense; training would be unstable.
Example: The connector learns that a patch showing whiskers should map near words like “cat.”

Blockify the target answer.

What happens: Pad with <EOS> so the answer length is divisible by block size D (default D=8), then split into blocks.
Why it exists: Diffusion will run inside blocks in parallel, so we need equal-sized chunks.
Example: “There is a yellow cat. <EOS> <EOS> …” split into 8-token blocks.

Add block-wise noise to answer blocks.

What happens: Randomly mask tokens inside entire answer blocks at a level matched to diffusion time; prompts and image context stay clean.
Why it exists: The model practices cleaning up noisy answer chunks using earlier clean context; this matches how inference will run.
Example: In block 2, half the tokens are masked; the model must reconstruct them.

Build a hybrid attention view (training-time only).

What happens: Concatenate the clean sequence and the noised sequence, and apply a special attention mask:
- Within a block (noisy side): bidirectional (to denoise in parallel inside the block).
- Between blocks: causal from earlier clean blocks to the current noisy block (to preserve order).
Why it exists: Lets the model see earlier clean context while parallel-denoising the current block.
Example: Noisy block 3 can attend to all clean info in blocks 1–2 and to its own positions bidirectionally.

Train with masked-token cross-entropy (block diffusion objective).

What happens: Compute loss only on masked tokens in the noisy blocks.
Why it exists: Focus learning on denoising; speeds convergence and matches the decoding rule.
Example: If tokens 3, 5, and 7 are masked, only they contribute to the loss.

Optional: If starting from an AR-VLM, skip step 2 and do end-to-end diffusion finetuning directly (paradigm shift only). If starting from an AR-LM, first train the projector with the standard AR objective (stable alignment), then switch to diffusion finetuning for the whole model (modality + paradigm shift).

Inference (block decoding with KV reuse):

Initialize cache with image + prompt.

What happens: Encode image and prompt once; store keys/values in KV-cache C0.
Why it exists: Reuse past context to speed up future computations.
Example: Captioning long scenes doesn’t recompute the prompt each time.

For block m, start from noise and denoise in S steps.

What happens: Use intra-block diffusion: at each step, compute logits for all positions in the block, pick confident tokens to fix, remask the rest.
Why it exists: Parallel denoising inside the block speeds decoding.
Example: With D=8 and S=8 (static), fix one token per step; or fix all above a confidence threshold (dynamic).

Reuse KV-cache causally across blocks.

What happens: Concatenate the new block’s KV with the cache from previous blocks: [K_cache; k_m], [V_cache; v_m].
Why it exists: Provides ordered context and big speedups.
Example: When writing block 4, the model fully attends to 1–3 via the cache.

Append the denoised block to the context and continue.

What happens: After the block is clean enough, treat it as part of the clean history. Stop when <EOS> appears in a block.
Why it exists: Supports variable-length outputs naturally.
Example: If <EOS> pops up in block 6 at step 5, stop early.

Two remasking strategies:

Static low-confidence remasking: Fix a fixed number of top-confidence tokens each step (e.g., ⌊D/S⌋). Simple and stable.
Dynamic low-confidence remasking: Fix all tokens above a threshold each step. Faster on easy content; offers a speed–quality dial.

Secret sauce (why this is clever):

No architecture changes: One transformer supports both paradigms.
Training–inference match: Block-wise noise at training mirrors block-wise denoising at test time.
Practical features: KV reuse and variable-length decoding make diffusion competitive in the real world, not just on paper.
Data efficiency: Leveraging AR pretraining means you need far less data to reach high diffusion-VLM performance.

04Experiments & Results

The tests (what and why): The authors evaluate DiffusionVL on a broad set of multimodal benchmarks to check general knowledge (MMMU, MMMU-Pro, MMBench, MMStar, MME), charts/docs (AI2D, ChartQA), and multi-image understanding (MuirBench), plus a detailed image captioning speed–quality test (detailcaps) with BERTScore and tokens-per-second.

Competition (who against whom):

Strong AR-VLMs: Qwen2.5-VL (3B, 7B), LLaVA variants, Cambrian-1.
Diffusion VLMs: LaViDa-L, Dimple, LLaDA-V.
Conversions from AR-LMs and dLLMs for controlled comparisons.

Scoreboard with context:

Data efficiency: DiffusionVL-7B, trained with 738K instruction samples (under 5% of LLaDA-V’s data), outperforms prior diffusion VLMs across many benchmarks.
Headline gains: +34.4% on MMMU-Pro (vision) and +37.5% on MME (Cognition) vs prior dVLMs, bringing results close to AR-VLM leaders.
Breadth: On MMBench (en-dev), SeedBench (image), AI2D, ChartQA, DiffusionVL-7B is consistently top-tier among diffusion models and narrows the gap to Qwen2.5-VL-7B AR results.
Speed: On detailcaps, DiffusionVL-7B is about 2× faster than LLaDA-V-8B at similar parallelism while achieving about 2.02× higher BERTScore—like running faster and writing better at the same time.

Surprising findings:

Translating from AR-LMs (language-only) works very well: After aligning vision via a small connector, diffusion finetuning yields dVLMs that match or nearly match AR-style finetuning and clearly beat starting from a diffusion LLM base.
Minimal gap between AR-VLMs and diffusion VLMs after finetuning: No complex annealing schedules needed to stabilize the switch.

Ablations (what changes what):

Denoising steps: More steps improve description quality (higher BERTScore) at a cost in speed—offering a clear speed–quality dial.
Block size: Smaller blocks give slightly better accuracy but less parallelism; D=8 is a strong default balancing both.
Dynamic thresholds (for remasking): Lower thresholds decode more tokens per step, accelerating more, with some quality drop—useful for time-critical tasks.

AR-LM vs dLLM conversions:

Starting from LLaDA-8B (dLLM) and finetuning to a VLM (Full- or Block-Diff) lags behind starting from Qwen2.5-7B (AR-LM) with block diffusion.
Building from AR-LM shows better downstream multimodal scores and indicates we don’t need a specialized diffusion LLM to get a strong dVLM.

Comparing to concurrent A2D-VL:

With similar data volume (≈400K), DiffusionVL-7B edges out A2D-VL-7B on MMMU and MMMU-Pro, and does so without annealing—supporting the claim that AR→Diffusion is a simple, robust path.

05Discussion & Limitations

Limitations:

Still a tradeoff: More denoising steps improve quality but slow decoding; smaller blocks help accuracy but reduce parallelism.
Data is smaller than some baselines, which is great for efficiency, but rare edge cases might benefit from more targeted finetuning.
Benchmarks are broad but not infinite; specialized domains (medical imaging, complex multi-image narratives) may need extra adaptation.
The approach assumes transformer-like architectures where attention masks can be adjusted; exotic architectures may need care.

Required resources:

A solid AR base model (AR-VLM or AR-LM) and a capable vision encoder (e.g., SigLIP2-400M).
Modest finetuning compute (relative to pretraining), with GPUs for block diffusion training and decoding.
Instruction-following multimodal data (hundreds of thousands of samples suffice here).

When not to use:

Ultra-short answers where AR decoding is already instant and simplicity beats complexity.
Tasks requiring strict token-by-token control or exact log-prob tracing identical to AR pipelines.
Environments where custom attention masks or KV handling are hard to implement.

Open questions:

How far can dynamic remasking go before quality dips too much across domains?
What is the best learned noise schedule per domain (charts vs photos vs diagrams)?
Can we blend AR and diffusion at runtime, switching modes adaptively per block?
How do very large models (70B+) behave with block diffusion in terms of stability and memory?
Can we extend this to robust multi-image and video reasoning with similar wins in speed and quality?

06Conclusion & Future Work

Three-sentence summary: DiffusionVL shows that you can keep a strong autoregressive model’s architecture and simply switch the rules of training and decoding to make it a fast, accurate diffusion vision-language model. With block-wise diffusion, KV-cache reuse, and variable-length generation, DiffusionVL achieves state-of-the-art diffusion VLM performance while using under 5% of the data of prior methods and runs about 2× faster on detailed captioning. This works whether you start from an AR-VLM (just change paradigm) or from an AR-LM (align vision, then change paradigm).

Main achievement: Proving a simple, general translation path from any strong AR model to a high-performing, practical diffusion VLM—no architecture changes required—while delivering big data efficiency and real inference speedups.

Future directions:

Adaptive decoding that blends AR and diffusion based on uncertainty and difficulty.
Learned, task-specific noise schedules and block sizes.
Scaling to video and multi-image tasks with the same KV reuse and parallelism.
On-device inference optimizations for mobile and edge.

Why remember this: It reframes AR vs diffusion as a choice of rules, not hardware—transformers can do both. That means we can upgrade the vast ecosystem of AR models into fast, parallel multimodal generators with minimal extra data and code changes, unlocking quicker, cheaper, and more flexible AI assistants.

Practical Applications

•Real-time image captioning for accessibility readers with faster, more descriptive outputs.
•Interactive diagram and chart understanding in education apps that explain steps clearly and quickly.
•On-device photo assistants that summarize albums without heavy cloud compute.
•Customer support that interprets screenshots and responds with guidance in near real time.
•Document analysis tools that parse forms, receipts, and PDFs efficiently.
•Robotics or AR assistants that describe a scene and suggest actions with low latency.
•Creative tools that co-write alt text and scene descriptions for content creators.
•Medical triage prototypes that quickly caption and annotate non-diagnostic visuals (e.g., workflow images) under expert supervision.
•Compliance tools that scan UI screenshots or charts for policy issues and summarize findings.
•Low-bandwidth deployments where parallel decoding and KV reuse reduce server load and cost.

Version: 1