DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Zefeng He; Xiaoye Qu; Yafu Li; Tong Zhu; Siyuan Huang; Yu Cheng

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

Beginner

Zefeng He, Xiaoye Qu, Yafu Li et al.12/30/2025

arXiv PDF

Key Summary

•DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.
•It uses diffusion models so the “thinking” happens by gradually cleaning up noise into a correct solution picture.
•This shift makes reasoning faster, more stable in cost (fixed steps), and naturally parallel (it explores many options at once).
•Across seven vision-heavy tasks (like mazes, Sudoku, TSP, and jigsaw puzzles), DiffThinker beats top MLLMs by large margins.
•A simple parser turns the solution image back into symbols so results are compared fairly to text-only models.
•Flow Matching trains the model to smoothly move from noise to solution, guided by the input image and instructions.
•Twenty inference steps give the best trade-off between speed and accuracy, and guidance scale around 4 works best.
•DiffThinker can team up with an MLLM: it draws several candidate solutions, and the MLLM checks which one fits the rules.
•A video version can reason too, but it’s slower and less accurate than the image-based approach right now.
•This work launches a new paradigm called Generative Multimodal Reasoning, showing visual thinking can solve long, vision-centric problems better.

Why This Research Matters

Many real-world problems are about space, shape, and layout; solving them by drawing is more natural than writing long explanations. DiffThinker proves that directly generating solution images makes vision-centric reasoning faster, more accurate, and easier to control. This helps applications like navigation, robotics, logistics, education, and design, where precise geometry matters. Fixed-step inference means predictable costs, which is vital for real-time systems. Native parallel exploration reduces backtracking and speeds up finding good solutions. Teaming DiffThinker with MLLMs blends visual precision and textual logic, opening the door to stronger hybrid agents. As visual foundation models improve, this paradigm could become a standard way to reason about the physical world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you sometimes solve a maze by tracing paths with your finger instead of writing down every move like “up, right, down”? When a problem is very visual, it often helps to think in pictures, not paragraphs.

🍞 Top Bread (Hook) Imagine trying to explain how to solve a jigsaw puzzle only using words, without ever moving any pieces. That would be slow and confusing.

🥬 The Concept: Chain-of-Thought (CoT)

What it is: Chain-of-Thought is when an AI (or a person) writes out its steps as text to reach an answer.
How it works:
1. Read the question
2. Write a step
3. Check and write the next step
4. Repeat until the final answer
Why it matters: Without CoT, the model may skip steps and make easy-to-miss mistakes on complex problems. 🍞 Bottom Bread (Anchor): When a math word problem asks for the total price with tax, CoT is like writing each calculation line-by-line to avoid errors.

The World Before: Multimodal Large Language Models (MLLMs) became great at understanding both pictures and words. To reason, they usually typed long CoT explanations. People even added tools (like image editors or code) so the model could look again and try again, known as “Thinking with Images.” This helped, but it stayed text-first: the model mostly talked about pictures instead of truly thinking in pictures.
The Problem: Long text chains are slow, hard to control (sometimes the model rambles), and not the best at tracking tiny visual details over many steps. For long, vision-heavy tasks (like planning a long route on a grid or arranging many pieces), errors creep in and performance drops.
Failed Attempts: Researchers tried longer CoT, stronger RL training with verifiable rewards, and multi-turn tool loops for images and even videos. These brought some gains but added latency, complexity, and still centered on text. Video generation can carry state over time but is expensive and not yet accurate enough for tricky puzzles.
The Gap: We needed a way to let the model actually think in pictures while it solves the problem, not just talk about pictures. That means letting the solution live natively in visual space, with precise lines, grids, and shapes.

🍞 Top Bread (Hook) You know how drawing the path in a maze feels more natural than listing directions like “right, right, up, up”? Drawing is the thinking.

🥬 The Concept: Generative Multimodal Reasoning

What it is: A new way where the model directly generates the solution as an image, guided by the input picture and the instructions.
How it works:
1. Take the task image and a short instruction
2. Generate a new image that shows the solution (e.g., the maze path)
3. Optionally parse that solution image back into symbols for grading
Why it matters: Without this, we force visual problems into text boxes, losing spatial precision and slowing everything down. 🍞 Bottom Bread (Anchor): Instead of writing “R, R, U, U,” DiffThinker just draws the red path on the maze.

Real Stakes: Faster, clearer, and more accurate visual reasoning helps in daily life—from robots planning routes in warehouses, to apps helping kids learn Sudoku, to tools arranging layouts or connecting points efficiently (like delivery routes). When the task is about shapes, positions, and paths, drawing the answer is often best.

🍞 Top Bread (Hook) Think of assembling LEGO: you build with your hands and eyes, not by dictating every move as a story.

🥬 The Concept: Image-to-Image Task

What it is: A setup where the input is an image and the output is a new image that contains the solution.
How it works:
1. Look at the original image (maze, Sudoku grid, puzzle pieces)
2. Imagine the correct changes
3. Generate a solution image that shows those changes
Why it matters: Without image-to-image reasoning, the model must convert visuals to text, then back to visuals, losing accuracy and time. 🍞 Bottom Bread (Anchor): You give a scrambled jigsaw as an image; the model outputs a reassembled picture—no long ingredient list of moves needed.

02Core Idea

You know how a sculptor reveals a statue by gently removing marble bits until the figure appears? DiffThinker does that with solutions in pictures.

The "Aha!" Moment in one sentence: Let the model solve vision problems by generating the solution image directly with a diffusion model, then (if needed) read that image back into symbols for fair scoring.

Three Analogies:

Art class: Instead of describing how to draw a cat step-by-step, just draw the cat correctly.
GPS map: Rather than writing directions line-by-line, show a highlighted route on the map.
Puzzle table: Instead of narrating moves, simply arrange the pieces into the finished scene.

Before vs After:

Before: MLLMs reason mainly in text, are slow for long CoT, and struggle with fine spatial details.
After: DiffThinker reasons in images, uses fixed steps (predictable speed), and naturally preserves lines, grids, and shapes with high precision.

🍞 Top Bread (Hook) Imagine cleaning a foggy window until the scene outside becomes clear.

🥬 The Concept: Diffusion Models

What it is: A method where AI starts from noisy images and gradually removes noise to reveal a clean result.
How it works:
1. Start with random noise
2. Learn how to push the noise toward real images
3. Take small steps that make the picture clearer
4. Stop when a clean, correct image appears
Why it matters: Without this gradual cleaning, the model may jump to messy or wrong images. 🍞 Bottom Bread (Anchor): Think of sharpening a blurry maze picture until the perfect red path is visible.

🍞 Top Bread (Hook) You know how floating down a calm river takes you smoothly from upstream to downstream?

🥬 The Concept: Flow Matching

What it is: A way to teach the model the smooth “flow” from noise to data so it learns stable and controllable steps.
How it works:
1. Pick a point between noise and the target image
2. Learn the best direction to move from that point toward the target
3. Repeat for many points so the whole path becomes smooth
4. During inference, follow this learned flow in fixed steps
Why it matters: Without flow matching, training can be unstable and slower to converge. 🍞 Bottom Bread (Anchor): It’s like marking arrows along a river so a boat always knows the best way downstream.

🍞 Top Bread (Hook) Cooking a complex meal needs a chef who can mix many ingredients just right.

🥬 The Concept: Multimodal Diffusion Transformer (MMDiT)

What it is: A transformer that mixes visual and textual cues to guide diffusion toward the correct solution image.
How it works:
1. Read the input image and instruction together
2. Share information across text and vision tokens
3. Predict the next small improvement to the image
4. Repeat until the solution appears
Why it matters: Without cross-modal mixing, the model might ignore the rules in the instruction or miss details in the image. 🍞 Bottom Bread (Anchor): For a maze with “avoid black squares,” MMDiT ensures the drawn path obeys that rule.

🍞 Top Bread (Hook) Packing a suitcase tightly, then unpacking everything neatly.

🥬 The Concept: Variational Autoencoder (VAE)

What it is: A tool that compresses images into a small code (latent) and can decode them back efficiently.
How it works:
1. Encode the image into a compact latent space
2. Do diffusion steps in this small space (faster)
3. Decode the final latent back to a crisp image
Why it matters: Without VAEs, generation would be slower and more memory-hungry. 🍞 Bottom Bread (Anchor): Like folding clothes to fit a carry-on, then unfolding them at the destination.

🍞 Top Bread (Hook) Sometimes you need your GPS to be more decisive, not too vague.

🥬 The Concept: Classifier-Free Guidance (CFG)

What it is: A knob that controls how strongly the instruction steers the image generation.
How it works:
1. Predict with and without the instruction
2. Blend them with a weight (the guidance scale)
3. Higher weight enforces rules more strongly (but too high can distort)
Why it matters: Without CFG, paths may be faint or off-target; too much CFG can cause artifacts. 🍞 Bottom Bread (Anchor): On mazes, guidance around 4 makes bold, clean solution lines without breaking the picture.

Why It Works (intuition): Many vision tasks are about geometry, layout, and global constraints. Drawing the answer preserves this structure. Diffusion offers fixed, learnable steps, so time is predictable and the model can explore multiple partial ideas in early steps (native parallelism) before settling on the best one. Parsing the final image back to symbols lets us compare apples-to-apples with text baselines.

Building Blocks:

Input image and short instruction
VAE to compress to latents
MMDiT to mix text–vision cues
Flow Matching to teach smooth, fixed-step generation
CFG to control adherence to rules
Image parser to convert solutions back to symbols for scoring

03Methodology

At a high level: (Input Image + Instruction) → Encode to Latent → Learn Smooth Flow (training) / Follow Flow (inference) → Decode Solution Image → Parse to Symbols.

Step 1: Prepare Inputs

What happens: The model receives a visual puzzle (maze, Sudoku grid, jigsaw pieces) plus a short instruction.
Why it exists: The instruction sets the goal (e.g., “find a path,” “complete the grid”). Without it, the model might draw something unrelated.
Example: Maze image + “Draw a valid path from the yellow start to the blue goal.”

🍞 Top Bread (Hook) Think of shrinking a big poster so it’s easier to mail, then expanding it later.

🥬 The Concept: Variational Autoencoder (VAE) (recap)

What it is: Compresses images to a small latent code and reconstructs them later.
How it works:
1. Encoder turns the solution image into a compact latent
2. Decoder turns latents back into images
Why it matters: Without VAE, generation is slower with bigger memory costs. 🍞 Bottom Bread (Anchor): The maze poster folds into a tidy roll (latent), then unrolls cleanly at the end.

Step 2: Training with Flow Matching

What happens: The model learns the “velocity field” that points from noisy latents toward the correct solution latent.
Why it exists: Teaches stable, predictable steps that move ever closer to the right image.
Example: For Sudoku, training learns how to move from noise to a valid, fully-filled grid image that obeys all rules.

🍞 Top Bread (Hook) Following arrows on a treasure map so you don’t get lost.

🥬 The Concept: Flow Matching (recap)

What it is: Learning small, reliable directions from noise to data across time.
How it works:
1. Sample an in-between point between noise and the target
2. Predict the best direction toward the target
3. Repeat across many points and times
Why it matters: Without it, steps could be wobbly and slow to learn. 🍞 Bottom Bread (Anchor): The arrows keep the maze path training steady and efficient.

Step 3: Multimodal Guidance During Training

What happens: A Multimodal Diffusion Transformer (MMDiT) mixes the instruction with visual features to guide each “cleaning” step.
Why it exists: Ensures the generated image obeys the task rules and aligns with the input image.
Example: In TSP, the instruction (“find the shortest tour”) and city-dot image together guide the model to draw a single continuous loop.

🍞 Top Bread (Hook) A conductor keeps all instruments on the same beat.

🥬 The Concept: MMDiT (recap)

What it is: A transformer that fuses text and vision tokens to steer diffusion.
How it works:
1. Attend across text and image features
2. Predict the next small improvement to the latent
3. Repeat across fixed steps
Why it matters: Without it, the solution might ignore the instruction or visual constraints. 🍞 Bottom Bread (Anchor): The “avoid holes” rule in VSP is respected while drawing the red path.

Step 4: Inference with Fixed Steps

What happens: Start from noise in latent space and follow the learned flow for a fixed number of steps (about 20 is a sweet spot).
Why it exists: Fixed steps = predictable, controllable runtime that doesn’t balloon with problem difficulty.
Example: Maze level-32 still takes ~20 steps; the model doesn’t stall writing thousands of tokens.

Step 5: Classifier-Free Guidance (CFG) Tuning

What happens: Adjust a guidance weight to balance following the instruction strongly vs. keeping images clean.
Why it exists: Too little guidance makes faint or uncertain solutions; too much causes artifacts.
Example: CFG ≈ 4 draws bold, clean maze paths.

🍞 Top Bread (Hook) Set your bike’s handlebar sensitivity—not too stiff, not too loose.

🥬 The Concept: CFG (recap)

What it is: A knob to dial task adherence.
How it works:
1. Predict with and without instruction
2. Blend them using a scale
Why it matters: Avoids weak or overcooked generations. 🍞 Bottom Bread (Anchor): At scale 4, Sudoku digits appear crisp and consistent with the puzzle.

Step 6: Decode to Image and Parse to Symbols

What happens: Decode the final latent with the VAE to get a pixel image. Then parse the solution image into symbols to compare with ground truth.
Why it exists: Fairly scores the visual method against text baselines and prevents answer leakage.
Example: In Sudoku, read the 81 digits from the generated grid and compare them cell-by-cell.

The Secret Sauce:

Efficiency: Training time is on par with strong baselines; inference is ~1.1s and comparable to smaller MLLMs but more accurate.
Controllability: Fixed steps mean stable, predictable cost (no runaway CoT length).
Native Parallelism: Early steps explore multiple options in one forward pass, then prune and focus—no explicit backtracking needed.
Collaboration: Generate N candidate images; let an MLLM verify which candidate satisfies all constraints. This combo outperforms either alone.

🍞 Top Bread (Hook) When solving a maze, you often sketch many light pencil paths before darkening the right one.

🥬 The Concept: Native Parallel Reasoning (emergent behavior)

What it is: DiffThinker’s early steps entertain multiple partial solutions simultaneously.
How it works:
1. Early diffusion steps keep options open
2. Global constraints prune bad paths
3. Later steps firm up the best path
Why it matters: Without it, you waste time committing early, then undoing mistakes. 🍞 Bottom Bread (Anchor): The model shows faint multiple routes at first and then converges to a single clean solution path.

04Experiments & Results

The Test: Seven tasks across four domains—sequential planning (VSP, VSP-Super, Maze), combinatorial optimization (TSP), constraint satisfaction (Sudoku), and spatial configuration (Jigsaw, VisPuzzle). Accuracy is measured at multiple difficulty levels with strict parsing from images to symbols for fairness.

The Competition: Strong closed models (GPT-5, Gemini-3-Flash), open-source MLLMs (Qwen3-VL-8B/32B in zero-shot, SFT, and RL/GRPO), and image-edit baselines. DiffThinker uses a 20B MMDiT on Qwen-Image-Edit foundations, with a newer DiffThinker++ variant for main leaderboard.

Scoreboard (with context):

Overall: DiffThinker wins big—+314.2% over GPT-5, +111.6% over Gemini-3-Flash, and +39.0% over fine-tuned Qwen3-VL-32B.
Sequential Planning (VSP, VSP-Super, Maze): While MLLMs’ accuracy drops as grids get larger, DiffThinker stays high—like scoring an A when others slide from B to C as homework gets harder.
Combinatorial Optimization (TSP): DiffThinker draws a single neat tour loop; accuracy is strong against baselines that output coordinate orders.
Constraint Satisfaction (Sudoku): Generates full, valid grids with high precision—no need to narrate 81 numbers.
Spatial Configuration (Jigsaw, VisPuzzle): Near-perfect reconstructions; DiffThinker shines where spatial layout is king.

🍞 Top Bread (Hook) Planning isn’t just steps; it’s also about seeing the map.

🥬 The Concept: Sequential Planning

What it is: Solving tasks by laying out actions in order.
How it works:
1. Understand start and goal
2. Consider obstacles
3. Build a step-by-step path
Why it matters: Without planning, you get stuck or loop back. 🍞 Bottom Bread (Anchor): In Maze, the red path must reach the goal without crossing walls.

🍞 Top Bread (Hook) Choosing the best route among many possibilities is like picking the shortest line at the supermarket.

🥬 The Concept: Combinatorial Optimization

What it is: Picking the best arrangement from many options.
How it works:
1. List constraints (visit each city once)
2. Compare options
3. Choose the shortest path
Why it matters: Without it, you waste time or miss the optimum. 🍞 Bottom Bread (Anchor): TSP needs a closed loop visiting all dots with minimal distance.

🍞 Top Bread (Hook) Every Sudoku is like a rule-following game.

🥬 The Concept: Constraint Satisfaction

What it is: Filling in values while obeying rules.
How it works:
1. Note constraints per row, column, box
2. Fill candidates
3. Keep only values that fit all rules
Why it matters: Without rules, the grid is nonsense. 🍞 Bottom Bread (Anchor): A valid Sudoku has no repeats in any row, column, or 3×3 box.

🍞 Top Bread (Hook) Arranging furniture so the room feels right.

🥬 The Concept: Spatial Configuration

What it is: Placing parts so the whole picture works.
How it works:
1. Recognize piece content and edges
2. Try placements
3. Settle on the arrangement that fits globally
Why it matters: Without global fit, the image looks wrong. 🍞 Bottom Bread (Anchor): Jigsaw tiles form a clean, continuous scene when placed correctly.

Surprising Findings:

Fixed Steps, Fixed Cost: About 20 inference steps deliver a sweet spot—accuracy jumps from 10 to 20 steps, then plateaus.
CFG Sweet Spot: Guidance around 4 yields bold, precise solutions; too low is timid, too high creates artifacts.
Native Parallelism: Visualizing intermediate steps shows multiple options explored early, then narrowed—no extra code for backtracking.
Collaboration Wins: Generating N candidates and letting an MLLM verify them boosts accuracy further.
Video vs Image: A video-based DiffThinker can solve mazes too, but is slower (~2.0s vs ~1.1s) and less accurate under current models.

Efficiency Results:

Training time comparable to strong SFT baselines and much less than RL (e.g., GRPO) overhead.
Inference latency ~1.1s—like handing in your quiz early with full marks, while others keep writing.

05Discussion & Limitations

Limitations:

Out-of-Distribution (OOD) Generalization: Zero-shot performance on very unfamiliar puzzles is bounded by the base generative model. When the world looks new, the visual generator may hesitate.
Vision-Centric Focus: Text-heavy math and symbolic logic remain areas where classic MLLMs may excel.
Parser Dependence: Fair scoring needs reliable image parsing; if parsing fails, a correct image might be misread.
Fine Control Edge Cases: Too strong guidance can cause artifacts; too weak can make faint, indecisive paths.

Required Resources:

A modern GPU (the paper used H200s) for training and fast inference.
Labeled datasets with parseable structure (grids, coordinates, indices) to enable objective evaluation.
Foundation models: a capable image diffusion backbone (e.g., MMDiT) and a VAE for latents.

When NOT to Use:

Purely text or symbolic problems (e.g., long algebra proofs) where visual layout adds little value.
Ultra-low-latency edge devices without sufficient compute.
Domains without clear visual parseability (hard to convert image solutions back into exact symbols).

Open Questions:

Stronger Visual Foundations: How far can accuracy climb with next-gen multimodal diffusion backbones?
Universal Parser: Can we build a robust, general image-to-symbol parser across many task types?
Hybrid Reasoners: What’s the best protocol for DiffThinker + MLLM teamwork (candidate counts, verification rules)?
Video Reasoning: Can future, more efficient video generators unlock even better long-horizon planning?
Data Efficiency: How little data is needed if we leverage synthetic curricula or self-training?

06Conclusion & Future Work

Three-Sentence Summary:

DiffThinker reframes multimodal reasoning as generating a solution image directly, then parsing it back into symbols for fair comparison.
Built on diffusion with Flow Matching, MMDiT guidance, a VAE latent space, and tuned CFG, it solves long, visual tasks with high spatial precision, fixed compute steps, and native parallel exploration.
Across seven tasks in four domains, it outperforms strong MLLMs, and collaboration with MLLMs pushes results even further.

Main Achievement:

Establishing Generative Multimodal Reasoning—showing that visual-native, image-to-image generation is a superior way to reason on vision-centric, long-horizon problems.

Future Directions:

Stronger base generative models optimized for reasoning, better OOD generalization, and universal parsers.
More efficient video-based reasoning to leverage temporal coherence.
Deeper hybrid systems where DiffThinker proposes visual candidates and MLLMs verify or refine them.

Why Remember This:

When the problem is visual, think in visuals. DiffThinker proves that drawing the answer can be faster, clearer, more accurate, and easier to control than writing long explanations—launching a new path for multimodal AI reasoning.

Practical Applications

•Robot navigation in warehouses: Generate safe, obstacle-avoiding paths directly on floor maps.
•Route planning for delivery drones or vehicles by drawing optimized tours on city maps (TSP-like).
•Interactive tutoring tools that visually complete Sudoku or maze puzzles to teach reasoning strategies.
•Layout and UI design assistance that arranges components into clean, readable configurations.
•Industrial packing and bin placement by visually configuring items to maximize space usage.
•AR/VR puzzle-solving assistants that show visual hints or complete solutions overlayed on scenes.
•Quality control in manufacturing by rearranging or aligning parts in images to meet spatial specs.
•Autonomous inspection planning where paths over large structures are drawn to cover all checkpoints.
•Education apps that convert kids’ rough puzzle attempts into corrected, rule-following visual solutions.
•Hybrid AI systems where DiffThinker proposes visual candidates and an MLLM verifies rule compliance.

Version: 1