Unified Thinker: A General Reasoning Modular Core for Image Generation

Sashuai Zhou; Qiang Zhou; Jijin Hu; Hanqing Yang; Yue Cao; Junpeng Ma; Yinchao Ma; Jun Song; Tiezheng Ge; Cheng Yu; Bo Zheng; Zhou Zhao

Unified Thinker: A General Reasoning Modular Core for Image Generation

Intermediate

Sashuai Zhou, Qiang Zhou, Jijin Hu et al.1/6/2026

arXiv PDF

Key Summary

•Unified Thinker separates “thinking” (planning) from “drawing” (image generation) so complex instructions get turned into clear, doable steps before any pixels are painted.
•It plugs a dedicated Thinker (an MLLM) in front of any image Generator (like a diffusion model), so you can upgrade logic without retraining the whole drawer.
•A new dataset, HieraReason-40K, teaches the Thinker to write neat, structured plans that include intent, constraints, and ordered sub-goals.
•Training happens in two stages: first supervised learning to learn the planning format, then reinforcement learning that uses the final image as feedback to make plans more executable.
•The RL uses group-relative rewards (GRPO): sample several plans, render them, and boost the plans that produce the best images.
•On tough reasoning benchmarks (RiseBench, WiseBench), Unified Thinker narrows the gap to top closed-source systems and clearly beats strong open-source baselines.
•It improves instruction following and visual plausibility without hurting general image quality, and even slightly lifts aesthetics on PRISM.
•The Thinker transfers across different generators (e.g., also boosts BAGEL), proving the module is portable.
•There is a small trade-off: extra planning adds latency, and some very precise edits (like fine geometry or perfect text rendering) remain challenging.
•Overall, the paper shows that executable reasoning—plans grounded in pixel feedback—is key to reliable, logic-heavy image generation and editing.

Why This Research Matters

Unified Thinker makes creative tools more trustworthy by ensuring images actually follow complex instructions, not just produce pretty guesses. Designers can request precise changes—like adding an object without touching the rest—and get faithful results. Teachers and students can visualize logical steps (stacks, sequences, timelines) accurately, making learning clearer. Brands and studios can keep characters, layouts, and continuity consistent while applying targeted edits at scale. Scientists and analysts can render reasoned visual hypotheses (e.g., temporal or causal changes) with better control. Because the Thinker is modular, it can ride along with new generators as they appear, upgrading logic without starting from scratch.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a great Lego set looks amazing on the box, but if you skip the instructions and just start snapping bricks together, your castle might look cool but won’t match what you wanted?

🥬 The Concept: Generative models for images are amazing at making pretty pictures, but they often skip the “instructions” part—especially when instructions are tricky, like “put three red cubes under a blue one and leave everything else the same.”

How it works (before this paper): People typed a prompt, the model guessed what to draw, and hoped for the best. If the prompt had hidden logic (counting, order, cause-and-effect), models often missed it.
Why it matters: Without careful thinking first, models make images that look nice but break the rules you asked for. 🍞 Anchor: Ask for “a stack from bottom to top: red, green, blue, white.” Many models might jumble the order or forget a color.

🍞 Hook: Imagine telling a friend, “Bake the cake for 45 minutes and don’t change anything else,” and they decorate it, repaint the kitchen, and move the table. That’s not following directions!

🥬 The Concept: The reasoning–execution gap is when a model can talk about the right steps, but its drawing doesn’t follow them.

How it works: Some systems plan in text, then hand that plan to a generator that can’t quite do it. The plan sounds smart, but the picture goes off-track.
Why it matters: Pretty images that ignore instructions are useless for editing, design, or scientific visuals. 🍞 Anchor: “Add a red line from the mushroom house to the mud pit.” If the model changes the grass color or moves the pit, it failed.

🍞 Hook: Think of two approaches to cooking. A) Mix every step in one giant pot. B) Split jobs: the head chef plans; the line cook executes.

🥬 The Concept: People tried two main approaches before. Built-in reasoning (everything trained together) and external planners (an MLLM writes steps, a frozen generator draws).

How they work: Built-in reasoning can hurt picture quality by overcomplicating training. External planners keep things modular but often propose steps the generator can’t execute.
Why it matters: We need both modularity and executable plans. 🍞 Anchor: A planner that asks the generator to “solve a Sudoku then draw it” fails if the generator can’t compute Sudoku.

🍞 Hook: Imagine having a universal “coach” who can guide any player on any team.

🥬 The Concept: What was missing was a principled, reusable way to plan images: a general Thinker that writes grounded, checkable steps the Generator can truly follow.

How it works: The Thinker turns big requests into clear, structured, visual instructions that match what the Generator is good at.
Why it matters: Good plans make good pictures—consistently. 🍞 Anchor: “Solve the Sudoku and fill in the blanks exactly here.” The Thinker outputs the final grid; the Generator only draws the filled cells.

🍞 Hook: Why should you care? Because in daily life we want edits that stick to the plan—change the outfit, not the face; add snow, not a new background; stack colors in order, not at random.

🥬 The Concept: This paper’s goal is reliable instruction following for both text-to-image and image editing using a reusable planning brain.

How it works: Decouple the Thinker (plans) and Generator (pixels), then align them with pixel-level feedback.
Why it matters: Your creative tools become trustworthy: they do what you asked, the way you asked. 🍞 Anchor: “Make the background a green grassland—don’t touch the person.” The final image keeps the person, swaps only the background, and looks natural.

02Core Idea

🍞 Hook: Imagine building a treehouse. The architect decides the blueprint; the carpenter handles the wood. If the blueprint is clear, the build is smooth.

🥬 The Concept: The “Aha!” is to split image creation into a Thinker (plans) and a Generator (draws), then train the Thinker so its plans are executable by judging the final pixels—not just nice-sounding text.

How it works: 1) The Thinker writes a structured plan: intent, explicit constraints, and ordered sub-goals. 2) The Generator consumes that as conditioning to render the image. 3) Reinforcement learning uses the finished image as feedback to improve planning toward visual correctness.
Why it matters: Without executable plans, models keep making pretty-but-wrong images. With them, logic-heavy tasks finally work. 🍞 Anchor: “From bottom to top: red, green, blue, white.” The Thinker writes that exact stack as a visual spec; the Generator draws it precisely.

Multiple analogies:

Architect and builder: The Thinker drafts floor-by-floor instructions; the Generator hammers the nails.
Coach and player: The Thinker sets the play; the Generator runs it on the field.
Recipe and chef: The Thinker lists ingredients and steps; the Generator cooks to taste.

Before vs. After:

Before: One-shot prompts; vague or text-only plans; frequent mismatches between plan and pixels.
After: Structured, generator-aware plans; images that respect order, counts, and constraints; fewer retries and less drift.

Why it works (intuition behind the math):

Plans are standardized and explicit (intent + constraints + sub-goals), so the Generator gets a clean target.
Pixel-grounded RL rewards only what actually renders well, aligning the Thinker’s policy with the Generator’s capabilities.
Decoupling lets each improve independently: upgrade the Thinker’s logic or swap the Generator without breaking the other.

Building blocks (each with a mini sandwich):

🍞 Hook: You know how a checklist keeps you from forgetting steps? 🥬 The Concept: Structured reasoning trace is a neat plan with intent, constraints, and sub-goals.
- How: Parse the request → make hidden rules explicit → order sub-steps → output an enhanced prompt.
- Why: Without structure, the Generator guesses and drifts. 🍞 Anchor: “Edit-only: add snow on top” keeps all other pixels unchanged.
🍞 Hook: Imagine telling a helper exactly what to draw, line by line. 🥬 The Concept: Enhanced prompt is the final, executable visual specification the Generator consumes.
- How: Convert the resolved target into concrete visual terms the Generator understands.
- Why: If the prompt is vague, pixels go astray. 🍞 Anchor: “Four aligned cubes, bottom-to-top: red, green, blue, white, studio lighting.”
🍞 Hook: Like a librarian who can read pictures and words. 🥬 The Concept: The Thinker is an MLLM that understands images and text to produce plans.
- How: It analyzes input, reasons, and writes structured outputs.
- Why: Text-only thinking misses important visual context in edits. 🍞 Anchor: It spots where to draw the red path in the photo before instructing the Generator.
🍞 Hook: Like a painter who turns a sketch into a masterpiece. 🥬 The Concept: The Generator (diffusion model) turns the enhanced prompt into pixels.
- How: Gradually denoise from noise to image, guided by the prompt.
- Why: You need a strong renderer to realize the plan faithfully. 🍞 Anchor: From structured spec to a photorealistic winter scene with correct snow effects.
🍞 Hook: Practice makes perfect, but only if you judge the final result. 🥬 The Concept: Pixel-grounded RL teaches the Thinker which plans actually render well.
- How: Sample multiple plans, render images, score them, and nudge the Thinker toward higher-scoring plans.
- Why: Text plausibility ≠ visual executability. 🍞 Anchor: The plan that yields the sharpest, most correct stack order gets boosted.

03Methodology

At a high level: Input (text and/or image) → Thinker (structured plan + enhanced prompt) → Generator (image) → Pixel-based rewards (to train the Thinker and stabilize execution).

Step-by-step with sandwich explanations:

Data: HieraReason-40K

🍞 Hook: Imagine a workbook full of solved examples showing both the thinking and the final answer.
🥬 The Concept: HieraReason-40K is a dataset of 40K instructions (T2I and I2I) paired with structured reasoning traces and enhanced prompts.
- How it works: Combine diverse sources; generate a three-stage trace (Task+Intent → Reasoning+Concrete result → Executable prompt); enforce an edit-only rule for I2I; filter for consistency.
- Why it matters: The Thinker needs clear examples of “how to think, then what to tell the Generator.”
🍞 Anchor: Sudoku editing: the trace computes the missing numbers; the prompt lists the final grid to fill.

Stage 1: Joint Supervised Fine-Tuning (SFT)

🍞 Hook: Like learning to write neat instructions while a friend practices drawing from them.
🥬 The Concept: Train the Thinker to produce structured traces, and train the Generator to render enhanced prompts—together but on their own views.
- How it works: Two synchronized views per example: Understanding view (input → Thinker trace) with language loss; Generation view (enhanced prompt → target image) with diffusion denoising loss. Optimize a weighted sum.
- Why it matters: Without this, the Thinker might write plans the Generator can’t use, or the Generator might not be attuned to the plan style.
🍞 Anchor: “Add snow only” prompts lead the Generator to preserve everything else while changing just the top surfaces.

Stage 2: Dual-Phase Reinforcement Learning (GRPO-based)

🍞 Hook: Think of a science fair: try several ideas, see which project scores best, then improve in that direction.
🥬 The Concept: Use Group Relative Policy Optimization to compare multiple candidates per prompt and reward the better ones.
- How it works (Phase 1 – Reasoning-oriented RL):
  1. Sample G alternative plans from the Thinker.
  2. Render each with a fixed Generator.
  3. Score final images (reasoning alignment, appearance consistency, visual plausibility) with a VLM judge.
  4. Give each plan a relative advantage (above/below group mean) and update the Thinker to prefer higher-scoring plans.
- Why it matters: Text that “sounds smart” is not enough; only plans that make good images should be reinforced.
- How it works (Phase 2 – Generation-oriented RL):
  1. Keep the Thinker steady; introduce controlled randomness in diffusion sampling (convert ODE to reverse-time SDE) to create multiple rollouts.
  2. Score the outputs and update the Generator to favor denoising paths that end in higher-quality, more faithful images.
- Why it matters: This stabilizes execution so good plans reliably become good pixels.
🍞 Anchor: Several alternative “thermal camera” prompts are tried; the one producing the clearest hot/cold color mapping gets boosted.

Reward design

🍞 Hook: Like a judge card that scores accuracy, neatness, and creativity.
🥬 The Concept: A VLM-based reward model scores images for instruction following and realism.
- How it works: For editing: score Reasoning/Alignment, Appearance Consistency, and Visual Plausibility (1–5); for T2I: Consistency, Realism, Aesthetics (0–2). Aggregate into a scalar reward.
- Why it matters: The Thinker learns which choices lead to images that both obey rules and look good.
🍞 Anchor: For “draw what it looks like 6 hours later,” the image with believable temporal changes and intact non-edited regions scores higher.

The execution interface and edit-only principle

🍞 Hook: Imagine using a sticky note that lists only what to change, not the whole recipe again.
🥬 The Concept: The Thinker’s final prompt for I2I mentions only the change; for T2I, it describes the full scene.
- How it works: Strict format and placeholders keep the plan precise; “Golden Rule” prevents restating unchanged content to avoid drift.
- Why it matters: Minimizes unintended edits and keeps control tight.
🍞 Anchor: “Add a clear red line path from A to B”—no mention of grass, sky, or buildings.

Secret sauce

🍞 Hook: You can teach a coach to give better plays by judging the final scoreboard, not just the pep talk.
🥬 The Concept: Pixel-grounded, dual-phase RL aligns the Thinker’s language with the Generator’s abilities and tunes the Generator’s sampling to be more faithful.
- How it works: Relative advantages (GRPO) stabilize learning; reverse-time SDE adds healthy diversity for Generator RL.
- Why it matters: This closes the planning-to-pixels loop that typical text-only planners miss.
🍞 Anchor: Over time, plans that consistently produce correct cube orders or precise edits dominate, and images become both logical and beautiful.

04Experiments & Results

🍞 Hook: Think of a tournament where teams must follow complicated playbooks exactly—no freestyle!

🥬 The Concept: Tests measure how well models obey tricky instructions, keep the right parts unchanged, and still look great.

How it works: Evaluate on four settings—reasoning-heavy image editing (RISEBench), reasoning T2I (WiseBench), general editing (GEditBench), and general T2I quality (PRISM). Compare to strong open-source and closed-source systems.
Why it matters: Real users need edits and generations that are both correct and attractive. 🍞 Anchor: “Change the background to green grassland” should not alter the person’s face or clothes and should look realistic.

The competition:

Strong baselines include Qwen-Image-Edit, BAGEL, OmniGen, FLUX.1, SD3.5, and closed-source front-runners like GPT-4o and Gemini 2.5.

Scoreboard with context:

RiseBench (reasoning-heavy editing): Adding Unified Thinker to Qwen-Image-Edit jumps Instruction Reasoning to 61.9, Consistency to 76.2, and Visual to 90.5, with big gains in temporal and spatial reasoning. That’s like going from a B- to a solid A in following multi-step rules while keeping images pretty.
WiseBench (reasoning T2I): Qwen-Image-Edit + Unified Thinker (Qwen3-VL-8B) reaches ~0.74 overall vs. 0.62 baseline—closing much of the gap to GPT-4o. That’s like catching up a full grade level in a logic test.
PRISM (general T2I): With Unified Thinker, Qwen-Image-Edit’s average rises from 73.8 to 78.1, improving aesthetics without losing alignment—like getting higher style points without breaking the routine.
GEditBench (general editing): Slight but consistent lift in overall score (G_O 7.56 → 7.71), showing planning helps even in everyday edits.

Surprising findings:

Transfer works: The Thinker trained with Qwen-Image-Edit also boosts BAGEL—proof the reasoning module is portable across generators.
Backbone trade-off: The larger Thinker (Qwen3-VL-8B) tends to win on logic-heavy tasks; the smaller (7B) can slightly favor aesthetic preferences in some cases—choose your flavor.
Mild tension: Pushing deep reasoning can slightly dent very low-level pixel tweaks unless joint SFT and Dual-RL are used—these stages smooth out the trade-offs.

Bottom line: Unified Thinker turns complex, hidden constraints into images that are both correct and compelling across multiple benchmarks and backbones.

05Discussion & Limitations

🍞 Hook: Even the best coaches have limits—they need good playbooks, fair judges, and time on the clock.

🥬 The Concept: Honest assessment of Unified Thinker’s boundaries.

Limitations:
1. Data and reward bias: If the traces or VLM judge favor certain styles or topics, the Thinker may overfit those preferences.
2. Backend differences: A plan that’s easy for one diffusion model may be harder for another; executability isn’t perfectly universal.
3. Hard edits: Fine geometry, strict locality, and pixel-perfect text rendering remain challenging.
4. Latency: Planning adds an extra step, so it’s slower than direct prompting.
Required resources: A capable MLLM Thinker (e.g., 7B–8B), a strong diffusion Generator, and compute for SFT + RL (multi-GPU recommended). At inference, you need the Thinker + Generator cascade.
When NOT to use: • Ultra-low-latency scenarios where every millisecond counts. • Purely stylistic, trivial prompts that don’t need reasoning. • Tasks demanding laser-precise vector geometry or perfect text rendering in pixels.
Open questions: • How to reduce latency while keeping reasoning quality? • Can plans be made even more generator-agnostic? • How to design broader, less biased rewards than a single VLM judge? • Can we teach the Generator to expose its limits so the Thinker plans within them automatically?

🍞 Anchor: If you just need “make it more vibrant,” a direct prompt might be faster. But for “swap the background, keep the subject untouched, and place two red cubes under the blue,” Unified Thinker shines.

06Conclusion & Future Work

Three-sentence summary: Unified Thinker decouples planning (Thinker) from rendering (Generator) and teaches the planner with pixel-grounded feedback so its steps are actually executable. A structured planning interface plus dual-phase RL closes the reasoning–execution gap for both text-to-image and image editing. The result is stronger instruction following, better logic, and solid image quality that transfers across generators.

Main achievement: Showing that a reusable, plug-in planning core—trained to optimize final pixels, not just text—reliably turns complex, logic-heavy requests into faithful images.

Future directions: Faster planning with lighter Thinkers; broader, human-aligned reward signals; richer intermediate representations (e.g., layouts, sketches) that further improve executability; tighter two-way communication so Generators can inform the Thinker about feasibility in real time.

Why remember this: It proves that “thinking first, then drawing”—and judging by the final picture—unlocks dependable, instruction-faithful image generation, moving open-source systems much closer to the frontier.

Practical Applications

•Photo editing assistants that reliably change only what you ask (e.g., swap a background while preserving the subject).
•Product customization pipelines that enforce exact colors, counts, and placements for catalogs or ads.
•Educational visuals that accurately render sequences, stacks, and timelines (e.g., physics or biology demonstrations).
•Storyboarding tools that keep character and scene continuity while making specific, reasoned edits across frames.
•Design QA bots that rephrase vague requests into executable prompts, reducing misinterpretations.
•Scientific communication where causal or temporal transformations (before/after states) are rendered consistently.
•Content moderation or compliance workflows that require precise redactions or overlays without collateral changes.
•Game asset editing that applies constrained modifications (e.g., add snow, change outfit) while preserving canonical features.
•Data augmentation that follows strict rules (e.g., place objects in ordered positions) for training downstream models.

Version: 1