EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Hongyu Li; Manyuan Zhang; Dian Zheng; Ziyu Guo; Yimeng Jia; Kaituo Feng; Hao Yu; Yexin Liu; Yan Feng; Peng Pei; Xunliang Cai; Linjiang Huang; Hongsheng Li; Si Liu

EditThinker: Unlocking Iterative Reasoning for Any Image Editor

Intermediate

Hongyu Li, Manyuan Zhang, Dian Zheng et al.12/5/2025

arXiv PDF

Key Summary

•EditThinker is a helper brain for any image editor that thinks, checks, and rewrites the instruction in multiple rounds until the picture looks right.
•Instead of one-shot editing, it runs a Critique–Refine–Repeat loop that mimics how people fix their work step by step.
•A single multimodal model (the Thinker) outputs a reasoning trace, two scores (how well it followed the instruction and how natural it looks), and a better next instruction in one go.
•The Thinker is trained first by copying an expert (supervised fine-tuning) and then sharpened with real feedback from editors (reinforcement learning).
•The authors built THINKEDIT-140k, a large multi-turn dataset of images, instructions, and reasoning traces to teach the Thinker.
•Plugging EditThinker into popular editors (like FLUX.1 Kontext, OmniGen2, and Qwen-Image-Edit) boosts their scores across four benchmarks.
•On hard reasoning tasks (RISE-Bench and KRIS-Bench), multi-turn thinking delivers especially big gains compared to single-turn editing.
•More capable Thinkers give bigger improvements, and allowing more turns usually helps up to a point.
•The system is editor-agnostic: you don’t change the editor; you add a Thinker on top to guide it.
•This approach makes instruction-following more reliable, like having a careful coach correct and guide each edit.

Why This Research Matters

Real photos and graphics often need careful touch-ups that keep key parts intact: faces, logos, or text on signs. One-shot edits can easily miss the mark, warping important details or changing things you wanted preserved. EditThinker adds a gentle, human-like loop to catch and fix issues step by step, so results match your words more reliably. Because it’s editor-agnostic, you can keep your favorite editor and still get a big upgrade in accuracy. This helps creators, marketers, teachers, and casual users produce more trustworthy images with less trial and error. It also points the way toward smarter multimodal systems that plan and self-correct, not just generate once. As tools like this spread, we’ll see faster workflows, fewer artifacts, and edits that better respect user intent.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you draw, you don’t get it perfect the first time—you sketch, look, fix, and try again? Computers that edit pictures have the same problem.

🥬 Filling (The Actual Concept):

What it is: Instruction-based image editing means a computer changes a picture by following your words, like “make the sky pink” or “turn the zebra into concrete.”
How it works (before this paper): Most editors try to do everything in a single turn—understand the instruction, plan the changes, and generate the new image all at once.
Why it matters: One-shot edits often miss details (like wrong colors, warped text, or lost backgrounds) because there’s no time to think, check, and fix.

🍞 Bottom Bread (Anchor): Imagine asking, “Replace the sailboat on the right with a lighthouse.” A single-turn edit might remove the boat but forget to add the lighthouse, or add one that’s the wrong size.

🍞 Top Bread (Hook): Imagine a chef tasting their soup while cooking, adding a bit more salt each time. That’s better than guessing the right taste in one try.

🥬 Filling (Single-turn Limitation):

What it is: Single-turn editing is like trying to finish the whole soup without tasting.
How it works: The model gets one instruction and must do the full change in one pass.
Why it matters: If it misunderstands part of your instruction (like “keep the pose but change the style”), it can’t fix itself.

🍞 Bottom Bread (Anchor): You say, “Animate the cat but keep the pose and fur pattern.” The model might change the pose or blur the fur because it didn’t get to double-check.

🍞 Top Bread (Hook): You know how a teacher grades your essay and gives you notes so you can rewrite it better?

🥬 Filling (Feedback Models):

What it is: Many systems used reward models or judges to give a score after the edit.
How it works: After generating the edited image, an evaluator gives a thumbs up/down or a score.
Why it matters: This helps a bit, but it’s too late to fix the plan mid-way—the score is after the fact and doesn’t directly tell the editor how to try again.

🍞 Bottom Bread (Anchor): It’s like being told “B- because of weak conclusion,” but you don’t get guided steps to rewrite your conclusion.

🍞 Top Bread (Hook): Picture an art coach who doesn’t just grade your painting but also explains what’s off and suggests exactly what to try next.

🥬 Filling (What was missing):

What it is: A thinking loop that looks at the current result, explains what’s wrong, and rewrites the instruction for the next try.
How it works: Evaluate → Reason → Rewrite → Try again, several times.
Why it matters: Without this loop, editors stay reactive (they do) instead of reflective (they think then do better).

🍞 Bottom Bread (Anchor): The coach says, “The lighthouse is too big for the distance. Keep the background, shrink the lighthouse, and don’t put it in the foreground.” You try again and it looks right.

🍞 Top Bread (Hook): Imagine building a LEGO set with step-by-step checks so you notice early if a piece is wrong.

🥬 Filling (Enter EditThinker):

What it is: This paper adds a Thinker—a multimodal model that sees both images and text—to guide any existing editor through multiple rounds.
How it works: The Thinker critiques the current result, gives two scores (following and quality), writes down its reasoning, and produces a refined instruction. The editor then generates a new attempt.
Why it matters: This turns editing into a safe, guided staircase instead of a single jump.

🍞 Bottom Bread (Anchor): After each draft image, the Thinker says, “Street sign letters are warped. Keep exact text shapes ‘Mt Lookout Rd’ and ‘North Park Rd’; change only the background to a city.” The next try preserves the signs perfectly.

Real Stakes (Why care in daily life):

Fixes warped text on signs and labels so edited photos look real.
Keeps people’s faces and poses consistent when changing style or background.
Makes product photos accurate when changing colors or materials.
Helps creators iterate quickly without handcrafting super-precise prompts.
Works with your favorite editor—you don’t need a new one, just a smarter helper on top.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re writing a story. You don’t publish your first draft—you reread, notice problems, and rewrite. That’s exactly what this paper brings to image editing.

🥬 Filling (Aha! in one sentence): Let the editor think while it edits by running an iterative Critique–Refine–Repeat loop guided by a single multimodal Thinker that both evaluates and rewrites instructions in every round.

Multiple Analogies:

Chef analogy: Taste → adjust spices → taste again, until the soup is just right.
Coach analogy: Watch the play → point out mistakes → practice the fixed move → play again.
GPS analogy: Drive → check you’re off-route → get a corrected turn-by-turn instruction → proceed.

🍞 Bottom Bread (Anchor): When turning a zebra into concrete, the first try might lose the stripes or add a base. The Thinker says, “Keep the zebra’s pose, keep black-and-white stripes as painted texture on concrete, no pedestal,” and the next try nails it.

🍞 Top Bread (Hook): You know how a to-do list helps you focus on the most important steps first?

🥬 Filling (Before vs After):

Before: Editors handled everything in one pass—understanding, planning, and generating—so they often missed details and couldn’t self-correct.
After: The Thinker first critiques what’s wrong, then rewrites a sharper instruction, and the editor tries again. This repeats a few times, closing gaps.
What changes: Higher instruction-following, fewer artifacts, better preservation of what should stay the same.

🍞 Bottom Bread (Anchor): For changing a background to a city while keeping street sign text pristine, the Thinker keeps reminding the editor: “Don’t warp text; preserve exact letter shapes,” until the signs look authentic.

🍞 Top Bread (Hook): Imagine building a puzzle: you constantly compare the piece to the box picture before locking it in.

🥬 Filling (Why it works—intuition, not equations):

The editor needs concrete, targeted instructions.
The Thinker looks at both the source and the current edited image to spot precise mismatches (pose changed, text warped, color off).
Writing the reasoning before the next prompt keeps the fix grounded in what actually went wrong, not just guessing.
Small, focused corrections compound over turns, like climbing stairs to the goal.

🍞 Bottom Bread (Anchor): If the lighthouse is the right place but too big, the Thinker won’t rewrite everything; it just says, “Keep position, reduce size for distance realism,” so the editor fixes exactly that.

🍞 Top Bread (Hook): Picture a single smart teammate who can grade your work and also tell you exactly how to improve it.

🥬 Filling (Building Blocks):

Dual-role Thinker: One multimodal model outputs three things at once: a reasoning trace (<think>), two scores (<score> for ‘semantic’ and ‘quality’), and a refined instruction (<answer>).
Structured format: Forces the Thinker to evaluate before planning.
Multi-turn loop: Repeat until the score crosses a threshold or a max turns limit.
Training in two stages: Learn the format and style by copying an expert (SFT), then learn what actually helps real editors via reinforcement learning.
Large dataset: THINKEDIT-140k with multi-round examples and reasoning traces so the Thinker learns real failure patterns and good fixes.

🍞 Bottom Bread (Anchor): After a first failed attempt to replace a zebra with a giraffe, the Thinker’s next instruction explicitly says: “Remove all zebra stripes; use large irregular brown patches with light tan lines; extend neck and legs,” which the editor then follows much better.

03Methodology

High-level recipe: Input (source image + original instruction) → Editor makes a first try → Thinker critiques, scores, and rewrites → Editor tries again → repeat until good.

🍞 Top Bread (Hook): Imagine you and a friend painting a picture. You paint a bit, your friend points out what to fix, you repaint, and you keep going until it looks great.

🥬 Filling (Each step, like a recipe):

Inputs and Roles

What happens: We keep the editor as-is (any existing image editor). We add a Thinker (a multimodal model) that sees the source image, the current edited image, the original instruction, and the last instruction used.
Why it exists: The editor is great at rendering but not at reflecting; the Thinker is great at reflecting and planning.
Example: Input tuple = (Isrc, Iedit^{t-1}, Ts, T^{t-1}).

The Thinker’s Structured Output

What happens: In one pass, the Thinker writes: <think> a clear reasoning trace </think> <score> { semantic: 0–10, quality: 0–10 } </score> <answer> a refined next instruction </answer>
Why it exists: The structure forces “evaluate first, then plan,” preventing sloppy rewrites.
Example: “Signs warped; keep exact text ‘Mt Lookout Rd’ and ‘North Park Rd’; swap background to city but do not alter letter shapes.”

The Critique–Refine–Repeat Loop

What happens: If the score is below a threshold, we run another turn with the new instruction. Stop when good enough or when max turns is reached.
Why it exists: Most first tries are imperfect; small, focused fixes stack up.
Example: Turn 1 animates the cat but changes pose. Thinker: “Strictly preserve lying pose and curled paws; add motion lines only.” Turn 2 respects pose and adds motion cues.

Training the Thinker: Two Stages

Stage A: Supervised Fine-Tuning (SFT) • What happens: The Thinker copies an expert’s examples (like GPT-4.1) to learn the format and good reasoning habits. • Why it matters: Cold start—learn to speak the right language and structure. • Example: It learns to always output <think>, then <score>, then <answer>.
Stage B: Reinforcement Learning (RL) • What happens: Now the Thinker learns from real editing outcomes. If its refined instruction truly improves the editor’s next image, it gets rewarded; if not, it learns to adjust. • Why it matters: Bridges the gap between “what sounds smart” and “what actually works with this editor.” • Example Rewards:
- Format reward: follow the output schema.
- Critic reward: its predicted scores should match an expert judge’s evaluation of the new image (don’t over/underestimate).
- Edit reward: the next image should be better than the previous one (difference in expert scores is positive).

THINKEDIT-140k Dataset Construction

What happens: An automated pipeline creates multi-turn “trajectories.” • Trajectory Generation: Use several editors plus an expert Thinker to iterate, producing reasoning and refined instructions until a stop token. • Trajectory Filter: Keep only those where at least one later step is better than the start and truncate at the best step. • Step-wise Filter: Convert each step into training samples (input tuple ↔ reasoning + refined instruction), balance by task types and score levels. • Data Split: Stable, high-quality samples for SFT; high-variance, improving samples for RL.
Why it matters: The Thinker learns both solid patterns and how to climb out of tough failures.
Example: A path that fixes a warped text edit over 3 steps becomes 3 labeled lessons.

Secret Sauce

Unified, dual-role Thinker: Critiques and plans in one brain, so the plan is grounded in the critique.
Structured reasoning-first output: Forces clarity before action.
Differential reward: We reward actual improvement between turns, not just absolute scores.
Editor-agnostic design: Works on top of popular editors without modifying them.

🍞 Bottom Bread (Anchor): Start with “Replace the sailboat on the right with a lighthouse.” If Turn 1 deletes the boat but the lighthouse is too big, Turn 2 instruction becomes “Add a lighthouse on the right at realistic size for distance; keep horizon and lighting consistent; no foreground placement.” The final image looks natural and correct.

04Experiments & Results

🍞 Top Bread (Hook): Imagine a science fair where every team brings a robot to do a task. Now add a great coach to some teams and see who gets better scores.

🥬 Filling (The Test):

What they measured and why: They checked two things—does the edit follow the instruction (Semantic) and does it look natural without artifacts (Quality)? They also looked at overall benchmark scores that combine these views.
The competition (base editors): FLUX.1 Kontext, OmniGen2, and Qwen-Image-Edit—strong editors many people already use.
The judges (benchmarks): • ImgEdit-Bench and GEdit-Bench-EN for general edits. • RISE-Bench and KRIS-Bench for hard, reasoning-heavy edits (spatial, causal, temporal, logical).

The Scoreboard (with context):

General Editing (ImgEdit-Bench): • FLUX.1 Kontext went from about 3.44 to 3.98 with EditThinker—like moving from a B- to a solid B+/A-. • OmniGen2 improved from 3.4 to 3.5 (smaller but steady boost). • Qwen-Image-Edit inched up from 4.36 to 4.37 (already strong; still gains).
General Editing (GEdit-Bench-EN): • FLUX.1 Kontext [Dev] improved from 6.18 to around 7.05–7.19 depending on the Thinker—like climbing almost a full grade. • OmniGen2 and Qwen-Image-Edit also saw consistent gains.
Reasoning Editing (RISE-Bench): • FLUX.1 Kontext [Dev] jumped from 5.8 to 14.4 with the trained Thinker, and even higher with an expert-level Thinker—this is a big leap, like going from struggling to passing with confidence. • Qwen-Image-Edit rose from 8.9 to 17.8 with the Thinker and up to 27.5 with the expert—major uptick on hard, multi-step reasoning.
KRIS-Bench (reasoning-focused): • FLUX.1 Kontext [Dev] overall went from 61.81 to 69.53 with the trained Thinker and higher with the expert. • Qwen-Image-Edit rose from 64.43 to 71.91 with the trained Thinker and even more with the expert.

Surprising/Notable Findings:

More Turns Help (up to a point): Letting the Thinker run more rounds (like up to 6–8) keeps boosting results—each small fix stacks.
Think-while-Edit beats Think-before-Edit: Prewriting a better prompt helps, but iterating with feedback helps more.
Stronger Thinkers = Bigger Gains: Swapping in a more capable expert (like GPT-4.1) as the Thinker boosts the same editor even further, proving the framework scales with reasoning power.
RL Adds Real-World Sharpness: After SFT, RL fine-tuning brings notable extra gains, especially on general editing overall scores, because it learns what truly moves the needle for each specific editor.

🍞 Bottom Bread (Anchor): On the zebra→giraffe replacement, early tries stretched the neck but kept stripes. The Thinker kept saying “remove stripes; add large irregular brown patches with tan lines,” and by later turns, the giraffe looked right and natural—scores climbed accordingly.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best coaches have limits—they can guide you, but the player still has to make the shot.

🥬 Filling (Honest assessment):

Limitations: • Depends on the base editor: If the editor can’t render a certain change (like ultra-precise text or exact micro-textures), the Thinker’s great plans can still fall short. • More turns mean more time and compute: Iteration improves quality but increases latency and GPU cost. • Judge reliability: Using expert evaluators (MLLM-as-a-judge) can introduce bias or occasional mis-scores. • Edge cases: Extremely long or contradictory instructions may need human simplification. • Multi-image or video edits: Current pipeline is focused on single-image; multi-image temporal coherence is not addressed.
Required Resources: • A capable multimodal model (the Thinker), one or more image editors, and GPUs for training/inference. • The THINKEDIT-140k dataset and access to an expert judge for scoring during RL.
When NOT to use: • Ultra-low-latency applications where extra turns aren’t acceptable. • Tasks needing pixel-perfect typography or metadata fidelity beyond the editor’s capability. • Settings without access to a suitable evaluator for scores.
Open Questions: • Can we learn an internal, fast reward that matches human judgment without heavy evaluators? • How to auto-decide the optimal number of turns per task to balance quality and speed? • Can the Thinker learn editor-specific quirks faster, like a plug-and-play profile per editor? • How to extend to multi-image or video with temporal consistency in the same loop? • Can partial-region edits be planned with explicit spatial masks predicted by the Thinker?

🍞 Bottom Bread (Anchor): If you need a quick social post in 1 second, multi-turn thinking might be overkill. But if you’re crafting a product photo that must be perfect, two to six thoughtful turns can be worth it.

06Conclusion & Future Work

Three-sentence summary: This paper adds a Thinker on top of any image editor so the system can critique, refine, and try again instead of hoping the first try is perfect. The Thinker outputs reasoning, scores, and a sharper next instruction in every round, trained first by imitation (SFT) and then grounded by real feedback (RL). Across four benchmarks and multiple editors, this iterative loop boosts instruction-following and realism, especially on hard reasoning edits.

Main achievement: Turning editing from a one-shot guess into an iterative, reasoning-driven process using a single dual-role multimodal model that both evaluates and plans.

Future directions: Faster internal rewards to cut evaluation cost, smarter stopping rules for the right number of turns, profiles tuned to each editor’s strengths/weaknesses, and extensions to multi-image and video with temporal consistency and regional control.

Why remember this: It shows that thinking while doing—critiquing, refining, and repeating—can unlock better results from tools you already have. Instead of replacing editors, add a brainy coach that guides them, transforming bumpy one-shots into steady step-ups toward your exact intent.

Practical Applications

•Product photography: change colors or materials while perfectly preserving labels and textures.
•Marketing and design: swap backgrounds without warping text, logos, or brand elements.
•Education: generate visual variations (style, material) while keeping core details for clear comparisons.
•E-commerce: standardize images (lighting, background) with safeguards to preserve critical features like shape and text.
•Content creation: stylize characters while strictly maintaining pose and identity across iterations.
•Photo restoration: iteratively remove artifacts or fix lighting while protecting faces and scene layout.
•AR/VR prototyping: refine scene edits step by step for natural integration and consistent scale.
•Scientific/technical visuals: modify diagrams or labels while keeping fonts and geometry intact.
•Interior/exterior mockups: replace objects (e.g., furniture, signage) with accurate placement and proportions.
•Creative direction: quickly explore multiple refined prompts guided by critique to reach the desired look.

Version: 1