RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Tianyuan Qu; Lei Ke; Xiaohang Zhan; Longxiang Tang; Yuqi Liu; Bohao Peng; Bei Yu; Dong Yu; Jiaya Jia

RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing

Intermediate

Tianyuan Qu, Lei Ke, Xiaohang Zhan et al.12/18/2025

arXiv PDF

Key Summary

•RePlan is a plan-then-execute system that first figures out exactly where to edit in a picture and then makes clean changes there.
•It solves Instruction–Visual Complexity, which happens when tricky instructions meet busy, cluttered images.
•A vision–language model (VLM) thinks step by step to produce region-aligned guidance: small boxes plus short hints about what to change.
•A diffusion editor follows those hints using a clever, training-free attention mask so multiple areas can be edited at once without spillover.
•RePlan improves the editor without retraining it, keeping image quality high and boundaries smooth.
•With just about 1,000 instruction-only examples and GRPO reinforcement learning, the planner becomes more reliable and better at reasoning.
•The new IV-Edit benchmark tests hard, real-life edits that require fine-grained grounding and knowledge-based reasoning.
•Across complex tasks, RePlan beats strong baselines trained on much larger datasets, especially on consistency (no unwanted changes elsewhere).
•It works interactively: users can tweak boxes or hints if needed, making results more controllable.
•This approach shows how planning with reasoning plus precise attention control can make image editing both smarter and safer.

Why This Research Matters

Photos and documents we edit every day are messy: many objects, overlapping text, and small details that matter. RePlan shows that if an AI plans first—locating exact regions and writing short hints—and then edits with attention rules, it can make precise, trustworthy changes. This means fewer accidental edits, cleaner boundaries, and more control for users. Designers can safely retouch parts without wrecking the rest; office workers can fix charts or dates without reformatting entire slides. Because the method doesn’t retrain the editor, it’s practical to deploy and update. Overall, it brings us closer to reliable, explainable visual editing that behaves the way people expect.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re helping a friend tidy their messy room by following their voice instructions like, “Put the red book on the second shelf and don’t touch the photo frame.” In a cluttered room, those instructions are tricky because there are many similar objects and tight spaces.

🥬 Filling (The Actual Concept):

What it is: Instruction-based image editing is when you tell an AI, in plain language, exactly how to change a picture.
How it works (before this paper): Earlier systems either tried to edit the whole image using a single sentence (great for global vibes, not precise spots) or they used two steps—first find a region, then inpaint it (better targeting, but fragile and slow with multiple regions).
Why it matters: Real photos are busy. If the AI doesn’t understand which exact thing to change, it may alter the wrong object or mess up nearby areas.

🍞 Bottom Bread (Anchor): Think of saying, “Change the color of the shoes of the woman with a light blue backpack to red.” The AI must find the right woman, find her shoes, and recolor only those shoes—nothing else.

—

🍞 Top Bread (Hook): You know how it’s harder to follow complicated directions in a crowded place than in an empty one? Pictures can feel like crowded places.

🥬 Filling (The Actual Concept):

What it is: Instruction–Visual Complexity (IV-Complexity) is the extra hardness that appears when complicated instructions meet cluttered scenes.
How it works: 1) The picture has many similar objects; 2) The instruction may be multi-step or need world knowledge; 3) These two make each other harder (e.g., “the used cup” in a desk full of cups).
Why it matters: Without handling IV-Complexity, edits become wrong (wrong target) or sloppy (bleeding into neighbors).

🍞 Bottom Bread (Anchor): If a desk has a clean mug and an empty, stained mug, “replace the used cup with a plant” needs the AI to pick the empty, stained one—not the clean one.

—

🍞 Top Bread (Hook): Imagine having a helpful teammate who can look and read at the same time, like a tour guide reading a map while pointing out landmarks.

🥬 Filling (The Actual Concept):

What it is: A Vision–Language Model (VLM) is an AI that connects what it sees with what you say.
How it works: 1) It looks at the image; 2) It reads your instruction; 3) It reasons to match words to places and things.
Why it matters: Matching words to exact regions is key for precise edits.

🍞 Bottom Bread (Anchor): The VLM hears “the word ‘Program’ in the title” and points exactly to those letters in the poster.

—

🍞 Top Bread (Hook): Think of a careful artist who improves a sketch by gently adding details layer by layer.

🥬 Filling (The Actual Concept):

What it is: A Diffusion Model is an AI artist that turns noisy images into clean images step by step.
How it works: 1) Start with a noisy version; 2) Gradually remove noise while following text/image hints; 3) End with a high-quality image.
Why it matters: Diffusion models make edits look realistic and consistent.

🍞 Bottom Bread (Anchor): If you say, “Make the apple look like polished gold,” the diffusion model adds shiny texture and correct lighting that matches the scene.

—

🍞 Top Bread (Hook): When you focus in class, you don’t pay the same attention to every sound—you listen to your teacher more than the pencil sharpener.

🥬 Filling (The Actual Concept):

What it is: The Attention Mechanism helps AI focus on the most relevant pieces of text or image for the task.
How it works: 1) Score connections between tokens (words, patches); 2) Give higher weight to important ones; 3) Use those to produce the output.
Why it matters: Without the right attention, edits can wander into the wrong places.

🍞 Bottom Bread (Anchor): For “change the color of her shoes,” attention helps the model connect the word “shoes” to the shoe pixels—not the shirt.

—

The World Before: End-to-end editors like InstructPix2Pix made sweeping changes well but often touched the wrong spots. Inpainting pipelines improved locality but required separate, sometimes brittle localization. Newer methods that attach VLMs to generators usually send only global, high-level prompts—helpful for overall meaning but too coarse for tight, region-accurate control.

The Problem: In real life, photos are busy and instructions are nuanced. Systems need both fine-grained grounding (which pixels?) and reasoning (which object among lookalikes? what does “used” mean?).

Failed Attempts: 1) Global-only guidance ignores region precision; 2) Multi-round inpainting degrades quality across passes; 3) VLM guidance that isn’t region-specific can still cause spillover to similar objects.

The Gap: We need to move from global hints to region-aligned plans that pair each small region with its own mini-instruction, and then make the editor honor those pairs.

Real Stakes: Better, safer tools for photo cleanup, design tweaks, document fixes, instructional graphics, education, and accessibility. Less frustration, fewer mistakes, and more trustworthy edits when details matter most.

02Core Idea

🍞 Top Bread (Hook): Picture a coach who first draws arrows on a whiteboard (the plan) and then the team executes the play exactly where the arrows point.

🥬 Filling (The Actual Concept):

What it is: The key insight is to plan with reasoning at the region level (boxes + hints) and then execute those plans with a diffusion editor that obeys region-specific attention rules.
How it works: 1) A VLM thinks step by step (chain-of-thought) to break a complex instruction into region-hint pairs; 2) Each hint gets its own text token; 3) Image patches inside each box are grouped; 4) A training-free attention mask makes each region listen mostly to its own hint (and the global hint), preventing spillover; 5) The diffusion model edits all regions in one go.
Why it matters: This brings the VLM’s sharp understanding directly into the editor’s hands, replacing vague global nudges with precise local control.

🍞 Bottom Bread (Anchor): For “turn the woman with a light blue backpack’s shoes red,” the planner outputs a box around her shoes with the hint “make these shoes red,” and the editor changes only those shoes—no bleeding onto the floor or pants.

—

🍞 Top Bread (Hook): You know how you solve math word problems by writing steps? That keeps you from making silly mistakes.

🥬 Filling (The Actual Concept):

What it is: Chain-of-Thought Reasoning means the VLM writes down its thinking to pin down targets and effects.
How it works: 1) It lists candidates (two cups); 2) Applies knowledge (“used” likely looks empty or stained); 3) Chooses the correct one; 4) Produces region boxes and mini-prompts.
Why it matters: Without clear steps, the planner might pick the wrong object or skip a needed sub-edit.

🍞 Bottom Bread (Anchor): “The red cup looks used; the glass has water, so it’s still in use → replace the red cup with a plant.”

—

🍞 Top Bread (Hook): Think of taping color-coded sticky notes onto parts of a poster—each note tells exactly what to change there.

🥬 Filling (The Actual Concept):

What it is: Region-aligned Planning creates pairs like (box, hint), including a global hint for the whole image and local hints for specific spots.
How it works: 1) Detect regions; 2) Attach a hint per region (even negative hints like “keep this unchanged”); 3) Output a structured plan in a reliable format (tags + JSON).
Why it matters: Without boxes and per-region hints, instructions blur together and spill into neighbors.

🍞 Bottom Bread (Anchor): “Global: keep everything else the same; Region 1: replace this red cup with a small potted plant; Region 2: keep this glass unchanged.”

—

🍞 Top Bread (Hook): Imagine you and your friends form circles so each group hears only its own instructions—no cross-chatter.

🥬 Filling (The Actual Concept):

What it is: Training-free Attention Region Injection is a set of attention rules you plug into the editor so regions mostly listen to their own hints.
How it works: 1) Group text tokens by hint; 2) Group image patches by region; 3) Set attention masks: regions attend to their hint + global hint; background attends only to global hint; image and latent tokens keep global self-attention to stay coherent; 4) No retraining needed.
Why it matters: Without these rules, edits can confuse targets, bleed across regions, or break global consistency.

🍞 Bottom Bread (Anchor): Change the keyboard-with-yellow-sticky-notes to black, and only that keyboard changes—sticky-note areas and nearby keys don’t get repainted by accident.

—

🍞 Top Bread (Hook): When learning a game, you improve by trying things, seeing what worked, and adjusting.

🥬 Filling (The Actual Concept):

What it is: Reinforcement Learning (RL) helps the planner get better by receiving rewards for good plans.
How it works: 1) Generate multiple plans; 2) Execute; 3) Score results; 4) Update the planner toward better-scoring behavior.
Why it matters: Without feedback, the planner may format plans poorly or reason inconsistently.

🍞 Bottom Bread (Anchor): If a plan makes clean, correct edits, it gets a higher score and is more likely next time.

—

🍞 Top Bread (Hook): Think of a class contest where the teacher ranks drawings made from the same prompt, then students copy what worked best.

🥬 Filling (The Actual Concept):

What it is: GRPO-based Reinforcement Learning compares multiple outputs for the same instruction and pushes the planner toward relatively better ones.
How it works: Stage 1 rewards correct plan format and thoughtful reasoning; Stage 2 adds image-level rewards for Target, Effect, and Consistency (weighted so not-editing isn’t rewarded). Only about 1k instruction-only samples are used.
Why it matters: It builds a reliable planner that both formats well and reasons well, without huge training sets.

🍞 Bottom Bread (Anchor): After GRPO, the planner is more likely to output valid JSON boxes and clear hints that lead to accurate, clean edits.

—

Multiple Analogies:

City planner: First decide which blocks get repairs (boxes + hints), then send crews with the right tools to each block at once (parallel edits).
Sticky notes on a painting: Label exactly what to tweak in each corner, then touch up only those labels.
Sports playbook: Draw precise routes (planning), then run the play (execution) without players drifting into each other’s lanes (attention masks).

Before vs After:

Before: Global prompts, guessy edits, multi-round inpainting, spillovers.
After: Step-by-step planning, region-hint pairs, one-pass precise edits, strong consistency.

Why It Works (intuition): Put the VLM’s sharp reasoning into a structured plan that the diffusion model can’t ignore, thanks to attention masks that bind each region to its hint while keeping global image coherence.

Building Blocks:

Chain-of-thought planning with boxes + hints.
Structured output with tags and JSON.
Region/text token grouping.
Five attention rules: hint isolation, region constraint, background constraint, intra-group connections, and image–latent full interaction.
Two-stage GRPO RL for reliability + quality.

03Methodology

High-level Recipe: Input Image + Instruction → VLM Planner (think step-by-step) → Region-Aligned Guidance (boxes + hints) → Diffusion Editor with Attention Injection → Edited Image

Step 1: Planner reads and reasons (Chain-of-Thought)

What happens: The VLM looks at the image and instruction, lists candidates for each target, applies knowledge (like what “used cup” might look like), and decides boxes and hints. It writes out: <think>…</think> <global>…</global> <region>[ {...}, {...} ]</region>.
Why this step exists: Without reasoning, the planner may choose the wrong target or skip key details.
Example: “Replace the used cup with a plant.” The VLM notes two cups, picks the empty red one as “used,” and outputs its box plus hint: “Replace this red cup with a small potted plant.”

Step 2: Structured output that tools can parse

What happens: The plan uses strict tags and JSON so it’s easy to read programmatically.
Why this step exists: If the format breaks, the editor can’t use it, and the reward becomes noisy.
Example: The region list contains entries like {"bbox_2d": [x1, y1, x2, y2], "hint": "..."}.

Step 3: Text encoding and grouping

What happens: Each hint (global + each region) is separately encoded into text tokens; we form text groups G_text_global, G_text_1, …, G_text_K.
Why this step exists: Grouped hints are needed to bind each region to its own instruction.
What breaks without it: Hints would mix, causing conflicting or diluted guidance.
Example: Global: “keep others unchanged”; Region 1: “make shoes red”; Region 2: “keep glass unchanged.”

Step 4: Image encoding and patch grouping

What happens: The image passes through a VAE encoder into a grid of patch tokens. Each bbox maps to a set of patches (G_img_k). Background patches that are in no box form G_img_bg.
Why this step exists: The editor must know which patches belong to which region.
What breaks without it: The model can’t link region hints to the right pixels.
Example: Patches inside the shoe box are grouped to receive the shoe-color hint.

Step 5: Training-free Attention Region Injection (the secret sauce)

What happens: We set a binary attention mask in every transformer layer of the diffusion editor (MMDiT) following five rules:
1. Intra-group interaction: tokens inside any single group can attend freely.
2. Hint isolation: different text groups cannot see each other.
3. Image–latent full interaction: all image and latent tokens are globally connected (keeps style and boundaries smooth).
4. Region constraint: region patches attend only to their own hint and the global hint.
5. Background constraint: background patches attend only to the global hint.
Why this step exists: It enforces who-listens-to-whom so each region follows its own hint without spillover.
What breaks without it: Targets get confused, neighbors get edited, and global consistency suffers.
Example: Editing the word “Program” to blue doesn’t recolor nearby words because the background patches don’t attend to the “Program” hint.

Step 6: One-pass, multi-region diffusion editing

What happens: The diffusion model runs normally, but its attention is guided by the mask and groups. All regions are edited in parallel in a single pass.
Why this step exists: Multi-round inpainting degrades images and costs time; one-pass is cleaner and faster.
What breaks without it: Quality drops with each round; edits may fight each other over iterations.
Example: Change two separate objects at once—like recoloring one shirt and deleting a knife—without the operations interfering.

Step 7: Reinforcement learning to strengthen planning (GRPO)

What happens: Two stages of GRPO improve the planner with only ~1k instruction-only samples:
- Stage 1 (Format + Reasoning): Rewards for correct tag/JSON format and for writing sufficiently detailed chain-of-thought.
- Stage 2 (Planning Quality): Image-level rewards from a strong VLM judge: Target (right place), Effect (right change), Consistency (unchanged elsewhere, reweighted by Effect so no-edit cheating doesn’t win). A small weight keeps Stage 1’s format reliability active.
Why this step exists: To make plans consistently valid and helpful, even without huge datasets.
What breaks without it: Plans may be sloppy, causing decoding failures or poor edits.
Example: After RL, the planner more reliably outputs accurate bboxes for small targets like shoes or words.

Interactivity: Because the plan is explicit, users can adjust a box or hint and re-run, like nudging sticky notes on a poster.

Secret Sauce Summary:

Tight coupling of reasoning (boxes + hints) with execution (attention masks) without extra training on the editor.
Clear division of labor: the VLM decides what and where; the diffusion model decides how it should look.
Parallel, precise edits with preserved global coherence.

A Note on Robustness and Edge Cases:

Overlapping boxes: Patches in the overlap can attend to both relevant hints; global image self-attention helps the model reconcile sub-edits.
Slight bbox errors: The method remains robust even with noticeable jitter (tested up to 50% corner shifts with modest impact).
Attention rule discoveries: Cutting off image-to-image attention across regions causes visible seams; letting image and latent tokens fully interact preserves smoothness; removing all text guidance to an area causes distortion, showing text tokens also help coordinate image information flow.

04Experiments & Results

🍞 Top Bread (Hook): If you want to know who really follows directions in a messy room, you give the same tough chore to everyone and see who does it best.

🥬 Filling (The Actual Concept):

What it is: The IV-Edit Benchmark is a test built to measure how well models handle Instruction–Visual Complexity: fine-grained grounding and knowledge-heavy edits in busy scenes.
How it works: It has about 800 image–instruction pairs across real-world photos and text-rich images (like tables, posters). Instructions average 21 words and often involve multiple regions. A strong VLM judge scores Target, Consistency, Quality, and Effect on a 1–5 scale.
Why it matters: Simpler benchmarks don’t capture real-world difficulty; IV-Edit stresses both understanding and precise, clean execution.

🍞 Bottom Bread (Anchor): Examples include “make the color of the word ‘FLOOR’ gold,” “replace the used cup with a plant,” and “change the closing date by one month,” each requiring accurate localization plus reasoning.

The Test: Models are compared on four dimensions—Target (did you edit the right thing?), Consistency (did you avoid messing up other areas?), Quality (does it look good?), Effect (did the edit match the instruction?). Overall is the average; a Weighted score multiplies Consistency by Effect to avoid rewarding doing nothing.

The Competition: Closed-source (GPT-4o, Gemini-2.5-Flash-Image) and open-source (InstructPix2Pix, Uniworld, Bagel-think, Qwen-Image-Edit, Flux.1 Kontext dev) systems are evaluated. RePlan is applied on top of MMDiT-based backbones (Flux.1 Kontext dev, Qwen-Image-Edit).

Scoreboard with Context:

RePlan improves both Flux.1 Kontext dev and Qwen-Image-Edit backbones. For Flux.1 Kontext dev, Overall rises to 3.46 and Weighted to 2.55. For Qwen-Image-Edit, Overall reaches 3.51 and Weighted 2.91. Think of this as moving from a solid B to a stronger B+/A- in a tough class where most students got Bs.
Consistency especially improves with RePlan, reflecting fewer unwanted changes elsewhere. This matches the goal of attention region injection.
Against strong VLM-guided baselines like Bagel-think and Qwen-Image-Edit (without RePlan), RePlan’s region-aware planning narrows the gap or surpasses them, despite using far less training data.

Surprising Findings:

Planner ablations show that zero-shot planners (even very smart ones) often fail on bbox accuracy or format reliability. The two-stage GRPO makes a big difference in getting sturdy, usable plans.
Removing chain-of-thought reasoning harms performance notably, confirming that step-by-step thinking helps the planner resolve tricky references.
One-stage RL underperforms the two-stage scheme, showing that first stabilizing format/reasoning and then adding image-level rewards is more sample-efficient.
Bbox perturbations up to 50% corner shifts cause only modest degradation, indicating resilience to imperfect region proposals.
Attention rule exploration reveals that completely isolating image regions produces harsh boundaries (bad), while allowing all image and latent tokens to interact keeps the whole picture coherent (good). Also, blocking text attention to a region causes distortion, implying text tokens help coordinate internal image information.

Qualitative Highlights:

Fine-grained edits like recoloring only the intended word in a title or only the chosen person’s shoes are executed cleanly.
Knowledge-based instructions (e.g., extending a month from “June 15” to the correct later date) showcase reasoning plus precise localization.
Compared to global-only guidance, region-aligned hints keep edits from drifting into semantically similar neighbors (e.g., not recoloring all keyboards when only one is targeted).

Takeaway: RePlan consistently lifts precision and cleanliness (Consistency) while maintaining target accuracy and visual quality—without retraining the editor—by turning the VLM’s reasoning into actionable, region-specific guidance.

05Discussion & Limitations

Limitations:

Ambiguous or highly novel instructions can still confuse the planner (e.g., unusual definitions of “used” or complex cultural cues).
If the VLM proposes a poor box (very wrong area), the edit will faithfully follow that wrong plan. Though robust to small errors, big mislocalizations still hurt.
Extremely dense multi-target scenarios may stress the planner’s format or bbox accuracy, despite RL improvements.
Text-heavy documents with very tiny fonts can push the limits of region detection and alignment.

Required Resources:

A VLM planner (e.g., Qwen2.5-VL 7B in this paper) and a diffusion editor with MMDiT architecture (e.g., Flux.1 Kontext dev or Qwen-Image-Edit).
Moderate GPU resources for diffusion inference; GRPO uses about 1k instruction-only samples plus a large VLM judge for scoring in Stage 2.
Reliable parsing and orchestration code for tag/JSON handling and attention mask injection.

When NOT to Use:

Purely global aesthetic changes (e.g., “make the whole image warmer”) may be simpler with standard global editors.
Cases needing heavy new content creation far beyond the image’s style priors (e.g., building a complex fantasy scene from scratch) might require dedicated generative pipelines.
Ultra-high-resolution print workflows where every pixel is critical may need specialized, resolution-scaled setups and careful validation.

Open Questions:

Can the planner learn to predict pixel-accurate masks (beyond boxes) reliably at scale without adding brittleness?
How to extend from images to video (temporal consistency) while preserving region-hint discipline?
Can region rules be adapted dynamically (soft masks, learnable routing) without retraining the editor?
How to reduce reliance on large VLM judges for reward, enabling lighter, cheaper RL loops?
Safety and fairness: How to ensure edits don’t reinforce harmful biases or unintentionally alter sensitive content?

Overall Assessment: RePlan shows that precise planning plus gentle, rule-based attention control can unlock strong, clean edits in complex scenes—bringing the VLM’s reasoning power directly into the editor’s hands without heavy retraining.

06Conclusion & Future Work

3-Sentence Summary: RePlan tackles Instruction–Visual Complexity by first planning with a VLM—producing region boxes with clear hints—and then executing with a diffusion editor guided by training-free attention rules. This tight plan–execute loop enables parallel, precise, and clean edits, greatly reducing spillover while keeping global image quality. With a small RL setup (GRPO, ~1k instruction-only samples) and a new IV-Edit benchmark, RePlan outperforms strong baselines trained on much larger data.

Main Achievement: The paper’s top contribution is refining VLM–diffusion interaction from vague, global semantics to concrete, region-aligned guidance that the editor is forced to respect—without retraining the editor—yielding state-of-the-art consistency under complex instructions.

Future Directions:

Move from bboxes to robust masks; extend to videos with temporal planning; shrink reward models for lighter RL; explore dynamic/learnable attention routing.
Expand IV-Edit with more domains (e.g., medical charts, scientific figures) and richer reasoning types.

Why Remember This: RePlan shows that thinking first (region reasoning) and then editing with carefully controlled attention can turn complicated, messy real-world instructions into precise, trustworthy image changes—like sticking labeled notes on a canvas and then painting exactly where each note says.

Practical Applications

•Product photography touch-ups: recolor a single item, remove a smudge, or swap a label without altering nearby products.
•Slide and poster fixes: change only the targeted word’s color or update a date while keeping layout and style intact.
•E-commerce listing updates: replace a specific object (e.g., outdated accessory) with a new one while preserving the scene.
•Document correction: adjust one table cell’s value based on reasoning (e.g., recalculate sums) without disturbing other cells.
•UI/UX mockups: swap icons or tweak only the selected component’s style in dense interfaces.
•Education and worksheets: highlight or alter exact diagram parts (e.g., recolor only the left ventricle) without touching labels.
•Photo cleanup: remove a single distracting object (e.g., a stray knife) cleanly, keeping surfaces and lighting consistent.
•Fashion and retail previews: change the color or material of one garment region (e.g., sleeves) without affecting the rest.
•Accessibility aids: enlarge or recolor targeted text elements on images for readability while preserving overall design.
•Creative edits: simulate physics-based outcomes in a specific area (e.g., wind affecting flags) without changing the whole scene.

Version: 1