ProEdit: Inversion-based Editing From Prompts Done Right

Zhi Ouyang; Dian Zheng; Xiao-Ming Wu; Jian-Jian Jiang; Kun-Yu Lin; Jingke Meng; Wei-Shi Zheng

ProEdit: Inversion-based Editing From Prompts Done Right

Intermediate

Zhi Ouyang, Dian Zheng, Xiao-Ming Wu et al.12/26/2025

arXiv PDF

Key Summary

•ProEdit is a training-free, plug-and-play method that fixes a common problem in image and video editing: the model clings too hard to the original picture and refuses to change what you asked for.
•It does this in two smart ways: (1) KV-mix gently blends source and target attention features only where the edit should happen, and (2) Latents-Shift slightly shakes up the hidden noise in the edit area so old details don’t overrule the new request.
•By keeping the background fully tied to the source while freeing the edited object, ProEdit can finally change tricky attributes like color, pose, and number without breaking scene consistency.
•It works across popular flow-based inversion editors like RF-Solver, FireFlow, and UniEdit, and even extends to video with strong temporal consistency.
•On standard image benchmarks (like PIE-Bench), ProEdit consistently improves both edit quality (higher CLIP similarity) and preservation of unedited regions (better PSNR/SSIM).
•On color edits—where older methods often fail—ProEdit shines because Latents-Shift releases the target from the source’s color “gravity.”
•It requires no retraining and no fragile head/layer picking; it simply drops in, caches attention once, builds a soft mask, and runs.
•Ablations show KV-mix and Latents-Shift each help on their own and work best together, like two hands guiding the edit to match the prompt while keeping the rest untouched.

Why This Research Matters

ProEdit makes everyday visual edits more trustworthy: when you ask for a clear change, you actually get it, and the rest stays put. This is crucial for creators who need fast, accurate tweaks—changing product colors, fixing costume details, or adjusting a character’s pose—without background damage. It also brings steadier, flicker-free edits to video, which filmmakers and marketers need for consistent storytelling. Because it’s plug-and-play and training-free, teams can adopt it quickly without expensive retraining or fragile head/layer tuning. By focusing changes only where needed, ProEdit saves time on manual cleanup and reduces failure cases that waste compute and effort. In short, it boosts both edit accuracy and production reliability across images and videos.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

The World Before: You know how you can use a photo editor to make a cat black instead of orange, or make one bear turn into two? AI editors try to do that from words—“Turn the orange cat black”—and they’ve gotten pretty good at keeping the scene the same (the fence, the sky, the style). But they often mess up the exact change you asked for. They cling to the original too tightly: the cat stays orange, the arms won’t cross, or two bears won’t become one.

Why this happens: Many modern editors use a trick called inversion-based editing. Imagine you tell the model, “Rewind this finished picture back into the kind of noise it came from,” and then “Play it forward again using the new words.” To keep the background and style steady, past methods inject lots of information from the source picture during this replay. That sounds safe—but too much of that source information leaks into the place you’re trying to change. The result? The background looks perfect, but the cat stubbornly stays orange.

🍞 Hook: Imagine you’re tracing a drawing. If you press too hard on the old lines while redrawing, you can’t add the new shape you want. 🥬 The Concept (Attention Mechanism): Attention is the AI’s way of deciding what to focus on (important parts get higher focus; unimportant parts get lower).

How it works:
1. Look at all tokens (words and image patches).
2. Compare each to the current goal.
3. Give higher scores to the helpful ones.
4. Use those to guide the next step.
Why it matters: Without attention, the model treats “the” like “cat,” and can’t aim edits correctly. 🍞 Anchor: When asked “Make the cat black,” attention should focus on “cat” tokens and the cat’s pixels—not on the fence.

🍞 Hook: Think of a movie you can rewind to the start and then play again with a new script. 🥬 The Concept (Inversion-based Editing): It rewinds a picture/video into a hidden “noise” state, then plays forward with a target prompt to make the edited result.

How it works:
1. Invert the source image into a special noise-like latent.
2. Keep some info (for consistency) during this process.
3. Run forward again, but now follow the new words.
Why it matters: It allows training-free, faithful edits—but too much source info can block requested changes. 🍞 Anchor: Rewind a photo of an orange cat, then play forward with “black cat” to get the changed cat on the same fence.

Failed attempts: Earlier tricks globally injected source attention features (especially the V part of attention) at many steps. Others tried picking special heads or layers. These helped keep the background, but they also snuck in the object’s old attributes—like color and pose—right into the edited area. So the model “remembers” too strongly and ignores your new instruction.

🍞 Hook: Picture two rivers—one is the old image information, the other is your new request. If you open the floodgate from the old river everywhere, the new river can’t shape the land where it needs to. 🥬 The Concept (Flow-based Models): These models learn a smooth path that carries you from random noise to a real image or video fast and accurately.

How it works:
1. Learn a velocity field that points from noise to data.
2. Solve an ODE to travel along that field in a few steps.
3. Reverse it to “invert” images back toward noise.
Why it matters: They’re fast, stable, and great for editing—but still prone to overusing source info if we’re not careful. 🍞 Anchor: With flow models, you can go from “static” to a full photo in fewer steps, or go backward to edit, like a smooth slide.

The Gap: We needed a way to keep backgrounds rock-solid while letting the edited object truly change as instructed. That means: inject source info only where it preserves what we want (the background), and reduce it exactly where we need freedom (the edit region).

Real stakes: This matters for anyone who edits images or videos—social media posts, product photos, films, games, and education. If you ask for “one white bear holding a rose,” you should get one white bear, not two. If you ask for “red car to black car,” you want a real black car, not a dark red.

02Core Idea

The “Aha!” in one sentence: Don’t blast source information everywhere—mix it precisely in attention where you don’t edit, and loosen the source grip in the hidden noise only where you do edit.

Three friendly analogies:

Painting tape: Put painter’s tape around the background (keep it) and only repaint the object (change it). KV-mix is the tape-and-brush control; Latents-Shift makes old paint less sticky.
DJ remix: Keep the original beat (background) at full volume, but crossfade the vocals (object) with a new singer (target prompt). Mix the right channels in the right place.
Gardening: Leave the garden intact, but gently replant the flowerbed (object). Don’t dump the original soil back onto the new flowers.

Before vs After:

Before: Editors started from inverted source noise and globally injected source attention. Background stayed, but object edits (color, pose, number) often failed.
After (ProEdit): Use a soft mask to find where edits belong. In that region, blend source-and-target attention keys/values (KV-mix) and nudge the latent distribution (Latents-Shift). Keep full source injection outside the mask to preserve background. Edits land; background stays solid.

Why it works (intuition, no equations):

Attention decides “what to copy and what to change.” If you globally inject source V features, you accidentally copy old object details too. KV-mix says: only mix source and target features inside the edit mask, and keep full source features outside. That preserves background while letting the object follow the new words.
The initial latent (inverted noise) still “leans” toward the source image. Latents-Shift lightly re-centers the latent’s statistics in the edit area using random noise as a neutral style, weakening the source’s grip without breaking structure.
Together, attention mixing (where to look) and latent shifting (what to start from) align so the model both aims and begins in a way that respects the new prompt.

Building blocks (each explained with the Sandwich pattern when first introduced):

Soft edit mask from attention: find where the object is.
KV-mix in attention: blend K and V features inside the mask with strength δ; fully inject source K,V outside it.
Latents-Shift in latent space: blend inverted noise with randomized, AdaIN-style statistics using strength β inside the mask.
Plug-and-play design: no retraining, no picking special heads/layers; works with popular flow-based inversion editors.

🍞 Hook: Think of tracing around the part you want to color so you don’t color outside the lines. 🥬 The Concept (KV-mix): KV-mix blends the source and target attention features (K and V) only inside the edit region, while keeping full source features outside.

How it works:
1. Build a soft mask that marks the edit area.
2. Inside the mask, mix source and target K,V with ratio δ.
3. Outside the mask, inject source K,V fully to preserve background.
Why it matters: It stops old object details from overwhelming the edit while perfectly keeping the background. 🍞 Anchor: To turn an orange cat black, KV-mix focuses mixing on the cat region; the wooden fence stays from the original.

🍞 Hook: Imagine the old paint color is hard to cover because it keeps bleeding through. 🥬 The Concept (Latents-Shift): Latents-Shift slightly adjusts the hidden noise distribution in the edit area by borrowing statistics from random noise.

How it works:
1. Take the inverted noise (source-leaning) and a random noise (neutral).
2. Match the edited region’s mean/variance to the random noise (AdaIN-style).
3. Blend with strength β so structure remains but source attributes loosen.
Why it matters: If you don’t reduce the source pull, the cat keeps wanting to be orange. 🍞 Anchor: With Latents-Shift, the “orange” bias fades, so “black cat” can actually appear.

🍞 Hook: Like a USB device that works when you plug it in—no new drivers needed. 🥬 The Concept (Plug-and-Play Architecture): ProEdit drops into existing inversion editors without retraining or head/layer tinkering.

How it works:
1. Cache source K,V during inversion.
2. Build a soft mask once.
3. Apply KV-mix and Latents-Shift at sampling time.
Why it matters: Easy adoption and consistent gains across tools. 🍞 Anchor: It works with RF-Solver, FireFlow, and UniEdit out of the box.

03Methodology

At a high level: Source image + source prompt → Invert (cache source K,V; compute mask) → Latents-Shift on inverted noise (masked) → Sample with target prompt + KV-mix → Output edited image/video.

Step 0: Inputs and outputs

What happens: You provide a source image (or video) and a target prompt like “Turn the orange cat black,” and you want an edited output that keeps the background and style.
Why this step exists: We must know exactly what to change (the object) and what to preserve (everything else).
Example: Source: an orange cat on a fence. Target: “A black cat sitting on a fence.” Output: the same fence and pose, but a black cat.

Step 1: Inversion with caching

What happens: We run inversion on a flow-based model (e.g., FLUX / HunyuanVideo). During this, we cache source attention features (K_s, V_s) across DiT blocks. We also collect attention maps to find the edit area.
Why it matters: Caching K_s, V_s gives us reliable background anchors without rummaging for special heads or layers later. If we skip caching, we lose precise alignment with the source.
Example data: For the orange cat, we get per-layer source K,V that reflect the image’s structure and style.

Step 2: Build a soft edit mask from attention

What happens: We use text-to-image attention maps (practically, from the last Double block at the first inversion step) to locate the object described by the prompt tokens (e.g., “cat,” “orange”). We then dilate the mask slightly to cover edges and avoid artifacts.
Why this step exists: We need to know exactly where edits should apply. Without a mask, global changes reintroduce the source object attributes everywhere.
Example: The mask covers the cat region, not the fence or sky. It’s soft (not just 0/1) to blend nicely.

Step 3: Latents-Shift (masked AdaIN-style adjustment)

What happens: We take the inverted noise (which still carries source bias) and gently shift its statistics in the masked region toward random noise. We blend with a strength β (e.g., 0.25) so we don’t break structure.
Why this step exists: If the latent starts too close to the source, the model keeps “rebuilding” the old object. This nudge frees the object area to follow the new prompt.
What breaks without it: Color edits (orange→black) and strong attribute changes (two→one) often fail or look half-changed.
Example: After Latents-Shift, the cat’s region is less “orange-leaning,” so “black” becomes reachable.

Step 4: KV-mix during sampling

What happens: We run the forward sampling with the target prompt. At chosen timesteps (practically, all DiT Double and Single attention blocks), we:
- Inside the mask: mix target K_tg, V_tg with source K_s, V_s using ratio δ (e.g., 0.9 toward target) so the object follows the target words but stays well-registered.
- Outside the mask: inject full K_s, V_s to perfectly preserve background.
- Text attention always uses target prompt features for correct guidance.
Why this step exists: Global source injection preserves background but also smuggles old object details. Masked KV-mix gives you both—steady background and truly edited object.
What breaks without it: You either get a great background but a stubborn object, or you get an edited object but a drifting background.
Example: The fence’s wood grain and lighting match the source; the cat turns black cleanly.

Step 5: Repeat over timesteps and output

What happens: We iterate the sampling steps (e.g., 15 for images, 25 for videos). At each step, the mask-guided KV-mix keeps the background steady, and the shifted latent lets the object align more with the target.
Why this step exists: Edits settle gradually. If you do it in one shot, you risk artifacts.
Example: Early steps shape the global form; later steps refine fur texture and lighting on the now-black cat.

Secret sauce (why this is clever):

Precision over power: Instead of blasting source info everywhere, KV-mix injects it only where it helps (outside the mask) and blends where it must not dominate (inside the mask).
Start-point hygiene: Latents-Shift fixes the start by weakening source bias just in the edit region, so the model doesn’t “snap back” to the old object.
No fragile knobs: Works across heads/layers automatically; no retraining; truly plug-and-play.

Implementation tips (from the paper’s settings, not strict rules):

Use the last Double block’s attention at the first inversion step to build the mask (good correlation; low memory).
Dilate the mask slightly to cover edges and avoid halos.
Typical strengths: δ≈0.9 (favor target in the edit area), β≈0.25 (gentle latent shift). Images often use ~15 steps; videos ~25.
Apply KV-mix in visual-token attention of both Double and Single blocks; keep text attention from the target prompt.

Concrete walkthrough (orange→black cat):

Input: source image (orange cat on fence), target prompt (“A black cat sitting on a fence”).
Invert and cache: get K_s, V_s and an attention map linking “cat/orange” to cat pixels.
Mask: build/dilate a cat-region mask.
Latents-Shift: nudge the cat area’s latent toward random stats with β=0.25.
Sampling + KV-mix: use target text; inside mask mix K_tg,V_tg with K_s,V_s using δ=0.9; outside mask inject K_s,V_s fully.
Output: same fence background, cat now black, edges clean.

04Experiments & Results

The test: The authors challenged ProEdit on standard image edits (PIE-Bench: 700 images, 10 edit types) and diverse video edits (55 clips, 40–120 frames, up to 720p). They measured two things at once: (1) Is the edit correct? (CLIP similarity for the whole image and the edited region), and (2) Is the rest preserved? (Structure distance, PSNR, SSIM for unedited regions). For video, they followed VBench-style scores: Subject Consistency, Motion Smoothness, Aesthetic Quality, and Imaging Quality.

The competition: ProEdit is compared with strong, training-free baselines—classic diffusion editors (P2P, PnP, PnP-Inversion, EditFriendly, MasaCtrl, InfEdit) and the latest flow-based inversion editors (RF-Inversion, RF-Solver, FireFlow, UniEdit), plus video baselines (FateZero, Flatten, TokenFlow). ProEdit is not a standalone editor; it plugs into flow-based editors like RF-Solver, FireFlow, and UniEdit.

The scoreboard (with context):

On PIE-Bench image editing, adding ProEdit to RF-Solver, FireFlow, and UniEdit reliably lifts CLIP similarity (better edit correctness) and preserves background more (higher PSNR/SSIM). Think of it as turning a solid B into an A across multiple subjects.
With UniEdit (α=0.8), ProEdit reaches state-of-the-art across both keeping the scene and making the right change. That means you no longer have to choose between a perfect background and a correct edit—you get both.
On color editing (a notorious failure case), ProEdit’s Latents-Shift helps break the source-color “gravity,” yielding clearly higher correctness scores while still preserving the rest. It’s like finally convincing the model the orange cat can truly be black.
For video, plugging ProEdit into RF-Solver yields higher Subject Consistency and Motion Smoothness with small but meaningful boosts to Aesthetics and Imaging Quality—upgrading from a steady capture to a cleaner, crisper cut without flicker.

Surprising findings:

You don’t need to cherry-pick attention heads or layers. KV-mix applied uniformly to visual tokens works robustly, which simplifies deployment.
A soft mask from the last Double block at the first inversion step suffices (and saves memory). Dilating it slightly reduces edge artifacts.
Ablations show KV-mix alone and Latents-Shift alone each help, but together they make the biggest gains—like two puzzle pieces that lock in.

What to look at in examples:

Attribute edits that used to fail (pose, number, color) now land correctly while the background remains nearly pixel-identical to the source.
In side-by-sides, older methods either (a) barely change the object or (b) change it but mess up the rest. ProEdit breaks this tradeoff.

Bottom line: Across images and videos, ProEdit consistently improves edit accuracy without sacrificing what you wanted to keep. It’s state-of-the-art when nested in strong flow-based inversion editors and is especially powerful on color and attribute changes.

05Discussion & Limitations

Limitations:

Very extreme edits (e.g., “turn a cat into a spaceship” with complex geometry) may still need additional conditioning or stronger masks; ProEdit is excellent at attribute and moderate semantic changes but is not magic for wild, out-of-domain swaps.
Mask quality matters. If the attention-derived mask mislocalizes the object, edits may bleed or miss parts; external masks can help but add user effort.
Latents-Shift must be balanced: too weak and the source bias persists; too strong and structure can wobble. Defaults work well, but rare cases may need tuning.

Required resources:

A flow-based text-to-image/video model (e.g., FLUX.1-dev, HunyuanVideo) and an inversion method (RF-Solver, FireFlow, or UniEdit) that ProEdit can plug into.
Enough GPU memory to cache K,V during inversion and run attention-controlled sampling. For 480–720p video, budget accordingly for sequence length.

When NOT to use:

If you want global style transfer of the entire image (not just an object), the selective mask may be unnecessary overhead.
If your toolchain is strictly diffusion U-Net based without access to suitable inversion and attention features, you won’t get the full benefit of KV-mix.
If the target prompt is extremely vague, the attention-derived mask may be unreliable; better prompts or a user-provided mask are advisable.

Open questions:

Can we automatically refine the mask over time, adapting to the evolving image for even cleaner boundaries?
Could learned priors choose δ and β per scene/object automatically for best results without manual tuning?
How far can this approach scale to very high resolutions and long videos while keeping memory low and motion consistent?
Can similar masked KV-mix and latent shifting help multi-object edits and multi-prompt video stories?

06Conclusion & Future Work

Three-sentence summary: ProEdit is a training-free, plug-and-play add-on that fixes the core weakness of inversion-based editing—too much source information blocking requested edits. It mixes source and target attention features only where edits happen (KV-mix) and loosens source bias in the hidden noise only there (Latents-Shift), while fully preserving the background elsewhere. Across images and videos, this delivers state-of-the-art accuracy on tricky attribute edits without breaking scene consistency.

Main achievement: Showing that precise, mask-guided control—KV-mixing in attention and latent shifting in the edit area—beats global source injection and eliminates the usual tradeoff between correct edits and preserved backgrounds.

Future directions: Smarter, self-updating masks; adaptive δ and β; extension to multi-object, multi-prompt stories; scaling to 4K+ videos with memory-efficient caching; and integrating optional user-provided masks for exact control when needed.

Why remember this: ProEdit changes the editing playbook—from “copy the source everywhere and hope the edit survives” to “protect the background and free the object, exactly where needed.” It’s simple to add, powerful in practice, and a clear step toward reliable, instruction-following visual editing for both images and videos.

Practical Applications

•Product photography: swap colors and materials on items while keeping the studio background identical.
•E-commerce: generate consistent variant images (sizes, colors) without reshooting products.
•Film and TV post-production: adjust props (add/remove), tweak outfit colors, or fix continuity in scenes without breaking backgrounds.
•Marketing and advertising: localize campaigns by changing brand colors or signage text while keeping the same shot.
•Social media content: quickly personalize images (e.g., add accessories, change hair color) while preserving image quality.
•Game asset iteration: modify character attributes (pose, gear color) without altering the environment.
•Education and tutorials: demonstrate controlled edits (e.g., “make the flower red”) to teach prompt-based editing.
•Architectural visualization: change facade colors/materials while preserving lighting and scene layout.
•Scientific imaging: highlight or recolor specific structures in figures without altering surrounding context.
•Video editing: add simple props, recolor objects, or adjust number/pose across frames with temporal consistency.

Version: 1