Unified Video Editing with Temporal Reasoner

Xiangpeng Yang; Ji Xie; Yiyuan Yang; Yan Huang; Min Xu; Qiang Wu

Unified Video Editing with Temporal Reasoner

Intermediate

Xiangpeng Yang, Ji Xie, Yiyuan Yang et al.12/8/2025

arXiv PDF

Key Summary

•VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.
•It predicts soft gray “reasoning frames” that highlight the exact region to change, so users don’t have to draw masks.
•A special timing trick (RoPE alignment) keeps motion lined up and lets the model handle videos much longer than it was trained on.
•Despite training on only 50k videos, VideoCoF beats larger systems on a new benchmark for instruction-based video editing.
•It handles tricky cases like multiple similar objects, left/right distinctions, and local style changes with high accuracy.
•The model follows a see → reason → edit routine that improves instruction-to-region alignment and reduces accidental changes elsewhere.
•A clean “temporal triptych” text prompt tells the model the three parts it should imagine: original, grounded (reasoning), and edited.
•Ablation studies show the gray, gradually highlighted reasoning format and 4 reasoning frames work best.
•RoPE index resetting avoids index collisions, prevents artifacts, and unlocks 4× length extrapolation without quality loss.

Why This Research Matters

VideoCoF lets anyone describe precise video edits in plain language without drawing masks—saving time and reducing mistakes. It keeps motion aligned so actions look natural, even in longer videos like vlogs, tutorials, or sports clips. Creators can reliably edit the correct person or object among many, matching real-world needs. The approach is data-efficient, showing strong results without training on millions of clips, which makes it more accessible. Its simple see → reason → edit principle could unify many kinds of edits under one tool. As videos keep growing longer online, techniques that stay stable over time become especially valuable. This paper offers a clean blueprint for trustworthy, mask-free video editing at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re editing a class video and you say, “Erase the girl on the left holding a controller,” but the tool erases the wrong person. Frustrating, right?

🥬 Filling (The Actual Concept): What it is: Before this paper, video editors had to choose between two imperfect options: precise tools that needed hand-drawn masks, and mask-free tools that often guessed the wrong spot. How it works (story of the field):

Expert, mask-based methods: You tell the tool exactly where to edit by providing a mask. Precise, but extra work and different tools per task.
Unified, mask-free methods: You just give the instruction and video, and the model tries in-context learning. Easier, but often uncertain about the exact region to change—especially with multiple similar objects or left/right.
Some teams tried adding big multimodal LLMs to guide edits, but that added heavy costs and complexity. Why it matters: Without a clear link between the words you say and the exact region to change, edits go to the wrong place, spill into background areas, or break motion consistency, especially in longer videos.

🍞 Bottom Bread (Anchor): Think of asking an assistant to “paint only the biggest cup white.” Without pointing, the assistant might color the wrong cup. That’s the old world.

🍞 Top Bread (Hook): You know how a good student shows their work step-by-step to get the right answer? That idea inspired a better editor.

🥬 Filling (The Actual Concept): What it is: The paper proposes VideoCoF, a model that follows see → reason → edit. It predicts a soft gray “reasoning frame” to highlight the target area before making the actual change. How it works:

See: Look at the original video.
Reason: Predict a gray-highlighted region showing where the edit should happen.
Edit: Perform the requested edit exactly in that region. Why it matters: You get mask-like precision without drawing masks, and the instruction maps cleanly to the correct spot.

🍞 Bottom Bread (Anchor): Say “Remove the man in the green shirt on the right.” VideoCoF first shows a gentle gray glow over that man, then removes him—no extra mask from you.

🍞 Top Bread (Hook): Imagine marching in step to keep a parade aligned. Timing matters in videos too.

🥬 Filling (The Actual Concept): What it is: Long videos often break unified editors because their internal timing assumptions don’t generalize beyond the short lengths they were trained on. How it works:

VideoCoF uses a timing system called RoPE (rotary position embeddings) and resets time indices to avoid clashes between source, reasoning, and edited frames.
It sets the reasoning frame to time 0, and both source and target to 1…F, preventing index collisions.
This preserves motion alignment and allows length extrapolation (e.g., 4× longer) without artifacts. Why it matters: Your edits stay synchronized with the original motion, even in longer videos.

🍞 Bottom Bread (Anchor): If the person lifts a shirt in frame 30, the edited version should lift the shirt in frame 30 too. With VideoCoF’s RoPE design, that alignment holds—even when the video is much longer than the training clips.

🍞 Top Bread (Hook): If directions are fuzzy, people make mistakes; if directions are clear, work goes smoothly.

🥬 Filling (The Actual Concept): What it is: The field lacked a simple, unified way to connect instructions to exact spatial regions without masks, and to keep motion aligned over time. How it works:

Introduce explicit reasoning frames (gray highlights) to ground language to space.
Use RoPE index resetting to ground time and prevent timing mix-ups.
Train a single Video Diffusion Transformer to learn the whole pipeline, guided by a “temporal triptych” prompt that narrates the three parts (original → grounding → edited). Why it matters: This fills the gap between precision and unification, making mask-free edits reliably accurate.

🍞 Bottom Bread (Anchor): Instead of carrying separate tools for addition, removal, swap, and style changes (and drawing masks), you just talk to one model that shows where it will act and then edits there—correctly.

02Core Idea

🍞 Top Bread (Hook): You know how coaches ask players to “visualize the play” before they move? Visualizing helps them act precisely.

🥬 Filling (The Actual Concept): The “Aha!” in one sentence: Make the video model first visualize the exact edit region (reason) and only then apply the edit. How it works:

Concatenate three parts along time: source frames → reasoning frames (predicted gray highlights) → target edited frames.
Train the model so it must predict the reasoning frames before it predicts the edited frames.
Use a careful RoPE time-index layout to keep motion aligned and allow long videos. Why it matters: This turns vague, mask-free editing into precise, instruction-following editing without extra user effort.

🍞 Bottom Bread (Anchor): Ask “Change only the tree bark to crystal.” The model first shows a gray glow over the bark (where), then makes it crystalline (what) without touching leaves or sky.

Multiple Analogies:

Teacher’s margin notes: Before fixing an essay, the teacher circles the exact sentences to change (reasoning) and then rewrites them (editing).
GPS navigation: First, drop a pin at the destination (reasoning region), then follow the route (edit) without wandering off.
Cookbook recipe: Identify the ingredient you’ll modify (reason), then perform the cooking step only on that ingredient (edit), not the whole dish.

Before vs After:

Before: Unified models guessed where to edit; multi-instance scenes confused them; long videos broke timing.
After: The model shows its target region explicitly, nails multi-instance/local edits, and stays aligned over longer videos thanks to RoPE.

Why It Works (intuition):

Separating “where” from “what” simplifies learning: first map language to a place, then apply the change there.
The gray mask acts like a gentle highlighter that diffusion models respect without overfitting to harsh binary masks.
RoPE index resetting avoids the model memorizing a fixed short timeline and instead teaches it a reusable sense of time.

Building Blocks (each explained with sandwich below):

Video Diffusion Model (the engine that paints frames)
Chain-of-Frames (the 3-part timeline: see → reason → edit)
Temporal Reasoning (understanding how motion and timing line up)
Reasoning Tokens via Visual Grounding (gray highlights as soft region maps)
RoPE Alignment Strategy (time index design to avoid collisions and enable extrapolation)
Temporal Triptych Prompt (a clear text template that narrates the three parts)

Concept Sandwiches:

Video Diffusion Model 🍞 Hook: Imagine a magic eraser and paintbrush that can add, remove, or restyle parts of a moving picture. 🥬 The Concept: What it is: A model that denoises fuzzy frames step by step until they look like the desired video. How it works: (1) Start from noisy latents, (2) predict cleaner versions over many steps, (3) use text instructions to guide what appears, (4) decode latents back to frames. Why it matters: It’s the core engine that can actually render your requested changes frame by frame. 🍞 Anchor: From noise to a scene with “a brown-and-white beagle sniffing a metal bowl,” as instructed.
Chain-of-Frames (CoF) 🍞 Hook: You know comic strips show panels in order: setup, clue, punchline. 🥬 The Concept: What it is: Arrange the process as three time blocks: original video → reasoning (where to edit) → edited video. How it works: (1) Feed source frames, (2) predict gray-highlight reasoning frames, (3) generate edited frames, in one timeline. Why it matters: Forces the model to think “where” before doing “what,” improving precision. 🍞 Anchor: First see the kitchen, then glow over the biggest cup, then make that cup white.
Temporal Reasoning 🍞 Hook: When you dance to music, you move in time with the beat. 🥬 The Concept: What it is: Understanding how things change over frames and staying aligned with the original motion. How it works: (1) Track object identity across frames, (2) align predicted edits with those frames, (3) preserve actions at the right moments. Why it matters: Prevents edits from drifting off-beat, which causes jitters or mismatched poses. 🍞 Anchor: If a jacket flips open at frame 30, the edited red leather jacket also flips open at frame 30.
Reasoning Tokens (gray visual grounding) 🍞 Hook: Highlighters help you focus on the important sentence before you rewrite it. 🥬 The Concept: What it is: Soft gray overlays that indicate the edit region, predicted by the model. How it works: (1) Predict grayscale highlight over the target, (2) optionally increase transparency progressively, (3) use it to guide the edit that follows. Why it matters: Gives explicit spatial cues without user-drawn masks. 🍞 Anchor: For “remove the woman on the left,” the model first shades that woman in gray, then removes her.
RoPE Alignment Strategy 🍞 Hook: Marchers avoid bumping by keeping their step numbers offset. 🥬 The Concept: What it is: A time-index plan that gives unique, non-colliding positions to source, reasoning, and edited frames. How it works: (1) Set reasoning at index 0, (2) set source and edited at 1…F, (3) avoid collisions and keep motion aligned, (4) enable longer-than-training videos. Why it matters: Prevents artifacts and keeps edits synchronized as videos get longer. 🍞 Anchor: No ghosting at the first edited frame; actions match across source and target even at 4× length.
Temporal Triptych Prompt 🍞 Hook: A clear to-do list beats a vague nudge. 🥬 The Concept: What it is: A text template that says: part 1 original, part 2 grounded region, part 3 edited result. How it works: (1) The prompt narrates all three parts, (2) the model conditions on that structure, (3) better maps language to steps. Why it matters: Improves instruction following without huge pretraining. 🍞 Anchor: “A video sequence showing three parts: first the original scene, then grounded the t-shirt, and finally the same scene but make the t-shirt cerulean.”

03Methodology

At a high level: Input (source video + instruction) → Step A: Predict reasoning tokens (gray-highlight frames) → Step B: Generate edited frames guided by the reasoning and text → Output: Edited video aligned in space and time.

Step-by-step recipe:

🍞 Hook: Think of cooking: prep (identify the ingredient), then cook (apply heat). 🥬 Step A — Predict Reasoning Tokens (Where to edit) What happens: The model takes the source video and instruction and predicts soft gray-highlight frames that mark the intended edit regions. How it works:

Encode source, reasoning, and target clips separately using a Video VAE to get latent tokens.
Concatenate them in time: [source | reasoning | target]. Keep source clean; add noise to reasoning+target latents during training.
Train the model to denoise reasoning first (finding the region), then target (applying the change). Why this step exists: Without it, the model guesses the region, causing wrong-object edits and spillover. Example: Instruction: “Remove the young woman in beige pants on the left.” The model first shades that person left-of-center.

🍞 Anchor: Like circling the sentence before rewriting it.

🍞 Hook: After circling the sentence, you rewrite just that line. 🥬 Step B — Generate Edited Frames (What to change) What happens: With the region highlighted, the model renders the edited video frames. How it works:

Use the predicted reasoning tokens to guide attention toward the region.
Denoise the target latents step by step using the instruction.
Decode the cleaned target latents back into frames via the VAE decoder. Why this step exists: Ensures the change is applied precisely where intended and not elsewhere. Example: “Replace her brown tracksuit with a bright red leather jacket and black leggings, add realistic highlights.” Only clothing changes; skin and background stay intact.

🍞 Anchor: Painting only inside the traced lines.

🍞 Hook: Marching in step keeps a band together. 🥬 Motion Alignment & Length Extrapolation — RoPE Design What happens: The model uses a careful timeline index plan to avoid time-index collisions and to generalize to longer videos. How it works:

Assign indices: source 1…F, reasoning 0, target 1…F.
This prevents the reasoning frame from colliding with the first source/target frames.
By not hardcoding a fixed [0…2F-1] schedule, the model can handle longer sequences at inference (e.g., 141 frames). Why this step exists: Without it, first frames get artifacts and long videos break alignment (blur, jitters). Example: Editing a 33-frame-trained model on 81 or 141 frames stays crisp with matched motion.

🍞 Anchor: No more tripping over the starting line at frame 0.

🍞 Hook: Good ingredients make good dishes. 🥬 Data Curation for Multi-Instance Reasoning What happens: Build triplets (source, reasoning, edited) across addition, removal, swap, and local style tasks. How it works:

Start from a large video pool; detect multiple objects with a vision-language model.
Segment each instance with Grounded-SAM2; create edits via tools (e.g., MiniMax-Remover for removal, VACE-14B for inpainting-based swaps/styles) using GPT-4o for creative prompts.
Filter results using Dover (aesthetics) and VIE (edit fidelity) scores; distill to 50k high-quality pairs. Why this step exists: The model must practice complex, instance-level scenes to learn solid reasoning. Example: Scenes with several people or many similar cups, annotated so the model learns left/right and “largest” correctly.

🍞 Anchor: Like practicing with tricky worksheets so the test feels easy.

🍞 Hook: Clear instructions help you avoid mistakes. 🥬 Temporal Triptych Prompting What happens: During training/inference, the text prompt describes the three-part structure. How it works:

Template: “A video sequence showing three parts: first the original scene, then grounded {region}, and finally the same scene but {edit}.”
This anchors language to the see → reason → edit timeline. Why this step exists: It boosts instruction following without massive instruction-tuning data. Example: “Grounded the left side hair,” then “Transform the person’s hair into realistic flames.”

🍞 Anchor: Like giving step-by-step directions to a friend so they won’t get lost.

The Secret Sauce (what makes it clever):

Forcing the model to show where it will act (reasoning frames) before acting.
Using soft, progressive gray masks that diffusion models understand well.
A RoPE index plan that avoids collisions and empowers long-sequence generalization.
A simple, structured prompt that teaches the model the three-part storyline without huge pretraining.

04Experiments & Results

🍞 Top Bread (Hook): Report cards mean more when you know the class average.

🥬 The Test: What they measured and why

Instruction Following: Did the edit match the exact request (e.g., right person, correct side, right object among many)?
Preservation: Were untouched parts truly preserved (no background damage)?
Quality: Is the video clean, natural, and artifact-free?
Success Ratio: A strict pass/fail judged by GPT-4o.
Perceptual Metrics: CLIP-T (text-image alignment), CLIP-F (temporal consistency), DINO (structural consistency). Why: Together, these reflect both “did it do the right thing?” and “did it do it cleanly over time?”

🍞 Bottom Bread (Anchor): It’s like grading a science fair: correctness, neatness, and staying within the project rules.

The Competition (baselines):

InsV2V, Señorita (I2V with InstructPix2Pix), VACE-14B (with GPT-4o captions), ICVE (huge pretraining), LucyEdit.

The Scoreboard (with context):

VideoCoF trained on only 50k pairs yet leads in Instruction Following (8.97) and Success Ratio (76.36%). That’s like getting an A+ while some others studied with way more textbooks.
CLIP-T is highest, showing strong text-to-video alignment.
On Preservation/Quality, ICVE is slightly higher in some cases, likely due to its 1M pretraining + 150k SFT scale advantage, but VideoCoF still stays competitive.

Surprising Findings:

Reasoning format matters a lot: a soft gray highlight with progressive transparency outperforms black masks or red overlays.
Four reasoning frames (which compress to one reasoning latent) hit the sweet spot; five frames (two reasoning latents) add complexity and hurt performance.
RoPE index resetting ([1…F, 0, 1…F]) avoids artifact collisions and unlocks 4× length extrapolation with stable motion.

Task-wise highlights (from VideoCoF-Bench):

Multi-instance removal: Precisely removes the correct person (e.g., “on the right”) where others sometimes remove the wrong one.
Object addition: Places new objects or people in the correct spatial context (e.g., girl inside the washing machine window).
Object swap: Accurately changes both face and clothing when instructed; some baselines change only part of the target or the wrong person.
Local style: Correctly identifies and edits “the largest cup” among similar items; others may edit the wrong object.

Big Picture: The see → reason → edit workflow converts vague mask-free editing into precise, grounded editing—at small data budgets—and stays steady on longer videos.

05Discussion & Limitations

🍞 Top Bread (Hook): Even great tools have instruction manuals and safety notes.

🥬 Limitations (be specific):

Global, full-frame artistic restyles are supported but weren’t the main focus; further tuning may boost them more.
Extremely unusual edits outside the training tasks (e.g., complex physics changes) may need more data.
Videos far beyond 4× length generalization should still work in principle, but performance at very large scales hasn’t been fully stress-tested.
Requires a Video Diffusion Transformer backbone and a VAE; not a tiny model, so inference needs GPU memory.

Required Resources:

A capable GPU for inference on 33–141+ frame sequences.
The VideoCoF weights, tokenizer, VAE, and prompt template.
For custom fine-tuning: a curated source–reasoning–edited dataset.

When NOT to Use:

If you need pixel-perfect rotoscoping across hundreds of frames for film-grade compositing, a mask-based pipeline might still be preferable.
If your edit requires extremely precise physics simulation (e.g., fluid-accurate spill propagation), specialized simulators are better.
If compute is very limited (e.g., mobile-only offline), a lighter editor may be necessary.

Open Questions:

How far can length extrapolation go while keeping perfect motion and identity? 8×? 16×?
What’s the best mix of image+video data to boost both detail and dynamics?
Can reasoning frames encode richer cues (e.g., layered regions or relations like “left of X and behind Y”) without complicating training?
Can the same see → reason → edit idea unify global style, ID-driven edits, and motion retiming under one prompt?

🍞 Bottom Bread (Anchor): Think of this as version 1 of a new blueprint: it works great now and points to exciting upgrades next.

06Conclusion & Future Work

Three-sentence summary: VideoCoF turns video editing into a see → reason → edit process by predicting soft gray reasoning frames before applying changes, removing the need for user masks. A careful RoPE time-index design keeps motion aligned and lets the model handle much longer videos than it trained on. With just 50k examples, it achieves state-of-the-art instruction following and success rates on a new benchmark.

Main Achievement: Showing that explicit, learned reasoning about where to edit—via gray-highlight frames—unlocks mask-free precision in unified video editing.

Future Directions: Scale data and tasks, blend image+video training for finer details, expand reasoning frames to encode richer relations, and generalize to global styling and ID-driven edits under the same framework.

Why Remember This: It’s a simple but powerful idea—make the model show where it will act before it acts—that bridges the gap between ease (no masks) and accuracy (right spot, right time), and it keeps working even as videos get longer.

Practical Applications

•Edit the correct person in crowded scenes (e.g., change only the coach’s jacket color on the right sideline).
•Add or remove objects in tutorials without touching the background (e.g., insert a missing tool on a workbench).
•Swap a product’s design or logo in ads while preserving hand and motion alignment.
•Local style transfer for fashion lookbooks (e.g., turn fabric to glossy leather) without affecting skin tones or hair.
•Cleanly remove passersby from vacation videos while keeping the scenery intact.
•Long-form content tweaks (e.g., recolor a bike across a 2–3 minute montage) with motion staying in sync.
•Education and science demos: highlight then transform a specific lab item to teach concepts visually.
•Social media content polishing: fix a single prop or outfit in a multi-person dance video.
•Pre-visualization in film: quickly test wardrobe or prop changes on a specific actor without manual rotoscoping.
•E-commerce: update product colorways or textures in try-on or showcase videos with precise region targeting.

Version: 1