FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Xijie Huang; Chengming Xu; Donghao Luo; Xiaobin Hu; Peng Tang; Xu Peng; Jiangning Zhang; Chengjie Wang; Yanwei Fu

FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing

Intermediate

Xijie Huang, Chengming Xu, Donghao Luo et al.1/5/2026

arXiv PDF

Key Summary

•This paper makes video editing easier by teaching an AI to spread changes from the first frame across the whole video smoothly and accurately.
•It introduces FFP-300K, a giant, high-quality dataset of 290k video pairs at 720p and 81 frames that is perfect for training this skill.
•The method removes the need for extra helpers like depth maps or per-video fine-tuning and works using just the source video and the edited first frame.
•A new trick called AST-RoPE lets the model adjust how it thinks about space and time so it keeps the look from the first frame and the motion from the source.
•Another trick, self-distillation, lets the model learn from a 'perfect copy' task to stay stable over time and avoid drifting away from the intended edit.
•On the EditVerseBench test, the method beats both research and commercial systems, with around +0.2 PickScore and +0.3 VLM score improvements.
•The dataset is built with two tracks: precise local edits (swap/remove objects) and full-scene stylization, ensuring diversity and temporal consistency.
•Ablations show both AST-RoPE and self-distillation are important, with the full model scoring the best on temporal consistency and quality.
•User studies also prefer this method for editing accuracy, motion accuracy, and overall video quality.
•The result is guidance-free, generalizable video editing that is more practical for creators, studios, and everyday users.

Why This Research Matters

This work makes professional-looking video edits possible without extra tools, so creators can work faster and with fewer errors. Movie studios can change props or materials across scenes reliably, saving time and money. Teachers and students can stylize educational videos smoothly, making lessons more engaging. Social media makers can keep their edits stable across longer clips, reducing flicker and weird artifacts. Because the method is guidance-free at run time, it’s simpler to use and more robust to different situations. The big dataset also helps the research community build and compare better FFP models. Overall, it brings high-quality, generalizable video editing closer to everyday use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you decorate the first page of a scrapbook, you want the rest of the pages to match the style so the whole book feels like one story? Videos are like that—if we can get the first frame just right, we’d love the rest to follow smoothly.

🥬 Filling (The Actual Concept):

What it is: This paper is about First-Frame Propagation (FFP), a way to edit videos by perfecting the first frame and then teaching an AI to spread that change through every frame while keeping the original motion.
How it works (history and problem):
1. Before: Many video editors tried to follow text instructions directly (like “make the car red”), but doing that across time is hard—models must understand meaning and also keep motion consistent, which often leads to flicker or wrong shapes.
2. A promising idea—FFP—let people use great image editing tools to perfect the first frame, then ask the model to propagate it. But existing FFP systems leaned on heavy “crutches” at run time: per-video fine-tuning, extra inputs like depth maps, or optical flow, which are slow and fragile.
3. Root cause: Training data wasn’t good enough. Many datasets had clips that were too short, too small (low resolution), or too narrow in the types of edits. Without long, high-quality, diverse examples, models didn’t learn strong “temporal priors” (the sense of how things move and stay stable over time).
Why it matters: Without the right data and a method tuned for FFP, video edits either don’t follow motion correctly (the edit slides off) or break visual quality (flicker, artifacts). Users want edits that look real and stay consistent.

🍞 Bottom Bread (Anchor): Imagine you put a sticker on the first frame so a gray statue becomes shiny silver. In a good FFP system, every next frame keeps that same shiny look stuck to the statue as it turns, walks past a tree, or the camera pans—no sliding, no flicker.

🍞 Top Bread (Hook): Picture two big problems: keeping how something looks (appearance) and keeping how it moves (motion). Holding both at once is like patting your head and rubbing your belly.

🥬 Filling (The Actual Concept):

What it is: The field needed a dataset and a method that can balance appearance (from the edited first frame) and motion (from the source video) without extra helpers.
How it works (what people tried and why it failed):
1. Instruction-based editing struggled with longer videos—text is vague, and models lost track over time.
2. Early FFP systems used run-time guidance: depth maps, masks, or per-video tuning. These helped, but made pipelines complex and limited generalization (if the depth map is wrong, the edit breaks).
3. Mixed or low-quality datasets (short clips, low-res, image-video blends) didn’t teach models to handle long motion and fine details together.
Why it matters: A clean, guidance-free approach is faster, simpler, and more reliable for real creators—like film editors, YouTubers, teachers, and kids making class projects.

🍞 Bottom Bread (Anchor): Think of trying to repaint a character’s jacket across a 3-second clip (short) versus a 5-second clip (long) at HD resolution. If your training examples are only tiny, 1-second, blurry clips, your model won’t learn how the jacket twists, stretches, and reflects light realistically over time.

🍞 Top Bread (Hook): Imagine a music teacher giving you lots of songs of different lengths, styles, and speeds so you can play confidently without help. That’s what the new dataset does for video edits.

🥬 Filling (The Actual Concept):

What it is: The paper introduces FFP-300K, a huge, clean dataset designed specifically for First-Frame Propagation.
How it works:
1. Scale: ~290k source/edited video pairs, each at 720p and 81 frames—long and sharp enough to teach long-range motion.
2. Two-track pipeline: Local edits (swap or remove specific objects) and Global stylization (change the whole scene’s look), each built with specialized steps for accuracy.
3. Quality control: Automated checks, manual verification for removals, and re-generation after fine-tuning to ensure stable, high-fidelity pairs.
Why it matters: With long, diverse, well-aligned pairs, models learn robust temporal priors and don’t need extra guidance at run time.

🍞 Bottom Bread (Anchor): It’s like practicing basketball layups and three-pointers separately, with drills for both, so in a real game you can switch smoothly. FFP-300K’s two tracks train both precise local changes and big scene-wide styles.

🍞 Top Bread (Hook): You know how maps can stretch or shrink distances (like zooming in on a city block)? If we could do that in space and time inside the model, we’d help it keep appearance and motion balanced.

🥬 Filling (The Actual Concept):

What it is: The method adds AST-RoPE, which adaptively remaps the model’s sense of positions across space and time, based on the source video.
How it works: It predicts two numbers from the source video—one that pulls other frames closer to the first frame (to keep the new look strong) and one that stretches/compresses time (to match how fast things move). Different attention heads specialize in space vs. time.
Why it matters: Without this, the model treats all frames the same way, and the edited look can fade or the motion can go off-track.

🍞 Bottom Bread (Anchor): It’s like telling the model, “Hey, pay a little extra attention to that first shiny statue frame anywhere in the video, and also speed up or slow down your sense of time to match this clip’s motion.”

🍞 Top Bread (Hook): Have you ever checked your own homework by redoing a problem the straightforward way? If the answers match, you trust your method.

🥬 Filling (The Actual Concept):

What it is: Self-distillation with identity propagation—create a “teacher” task where the model reconstructs a video from itself (a perfect target), then make the FFP “student” match the teacher’s motion patterns over time.
How it works: The teacher encodes ideal relationships between frames. The student learns to keep those relationships while spreading the first-frame edit, which fights drift.
Why it matters: Without this, the edit’s influence can fade or wobble by the end of the video.

🍞 Bottom Bread (Anchor): Think of tracing over your own neat drawing to learn the exact curves, then drawing a new picture with the same curves but a new color—you keep the structure steady while changing the appearance.

Finally, what are the real-life stakes? Clean, stable, guidance-free editing makes it easier for movie studios to change props, for educators to stylize lessons, and for everyday creators to make fun, consistent videos without extra tools or long wait times.

02Core Idea

🍞 Top Bread (Hook): Imagine you nail the look of the first frame—like giving a robot a shiny red paint job—and you want that look to stick perfectly as it walks, turns, and waves.

🥬 Filling (The Actual Concept):

The “Aha!” in one sentence: If you train on the right kind of long, sharp, diverse examples and teach the model to adapt its sense of space and time while learning from its own perfect copy, you can spread a first-frame edit across a whole video without extra help.

Multiple Analogies:

Sticker Book: You place one perfect sticker on page 1 (the edit). AST-RoPE is the glue that keeps similar stickers aligned on later pages, and self-distillation is checking your past pages to keep the placement consistent.
Orchestra Conductor: The first frame sets the melody (appearance). AST-RoPE tells sections when to play louder/softer across time (motion speed), while self-distillation listens to a perfect recording (identity task) and keeps the live performance on beat.
GPS with Traffic: The first frame is your destination’s style. AST-RoPE updates the map distances (space-time) using live traffic (source motion). Self-distillation is checking your route against a known good replay to avoid drifting off course.

Before vs. After:

Before: Models leaned on crutches at run time (depth maps, per-video tuning). Edits often flickered, slid off objects, or lost motion details.
After: Guidance-free propagation using just the source video and the edited first frame. Edits stay stuck to objects, motion feels natural, and visuals are crisp.

Why It Works (Intuition without equations):

Appearance anchor: AST-RoPE shrinks the model’s perceived distance to the first frame for special heads that track appearance, so the edit’s look remains strong.
Motion match: AST-RoPE stretches or compresses time to fit how fast things actually move in the source, so motion cues are respected.
Stability over time: Self-distillation gives the model a gold-standard motion blueprint from a simple identity task. The FFP task is nudged to keep the same inter-frame relationships, preventing drift.
Data matters: Long, high-quality, diverse pairs (FFP-300K) teach the model long-range stability and fine detail together.

Building Blocks (explained with Sandwich units):

🍞 You know how fixing page 1 of a comic sets the style for the whole book?
🥬 First-Frame Propagation (FFP): It’s a way to edit videos by perfecting frame 1, then spreading that change across the rest while keeping original motion.

Steps: Edit frame 1 using great image tools; feed the source video + edited frame to the model; the model generates all other frames with the new look.
Why it matters: It splits a hard problem (understand text + keep motion) into easier parts (edit one frame well + propagate).
🍞 Anchor: You color a car blue in frame 1; every next frame shows the same blue car as it drives and turns.

🍞 Imagine a training gym that has drills for both finger tricks and full-body moves.
🥬 FFP-300K dataset: A huge set of paired videos at 720p and 81 frames, built with two tracks—local object edits (swap/remove) and global stylization.

Steps: Detect and mask objects; generate swap/remove examples; make whole-scene style examples with depth-aware guidance; filter for quality.
Why it matters: Long, clean, diverse pairs teach stable motion and rich appearances, so no run-time crutches are needed.
🍞 Anchor: Practice swapping a backpack (local) and turning a scene into watercolor (global) across many clips.

🍞 Think of stretching a rubber grid over a map so important spots are closer and time runs faster or slower where needed.
🥬 Adaptive Spatio-Temporal RoPE (AST-RoPE): A mechanism that changes the model’s internal sense of space and time based on the source video.

Steps: Predict two scalings from the source: one draws all frames nearer to frame 1 for appearance heads; the other stretches/compresses time for motion heads.
Why it matters: Keeps the edit glued to objects and motion aligned with the source.
🍞 Anchor: The model pays extra attention to the edited jacket in every frame and speeds up its sense of time for a running person.

🍞 Like checking your drawing by tracing a perfect version first.
🥬 Self-distillation with identity propagation: A teacher task reconstructs a video from itself (a perfect target); the FFP student matches those inter-frame relationships.

Steps: Run two paths in training: teacher (identity) and student (FFP). Align their motion relationships so edits don’t drift.
Why it matters: Prevents the model from forgetting the edit by the end.
🍞 Anchor: The curvature of a moving cat’s tail stays consistent across frames even after you changed its fur pattern.

🍞 Imagine a simple conveyor belt: inputs go in; features get combined; outputs roll out.
🥬 The model pipeline: Encode source video and edited frame; combine them into a composite latent; use a diffusion transformer with AST-RoPE to generate all frames; train with flow-matching plus the distillation losses.

Why it matters: Clean inputs + adaptive attention + stability training = guidance-free, high-fidelity edits.
🍞 Anchor: You feed the original clip + one edited frame, and out comes a smooth, consistent edited video.

03Methodology

🍞 Top Bread (Hook): Imagine a recipe for a big layered cake: you prepare the base (data), mix the batter (model inputs), bake with a smart oven (adaptive attention), and do a taste test against a perfect sample (self-distillation). Everything works together.

🥬 Filling (The Actual Concept):

What it is: A guidance-free First-Frame Propagation pipeline that learns from a purpose-built dataset and two key training tricks—AST-RoPE and self-distillation—to balance appearance and motion.
High-level flow: Input (source video + edited first frame) → encode to latents → combine into a composite latent with a first-frame mask → Diffusion Transformer (DiT) with AST-RoPE → training with flow matching + motion and first-frame consistency alignment → Output edited video.

Step-by-step details (with Sandwich for each key step):

🍞 You know how you compress photos into smaller files for faster sharing?
🥬 Encode to latents: The source video and the edited first frame are passed through a VAE to create compact feature maps (latents).

Why it exists: Working in latent space speeds up training and focuses on the most important visual information. Without it, generation is slower and noisier.
🍞 Anchor: Instead of juggling millions of pixels per frame, the model handles a neat stack of feature maps.

🍞 Imagine putting a bright sticky note on page 1 to remind the model, “This is the edited one!”
🥬 Composite latent assembly: We pad the edited first-frame latent along time, concatenate it with the noisy target latent, the source latent, and a binary mask that marks which positions belong to frame 1.

Why it exists: Explicitly flags where the edit comes from so the model knows what to copy and where. Without it, the edit signal could get lost.
🍞 Anchor: The mask is like a highlighter over frame 1 so attention keeps returning to the new appearance.

🍞 Think of a translator who must understand both where things are (space) and how they change (time).
🥬 Diffusion Transformer (DiT): The core generator reads the composite latent and iteratively denoises it to produce the final edited frames.

Why it exists: Diffusion models are great at making images/videos crisp and realistic; the transformer backbone helps capture long-range relationships. Without it, you’d get blurry or unstable videos.
🍞 Anchor: The DiT refines a noisy sketch into a clean video while remembering what came from the first frame and where objects move.

🍞 Picture stretching a grid so important squares get closer together and the timeline can run faster or slower.
🥬 AST-RoPE (adaptive positions): The model predicts two scaling factors from the source video: one that decreases the effective distance to frame 1 for spatial heads (keeps appearance strong), and one that rescales time for temporal heads (fits motion speed).

Why it exists: Standard positional embeddings treat all frames uniformly. Without AST-RoPE, edits can fade or misalign with motion.
🍞 Anchor: For a fast dance video, the temporal scale compresses time; for a slow pan, it stretches time so attention connects the right frames.

🍞 Ever compare two class notebooks—one perfect, one learning—to match their patterns?
🥬 Self-distillation with identity propagation: Train a teacher path that reconstructs a video from itself (identity). Use its inter-frame relationship patterns as targets for the student FFP path. Two gentle losses keep motion and first-frame influence stable.

Why it exists: Prevents semantic drift—where the edit’s effect fades or warps over time. Without it, the last frames can look off.
🍞 Anchor: The way shadows move between frames remains consistent even after you changed a character’s clothes in frame 1.

🍞 Think of a coach tracking both form and timing during practice.
🥬 Training objective (flow matching + alignment): The main diffusion training (flow matching) teaches denoising; the motion-alignment loss matches frame-to-frame relationships; the first-frame consistency loss keeps the edit’s influence steady across time.

Why it exists: Each piece fixes a failure mode—blurry samples, motion wobble, or fading edits.
🍞 Anchor: It’s like practicing scales (clarity), rhythm (motion), and melody (appearance) together.

🍞 You know how practicing both piano pieces and full-band rehearsals makes you a well-rounded musician?
🥬 Two-track data pipeline (for training):

Local editing: A vision-language model identifies objects; a segmenter makes masks; a video inpainting model performs swap/remove; mask erosion and bounding-box choices are tuned per task; tough removals are filtered and regenerated after fine-tuning for quality.
Global stylization: Start from a style image and generate a source video aligned to it; then stylize with depth guidance so structure stays real while appearance changes.
Why it exists: Teaches both precise, object-level propagation and scene-wide appearance shifts. Without both, the model would be lopsided.
🍞 Anchor: Practice removing a person cleanly from a crowd (local) and turning a whole street into watercolor (global).

🍞 Sometimes the secret spice is what ties the whole dish together.
🥬 Secret sauce: The combination of long, clean pairs (FFP-300K), adaptive attention (AST-RoPE), and self-checking (self-distillation).

Why it exists: Any one alone helps; all three together unlock guidance-free, generalizable editing.
🍞 Anchor: That’s why the model can handle an 81-frame 720p clip with stable, sticky edits and natural motion—no extra depth maps or per-video tuning.

Concrete example with actual data:

Input: Source video of a wooden sculpture turning; edited first frame where the sculpture’s material is changed to silver.
Process: Encode both to latents; mark frame 1; DiT with AST-RoPE focuses attention on the edited material while aligning time to the turn speed; self-distillation keeps the material coherent through the full 81 frames.
Output: The silver look sticks as the sculpture rotates; reflections shift properly; no flicker or drift.

What breaks without each step:

No mask/composite: The model can’t find the edit signal; the change fades.
No AST-RoPE: The edit slides or falls off in motion; spatial details get lost over time.
No self-distillation: The last frames look less edited; subtle drift appears.
No two-track data: Good at stylization but bad at precise swaps—or vice versa.

04Experiments & Results

🍞 Top Bread (Hook): Think of a science fair where every robot does the same obstacle course. We measure speed, balance, and how well they follow instructions, then see who wins.

🥬 Filling (The Actual Concept):

The test: The team used EditVerseBench, a broad benchmark covering many editing types. Because this paper focuses on First-Frame Propagation, they filtered it to 125 clips that are suited for propagation. A standard image editor (Qwen-Edit) made the first-frame edits so the test is fair.
What they measured and why:
1. Temporal consistency (CLIP and DINO): Do frames agree with each other over time? Like checking a dance stays in sync.
2. Text alignment (Frame and Video scores): Does the final video match the edit’s description per frame and across the whole clip?
3. Perceptual quality (PickScore): Does it look great to viewers, like a polished film?
4. VLM score: A vision-language model grades how well the edit matches the instruction in multiple sampled frames.

The competition (baselines):

Training-free: TokenFlow, STDF.
Instruction-based: InsV2V, LucyEdit, EditVerse, and a commercial model (Aleph).
FFP-based: VACE and Señ̃orita (tested fairly with the same edited first frames).

Scoreboard with context:

Ours-81f hit very high temporal consistency (CLIP 0.991, DINO 0.991) and strong video-level text alignment (25.925). That’s like running the course without stumbling even once over a long track.
Ours-33f earned the top perceptual quality (PickScore ~20.419) and the best VLM score (~7.631)—like getting the highest beauty and instruction-following marks.
Against strong systems like Aleph (commercial) and top academic models, the method improved around +0.2 PickScore and +0.3 VLM score—like moving from a solid B to a strong A on overall quality and instruction faithfulness.

Qualitative findings (what you see):

Instruction-based methods sometimes changed the wrong thing or introduced flicker (hard to see in single frames, obvious in video). Some FFP competitors showed lower visual quality or mosaic artifacts.
The proposed method produced longer, steadier clips with edits that stayed attached to objects and looked realistic.

User study (people’s choice):

15 participants rated 8 random videos on Editing Accuracy, Motion Accuracy, and Video Quality (1–5). The proposed method scored highest across all three. That’s like winning both the judges’ choice and the audience award.

Ablation study (what parts matter):

Baseline (just the underlying model fine-tuned on FFP-300K) already performed well—showing the dataset’s strength.
+AST-RoPE boosted temporal and text scores significantly—proving adaptive positions help a lot.
Full model (+self-distillation) reached the best across-the-board results—showing both components are complementary.

Extra test (UNICBench):

On another benchmark filtered for FFP settings, the method again topped all metrics compared to AnyV2V, LucyEdit, Señ̃orita, and UNIC. This shows generalization beyond a single benchmark.

Why these numbers matter:

High temporal consistency means the edit doesn’t flicker or drift—vital for watchable videos.
Strong text alignment and VLM score means the edit matches the user’s intent, not just looking pretty.
Top PickScore says viewers would likely prefer these outputs.

Bottom line: Data designed for FFP + AST-RoPE + self-distillation together deliver guidance-free, state-of-the-art video editing that both metrics and humans prefer.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best scooters have limits—you wouldn’t use one to tow a truck. Let’s be honest about where this method shines and where it struggles.

🥬 Filling (The Actual Concept):

Limitations (what it can’t do well yet):
1. Domain balance: The dataset, while diverse, has heavier coverage of indoor/urban scenes. Rare or exotic motions and scenes may still challenge the model.
2. First-frame dependency: Garbage in, garbage out. If the first-frame edit is sloppy (wrong lighting, warped shapes), the model will faithfully propagate those issues.
3. Extreme motions or occlusions: Very fast, chaotic motion or frequent object hiding can still cause slight appearance drift late in the sequence.
4. Ultra-long or ultra-high-res: The method trains for 81 frames at 720p. Extending to minutes-long clips or native 4K requires more compute and possibly architectural tweaks.
5. Open-world instructions alone: This system expects an edited first frame, not only text. If you want pure text-to-video edits without making a first-frame, use instruction-based editors instead.
Required resources:
- Training needs substantial GPU memory and time (long videos, transformer diffusion, dual-path distillation). Inference is lighter (no guidance), but still benefits from a good GPU for 720p x 81 frames.
When NOT to use:
- If you only have a text instruction and no way to produce a high-quality first-frame edit.
- If you must edit very long videos at 4K in one pass on limited hardware.
- If the scene has extreme, unknown physics (e.g., non-realistic warping) that the dataset did not cover.
Open questions:
1. Scaling up: How well does AST-RoPE and self-distillation extend to 4K and hundreds of frames?
2. Interactive editing: Can we update the edit mid-video and re-propagate quickly without retraining?
3. Multi-object complexity: How to best handle many interacting edits (multiple swaps, materials, and styles at once)?
4. Robustness and fairness: How to ensure consistent performance across rare scenes, lighting, and demographics?
5. Beyond 2D videos: Can the ideas transfer to 3D or multi-view video where geometry matters even more?

🍞 Bottom Bread (Anchor): Think of this as a fast, sturdy bicycle for city rides—awesome for most trips, but for a mountain expedition (4K films, hour-long clips), you’ll want extra gears and supplies. The path forward is adding those gears while keeping the bike light and easy to ride.

06Conclusion & Future Work

🍞 Top Bread (Hook): If you can make the first frame perfect and teach the model to respect both looks and motion, you can make the whole video sing.

🥬 Filling (The Actual Concept):

3-sentence summary: This paper tackles guidance-free First-Frame Propagation by fixing the data problem (FFP-300K) and tuning the model’s sense of space-time (AST-RoPE) while teaching it to stay steady over time (self-distillation). Together, these ideas let the system keep the edited appearance glued to moving objects while preserving source motion, all at 720p and 81 frames. The result is state-of-the-art performance on public benchmarks and in user preference.
Main achievement: Turning FFP into a practical, generalizable, guidance-free solution by combining a purpose-built dataset with adaptive positional encoding and a stability-focused training objective.
Future directions: Scale to longer and higher-resolution videos; add interactive mid-sequence edits; improve robustness to extreme motion; extend to multi-object and 3D scenarios; explore lighter models for fast on-device editing.
Why remember this: It shows that the right data plus a small number of smart architectural and training choices can replace heavy run-time crutches—making high-quality video editing simpler, faster, and more accessible.

🍞 Bottom Bread (Anchor): Like learning to copy a perfect first drawing across a flipbook, the model now flips through 81 pages with the new look sticking perfectly, no tape or extra tools needed.

Practical Applications

•Material or texture changes that stick to moving objects (e.g., turning a wooden statue into metal across a whole clip).
•Local object swaps in product videos (e.g., replace one shoe color with another while the model walks).
•Clean object removals for documentaries or vlogs (e.g., remove a distracting sign without flicker).
•Full-scene stylization for trailers or short films (e.g., watercolor or claymation look with consistent motion).
•Content-safe edits for education (e.g., anonymize faces consistently across extended footage).
•Fashion try-ons in motion (e.g., swap jackets, change patterns, keep motion natural).
•Sports highlight enhancements (e.g., stylize replays while preserving player trajectories).
•AR/VR previsualization (e.g., change materials and styles of moving assets without manual keyframing).
•Marketing A/B variants (e.g., different colorways of a product across the same action sequence).
•Historical film restoration aids (e.g., consistent colorization or artifact cleanup over longer shots).

Version: 1