Motion Attribution for Video Generation

Xindi Wu; Despoina Paschalidou; Jun Gao; Antonio Torralba; Laura Leal-Taixé; Olga Russakovsky; Sanja Fidler; Jonathan Lorraine

Motion Attribution for Video Generation

Intermediate

Xindi Wu, Despoina Paschalidou, Jun Gao et al.1/13/2026

arXiv PDF

Key Summary

•Motive is a new way to figure out which training videos teach an AI how to move things realistically, not just how they look.
•It uses motion-focused gradients (math arrows) that only pay attention to parts of the video that are actually moving.
•A smart mask made from optical flow highlights motion pixels, so static backgrounds don’t trick the system.
•To keep it fast and scalable, Motive uses one shared noise step, a compact projection (Fastfood JL), and a frame-length fix.
•Using the top 10% most motion-influential clips for fine-tuning beats using the whole dataset on key motion metrics.
•On VBench, Motive improves dynamic degree and keeps visuals sharp; humans prefer its results 74.1% over the base model.
•Motive works across different video generators (Wan 2.1 and 2.2), showing it’s model-agnostic.
•It helps curate better fine-tuning data, leading to smoother motion, better physics, and fewer glitches.
•This is the first framework that attributes motion (how things move) rather than just appearance (how they look) in video generation.
•The method is practical at modern scale and its one-time compute can be reused for many motion queries.

Why This Research Matters

Videos are how we tell stories, teach skills, and train robots, so believable motion is crucial. Motive helps pick the training clips that actually teach models to move realistically, making results smoother and more physically plausible. With better motion, educational videos, simulations, and creative content feel more natural and trustworthy. It saves compute and time by focusing fine-tuning on the most helpful 10% instead of brute-forcing everything. The method also offers a transparent way to trace where a model’s motion habits came from, aiding auditing and safety. As video AIs power games, films, and virtual assistants, motion-aware data curation will be a key lever for quality and reliability.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine watching a magic trick on TV where the ball should bounce, roll, and stop realistically. If the ball teleports between frames or wobbles strangely, the trick feels fake.

🥬 The Concept: Video generation AIs used to be great at making single pretty pictures, but videos require believable motion across time.

What it is: Video generation creates many frames that must fit together so objects move smoothly and obey simple physics.
How it works: 1) Start from noise, 2) Denoise step by step, 3) Produce frames, 4) Stitch them into a clip.
Why it matters: Without caring about motion, you get flicker, drifting identities, or impossible physics even if each frame looks sharp.

🍞 Anchor: A basketball that looks perfect in each snapshot but changes size or jumps sideways between frames will still look wrong as a video.

🍞 Hook: You know how a class learns the wrong thing if the examples are confusing? AI models learn from examples too.

🥬 The Concept: Data attribution asks which training examples most influenced a model’s behavior.

What it is: A way to score each training clip for how it changes the model’s output on a given test case.
How it works: 1) Compute gradients on the test, 2) Compute gradients on each train clip, 3) Compare them (like matching fingerprints), 4) Rank which clips push the model toward the observed behavior.
Why it matters: Without knowing which examples cause which behaviors, fixing problems is guesswork.

🍞 Anchor: If the model keeps making people walk like robots, attribution helps you find the training videos that taught it that walk.

🍞 Hook: Think of time in a video like beats in a song—missing or extra beats mess up the rhythm.

🥬 The Concept: Diffusion video models extend image diffusion by adding a time axis and learning dynamics.

What it is: A denoising process that builds a whole clip, not just one image.
How it works: 1) Encode frames into a compact latent space, 2) Add noise and learn to remove it at many timesteps, 3) Use temporal attention to connect frames, 4) Decode back to video.
Why it matters: Treating time properly prevents flicker and keeps motion consistent.

🍞 Anchor: The model must remember a yo-yo’s position across frames, not just redraw a nice yo-yo each time.

🍞 Hook: You know how a windsock shows wind strength and direction? We need a windsock for pixels.

🥬 The Concept: Optical flow measures how pixels move between frames.

What it is: A per-pixel motion field telling where each pixel went next.
How it works: 1) Compare two frames, 2) Find matching patches, 3) Estimate horizontal/vertical shifts, 4) Build a motion map.
Why it matters: Without motion maps, we can’t tell moving parts from static backgrounds.

🍞 Anchor: If a cat walks right across the screen, optical flow shows many rightward arrows around the cat and near-zero arrows in the still background.

The world before: Researchers already had image-focused attribution for diffusion models, but when they tried it on videos, it mostly matched appearance (objects, textures, backgrounds), not motion. Three specific headaches popped up: (i) focusing on where motion actually happens, (ii) keeping computation and storage manageable across long clips, and (iii) measuring true temporal relations like velocity and trajectory.

Failed attempts: Naïvely treating time like an extra spatial dimension blurred motion with appearance. Influence rankings were biased by long clips (more frames = bigger gradients), and per-timestep noise made comparisons shaky. As a result, selected training data didn’t reliably improve motion.

The gap: We needed motion-specific attribution that isolates dynamic regions, corrects biases (like frame length), and scales to billion-parameter models and large datasets.

Real stakes: Better motion means smoother storytelling, safer simulation videos for training robots, more convincing educational media, and fewer physics fails in games or films. When fine-tuning on small budgets, picking the right 10% of clips can make or break motion quality.

02Core Idea

🍞 Hook: Imagine you’re grading a group dance. You should watch the dancers’ legs and arms, not the curtain in the back.

🥬 The Concept: The key idea is to weigh gradients by motion so attribution pays attention to moving parts, not static scenery.

What it is (one sentence): Reweight the learning signal with a motion mask so influence scoring highlights dynamics over appearance.
How it works: 1) Detect motion via optical flow, 2) Turn motion magnitude into a mask (0 to 1), 3) Compute a motion-weighted loss and its gradient, 4) Compare projected, normalized gradients between query and each training clip, 5) Rank and select the most motion-influential clips.
Why it matters: Without motion weighting, attribution chooses clips that “look similar” but don’t teach the model how to move.

🍞 Anchor: For a “rolling wheel” query, motion-weighted attribution picks clips with smooth rotations, not just photos of wheels.

Aha! moment in one sentence: If you only listen to where things move, you can find which training clips actually taught the model to move that way.

Three analogies for the same idea:

Spotlight analogy: Dim the background lights and shine the spotlight on the dancer; now you can judge the dance, not the stage.
Recipe analogy: Taste only the spicy part to find which ingredient made the dish spicy, not which made it red.
Detective analogy: Dust for fingerprints only on the safe’s handle, not the wall around it.

Before vs. After:

Before: Attribution mixed appearance with motion; long videos and busy textures dominated rankings.
After: Motion-aware masks focus on dynamic regions; rankings favor clips that improve smoothness, trajectories, and physical plausibility.

Why it works (intuition, no equations): Gradients tell us how each example would nudge the model. If we downweight static pixels and upweight moving pixels before taking gradients, we point the “nudge detector” at dynamics. Normalizing for frame count and sharing the same denoising step and noise lowers random variation, so the ranking becomes stable and fair. A compact projection preserves the angle between gradients, so influence comparisons stay meaningful at model scale.

Building blocks:

Motion detection (optical flow) to find moving pixels.
Motion-weighted loss to steer gradients toward dynamics.
Frame-length normalization so long clips don’t win by size.
Shared timestep and noise for stable comparisons.
Fastfood projection to store tiny gradient fingerprints safely.
Majority-vote aggregation across multiple queries to find broadly helpful clips.

🍞 Hook: You know how reading glasses help you see only what matters on a page?

🥬 The Concept: Motion-specific gradients are gradients taken after masking for motion.

What it is: The model’s learning signal calculated mostly from moving regions.
How it works: 1) Make a per-pixel motion weight, 2) Multiply errors by these weights, 3) Backpropagate, 4) Compare these gradients between query and train clips.
Why it matters: It removes the “pretty-but-static” bias and elevates clips that truly teach dynamics.

🍞 Anchor: For “swing,” motion gradients focus on the pendulum’s arc, not the lab wall.

🍞 Hook: Imagine sticky notes on just the parts of the picture that wiggle.

🥬 The Concept: Motion-weighted loss masks are per-pixel weights (0–1) that amplify moving areas and mute static ones.

What it is: A motion saliency filter applied in the loss.
How it works: 1) Compute motion magnitudes, 2) Normalize to 0–1, 3) Downsample to the model’s latent grid, 4) Use them to weight per-location errors, 5) Average fairly over frames.
Why it matters: Without masks, static backgrounds drown out the motion signal.

🍞 Anchor: On a “floating leaf” clip, the mask lights up the leaf and nearby ripples, leaving glass tank walls near zero.

🍞 Hook: Suppose you have a giant poster and only need a business-card-sized summary that keeps the big idea.

🥬 The Concept: Fastfood JL projection compresses huge gradients into short vectors while keeping relative angles.

What it is: A structured random projection that preserves similarity.
How it works: 1) Multiply by fast Hadamard-like pieces, 2) Shuffle and scale, 3) Normalize, 4) Store tiny vectors, 5) Compare with fast dot products.
Why it matters: Without projection, storing and comparing billion-parameter gradients is impractical.

🍞 Anchor: Instead of carrying every library book, you carry a smart summary card that still lets you compare two books’ topics.

03Methodology

At a high level: Query video → Detect motion → Build motion mask → Motion-weighted gradient → Normalize and project → Compute influence with train clips → Rank → Select top-K → Fine-tune → Better motion.

Step 1: Detect motion (optical flow)

What happens: Run AllTracker to estimate how every pixel moves between frames and collect magnitudes (how much) and directions (where).
Why this step exists: We must know where motion is to focus attribution there.
Example: In a “roll” query, the tire’s rim and ground-contact area show strong flow; the wall shows near-zero.

Step 2: Build a motion mask

What happens: Convert motion magnitudes to weights between 0 and 1 (min–max), then downsample to the model’s latent grid.
Why this step exists: The model’s loss is computed in latent space; we need matching-resolution weights to multiply per-location errors.
Example: A “float” clip’s mask highlights the leaf and its ring ripples, dimming the tank background.

Step 3: Motion-weighted loss and gradient

What happens: Multiply each location’s error by the motion weight, average fairly across frames, then backpropagate to get motion-specific gradients.
Why this step exists: Without weighting, static appearance dominates; we want gradients that speak for dynamics.
Example: In “bounce,” the mask brightens impact frames so the gradient captures deformation timing and rebound height.

Step 4: Frame-length normalization

What happens: Divide gradients by the number of frames so longer clips don’t get inflated scores just for being long.
Why this step exists: Raw gradients scale with frame count, biasing rankings.
Example: Two near-identical motions, 5 s vs. 10 s; after normalization, they’re scored by content, not duration.

Step 5: Common timestep and noise

What happens: Compute all gradients (query and train) at the same denoising step and with the same noise sample.
Why this step exists: Shared randomness lowers variance so rankings are stable even with a single sample.
Example: Using the mid-trajectory step avoids both extreme noise (too early) and over-detailed fine-tuning (too late).

Step 6: Fastfood projection and normalization

What happens: Compress giant gradients into short vectors that approximately preserve angles; then normalize to unit length.
Why this step exists: Makes storage and comparisons cheap, enabling million-scale datasets and billion-parameter models.
Example: Project 1.4B-dim gradients to 512-dim fingerprints you can dot-product quickly.

Step 7: Influence scoring (cosine similarity)

What happens: Take a dot product between the projected, normalized query gradient and each training clip’s projected, normalized gradient.
Why this step exists: The cosine captures alignment: higher = more helpful for that motion.
Example: Clips showing steady rotations rank high for a “spin” query, while cartoons with off-physics rank low.

Step 8: Multi-query aggregation (majority vote)

What happens: For multiple motion queries, count how often a training clip appears above a percentile threshold; pick the clips with the most votes.
Why this step exists: Prevents overfitting to a single example and finds generally helpful motion teachers.
Example: A video of a ball on choppy water might help both “float” and “bounce,” earning more votes.

Step 9: Fine-tune with selected subset

What happens: Fine-tune the model on the top-K influential clips; freeze nonessential parts (e.g., text encoder, VAE) per setup.
Why this step exists: Focus training on motion-bright examples to boost dynamics fast with small budgets.
Example: Using only 10% of data selected by Motive improved dynamic degree beyond full-dataset fine-tuning on VBench.

The secret sauce:

Motion-first gradients (masking) that separate dynamics from appearance.
Stability trio: single shared timestep, single shared noise, and frame-length normalization.
Scale tricks: Fastfood projection for tiny, reusable gradient fingerprints plus majority-vote aggregation.

Sandwich recaps for key tools:

🍞 Hook: Ever highlight only the sentences you need for a test? 🥬 The Concept: Majority vote selection tallies which clips repeatedly come out influential across queries.
- How it works: 1) Score per query, 2) Threshold by percentile, 3) Add a vote, 4) Pick top vote-getters.
- Why it matters: Avoids brittle picks from one lucky query. 🍞 Anchor: If five teachers recommend the same book, it’s probably good.
🍞 Hook: Carrying a whole toolbox is heavy; a pocket tool can be enough if it’s well made. 🥬 The Concept: Single-sample attribution uses one shared step and noise for all comparisons.
- How it works: 1) Fix timestep, 2) Fix noise, 3) Compute one gradient each, 4) Compare.
- Why it matters: Saves tons of compute while keeping rankings stable. 🍞 Anchor: You and a friend test two sneakers on the same track, same weather—fair race, clear winner.

04Experiments & Results

The test: Do Motive-selected clips actually improve motion in generated videos?

We measure with VBench on six axes: subject consistency, background consistency, motion smoothness, dynamic degree, aesthetic quality, and imaging quality. Motion smoothness and dynamic degree are the main motion targets.
We also run human preference tests across 10 motion categories (e.g., bounce, spin, float) with 17 annotators.

The competition (baselines):

Base model (no fine-tuning)
Full fine-tuning (use all data; approximate upper bound)
Random selection (uniform)
Motion magnitude (picks clips with highest average motion only)
V-JEPA embeddings (picks clips with look-and-motion semantics from self-supervised features)
Ours w/o motion masking (plain video-level attribution)

The scoreboard (with context):

Dynamic degree: Motive 47.6% vs. Random 41.3% vs. Full fine-tune 42.0%. • Think of this as a race where Motive runs a clear personal best; it’s like moving from a B- to an A- while others hover at B.
Motion smoothness: Motive 96.3%, competitive with best baselines. • Like keeping the camera steady while everyone else jiggles just a bit more.
Subject consistency: Motive 96.3%, better than base (95.3%) and above or on par with others. • Characters stay themselves, not morphing mid-clip.
Aesthetic quality and imaging quality stay solid for Motive (46.0% and 64.6%), showing motion gains don’t break looks.

Human evaluation (pairwise choices):

Ours vs Base: 74.1% wins (clear human preference for better motion).
Ours vs Random: 58.9% wins.
Ours vs Full FT: 53.1% wins (using only 10% top data!
Ours vs w/o Motion Masking: 46.9% wins. Interpretation: People consistently notice and prefer the improved motion from Motive-selected data.

Surprising findings:

Less can be more: Using only the top 10% influential clips beats or matches using the full dataset on key motion metrics.
Not just “more motion”: High-influence picks aren’t simply the clips with the most motion. The top 10% influential set has only a small average motion increase vs. the bottom 10%, proving the method cares about learning signal, not raw motion.
One shared timestep is enough: A single mid-trajectory step with shared noise gives rankings close to multi-step versions, saving huge compute.
Compact fingerprints work: Projecting to 512 dimensions preserves ranking well (Spearman ~75%), making storage and retrieval lightweight.

Qualitative highlights:

Compress: More realistic squish-and-recover.
Spin: Smoother, steadier rotation with less wobble.
Slide: Friction and glide feel more natural.
Free fall: Gravity-like acceleration looks right.

Takeaway: Motive’s motion-aware attribution reliably upgrades temporal dynamics without paying a big visual-quality tax, and humans can clearly tell.

05Discussion & Limitations

Limitations:

Tracker dependence: Motion masks rely on the chosen tracker (e.g., AllTracker). Heavy occlusion or transparency can degrade motion saliency.
Camera-only motion: It’s still tricky to separate ego-motion (camera moves) from object motion without extra signals like camera pose.
Micro-motion: Very subtle, tiny motions may be under-emphasized.
Segment granularity: Treating a video as a whole can dilute very informative short intervals.
CFG mismatch: Classifier-free guidance at inference may shift dynamics not fully captured by training-time attribution.

Required resources:

One-time per-sample gradient compute for the training set (e.g., ~150 hours on 1 A100 for 10k clips), but embarrassingly parallel (~2.3 hours on 64 GPUs).
Storage for projected 512-d gradient fingerprints (manageable) and cached motion masks.
Access to a pretrained video diffusion model (e.g., Wan 2.1/2.2) and a fine-tuning pipeline.

When NOT to use:

If motion isn’t the goal (e.g., purely aesthetic single-frame tasks), motion-aware attribution is overkill.
If you can’t afford any gradient computation or don’t have GPU access, lighter heuristics may be the only option.
If your dataset has strong camera-shake but little object motion and no camera metadata, separating motion types may disappoint.

Open questions:

Can we disentangle camera vs. object motion automatically using self-calibration or pose estimation?
How to push to finer granularity—segment-level or event-level attribution—to catch brief but critical motion phases?
Can closed-loop selection (attribute → fine-tune → re-attribute) outperform one-shot ranking consistently?
How robust is performance across very different trackers or self-supervised motion cues?
Can similar motion attribution ideas extend to audio (rhythm), robotics (trajectories), or multimodal video+audio generation?

06Conclusion & Future Work

3-sentence summary:

Motive is the first motion-centric data attribution framework for video generation that highlights moving regions when measuring influence.
By masking loss with motion, fixing frame-length bias, sharing timestep/noise, and projecting gradients, it scales to modern models and datasets.
Using Motive-selected top data for fine-tuning boosts temporal coherence and physical plausibility, even beating full-data fine-tuning on key motion metrics.

Main achievement:

Turning attribution into a motion-first tool that traces which training clips actually teach dynamics, not just appearance—and using that to curate small, powerful fine-tuning subsets.

Future directions:

More robust motion saliency (ensembling trackers, using confidence/visibility channels), better camera/object disentanglement, segment-level attribution, active data curation loops, and extensions to other modalities.

Why remember this:

As video generators scale, how they move matters as much as how they look. Motive shows that smart data choices—guided by motion-aware influence—can deliver smoother, more realistic dynamics with less data and practical compute, making next-gen video AI both better and more explainable.

Practical Applications

•Curate fine-tuning subsets that boost specific motions (e.g., rolling, bouncing) for a target product demo.
•Diagnose weird motion artifacts by tracing them back to problematic training clips and filtering them out.
•Build specialist motion models (e.g., sports moves, dance, robotics) using only the top motion-influential 5–10% of data.
•Improve physics realism for simulation videos used in robot training or planning environments.
•Stabilize character motion and identity for animated shorts or ads without sacrificing visual quality.
•Speed up iteration cycles by reusing stored gradient fingerprints to answer new motion queries quickly.
•Create safer generators by identifying and removing clips that teach unsafe or undesirable dynamics.
•Enable closed-loop data curation: attribute, fine-tune, re-attribute to continuously refine motion quality.
•Personalize motion styles (e.g., calm glide vs. energetic bounce) by selecting influence-matched training clips.
•Transfer the approach to other modalities (e.g., audio rhythm or multimodal video+audio) to improve temporal coherence.

Version: 1