Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Yang Fei; George Stoica; Jingyuan Liu; Qifeng Chen; Ranjay Krishna; Xiaojuan Wang; Benlin Liu

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Intermediate

Yang Fei, George Stoica, Jingyuan Liu et al.12/12/2025

arXiv PDF

Key Summary

•The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
•It introduces SAM2VideoX, which distills structure-preserving motion priors from SAM2 into CogVideoX using a clever training loss.
•A bidirectional fusion trick combines forward and backward tracking features so the generator learns from full video context.
•A Local Gram Flow (LGF) loss focuses on how nearby parts move together across frames, not just on exact feature values.
•Compared to baselines, SAM2VideoX makes people, animals, and objects move in ways that keep limbs connected and shapes intact.
•On VBench, SAM2VideoX scores 95.51% Motion Score, beating REPA by 2.60 points, and drops FVD to 360.57 (21–22% better).
•In human studies, viewers preferred SAM2VideoX videos in most matchups (about 71% on average across comparisons).
•Mask-only supervision and image-only teachers (like DINO) underperform because they miss fine, time-aware motion cues.
•Fusing tracking features in LGF space (not raw feature space) avoids harmful cross-terms and stabilizes training.
•This approach improves motion realism without extra controls at inference, helping video models become more faithful world simulators.

Why This Research Matters

Smooth, believable motion is the difference between a cool-looking clip and a trustworthy simulation. When videos keep limbs connected and identities stable, they become more useful for education, design previews, and safe robotics planning. This work shows a scalable way to give generators a true sense of motion by learning from trackers that already understand it. Because it removes the need for fragile control signals at inference, creators can get better motion without extra complexity. As a result, video AI can move closer to “world simulation,” where things don’t just look right—they move right, too.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a puppet show looks great only when every string pull keeps the puppet’s arms and legs connected and moving naturally? If a string slips, the puppet’s elbow could bend the wrong way, and the illusion breaks.

🥬 Filling (The Actual Concept)

What it is: Structure-preserving motion means things move in videos while keeping their shapes and parts connected—like knees bending at knees, not at shins.
How it works: (1) Recognize the object and its parts, (2) Track how nearby parts move together frame by frame, (3) Make sure the motion follows realistic limits (like joints).
Why it matters: Without it, video models make weird mistakes: extra legs, stretched textures, and limbs that slide or shear.

🍞 Bottom Bread (Anchor) Think of a running lion: correct motion alternates legs; wrong motion makes the legs move together like a hopping toy.

The World Before Video generators got very good at making individual images look sharp and pretty. But turning images into a smooth, believable video is harder. Especially with humans and animals (articulated, bendy things), models often fumble: a cyclist’s knees freeze, a dancer’s arm folds through her body, or a dog gains a phantom leg mid-stride. Many thought “just add more data” would fix this, but scaling datasets only helped a little.

The Problem Models lacked an internal sense of structure—how parts should stay connected while moving. Prior attempts tried to guide models during generation with control signals like optical flow (pixel shifts) or skeletons (stick figures). But those signals are noisy, miss long-range context, and depend on external tools that make errors—so the videos still got weird.

Failed Attempts

Bigger datasets: More videos didn’t teach the model the exact “rules” of how parts co-move because lots of data still contains oddities and labeling noise.
Optical flow and skeleton conditioning: These give short-term hints but don’t capture object identity through occlusions or long sequences. They also break under fast motion.
Mask supervision: Teaching with just object outlines is too coarse; it ignores the rich relationships inside the object (like how a thigh links to a calf through a knee).

The Gap What was missing was a strong, time-aware teacher that really understands how parts of an object move together over long videos. We needed dense, reliable, and temporally consistent motion features that a generator could learn from—without relying on brittle, handcrafted controls at inference.

🍞 Top Bread (Hook) Imagine learning a dance by watching a great choreographer who never loses track of any dancer, even when they weave behind others.

🥬 The Concept: Video Diffusion Model

What it is: A video diffusion model starts from noisy video and step-by-step removes noise to create a believable clip.
How it works: (1) Add noise to training videos, (2) Train a model to predict how to remove that noise, (3) At test time, start from noise and repeatedly denoise to produce a video.
Why it matters: It’s the main engine that actually makes the video frames.

🍞 Anchor Like sculpting from a rough block: each pass removes a little noise to reveal the final movie.

🍞 Top Bread (Hook) When you read a comic, you predict how a character will move next based on what just happened.

🥬 The Concept: Motion Prior

What it is: A motion prior is a learned guideline about how things are likely to move.
How it works: (1) Watch many examples, (2) Learn typical co-movements of parts, (3) Use that knowledge during generation to avoid impossible moves.
Why it matters: Without a good prior, the model guesses and often guesses wrong.

🍞 Anchor If you’ve seen lots of cats jump, you expect tucked legs mid-air—not spaghetti limbs.

A New Direction: Structure From Tracking Instead of controlling the generator with fragile hand-crafted signals, the authors use a powerful tracker, SAM2, as a teacher. Trackers must keep the same object identity over long videos and through occlusions, so their internal features naturally encode how parts move together. The idea: distill (transfer) this motion understanding into the video generator, so it “just knows” how to move things realistically.

🍞 Top Bread (Hook) Think of a guide who can watch a video forward or backward and still point out the same moving parts.

🥬 The Concept: Segment Anything Model 2 (SAM2)

What it is: A video tracker/segmenter that follows objects across frames and keeps their identity consistent—even through occlusions.
How it works: (1) See a frame, (2) Use memory from past frames, (3) Output features and masks that stick to the same object over time.
Why it matters: Its internal features are rich, dense, and time-aware—perfect motion teachers.

🍞 Anchor If you circle a ballerina in frame 1, SAM2 can keep following that same dancer through spins and passes behind others.

Real Stakes

Entertainment: Characters moving wrong breaks immersion.
Education/science: Inaccurate motion misleads learners and analysts.
Design/ads: Awkward motion ruins product demos.
Robotics/simulation: Unreal motion teaches bad habits or yields unsafe plans.
Trust: Realistic motion makes AI videos more believable and useful.

02Core Idea

🍞 Top Bread (Hook) Imagine tracing a moving cartoon twice: once from start to end, and once from end to start. If both tracings agree on how parts move, you’ve captured the true motion.

🥬 The Concept in One Sentence (The “Aha!”) Teach a video generator to move things realistically by distilling structure-preserving motion from a strong video tracker (SAM2) into a diffusion transformer (CogVideoX) using a bidirectional fusion of tracking features and a Local Gram Flow loss that matches how nearby parts move together.

Multiple Analogies

Coach and athlete: SAM2 is the coach with deep motion wisdom; the diffusion model is the athlete. Training transfers the coach’s know-how so the athlete moves correctly even without the coach nearby.
Orchestra and conductor: Forward and backward features are like hearing the music played normally and in reverse; fusing them reveals the full score so the players (the generator) keep perfect timing and harmony.
Neighborhood watch: LGF watches how each pixel’s small neighborhood shifts to the next frame, like neighbors walking together block-by-block, making sure no one teleports.

🍞 Top Bread (Hook) You know how reading a story forward or backward still keeps the characters the same if the story is consistent?

🥬 The Concept: Bidirectional Feature Fusion

What it is: Combine SAM2’s forward and backward tracking cues into a single teacher signal that sees the whole video timeline.
How it works: (1) Run SAM2 forward, (2) Run SAM2 on the reversed video, (3) Fuse their local motion relationships (not raw features) so they don’t fight each other.
Why it matters: The generator (which has global attention) needs global, time-symmetric hints; otherwise it learns lopsided or conflicting motion.

🍞 Anchor Like averaging two good maps drawn from opposite directions—but only after converting them into “how roads connect,” not messy raw sketches.

🍞 Top Bread (Hook) Imagine comparing two flipbooks not picture-by-picture, but by how each small patch moves to the next page.

🥬 The Concept: Gram Matrix and Local Gram Flow (LGF)

What it is: A Gram matrix captures similarities between features; LGF focuses on local similarities from frame t to t+1 within a small 7×7 neighborhood.
How it works: (1) For each location, compute dot-products with nearby locations in the next frame, (2) Turn these into a probability distribution (softmax), (3) Align student vs. teacher using KL divergence so relative motion patterns match.
Why it matters: Matching relative co-movement (who moves with whom) beats matching raw numbers; it teaches structure, not just appearance.

🍞 Anchor Two dancers are “similar” if they move together from one beat to the next; LGF checks that kind of togetherness.

🍞 Top Bread (Hook) Think of polishing a movie by removing fuzz a little at a time.

🥬 The Concept: Denoising Diffusion Transformer (DiT)

What it is: A transformer that generates videos by repeatedly denoising latent features with global, bidirectional attention.
How it works: (1) Encode video into latents, (2) Learn to predict the right denoising step, (3) Use attention to connect far-apart frames.
Why it matters: It’s the generator that learns the motion lessons.

🍞 Anchor Like a director who can see the whole script and keep continuity across scenes.

Before vs. After

Before: Generators often bent arms wrong or slid textures, and adding controls (flow/skeletons) during inference was clunky and error-prone.
After: The generator internalizes motion rules; limbs stay attached, identities persist, and motion becomes smooth—without extra controls at inference.

Why It Works (Intuition)

SAM2’s tracking features already encode long-range correspondences that keep identity and parts intact.
Fusing forward/backward motion in the LGF space gives a stable, global teaching signal.
Aligning distributions of local co-movement (via KL) teaches robust structure, not brittle pixel-by-pixel matches.

Building Blocks

Teacher: SAM2’s dense memory features (forward and backward).
Student: CogVideoX (a DiT video generator) features from a mid layer.
Projector: A small network that maps student features into the teacher’s feature space.
LGF Operator: Computes local cross-frame similarity vectors.
LGF-KL Loss: Aligns how neighborhoods flow, focusing on relative similarity patterns.

🍞 Bottom Bread (Anchor) Result: A cyclist’s knees bend and cycle naturally through frames; a lion alternates legs correctly; hands grasp objects without teleporting fingers.

03Methodology

At a high level: Input image/video → Encode to latents → Add noise → DiT predicts denoising steps while we also extract its mid-layer features → Project those features → Compare their local motion patterns (LGF) to SAM2’s fused motion teacher → Train with diffusion loss + LGF-KL motion distillation → Output a video with structure-preserving motion.

Step 1: Prepare the Teacher (SAM2)

What happens: For each training clip, run SAM2 forward (normal order) and backward (reverse order) to get dense memory features that consistently track the subject. Use a bounding box prompt (from GroundingDINO) to focus on the main subject.
Why this exists: SAM2’s internal features capture which parts belong together over time, even with occlusions. Single masks are too coarse; internal features are rich and continuous.
Example: Track a ballerina through a spin: SAM2 keeps the same dancer identity, preserving arm–shoulder–torso relationships frame by frame.

Step 2: Fuse Motion the Right Way (Bidirectional LGF Fusion)

What happens: Compute Local Gram Flow for forward and backward SAM2 features separately, then blend them with a convex combination (k·LGF_fwd + (1−k)·LGF_bwd).
Why this exists: Fusing raw features creates harmful cross-terms (conflicting temporal signals). Fusing after LGF captures consistent co-movement without interference.
Example: Two maps drawn from opposite ends are best combined after converting them into road-connectivity graphs, not by smearing the drawings together.

Step 3: Build the Student Side (DiT + Projector)

What happens: Take the base generator (CogVideoX-5B-I2V, a DiT), encode the input video into latents, add noise, and pass through the DiT. From an intermediate block (e.g., 25th), extract features. A small projector (interpolation + MLP) maps them into the teacher’s space.
Why this exists: The projector bridges architecture differences so comparisons with SAM2’s features are meaningful.
Example: Translating a sentence from English (DiT space) into Spanish (SAM2 space) before comparing meanings.

Step 4: Compare Motion as Relative Patterns (Local Gram Flow)

What happens: For each token (feature location) at frame t, compute similarities (dot-products) to a 7×7 neighborhood at frame t+1. This forms a similarity vector per location, per time step. Apply softmax to make a probability distribution of where that local patch “flows.”
Why this exists: It encodes who moves with whom locally—teaching co-movement and part topology, not just raw values.
Example: For a knee at frame t, its most similar neighbor at t+1 should be the slightly shifted knee (not the calf tip teleporting away).

Step 5: Align with KL Divergence (LGF-KL Loss)

What happens: Compare teacher vs. student LGF distributions using KL divergence and average over all locations and frames. This forms L_feat.
Why this exists: KL matches relative rankings (which neighbors are more likely), which is more stable and meaningful than forcing exact numbers with L2.
Example: Grading the order of most-likely movements rather than demanding two drawings have identical pixel intensities.

Step 6: Train with Two Losses

What happens: Optimize the standard diffusion v-prediction loss (to denoise) plus λ·L_feat (to learn motion structure). In practice, λ≈0.5 worked well.
Why this exists: The model must both make sharp pictures (diffusion) and move them right (LGF-KL motion distillation).
Example: Learning to write neatly (clarity) and to tell a story that makes sense (structure) at the same time.

Concrete Mini Example (Cyclist)

Teacher: SAM2 forward/back features track thighs, knees, calves across frames.
Student: DiT features are projected; LGF checks that a knee at t is most similar to a slightly advanced knee at t+1.
Loss: If the student thinks the knee jumps to the wrong spot, KL gets large and nudges it back.
Outcome: Pedaling circles look natural; knees don’t freeze or snap.

Secret Sauce

Distill dense, time-aware tracking features into the generator—no extra controls at inference.
Fuse forward/backward motion signals only after converting them into LGF, avoiding destructive cross-terms.
Align relative co-movement distributions with KL, not raw values—capturing structure instead of brittle appearances.

Training Details (Friendly Summary)

Data: ~9.8k motion-focused clips (people/animals), 8 fps, up to 100 frames.
Base model: CogVideoX-5B-I2V; features from a mid block.
Optimization: LoRA on attention modules, AdamW, short training (thousands of steps), batch accumulation.
Practicality: Precompute SAM2 features to avoid heavy teacher runtime during training.

What Breaks Without Each Step

No teacher features: Model keeps making implausible motion.
Raw feature fusion: Conflicting signals cause artifacts and instability.
No LGF: You miss local co-movement; limbs drift and shear.
L2 instead of KL: Overly rigid matching harms learning of real structure, increasing flicker.
No projector: Spaces don’t align; supervision becomes noisy.

04Experiments & Results

🍞 Top Bread (Hook) Imagine a report card for videos that checks: Do parts stay together? Is the motion smooth? Does the background remain steady?

🥬 The Concept: VBench (and Friends)

What it is: A benchmark suite that grades video generation on motion smoothness, subject consistency, and background consistency, among others.
How it works: (1) Generate videos from standard prompts, (2) Compute metrics, (3) Compare across models fairly.
Why it matters: Numbers help us see if motion looks real, not just pretty.

🍞 Anchor Like testing cars on the same track to compare speed and safety.

What They Measured and Why

Motion Smoothness: Are changes frame-to-frame gentle and realistic?
Subject Consistency: Does the person/animal keep their identity and shape?
Background Consistency: Does the scene avoid flicker and warping?
Dynamic Degree: Is there enough motion to make the test meaningful (not just static frames)?
FVD (Fréchet Video Distance): A popular measure of overall perceptual quality—the lower, the better.
Human Preference: Do people pick these videos as more realistic in head-to-head tests?

Competitors

Base: CogVideoX-5B-I2V (no special motion training).
+LoRA Fine-tuning: Trained more on motion clips but no special teacher.
+Mask Supervision: Predict segmentation masks as supervision.
+REPA: Aligns features to DINO (image-only teacher), not video.
HunyuanVid: A strong, larger open-source model (≈13B params).
Track4Gen (adapted): Uses point tracking trajectories for guidance.

Scoreboard (With Context)

VBench Motion Score (higher is better): SAM2VideoX hits 95.51%, beating REPA’s 92.91% by +2.60 points—like moving from a solid B to a strong A.
Extended Motion Score (adds input consistency): 96.03% (best among tested baselines in this study).
FVD (lower is better): 360.57 vs. REPA 457.59 and LoRA 465.00—roughly a 21–22% improvement, which is a big quality jump.
Human Preference: In blind A/B tests, viewers chose SAM2VideoX in most comparisons (about 64–84% win rates vs. different baselines, averaging around 71%).
Against HunyuanVid: Despite HunyuanVid being more than twice as large, SAM2VideoX is highly competitive on motion/consistency and achieves much better FVD in these tests.

Surprising/Notable Findings

Mask supervision underperforms: It tends to push the model toward static or coarse outlines, missing internal part relationships and hurting FVD.
Image-only teachers (REPA with DINO) lack temporal wisdom: Good for single images, but weaker for time consistency, so motion quality lags.
Dense features beat sparse trajectories (Track4Gen*): Point tracks can drift and accumulate errors; dense SAM2 features provide steadier supervision.
LGF + KL matters: Using plain L2 (even on LGF outputs) drops scores and increases flicker—relative distribution matching is key.
Forward-only teacher is decent, but LGF fusion of forward+backward is best: It resolves conflicts and gives a global, time-symmetric signal.

A Taste of the Data and Setup

About 9.8k single-subject clips (people/animals), 8 fps, ≤100 frames.
Generation for eval: 49-frame videos, guidance scale 6.0, 50 denoising steps.
Fairness: Models with too-low motion (Dynamic Degree) are excluded from VBench comparisons to avoid rewarding near-static outputs.

Takeaway The distilled, bidirectional, structure-aware motion prior consistently lifts motion realism and perceptual quality beyond simple fine-tuning, mask training, and image-only alignment—without needing extra control inputs at inference.

05Discussion & Limitations

Limitations

High-speed, complex motion (e.g., fast sports, breakdancing) can still show artifacts—the final quality is partly limited by the base generator’s capacity.
Multi-object scenes are harder: The current pipeline is strongest for a single main subject; identity switches can occur with many interacting objects.
Teacher dependency: If SAM2 struggles in rare cases, the distilled prior inherits some of that difficulty.
Precomputation: Storing teacher features adds a data step, though it saves training-time compute.

Required Resources

A decent GPU setup to fine-tune a 5B-parameter video model (the authors used 8× high-memory GPUs, gradient accumulation).
Precomputed SAM2 teacher features for your training clips.
A motion-focused dataset (subjects with articulated motion help the model learn the right priors).

When NOT to Use

Pure style/artistic effects without concern for realistic structure; the extra motion supervision may not match goals.
Rigid-object-only domains (e.g., panning landscapes) where structure-preserving articulation isn’t the bottleneck.
Extremely crowded, multi-subject scenes where single-subject tracking priors are insufficient; a multi-object extension would be better.

Open Questions

Multi-object extension: How to track and distill multiple identities robustly, even through complex occlusions?
Joint training: Can we co-train trackers and generators end-to-end for even tighter coupling?
Other teachers: Would specialized 3D or physics-aware teachers improve realism further?
Longer videos: How to maintain structure over minutes, not seconds?
Controllability: Can we let users nudge motion while preserving the distilled structure knowledge?

Overall The method makes a strong case: structure from tracking is a powerful, scalable prior. Still, the next frontier is handling many interacting subjects, faster motions, and longer storylines with the same structural grace.

06Conclusion & Future Work

Three-Sentence Summary This paper shows how to teach a video generator realistic, structure-preserving motion by distilling dense, time-aware features from a strong tracker (SAM2). Two key ideas—bidirectional fusion in Local Gram Flow space and a KL-based alignment of local co-movement—let the generator internalize how parts move together across frames. The result is smoother, more believable motion that beats common baselines and earns strong human preference without extra controls at inference.

Main Achievement Turning a tracker’s long-range, identity-preserving understanding into a motion prior for generation—using LGF-KL and careful bidirectional fusion—significantly boosts motion realism and perceptual quality.

Future Directions Extend to multi-object tracking/teaching, explore joint training with other teachers (e.g., 3D/physics-aware), and scale to longer, more complex scenes with controllable motion cues. Investigate lightweight teacher approximations to reduce storage and expand accessibility.

Why Remember This It reframes motion learning: instead of bolting on controls, give the generator a real motion sense by distilling from a tracker that already preserves identity and part topology. This shift helps video models move from pretty pictures in sequence toward faithful world simulations where things look right because they move right.

Practical Applications

•Improve human and animal motion in creative video tools so characters move naturally without manual keyframing.
•Generate realistic product demos (e.g., opening laptops, rotating shoes) that keep shapes intact while moving.
•Enhance sports highlights or training visuals with plausible joint motion that aids coaching and analysis.
•Create safer robotics simulations where grasping and walking look physically consistent before trying in the real world.
•Boost educational animations (biology, physics) where moving parts must stay connected and anatomically correct.
•Stabilize motion in ad campaigns and trailers to avoid uncanny artifacts that distract viewers.
•Pre-visualize film scenes with accurate limb and object interactions, reducing costly reshoots.
•Power virtual try-ons where clothes and bodies move together realistically during turns and walks.
•Support AR/VR experiences with believable avatar motion that maintains body structure over time.
•Assist medical or rehab visualizations showing correct joint trajectories for therapy guidance.

Version: 1