Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
Key Summary
- âąThe paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
- âąIt introduces SAM2VideoX, which distills structure-preserving motion priors from SAM2 into CogVideoX using a clever training loss.
- âąA bidirectional fusion trick combines forward and backward tracking features so the generator learns from full video context.
- âąA Local Gram Flow (LGF) loss focuses on how nearby parts move together across frames, not just on exact feature values.
- âąCompared to baselines, SAM2VideoX makes people, animals, and objects move in ways that keep limbs connected and shapes intact.
- âąOn VBench, SAM2VideoX scores 95.51% Motion Score, beating REPA by 2.60 points, and drops FVD to 360.57 (21â22% better).
- âąIn human studies, viewers preferred SAM2VideoX videos in most matchups (about 71% on average across comparisons).
- âąMask-only supervision and image-only teachers (like DINO) underperform because they miss fine, time-aware motion cues.
- âąFusing tracking features in LGF space (not raw feature space) avoids harmful cross-terms and stabilizes training.
- âąThis approach improves motion realism without extra controls at inference, helping video models become more faithful world simulators.
Why This Research Matters
Smooth, believable motion is the difference between a cool-looking clip and a trustworthy simulation. When videos keep limbs connected and identities stable, they become more useful for education, design previews, and safe robotics planning. This work shows a scalable way to give generators a true sense of motion by learning from trackers that already understand it. Because it removes the need for fragile control signals at inference, creators can get better motion without extra complexity. As a result, video AI can move closer to âworld simulation,â where things donât just look rightâthey move right, too.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook) You know how a puppet show looks great only when every string pull keeps the puppetâs arms and legs connected and moving naturally? If a string slips, the puppetâs elbow could bend the wrong way, and the illusion breaks.
đ„Ź Filling (The Actual Concept)
- What it is: Structure-preserving motion means things move in videos while keeping their shapes and parts connectedâlike knees bending at knees, not at shins.
- How it works: (1) Recognize the object and its parts, (2) Track how nearby parts move together frame by frame, (3) Make sure the motion follows realistic limits (like joints).
- Why it matters: Without it, video models make weird mistakes: extra legs, stretched textures, and limbs that slide or shear.
đ Bottom Bread (Anchor) Think of a running lion: correct motion alternates legs; wrong motion makes the legs move together like a hopping toy.
The World Before Video generators got very good at making individual images look sharp and pretty. But turning images into a smooth, believable video is harder. Especially with humans and animals (articulated, bendy things), models often fumble: a cyclistâs knees freeze, a dancerâs arm folds through her body, or a dog gains a phantom leg mid-stride. Many thought âjust add more dataâ would fix this, but scaling datasets only helped a little.
The Problem Models lacked an internal sense of structureâhow parts should stay connected while moving. Prior attempts tried to guide models during generation with control signals like optical flow (pixel shifts) or skeletons (stick figures). But those signals are noisy, miss long-range context, and depend on external tools that make errorsâso the videos still got weird.
Failed Attempts
- Bigger datasets: More videos didnât teach the model the exact ârulesâ of how parts co-move because lots of data still contains oddities and labeling noise.
- Optical flow and skeleton conditioning: These give short-term hints but donât capture object identity through occlusions or long sequences. They also break under fast motion.
- Mask supervision: Teaching with just object outlines is too coarse; it ignores the rich relationships inside the object (like how a thigh links to a calf through a knee).
The Gap What was missing was a strong, time-aware teacher that really understands how parts of an object move together over long videos. We needed dense, reliable, and temporally consistent motion features that a generator could learn fromâwithout relying on brittle, handcrafted controls at inference.
đ Top Bread (Hook) Imagine learning a dance by watching a great choreographer who never loses track of any dancer, even when they weave behind others.
đ„Ź The Concept: Video Diffusion Model
- What it is: A video diffusion model starts from noisy video and step-by-step removes noise to create a believable clip.
- How it works: (1) Add noise to training videos, (2) Train a model to predict how to remove that noise, (3) At test time, start from noise and repeatedly denoise to produce a video.
- Why it matters: Itâs the main engine that actually makes the video frames.
đ Anchor Like sculpting from a rough block: each pass removes a little noise to reveal the final movie.
đ Top Bread (Hook) When you read a comic, you predict how a character will move next based on what just happened.
đ„Ź The Concept: Motion Prior
- What it is: A motion prior is a learned guideline about how things are likely to move.
- How it works: (1) Watch many examples, (2) Learn typical co-movements of parts, (3) Use that knowledge during generation to avoid impossible moves.
- Why it matters: Without a good prior, the model guesses and often guesses wrong.
đ Anchor If youâve seen lots of cats jump, you expect tucked legs mid-airânot spaghetti limbs.
A New Direction: Structure From Tracking Instead of controlling the generator with fragile hand-crafted signals, the authors use a powerful tracker, SAM2, as a teacher. Trackers must keep the same object identity over long videos and through occlusions, so their internal features naturally encode how parts move together. The idea: distill (transfer) this motion understanding into the video generator, so it âjust knowsâ how to move things realistically.
đ Top Bread (Hook) Think of a guide who can watch a video forward or backward and still point out the same moving parts.
đ„Ź The Concept: Segment Anything Model 2 (SAM2)
- What it is: A video tracker/segmenter that follows objects across frames and keeps their identity consistentâeven through occlusions.
- How it works: (1) See a frame, (2) Use memory from past frames, (3) Output features and masks that stick to the same object over time.
- Why it matters: Its internal features are rich, dense, and time-awareâperfect motion teachers.
đ Anchor If you circle a ballerina in frame 1, SAM2 can keep following that same dancer through spins and passes behind others.
Real Stakes
- Entertainment: Characters moving wrong breaks immersion.
- Education/science: Inaccurate motion misleads learners and analysts.
- Design/ads: Awkward motion ruins product demos.
- Robotics/simulation: Unreal motion teaches bad habits or yields unsafe plans.
- Trust: Realistic motion makes AI videos more believable and useful.
02Core Idea
đ Top Bread (Hook) Imagine tracing a moving cartoon twice: once from start to end, and once from end to start. If both tracings agree on how parts move, youâve captured the true motion.
đ„Ź The Concept in One Sentence (The âAha!â) Teach a video generator to move things realistically by distilling structure-preserving motion from a strong video tracker (SAM2) into a diffusion transformer (CogVideoX) using a bidirectional fusion of tracking features and a Local Gram Flow loss that matches how nearby parts move together.
Multiple Analogies
- Coach and athlete: SAM2 is the coach with deep motion wisdom; the diffusion model is the athlete. Training transfers the coachâs know-how so the athlete moves correctly even without the coach nearby.
- Orchestra and conductor: Forward and backward features are like hearing the music played normally and in reverse; fusing them reveals the full score so the players (the generator) keep perfect timing and harmony.
- Neighborhood watch: LGF watches how each pixelâs small neighborhood shifts to the next frame, like neighbors walking together block-by-block, making sure no one teleports.
đ Top Bread (Hook) You know how reading a story forward or backward still keeps the characters the same if the story is consistent?
đ„Ź The Concept: Bidirectional Feature Fusion
- What it is: Combine SAM2âs forward and backward tracking cues into a single teacher signal that sees the whole video timeline.
- How it works: (1) Run SAM2 forward, (2) Run SAM2 on the reversed video, (3) Fuse their local motion relationships (not raw features) so they donât fight each other.
- Why it matters: The generator (which has global attention) needs global, time-symmetric hints; otherwise it learns lopsided or conflicting motion.
đ Anchor Like averaging two good maps drawn from opposite directionsâbut only after converting them into âhow roads connect,â not messy raw sketches.
đ Top Bread (Hook) Imagine comparing two flipbooks not picture-by-picture, but by how each small patch moves to the next page.
đ„Ź The Concept: Gram Matrix and Local Gram Flow (LGF)
- What it is: A Gram matrix captures similarities between features; LGF focuses on local similarities from frame t to t+1 within a small 7Ă7 neighborhood.
- How it works: (1) For each location, compute dot-products with nearby locations in the next frame, (2) Turn these into a probability distribution (softmax), (3) Align student vs. teacher using KL divergence so relative motion patterns match.
- Why it matters: Matching relative co-movement (who moves with whom) beats matching raw numbers; it teaches structure, not just appearance.
đ Anchor Two dancers are âsimilarâ if they move together from one beat to the next; LGF checks that kind of togetherness.
đ Top Bread (Hook) Think of polishing a movie by removing fuzz a little at a time.
đ„Ź The Concept: Denoising Diffusion Transformer (DiT)
- What it is: A transformer that generates videos by repeatedly denoising latent features with global, bidirectional attention.
- How it works: (1) Encode video into latents, (2) Learn to predict the right denoising step, (3) Use attention to connect far-apart frames.
- Why it matters: Itâs the generator that learns the motion lessons.
đ Anchor Like a director who can see the whole script and keep continuity across scenes.
Before vs. After
- Before: Generators often bent arms wrong or slid textures, and adding controls (flow/skeletons) during inference was clunky and error-prone.
- After: The generator internalizes motion rules; limbs stay attached, identities persist, and motion becomes smoothâwithout extra controls at inference.
Why It Works (Intuition)
- SAM2âs tracking features already encode long-range correspondences that keep identity and parts intact.
- Fusing forward/backward motion in the LGF space gives a stable, global teaching signal.
- Aligning distributions of local co-movement (via KL) teaches robust structure, not brittle pixel-by-pixel matches.
Building Blocks
- Teacher: SAM2âs dense memory features (forward and backward).
- Student: CogVideoX (a DiT video generator) features from a mid layer.
- Projector: A small network that maps student features into the teacherâs feature space.
- LGF Operator: Computes local cross-frame similarity vectors.
- LGF-KL Loss: Aligns how neighborhoods flow, focusing on relative similarity patterns.
đ Bottom Bread (Anchor) Result: A cyclistâs knees bend and cycle naturally through frames; a lion alternates legs correctly; hands grasp objects without teleporting fingers.
03Methodology
At a high level: Input image/video â Encode to latents â Add noise â DiT predicts denoising steps while we also extract its mid-layer features â Project those features â Compare their local motion patterns (LGF) to SAM2âs fused motion teacher â Train with diffusion loss + LGF-KL motion distillation â Output a video with structure-preserving motion.
Step 1: Prepare the Teacher (SAM2)
- What happens: For each training clip, run SAM2 forward (normal order) and backward (reverse order) to get dense memory features that consistently track the subject. Use a bounding box prompt (from GroundingDINO) to focus on the main subject.
- Why this exists: SAM2âs internal features capture which parts belong together over time, even with occlusions. Single masks are too coarse; internal features are rich and continuous.
- Example: Track a ballerina through a spin: SAM2 keeps the same dancer identity, preserving armâshoulderâtorso relationships frame by frame.
Step 2: Fuse Motion the Right Way (Bidirectional LGF Fusion)
- What happens: Compute Local Gram Flow for forward and backward SAM2 features separately, then blend them with a convex combination (k·LGF_fwd + (1âk)·LGF_bwd).
- Why this exists: Fusing raw features creates harmful cross-terms (conflicting temporal signals). Fusing after LGF captures consistent co-movement without interference.
- Example: Two maps drawn from opposite ends are best combined after converting them into road-connectivity graphs, not by smearing the drawings together.
Step 3: Build the Student Side (DiT + Projector)
- What happens: Take the base generator (CogVideoX-5B-I2V, a DiT), encode the input video into latents, add noise, and pass through the DiT. From an intermediate block (e.g., 25th), extract features. A small projector (interpolation + MLP) maps them into the teacherâs space.
- Why this exists: The projector bridges architecture differences so comparisons with SAM2âs features are meaningful.
- Example: Translating a sentence from English (DiT space) into Spanish (SAM2 space) before comparing meanings.
Step 4: Compare Motion as Relative Patterns (Local Gram Flow)
- What happens: For each token (feature location) at frame t, compute similarities (dot-products) to a 7Ă7 neighborhood at frame t+1. This forms a similarity vector per location, per time step. Apply softmax to make a probability distribution of where that local patch âflows.â
- Why this exists: It encodes who moves with whom locallyâteaching co-movement and part topology, not just raw values.
- Example: For a knee at frame t, its most similar neighbor at t+1 should be the slightly shifted knee (not the calf tip teleporting away).
Step 5: Align with KL Divergence (LGF-KL Loss)
- What happens: Compare teacher vs. student LGF distributions using KL divergence and average over all locations and frames. This forms L_feat.
- Why this exists: KL matches relative rankings (which neighbors are more likely), which is more stable and meaningful than forcing exact numbers with L2.
- Example: Grading the order of most-likely movements rather than demanding two drawings have identical pixel intensities.
Step 6: Train with Two Losses
- What happens: Optimize the standard diffusion v-prediction loss (to denoise) plus λ·L_feat (to learn motion structure). In practice, λâ0.5 worked well.
- Why this exists: The model must both make sharp pictures (diffusion) and move them right (LGF-KL motion distillation).
- Example: Learning to write neatly (clarity) and to tell a story that makes sense (structure) at the same time.
Concrete Mini Example (Cyclist)
- Teacher: SAM2 forward/back features track thighs, knees, calves across frames.
- Student: DiT features are projected; LGF checks that a knee at t is most similar to a slightly advanced knee at t+1.
- Loss: If the student thinks the knee jumps to the wrong spot, KL gets large and nudges it back.
- Outcome: Pedaling circles look natural; knees donât freeze or snap.
Secret Sauce
- Distill dense, time-aware tracking features into the generatorâno extra controls at inference.
- Fuse forward/backward motion signals only after converting them into LGF, avoiding destructive cross-terms.
- Align relative co-movement distributions with KL, not raw valuesâcapturing structure instead of brittle appearances.
Training Details (Friendly Summary)
- Data: ~9.8k motion-focused clips (people/animals), 8 fps, up to 100 frames.
- Base model: CogVideoX-5B-I2V; features from a mid block.
- Optimization: LoRA on attention modules, AdamW, short training (thousands of steps), batch accumulation.
- Practicality: Precompute SAM2 features to avoid heavy teacher runtime during training.
What Breaks Without Each Step
- No teacher features: Model keeps making implausible motion.
- Raw feature fusion: Conflicting signals cause artifacts and instability.
- No LGF: You miss local co-movement; limbs drift and shear.
- L2 instead of KL: Overly rigid matching harms learning of real structure, increasing flicker.
- No projector: Spaces donât align; supervision becomes noisy.
04Experiments & Results
đ Top Bread (Hook) Imagine a report card for videos that checks: Do parts stay together? Is the motion smooth? Does the background remain steady?
đ„Ź The Concept: VBench (and Friends)
- What it is: A benchmark suite that grades video generation on motion smoothness, subject consistency, and background consistency, among others.
- How it works: (1) Generate videos from standard prompts, (2) Compute metrics, (3) Compare across models fairly.
- Why it matters: Numbers help us see if motion looks real, not just pretty.
đ Anchor Like testing cars on the same track to compare speed and safety.
What They Measured and Why
- Motion Smoothness: Are changes frame-to-frame gentle and realistic?
- Subject Consistency: Does the person/animal keep their identity and shape?
- Background Consistency: Does the scene avoid flicker and warping?
- Dynamic Degree: Is there enough motion to make the test meaningful (not just static frames)?
- FVD (FrĂ©chet Video Distance): A popular measure of overall perceptual qualityâthe lower, the better.
- Human Preference: Do people pick these videos as more realistic in head-to-head tests?
Competitors
- Base: CogVideoX-5B-I2V (no special motion training).
- +LoRA Fine-tuning: Trained more on motion clips but no special teacher.
- +Mask Supervision: Predict segmentation masks as supervision.
- +REPA: Aligns features to DINO (image-only teacher), not video.
- HunyuanVid: A strong, larger open-source model (â13B params).
- Track4Gen (adapted): Uses point tracking trajectories for guidance.
Scoreboard (With Context)
- VBench Motion Score (higher is better): SAM2VideoX hits 95.51%, beating REPAâs 92.91% by +2.60 pointsâlike moving from a solid B to a strong A.
- Extended Motion Score (adds input consistency): 96.03% (best among tested baselines in this study).
- FVD (lower is better): 360.57 vs. REPA 457.59 and LoRA 465.00âroughly a 21â22% improvement, which is a big quality jump.
- Human Preference: In blind A/B tests, viewers chose SAM2VideoX in most comparisons (about 64â84% win rates vs. different baselines, averaging around 71%).
- Against HunyuanVid: Despite HunyuanVid being more than twice as large, SAM2VideoX is highly competitive on motion/consistency and achieves much better FVD in these tests.
Surprising/Notable Findings
- Mask supervision underperforms: It tends to push the model toward static or coarse outlines, missing internal part relationships and hurting FVD.
- Image-only teachers (REPA with DINO) lack temporal wisdom: Good for single images, but weaker for time consistency, so motion quality lags.
- Dense features beat sparse trajectories (Track4Gen*): Point tracks can drift and accumulate errors; dense SAM2 features provide steadier supervision.
- LGF + KL matters: Using plain L2 (even on LGF outputs) drops scores and increases flickerârelative distribution matching is key.
- Forward-only teacher is decent, but LGF fusion of forward+backward is best: It resolves conflicts and gives a global, time-symmetric signal.
A Taste of the Data and Setup
- About 9.8k single-subject clips (people/animals), 8 fps, â€100 frames.
- Generation for eval: 49-frame videos, guidance scale 6.0, 50 denoising steps.
- Fairness: Models with too-low motion (Dynamic Degree) are excluded from VBench comparisons to avoid rewarding near-static outputs.
Takeaway The distilled, bidirectional, structure-aware motion prior consistently lifts motion realism and perceptual quality beyond simple fine-tuning, mask training, and image-only alignmentâwithout needing extra control inputs at inference.
05Discussion & Limitations
Limitations
- High-speed, complex motion (e.g., fast sports, breakdancing) can still show artifactsâthe final quality is partly limited by the base generatorâs capacity.
- Multi-object scenes are harder: The current pipeline is strongest for a single main subject; identity switches can occur with many interacting objects.
- Teacher dependency: If SAM2 struggles in rare cases, the distilled prior inherits some of that difficulty.
- Precomputation: Storing teacher features adds a data step, though it saves training-time compute.
Required Resources
- A decent GPU setup to fine-tune a 5B-parameter video model (the authors used 8Ă high-memory GPUs, gradient accumulation).
- Precomputed SAM2 teacher features for your training clips.
- A motion-focused dataset (subjects with articulated motion help the model learn the right priors).
When NOT to Use
- Pure style/artistic effects without concern for realistic structure; the extra motion supervision may not match goals.
- Rigid-object-only domains (e.g., panning landscapes) where structure-preserving articulation isnât the bottleneck.
- Extremely crowded, multi-subject scenes where single-subject tracking priors are insufficient; a multi-object extension would be better.
Open Questions
- Multi-object extension: How to track and distill multiple identities robustly, even through complex occlusions?
- Joint training: Can we co-train trackers and generators end-to-end for even tighter coupling?
- Other teachers: Would specialized 3D or physics-aware teachers improve realism further?
- Longer videos: How to maintain structure over minutes, not seconds?
- Controllability: Can we let users nudge motion while preserving the distilled structure knowledge?
Overall The method makes a strong case: structure from tracking is a powerful, scalable prior. Still, the next frontier is handling many interacting subjects, faster motions, and longer storylines with the same structural grace.
06Conclusion & Future Work
Three-Sentence Summary This paper shows how to teach a video generator realistic, structure-preserving motion by distilling dense, time-aware features from a strong tracker (SAM2). Two key ideasâbidirectional fusion in Local Gram Flow space and a KL-based alignment of local co-movementâlet the generator internalize how parts move together across frames. The result is smoother, more believable motion that beats common baselines and earns strong human preference without extra controls at inference.
Main Achievement Turning a trackerâs long-range, identity-preserving understanding into a motion prior for generationâusing LGF-KL and careful bidirectional fusionâsignificantly boosts motion realism and perceptual quality.
Future Directions Extend to multi-object tracking/teaching, explore joint training with other teachers (e.g., 3D/physics-aware), and scale to longer, more complex scenes with controllable motion cues. Investigate lightweight teacher approximations to reduce storage and expand accessibility.
Why Remember This It reframes motion learning: instead of bolting on controls, give the generator a real motion sense by distilling from a tracker that already preserves identity and part topology. This shift helps video models move from pretty pictures in sequence toward faithful world simulations where things look right because they move right.
Practical Applications
- âąImprove human and animal motion in creative video tools so characters move naturally without manual keyframing.
- âąGenerate realistic product demos (e.g., opening laptops, rotating shoes) that keep shapes intact while moving.
- âąEnhance sports highlights or training visuals with plausible joint motion that aids coaching and analysis.
- âąCreate safer robotics simulations where grasping and walking look physically consistent before trying in the real world.
- âąBoost educational animations (biology, physics) where moving parts must stay connected and anatomically correct.
- âąStabilize motion in ad campaigns and trailers to avoid uncanny artifacts that distract viewers.
- âąPre-visualize film scenes with accurate limb and object interactions, reducing costly reshoots.
- âąPower virtual try-ons where clothes and bodies move together realistically during turns and walks.
- âąSupport AR/VR experiences with believable avatar motion that maintains body structure over time.
- âąAssist medical or rehab visualizations showing correct joint trajectories for therapy guidance.