FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Yulu Gan; Ligeng Zhu; Dandan Shan; Baifeng Shi; Hongxu Yin; Boris Ivanovic; Song Han; Trevor Darrell; Jitendra Malik; Marco Pavone; Boyi Li

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Intermediate

Yulu Gan, Ligeng Zhu, Dandan Shan et al.12/11/2025

arXiv PDF

Key Summary

•FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.
•It detects and follows objects over time, draws their paths, and feeds those paths to a language model to write precise motion descriptions.
•Then it auto-writes multiple-choice questions that check skills like what moved, when it moved, where it moved, and how many times it moved.
•Fine-tuning existing video-language models on this new data greatly improves their understanding of motion without hurting other skills.
•On key tests (like AV-Car), medium-sized open models fine-tuned with FoundationMotion beat much larger models, including Gemini-2.5-Flash.
•Adding structured motion tracks (bounding box JSONs) makes captions and questions more accurate, detailed, and time-consistent than video-only prompts.
•The dataset includes about 46.7k videos and 467k QA pairs focused on fine-grained movements across people, robots, and cars.
•Careful video preprocessing (like filtering strong camera motion) keeps tracking clean so models learn true object motion, not camera shake.
•The approach shows a scalable path to teach machines not just 'what happened' but 'how it happened' in real-world video.
•Limitations include mostly 2D understanding today and dependence on detector quality, with 3D motion and dexterous hand reasoning as next steps.

Why This Research Matters

Motion details are the difference between safe and unsafe decisions in the real world. With FoundationMotion, AI can finally learn not just to name things in a video, but to understand how they move, in what order, and where they go. This boosts reliability for robots in homes and factories, for driver assistance and autonomous cars, and for video assistants that explain procedures or sports plays. Because the pipeline is automatic, we can scale motion learning without endless manual labeling. Better motion understanding also helps accessibility tools describe dynamic scenes more clearly. Overall, this is a practical step toward AI that reasons about the physical world like we do.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine watching a soccer game on TV. You can easily tell which player kicks the ball, where it goes, and whether it curves or goes straight. Your brain is great at tracking motion and telling a story about it.

🥬 Filling (The Actual Concept): Before this research, many AI video models were good at naming things (there is a ball, a player, a car), and even recognizing simple actions (kicking, driving), but they stumbled on the fine details of motion: which direction, how far, how fast, and in what order things happened. Why? Because they didn’t have enough training data that explained motion step by step. Creating that kind of data by hand is slow and expensive—humans would have to watch videos frame by frame and label who moved where and when. Without lots of such examples, models guess and often get motion wrong.

🍞 Bottom Bread (Anchor): Think of a robot trying to pour water into a cup. If the robot only knows “pouring” but not how the bottle moves relative to the cup, it spills. The missing ingredient is detailed motion knowledge.

🍞 Top Bread (Hook): You know how a coach draws arrows on a whiteboard to show how players should run? Those arrows are like motion blueprints.

🥬 Filling (The Actual Concept): The problem researchers faced was a shortage of big, fine-grained motion datasets—the kind that draw those arrows precisely for every object in a video. Earlier attempts relied on humans to annotate motions, which doesn’t scale to millions of video moments. Some projects used language models to write captions or questions for videos, but without exact object paths, descriptions stayed vague (e.g., “the car moves” instead of “the car turns right behind the truck”). What was missing was a way to automatically extract object paths and feed them to a language model so it could write very specific motion stories and quizzes.

🍞 Bottom Bread (Anchor): If you ask, “Which way did the red car turn after passing the bus?” a vague caption won’t help; you need the red car’s exact path to answer confidently.

🍞 Top Bread (Hook): Imagine trimming a long movie to just the exciting chase scenes so your friend only watches the good parts.

🥬 Filling (The Actual Concept): Many prior systems tried to learn motion from long, shaky videos, often with strong camera movement that confuses even humans. They also mixed in lots of quiet moments with little motion. This made models waste attention and learn the wrong patterns. The gap: we needed a pipeline that (1) picks short, motion-rich clips, (2) detects and tracks multiple objects well, (3) uses those tracks to write precise captions, and (4) turns them into fair, challenging multiple-choice questions that cover motion recognition, order, location, direction, and counting.

🍞 Bottom Bread (Anchor): It’s like building a great study guide: select clear examples, show the arrows, explain what happened, and then quiz on the important parts.

🍞 Top Bread (Hook): Picture learning to bike. It’s not enough to know what a bike is; you must learn how to balance, steer, and pedal in a certain order.

🥬 Filling (The Actual Concept): The real stakes are big. Robots need to understand how to move objects safely. Self-driving cars must judge who is moving where and how quickly. Video assistants should answer “how” questions—like whether a person twisted a cap before pouring—without guessing. Bad motion understanding can cause spills in kitchens, scrapes in factories, or worse on roads. Getting “how” right makes AI safer and smarter in the physical world.

🍞 Bottom Bread (Anchor): If an autonomous car confuses “the pedestrian is crossing ahead” with “the pedestrian is standing still,” that’s the difference between braking in time and a dangerous situation.

— Concept Primers in the right order —

🍞 Top Bread (Hook): You know how your eyes can find your friend in a crowd and then keep watching them as they walk?

🥬 The Concept: Object Detection and Tracking is how computers find things in a picture and keep following them from frame to frame. How it works: (1) The model spots objects (like cars, hands, or cups). (2) It draws a box around each. (3) It gives each object a number (an ID). (4) As the video plays, it keeps re-finding each box so the same ID follows the same object. Why it matters: Without tracking, the computer keeps “forgetting” which object is which and mixes up their movements.

🍞 Bottom Bread (Anchor): In a street video, the model keeps a green box on the same blue car as it turns, instead of jumping to another car by mistake.

🍞 Top Bread (Hook): Imagine skipping the boring parts of a long video to only watch the skateboard tricks.

🥬 The Concept: Motion-Centric Video Cropping means cutting a short clip that focuses on where the action happens. How it works: (1) Pick 5–10 seconds around the most active moment. (2) Filter out clips with heavy camera shake that hides true object motion. (3) Keep frames that show clear object movement. Why it matters: Without this, models learn from blurry, unhelpful parts and get confused by camera motion instead of object motion.

🍞 Bottom Bread (Anchor): If the camera pans with the runner, you can’t tell if the runner is fast or the camera is moving—cropping and filtering fix that.

🍞 Top Bread (Hook): Think of drawing a dotted line behind a toy car to show where it went.

🥬 The Concept: Trajectory Annotation is marking the path an object takes across frames. How it works: (1) Track the object box each frame. (2) Save the box corners as numbers between 0 and 1. (3) Connect them over time to show direction, speed, and turns. Why it matters: Without trajectories, descriptions stay vague and can’t answer detailed “how” questions.

🍞 Bottom Bread (Anchor): With a path, you can tell the red ball rolled left, slowed near the wall, then bounced right.

02Core Idea

🍞 Top Bread (Hook): You know how a teacher makes a great quiz by first understanding the lesson and then asking targeted questions?

🥬 The Concept: Large Language Models (LLMs) are text tools that can read signals and write clear, detailed descriptions and fair quizzes. How it works: (1) Give the LLM the video frames. (2) Also give it the object paths (the “arrows”). (3) Prompt it to describe who moved, where, and when. (4) Ask it to turn those details into challenging multiple-choice questions. Why it matters: If you only give raw video, the LLM often writes vague captions. With paths, it gets specific and time-consistent.

🍞 Bottom Bread (Anchor): Instead of “the car moves,” it writes “the blue car turns right behind the truck at the center of the frame.”

🍞 Top Bread (Hook): Imagine a factory line that takes in tomatoes and spits out ketchup bottles—all automatically.

🥬 The Concept: A Data Curation Pipeline is a step-by-step machine that turns messy input (raw videos) into clean, labeled learning material. How it works: (1) Preprocess: clip the right moments and remove heavy camera motion. (2) Detect and track: find people, hands, objects; follow them over time. (3) Summarize: use LLMs to write motion-rich captions. (4) Quiz: auto-generate multiple-choice questions on motion skills. Why it matters: Without a pipeline, you can’t scale to tens of thousands of high-quality motion examples.

🍞 Bottom Bread (Anchor): It’s like turning a huge pile of sports footage into a training workbook that teaches all the plays.

🍞 Top Bread (Hook): Think of a “motion librarian” that catalogs not just what’s in the video, but how it moves.

🥬 The Concept: FoundationMotion is that motion librarian—an automated system that builds a large motion dataset and trains models to understand spatial movement. How it works: (1) Automatically crops videos to motion-rich clips. (2) Detects and tracks multiple objects, including left vs. right hands. (3) Saves trajectories as structured data. (4) Uses LLMs to write precise captions and balanced multiple-choice questions across motion categories. Why it matters: Without FoundationMotion, we’d stay stuck with small, hand-made datasets and models that guess at motion details.

🍞 Bottom Bread (Anchor): After training with FoundationMotion, a model answers, “The person guides the thread to the left, then pushes it through the fabric,” instead of just “sewing.”

🍞 Top Bread (Hook): When you learn a dance, you don’t just memorize the name—you learn the steps and their order.

🥬 The Concept: Question-Answer Pairs Generation is turning motion stories into quizzes that test key skills. How it works: (1) Read the caption and frames. (2) Write questions about motion recognition, order, who-did-what, where-in-the-frame, counting repeats, and direction/speed/trajectory. (3) Create smart wrong answers from the same scene to avoid guesswork. (4) Randomize answer positions to be fair. Why it matters: Without good quizzes, models never get pushed to master the tricky parts of motion.

🍞 Bottom Bread (Anchor): Instead of “Did the car move?” the quiz asks “Which way did the car turn after passing the bus?” with three realistic distractors.

🍞 Top Bread (Hook): Imagine being a detective who watches a clip and can say not just what happened, but how and why the pieces fit.

🥬 The Concept: Motion Understanding and Reasoning is the skill of explaining movements precisely and connecting them over time and space. How it works: (1) Ground in objects: who’s acting. (2) Track paths: where they go. (3) Sequence events: in what order. (4) Relate spatially: left/right, front/behind, near/far. (5) Summarize as answers. Why it matters: Without reasoning, AI can’t safely help in homes, factories, or streets—because small motion details make big differences.

🍞 Bottom Bread (Anchor): A strong model answers, “The pedestrian steps into the lane ahead of the ego car, causing a necessary slow-down,” not just “There’s a person.”

The Aha! Moment in one sentence: If you feed a language model the exact object paths (the motion arrows), it can write precise motion descriptions and quizzes at scale, which in turn trains video models to truly understand how things move.

Three analogies for the same idea:

Map analogy: Instead of saying “the car drove,” draw its route on a map, then write questions about turns, distance, and order.
Sports replay: Add player trails on a replay, then quiz about who cut left first, who crossed behind, and how many passes happened.
Cooking recipe: Track the spoon’s path and bowl’s position, then ask whether stirring happened before pouring and in which direction.

Before vs. After:

Before: Models knew the nouns and some verbs but got confused by the “how”: direction, sequence, spatial relations, and counting.
After: Models become motion-aware, answering fine-grained, stepwise questions and outperforming much larger models on motion benchmarks.

Why it works (intuition, not equations): Motion is about change across time. Plain frames hide those changes; explicit trajectories reveal them. When LLMs see structured, frame-by-frame paths, they stop guessing and start grounding statements in real movement. Combining clean clips, robust tracking, and targeted quizzes delivers the right signals to learn reliable motion reasoning.

Building blocks of the idea:

Motion-centric preprocessing: short, clear clips; camera-motion filtering.
Multi-object detection and tracking: keep identities consistent over time.
Trajectory annotation: normalized box paths per frame.
Caption generation with structure: prompt LLMs across action, order, associations, space, dynamics, and relationships.
QA generation: balanced, multi-skill, realistic distractors; fair answer placement.
Fine-tuning: teach models with thousands of motion-rich examples, improving generalization.

03Methodology

At a high level: Raw Video → Motion-Centric Preprocessing → Detection and Tracking → Trajectory-to-Caption Summarization → Motion QA Generation → Fine-tuning Models.

Step A: Motion-Centric Preprocessing

What happens: The pipeline trims each video to a 5–10 second segment around the interesting action and filters out clips with strong camera motion that would hide true object motion. It samples frames so later steps are efficient and consistent.
Why this step exists: If the camera is moving a lot, even humans struggle to describe object motion correctly. Removing those cases boosts tracking quality and makes labels reliable.
Example: A 40-second driving video is cropped to an 8-second segment where a car changes lanes, skipping earlier calm stretches and discarding clips with heavy camera pan.

How it’s done (light details):

Segment selection: choose 5–10 seconds near the middle, with small randomness to diversify.
Camera motion filtering: estimate camera pose over frames (translation, rotation). If the motion score is high, drop the clip.

Step B: Object Detection and Tracking

What happens: The system finds objects (cars, cups, hands, people) and follows each one through the clip. It pays special attention to human-centric parts like left vs. right hands and whether a hand is touching an object.
Why this step exists: You can’t explain motion without knowing who moved. Tracking keeps the same ID on the same object across time so paths are meaningful.
Example: In a crafting video, detectors find a person, left hand, right hand, and a flower. Tracking keeps the right hand’s ID consistent as it reaches, picks up scissors, and cuts.

How it’s done (light details):

Open-vocabulary detection: A vision-language model proposes object names in the first frame; a detector (queried per object name) draws precise boxes.
Human-centric detection: A high-quality person detector locates people; a pose model outlines whole-body keypoints; a hand model finds left/right hands, contact state, and the held object’s box.
Temporal tracking: A video segmentation/tracking tool propagates masks/boxes frame-by-frame. IDs are hierarchically assigned so, for person ID k, left hand and right hand are consistent sub-IDs. Every few frames, fresh detections nudge the tracker back on track to prevent drift.

Step C: Trajectory Annotation to Caption Generation

What happens: For every tracked object, the pipeline saves a sequence of normalized bounding boxes (left, top, right, bottom) per frame, plus who is interacting with whom. It overlays colored boxes on frames and sends both visuals and the structured JSON to a language model to write a rich caption.
Why this step exists: Text models are great at language but weak at precisely reading motion from raw pixels. Feeding them explicit paths unlocks accurate, time-aware descriptions.
Example: The caption might say, “The person’s left hand holds the flower steady near the center; the right hand lifts scissors from the right side, moves leftward, and snips the flower stem.”

How it’s done (light details):

JSON schema: object IDs, types, per-frame boxes, and per-frame interactions.
Prompting: Ask for seven motion dimensions—actions/gestures, temporal order, who-did-what, spatial context, repetition, dynamics (direction/distance/velocity/trajectory), and changing relationships.

Step D: Motion QA Generation

What happens: Using the caption (and frames), the language model creates multiple-choice questions that test different motion skills with realistic distractors drawn from the same scene.
Why this step exists: Good questions teach and test. By covering recognition, order, who-did-what, spatial location, counting, and traditional motion analysis (direction, distance, speed, trajectory, relationships), the dataset trains broad motion reasoning.
Example: “Which hand performs the cutting?” A: Right hand (correct), B: Left hand, C: Both hands, D: Neither hand.

How it’s done (light details):

Categories: Motion Recognition (MR), Action Order (AO), Motion-related Objects (MO), Location-related Motion (LM), Repetition Count (RC), plus direction/distance/trajectory/speed and spatial relations.
Balance: The correct answer is evenly distributed among A–D across the dataset to avoid position bias; distractors are plausible but distinct.

Step E: Fine-tuning Models

What happens: The curated caption+QA pairs are used to fine-tune existing video-language models (e.g., NVILA-Video-15B, Qwen2.5-VL-7B). Training uses standard optimization, with learning rates and schedules tuned to preserve general skills.
Why this step exists: The goal is to inject motion understanding without forgetting other capabilities, creating compact yet strong motion-aware models.
Example: After fine-tuning, a 7B model that once answered vaguely (“the car moves”) now answers precisely (“the car turns right after passing the bus”).

Concrete walkthrough with data:

Input: A driving clip (6.5 seconds). Preprocessing keeps it (low camera motion). Detectors find ego car, truck, pedestrian. Tracking assigns IDs. Trajectories show the pedestrian stepping into the lane from left to center; the ego car slows. Caption: "A pedestrian crosses in front of the ego car from left to center as the truck continues straight in the adjacent lane." QAs: (1) Who crosses? (2) Which direction? (3) What happens first? (4) Where in the frame? (5) How many times does the pedestrian step into the lane?

The secret sauce:

Structured motion signals (the trajectories) plus careful prompts let LLMs move from fuzzy to precise motion language.
Human-centric tracking that separates left vs. right hands and detects hand–object contact gives rich detail for everyday tasks.
Balanced, multi-skill QAs ensure the model doesn’t just memorize labels but truly reasons over space and time.

04Experiments & Results

The test: Researchers measured how well models understood motion across public and self-built benchmarks. They looked at accuracy on tasks that ask what moved, who did what, in what order, where on the screen, how often, and in which direction or along which path. They also checked whether training on FoundationMotion hurt performance on other abilities—it didn’t.

The competition: Baselines included strong closed-source and open-source models such as Gemini-2.5-Flash and Qwen2.5-VL-72B, plus common video-language backbones like NVILA-Video-15B/8B and Qwen2.5-VL-7B. A separate dataset (PLM) matched in size was used to compare whether “just more data” helps as much as motion-focused data.

The scoreboard (with context):

AV-Car (zero-shot car motion QAs): NVILA-Video-15B fine-tuned with FoundationMotion reached 91.5%—that’s like scoring an A+ when others are at a B: Gemini-2.5-Flash at 84.1% and Qwen2.5-VL-72B at 83.3%.
Across multiple benchmarks, FoundationMotion fine-tuning brought consistent gains. For NVILA-Video-15B: +1.0% on MotionBench, +0.1% on VLM4D, +7.1% on AV-Car, +0.6% on AV-Hand, +2.4% on Daily, and a big +14.9% on Robotics. Similar patterns held for NVILA-Video-8B and Qwen2.5-VL-7B, with double-digit boosts in several zero-shot settings like Robotics and Daily.
Compared to training with an equally large PLM dataset, FoundationMotion produced bigger, steadier gains and avoided notable drops. For instance, on NVILA-Video-15B, PLM hurt AV-Car (−5.0%) and AV-Hand (−2.5%), while FoundationMotion improved both (+7.1% and +0.6%).

Surprising and illuminating findings:

Mid-sized models can beat much larger ones when trained with the right motion-focused data. This suggests that data quality and structure matter as much as model size for motion reasoning.
Adding structured trajectories to the LLM (video + bounding box JSONs) made a large difference in QA quality. An external evaluator (GPT-4) rated video+JSONs notably higher than video-only in fine-grained action accuracy (+2.6), motion detail (+2.6), temporal coherence (+2.4), and overall QA quality (+2.3) on a 0–10 scale.
Different question types offer complementary benefits. In a small ablation with Qwen2.5-7B, Repetition Count delivered the biggest single-type boost (~+14.6% over base), with Location-related Motion and Motion-related Objects close behind; combining all types matched the top gains and stabilized performance.
Dataset hygiene matters. Distributing correct answers evenly among A–D reduces biases. Keeping questions concise (most ~30–80 characters) and clips short (mostly 3–7 seconds) keeps the focus on motion rather than long-term memory.

Takeaway from numbers: FoundationMotion’s automatically curated, motion-structured supervision lifts motion reasoning across domains—daily hand use, robots, and autonomous driving—often enough to surpass flagship closed models on motion-specific tasks, all while preserving general video-language ability.

05Discussion & Limitations

Limitations (be specific):

Mostly 2D reasoning today: Trajectories live in image space, so depth and true 3D hand articulations aren’t fully captured. That can miss crucial details like whether a hand moves toward or away from the camera.
Detector/tracker dependence: Errors in detecting small or occluded objects, left vs. right hands, or contact states can ripple into wrong captions and QAs.
Camera motion filtering: Dropping strong camera-motion clips simplifies learning but narrows coverage; future systems must learn to disentangle object motion from camera motion rather than avoid it.
Domain shifts: Although zero-shot performance is strong, unusual scenes (e.g., underwater videos, crowded night markets) may still challenge detectors and trackers.
Annotation granularity: Box-level tracks are good for many tasks but not for fine finger poses, rotating tools, or curved 3D paths.

Required resources:

Compute: Multi-GPU training (e.g., 8×A100) is typical for fine-tuning; detection/tracking and LLM steps also need GPU or fast CPU clusters.
Storage and bandwidth: Tens of thousands of clips plus per-frame tracks and overlays require organized storage and data loaders.
Off-the-shelf models: Access to robust detectors, trackers, pose estimators, and a capable LLM for caption/QA writing.

When not to use:

If you only need coarse labels (e.g., "a person is present"), this pipeline is overkill.
If your videos are dominated by camera motion (e.g., first-person parkour), results may degrade unless you extend the method to model camera motion explicitly.
If privacy constraints prevent running detection/tracking on sensitive content, data curation may be limited.

Open questions:

3D motion grounding: How to seamlessly integrate monocular depth, multi-view cues, or 3D scene graphs so models answer in 3D, not just 2D?
Robustness to camera motion: Can we jointly estimate camera and object motion so the system explains both without filtering?
Fine-grained hand dynamics: How to capture finger-level trajectories and tool articulation for dexterous tasks?
Long horizon reasoning: How to scale from 5–10s motion snippets to minute-long chains of cause and effect?
Data governance: How to ensure fairness, consent, and bias control while scaling automated video curation?

06Conclusion & Future Work

Three-sentence summary: FoundationMotion is a fully automated pipeline that detects and tracks objects in videos, saves their motion paths, and uses those paths to auto-write precise captions and multiple-choice questions about how things move. Fine-tuning existing video-language models on this motion-rich dataset makes them much better at spatial and temporal reasoning, often beating far larger models on motion benchmarks. The approach scales motion understanding beyond costly manual labeling while preserving general video-language skills.

Main achievement: Showing that structured motion signals (trajectory JSONs) paired with targeted captions and QAs can upgrade mid-sized models into motion-aware systems that outperform state-of-the-art baselines on several real-world motion tasks.

Future directions: Move from 2D to 3D motion reasoning (depth, articulated hands, object rotations), learn to disentangle camera vs. object motion rather than filtering, and extend to longer videos with richer cause-and-effect chains. Improving detectors under occlusion and in rare domains will further strengthen downstream motion reasoning.

Why remember this: It reframes motion learning as a data problem—if you can automatically show the model exactly how things move and then quiz it well, you can teach powerful motion understanding at scale. That’s a recipe for safer robots, smarter assistants, and more reliable autonomous systems in the physical world.

Practical Applications

•Driver assistance: Detect pedestrians crossing ahead and understand turning maneuvers for safer alerts.
•Home robotics: Guide grasp, lift, and pour actions with correct hand-object sequencing to reduce spills.
•Industrial automation: Monitor and reason about tool trajectories for quality control and safety checks.
•Sports analytics: Summarize player movements, passes, and cuts with precise spatial and temporal order.
•Video tutoring: Explain step-by-step motions in crafting, cooking, or repairs and quiz learners interactively.
•Surveillance triage: Flag critical motion events (e.g., intrusion paths) with direction and timing.
•AR/VR assistance: Provide motion-aware captions that describe object paths and interactions in view.
•Medical training videos: Highlight instrument trajectories and gesture order in surgical procedures.
•Traffic analysis: Quantify vehicle trajectories at intersections for urban planning and safety audits.
•Human-computer interaction: Recognize left vs. right hand gestures and their sequences for robust control.

Version: 1