Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Shengchao Zhou; Yuxin Chen; Yuying Ge; Wei Huang; Jiehong Lin; Ying Shan; Xiaojuan Qi

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

Intermediate

Shengchao Zhou, Yuxin Chen, Yuying Ge et al.12/23/2025

arXiv PDF

Key Summary

•The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
•It builds DSR Suite, an automated pipeline that turns real internet videos into thousands of multiple-choice questions with step-by-step, time-aware answers.
•Two datasets come out of this pipeline: DSR-Train (for learning) and DSR-Bench (a human-refined test) that cover moving viewpoints, many objects, and fine-grained changes.
•A lightweight add-on called the Geometry Selection Module (GSM) teaches models to pull only the geometry that a question actually needs, instead of dumping in all 3D data.
•GSM works with two Q-Formers: the first shrinks the question into a compact summary; the second fetches only the relevant 3D signals and turns them into a small set of geometry tokens.
•Trained on DSR-Train, Qwen2.5-VL-7B + GSM tops the DSR-Bench across all subtasks while keeping strong scores on general video understanding benchmarks.
•Compared to prior methods that fused all 3D features, GSM avoids overfitting and preserves general skills.
•The benchmark stresses understanding continuous change (procedural answers) rather than one-shot snapshots, which better matches real-world motion.
•Scaling data from 5K to 50K QAs steadily improves results, and mixing static+dynamic training boosts both kinds of spatial reasoning.
•Models improved with DSR-Train also perform better on downstream agent tasks (like MineDojo), showing practical benefits beyond benchmarks.

Why This Research Matters

Many real tasks involve motion, not just still scenes: robots hand objects to moving people, cars merge with other vehicles, and AR apps anchor labels on moving players. This work gives AI a practical way to understand “what moved where, from whose viewpoint, and when,” which is key for safety, clarity, and control. By using procedural answers, the benchmark rewards understanding of continuous change, not just final outcomes. The Geometry Selection Module makes models smarter without making them brittle, keeping their general video reasoning intact. The framework also scales with data and transfers to agent tasks, showing impact beyond lab tests.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how watching a soccer game is harder than looking at a single photo? You must track players, the ball, and who’s getting closer or farther—over time.

🥬 The Concept (Dynamic Spatial Reasoning, DSR): DSR is the skill of understanding how objects’ positions, directions, and relationships in 3D change as time passes.

What it is: A model’s ability to reason about 3D space and motion over time (4D = 3D space + time).
How it works: 1) Notice objects; 2) Follow how they move; 3) Track how their distances and directions change; 4) Consider the camera or viewer’s changing viewpoint; 5) Explain these changes clearly.
Why it matters: Without DSR, an AI gets confused when things move—it may miss who passed whom, who sped up, or how the camera angle changed. 🍞 Anchor: If you ask, “Is the blue car getting closer to the camera between 3s and 7s?” a DSR-capable model can answer and explain the change.

The World Before: Vision-language models (VLMs) were great at general descriptions (“A dog runs in the park”) but struggled when asked precise, time-aware spatial questions (“From the skateboarder’s viewpoint, which dog moves left then behind from 2s–5s?”). Most training and tests focused on static scenes or very short motions. When researchers tried to inject 3D knowledge directly (like dumping in all geometry features), models often got better at a niche task but worse at general video understanding—too much noisy detail hurt their balance.

The Problem: There weren’t enough large, diverse, 4D-aware resources to train and test DSR fairly. We needed scalable data from real, messy videos (in the wild), covering multi-object motion, changing viewpoints, and fine-grained, step-by-step answers that reflect continuous change.

Failed Attempts:

Static-only datasets: Good for chairs on a table; weak for cars in traffic.
Two-frame change tasks: Too short to test long-term motion understanding.
Domain-limited videos (e.g., only driving): Not diverse enough for general DSR.
Naïve 3D fusion (just add all geometry features): Improves DSR a bit but harms general skills by flooding the model with irrelevant details.

The Gap: We needed (1) a scalable way to turn real videos into DSR training data, (2) a fair, comprehensive benchmark with fine-grained answers, and (3) a smart model add-on that selects only the geometric information relevant to each question.

🍞 Hook (4D): Imagine a flipbook: each page is a 3D scene, and flipping adds time. 🥬 4D

What it is: 3D space plus time.
How it works: Track 3D positions, then stack them across frames.
Why it matters: Motion only makes sense when you connect many instants. 🍞 Anchor: A dancer’s spin is not just “left” or “right”—it’s a curve traced through time.

Real Stakes: Robots need DSR to hand you the right tool as you move. AR headsets need it to anchor arrows on a running teammate. Cars need it to understand who is merging, not just who is nearby. Video assistants need it to explain plays or safety near a construction site. Without DSR, smart systems misread motion, make unsafe choices, or give vague answers.

🍞 Hook (In-the-wild videos): Think of home videos vs. perfect studio shots—real life is messy. 🥬 In-the-wild videos

What: Real, diverse internet videos with moving cameras and objects.
How: Sample frames, filter for motion, and extract geometry cues without needing exact scale.
Why: If models only see tidy lab scenes, they fail in the real world. 🍞 Anchor: A phone clip of kids playing tag has changing viewpoints, occlusions, and quick moves—exactly what DSR must handle.

02Core Idea

🍞 Hook: You know how, before packing a school bag, you first check the day’s schedule so you only bring what you need?

🥬 The Concept (Aha! Insight): Only select the geometric facts that a specific question needs, and learn DSR from lots of real videos with fine-grained, time-aware answers.

What it is: DSR Suite = a data pipeline + a smart selector (GSM) that filters 3D knowledge based on the question.
How it works: 1) Build DSR-Train from in-the-wild videos with reliable geometry cues; 2) Build DSR-Bench with human-refined questions and procedural answers; 3) Add GSM to a VLM so it picks only relevant 3D info via two Q-Formers; 4) Train; 5) Evaluate.
Why it matters: Dumping all 3D details overwhelms the model; selective geometry keeps general skills intact while boosting DSR. 🍞 Anchor: Asked “Which dog speeds up more from the child’s viewpoint?” the model pulls only the dog-centered motion cues needed and ignores irrelevant background geometry.

Three Analogies for GSM:

Library analogy: You don’t read every book; you ask the librarian (question condenser) who then fetches only the chapters you need (relevant geometry selector).
Backpack analogy: First check your plan (condense text), then pack just the right notebooks (select geometry tokens) instead of your whole desk.
Detective analogy: Form a hypothesis from the question, then examine only the relevant clues, not every footprint in the city.

Before vs After:

Before: Models either ignored rich 3D motion or got swamped by it, losing general video skills.
After: With DSR-Train + GSM, the model focuses on just-right geometry, scoring SOTA on DSR-Bench while staying strong on general benchmarks.

🍞 Hook (Q-Former): Imagine you have sticky notes with smart questions that you press onto a page to pull out the right facts. 🥬 Q-Former

What it is: A module with learnable queries that attend to tokens and extract compact, task-relevant summaries.
How it works: Queries “look” at text or 3D tokens via attention and return a fixed-size set of distilled features.
Why it matters: Fixed, small packets of only-what-matters stop models from drowning in too many details. 🍞 Anchor: Thirty-two queries can pull the exact geometry needed to answer, “From 3s–6s, does the bike move left then behind the bus?”

Why It Works (Intuition, not equations):

Language first: Condense the question so the model knows exactly what to seek (distance, direction, who’s viewpoint, time window).
Targeted geometry next: Use that condensed intent to fetch only a small set of matching 3D cues (positions, motion, orientations) as geometry tokens.
Bounded fusion: Feed a fixed number of geometry tokens to the LLM, keeping noise low and general reasoning intact.

Building Blocks:

DSR-Train: Large-scale, auto-generated, multiple-choice QAs with procedural (step-by-step) answers.
DSR-Bench: Human-polished, diverse videos, viewpoints, objects, and fine-grained answers.
GSM: Two stacked Q-Formers—one for question condensation, one for relevant-geometry selection.
4D Priors: Camera poses, point clouds, masks, orientations, 3D trajectories from strong vision models, mostly in relative (non-metric) scale to support trend-based answers.

🍞 Hook (Geometric prior): Like asking a friend who already mapped the classroom where desks and doors likely are. 🥬 Geometric prior

What it is: Precomputed 3D cues (poses, trajectories, point clouds, orientations) from foundation models.
How it works: Extract reliable, relative 3D structure from videos; then let GSM pick what’s relevant per question.
Why it matters: Speeds learning and grounds answers in real geometry instead of guesswork. 🍞 Anchor: Knowing the camera pose and object trajectories helps cleanly answer, “Does the truck go from right to behind the bus between 8s and 14s?”

03Methodology

At a high level: Video → Stage 1 (Video Curation) → Stage 2 (Geometric Clue Extraction) → Stage 3 (Data Generation) → Train VLM + GSM → Output: DSR-capable answers.

🍞 Hook (Pipeline factory): Imagine a toy factory line: input raw plastic, mold parts, paint details, pack into boxes. 🥬 Automated Data Generation Pipeline

What it is: A 3-stage process that turns messy internet videos into clean, answerable DSR questions.
How it works: 1) Pick good videos with motion; 2) Extract 3D cues; 3) Build questions and step-by-step answers.
Why it matters: Without clean, scalable data, models can’t learn robust DSR. 🍞 Anchor: From a skatepark video, we choose clips with moving skaters, compute their 3D paths, then ask, “From the camera view, does Skater A move left then behind Skater B from 2s–6s?”

Stage 1: Video Curation

What happens: Start from a giant in-the-wild video pool (Koala-36M). Filter to keep only clips with meaningful object motion and reasonable durations (20–120s). Use language and vision models to judge motion content and scene variety.
Why it exists: Many internet videos barely move (or only deform, like waving hands). These don’t teach DSR.
Example: A caption “Two cars race around a bend” passes; “A person stands still giving a speech” is filtered out for DSR.

🍞 Hook (Camera pose): Point your phone—where it is and where it faces changes what you see. 🥬 Camera pose

What it is: The camera’s position and orientation in 3D.
How it works: Estimate relative pose frame by frame using a robust geometry model.
Why it matters: Viewpoint changes can make a still object look like it moves; we must account for the camera. 🍞 Anchor: If the camera walks forward, a parked car looks closer; pose estimation corrects for that.

Stage 2: Geometric Clue Extraction

What happens: For sampled frames, compute scene-level cues (camera poses, local point clouds) and object-level cues (masks, 3D centers over time = trajectories, orientations for agents).
Why it exists: We need concrete 4D evidence to ask precise questions and produce truthful answers.
Example with data: From frames at 1 FPS, lift each tracked object’s mask onto the local point cloud to get its 3D center per timestamp. Now we can compute “distance grew” or “direction turned from right to behind.”

🍞 Hook (Point cloud): Imagine sprinkling glitter in the scene, each speck marking a 3D spot. 🥬 Point cloud

What it is: A set of 3D points outlining surfaces in the scene.
How it works: Reconstructs relative geometry from video frames.
Why it matters: Lets us place objects in 3D and measure changes. 🍞 Anchor: A wall is a dense sheet of points; a car is a clustered blob moving through those points.

🍞 Hook (3D trajectory): Like an ant’s trail—but in space and time. 🥬 3D trajectory

What it is: The path of an object’s 3D center across frames.
How it works: Track the object mask, lift to 3D each frame, then link centers over time.
Why it matters: Distance, direction, and speed trends come from trajectories. 🍞 Anchor: A dog’s path curves left, then goes behind its owner—exactly what a question may ask.

Stage 3: Data Generation (QAs)

What happens: Build two families of multiple-choice QAs: template-based (six types) and free-form (LLM-generated), across viewpoints (camera or an agent), with absolute (fixed) or relative (moving) viewpoints.
Why it exists: We need both structured probes of core skills (templates) and natural language variety (free-form) for robust learning and fair evaluation.
Example: “Between 3s and 16s, following the boy’s perspective, how does the dog’s direction to the other dog change?”

🍞 Hook (Viewpoint transform): Standing on a skateboard vs. pausing a photo—the world looks different. 🥬 Viewpoint (absolute vs. relative)

What it is: Absolute fixes the observer at one instant; relative follows the observer as they move.
How it works: Transform all object coordinates into the chosen reference frame using camera pose (and agent orientation for agent viewpoints).
Why it matters: DSR must work whether we freeze the viewer or ride along. 🍞 Anchor: From the runner’s moving viewpoint, another runner may shift from right to behind even if a static camera would see something else.

🍞 Hook (Egocentric vs. allocentric): “Turn left from where I stand” vs. “Turn east on the map.” 🥬 Egocentric vs. allocentric

What it is: Egocentric = from the observer; allocentric = in a fixed world frame.
How it works: Choose a frame and project motion into it.
Why it matters: Many human questions are egocentric; some tasks need map-like consistency. 🍞 Anchor: “Behind me” is egocentric; “south of the tree” is allocentric.

🍞 Hook (Procedural answers): A recipe explains each step—not just “cake happened.” 🥬 Procedural answers

What it is: Fine-grained answers that describe how a state changes over time (e.g., keep constant → smaller → larger).
How it works: Compare attribute across adjacent frames, compress repeated states, and output a short sequence of qualitative steps.
Why it matters: Real motion is a process, not a single snapshot. 🍞 Anchor: “Left, then behind|left” is more informative than just “behind.”

Training with GSM

What happens: Use Qwen2.5-VL-7B as base. GSM adds two Q-Formers: (1) Semantic Condenser turns question text into a fixed set of query embeddings; (2) Relevant-Geometry Selector uses those queries to pull a compact set of N geometry tokens from 3D priors. Concatenate [vision tokens; geometry tokens; text tokens] into the LLM.
Why it exists: To boost DSR without hurting general video skills by filtering out irrelevant 3D noise.
Example with data: For “Which car gets closer faster to the camera between 2s and 6s?”, GSM emphasizes car trajectories and distances, not background trees.

The Secret Sauce

Targeted selection over brute-force fusion: Smaller, question-aligned geometry packets beat dumping all 3D features.
Fixed-size geometry tokens: Keep compute stable and prevent overfitting.
Relative-scale cues: Trend-based, robust answers without needing exact meters.

04Experiments & Results

The Test: DSR-Bench measures 12 template-based subtasks (distance, direction, orientation, speed, speed comparison, direction prediction × absolute vs. relative viewpoints) plus a non-template subset. Videos cover six broad categories (sports, vehicles, art, labor, daily life, wildlife). Answers are procedural, describing how things change, not just the final state.

The Competition: The paper compares against proprietary general models (e.g., GPT-4o, Gemini-2.5), video understanding models (LLaVA-Video, VideoRefer, LongVILA-R1), general-purpose open models (Qwen, InternVL), and spatial reasoning models (VLM-3R, VG-LLM), most of which focus on static or coarse dynamics.

The Scoreboard (contextualized):

Our model (Qwen2.5-VL-7B + GSM, trained on DSR-Train) leads across all subtasks on DSR-Bench. For example, Absolute Distance hits 87.0%—like scoring an A+ when many others hover near passing. Absolute Orientation reaches 84.1%, and Relative Direction 76.1%. Non-template questions reach 46.4%, strong for open-ended phrasing.
Average accuracy on DSR-Bench is 58.9% for ours, notably higher than others, showing broad, consistent gains across motion types and viewpoints.
Even specialized spatial models trained on static scenes trail behind, highlighting the challenge and the value of dynamic data.

Generalization and Trade-offs:

GSM vs. naive 3D addition: When simply adding all 3D tokens, DSR rises but general video benchmarks (e.g., Video-MME) can drop—information overload. GSM keeps DSR high while preserving general understanding (near-baseline Video-MME scores), balancing specialization and breadth.
Query number ablation: More queries (geometry tokens) help DSR slightly but can nibble away at general performance; N=32 offers a sweet spot.
Data scaling: Training on 5K → 50K QAs shows steady gains on DSR-Bench (e.g., ~47% → ~59%), underscoring the usefulness of more diverse dynamic examples.

Surprising Findings:

Static spatial models sometimes beat proprietary giants on DSR tasks, signaling that 3D-aware supervision matters more than model size alone for this skill.
Mixing static + dynamic training improves both static (VSI-Bench) and dynamic (DSR-Bench) scores simultaneously, showing complementary learning.
Downstream agents (MineDojo) benefit: Agents built from the DSR-trained model succeed more often on tasks with moving animals/hostiles (e.g., 26.5% vs. ~16% for animals; 22.3% vs. ~12% for hostiles), proving real-world utility beyond Q&A.

Why these results matter: The model not only wins on the home benchmark but also stays competent elsewhere, demonstrating that selective geometry and fine-grained supervision are a practical path to robust 4D reasoning.

05Discussion & Limitations

Limitations:

Relative scale only: The pipeline focuses on non-metric (trend-based) geometry. For tasks needing exact meters/seconds, extra calibration would be required.
Orientation for non-agents: To avoid noise, only agent classes get orientations; some non-agent rotations remain unmodeled.
Occlusion and exits: Long occlusions or objects entering/exiting can still challenge tracking and question stability.
Multiple-choice format: While procedural, choices can constrain expression; free-form parts help but are smaller.
Heavy-lift preprocessing: Extracting high-quality 3D cues over long videos is compute-intensive.

Required Resources:

A base video VLM (e.g., Qwen2.5-VL-7B) and access to 3D foundation tools for poses, point clouds, masks, trajectories, and orientations.
GPU time for QA generation, training, and evaluation.
Optional human refining for benchmarks (as done for DSR-Bench).

When NOT to Use:

Exact metric tasks (e.g., “Is the car 2.5 meters away?”) without additional scale calibration.
Ultra-crowded scenes with frequent identity switches where tracking priors break down.
Real-time low-latency systems without preprocessing budgets.

Open Questions:

Metric grounding: How to reliably add absolute scale without heavy annotation?
Forecasting: Can the model predict future 3D paths, not just describe past ones?
Complex interactions: How to robustly reason about multi-agent coordination and physical constraints?
Memory and recurrence: What’s the best way to keep long temporal context without bloating computation?
Unified spatial curriculum: What’s the optimal mix of static + dynamic data to maximize transfer?

06Conclusion & Future Work

Three-Sentence Summary: This paper introduces DSR Suite, which turns real videos into fine-grained, time-aware spatial questions for training (DSR-Train) and human-refined testing (DSR-Bench). A lightweight Geometry Selection Module (GSM) uses two Q-Formers to pull only question-relevant 3D cues, boosting dynamic spatial reasoning without sacrificing general video understanding. Together, they deliver state-of-the-art results on DSR-Bench and improved performance on related tasks and agents.

Main Achievement: Proving that selective geometry—via GSM—and procedural, viewpoint-aware supervision—via DSR-Train/DSR-Bench—enable strong, balanced 4D reasoning in VLMs.

Future Directions: Add metric scale where needed, extend to motion forecasting and physics reasoning, refine tracking under occlusions, and grow mixed static+dynamic curricula for broader transfer. Explore deployment in robotics, AR navigation, sports analytics, and autonomous systems.

Why Remember This: It shows a practical recipe for teaching AI to understand how the world changes—not just how it looks—by pairing carefully built data with a simple, powerful geometry selector that keeps models both smart and versatile.

Practical Applications

•Robot assistants tracking where people and tools move in a workshop to hand over the right item safely.
•AR navigation that keeps arrows and labels correctly attached to moving teammates during sports or training.
•Video coaching tools that explain how players change positions and speeds across a play from specific viewpoints.
•Driver-assist systems that reason about which car is approaching faster and from which direction over time.
•Home security analytics that summarize who moved where and when without relying on exact measurements.
•Warehouse automation that plans paths around moving carts and workers with viewpoint-aware reasoning.
•Education apps that teach physics of motion by analyzing real videos with step-by-step explanations.
•Sports broadcasting that generates procedural captions of plays (e.g., “runner moves left, then cuts behind defender”).
•Embodied AI in games (like Minecraft) that succeeds more often at tasks involving moving animals or enemies.
•Drone filming that keeps track of multiple moving subjects while compensating for camera motion.

Version: 1