MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence
Key Summary
- •This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.
- •It tests four levels of smarts: Perception (seeing and mapping space), Planning (choosing actions), Prediction (guessing what happens next), and Cross-Video Reasoning (connecting different videos).
- •The benchmark has 1,106 multiple-choice questions built from 1,278 clips taken from 25 public datasets and in-house videos, each checked by 3D vision experts.
- •Across 25 strong AI models, the best model still scores about 38%, while humans reach 96.4%, showing a very large gap.
- •Models do worst on Prediction and on camera-to-object spatial relations, where precise geometry and perspective matter most.
- •More frames do not always help; a 'Sufficient-Coverage' setting often performs no better than giving 50 well-spread frames, and a popular keyframe method (AKS) even hurts here.
- •Adding 3D spatial cues or asking models to think step-by-step (chain-of-thought) brings little to no gain on this benchmark.
- •Error analysis shows models mainly fail at geometry, fine-grained motion grounding, aligning their answers to the prompts, and matching across videos.
- •Three focused sub-benchmarks (Indoor Scene Perception, Robot, and Grounding) allow targeted testing of room layouts, navigation/manipulation, and spatial-temporal localization.
- •Overall, MMSI-Video-Bench sets a tough, realistic standard to guide future progress in video-based spatial intelligence.
Why This Research Matters
If we want helpful home robots, AR guides, and safe autonomous systems, they must truly understand space and time, not just label objects. MMSI-Video-Bench measures exactly that, showing where today’s models fall short and why. Because the questions are human-written and diverse, scores tell us about real abilities, not template tricks. The benchmark’s tough cases—like predicting next steps or merging different videos—mirror the real world, where we rarely see everything from one angle. Engineers can use the detailed error patterns to build better models, and product teams can pick the safest models for navigation and planning tasks. In short, this benchmark turns vague promises about “video understanding” into concrete, fixable goals that make AI more reliable in our daily lives.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine trying to help a friend move furniture by watching short phone videos of their house. You’d need to figure out where the couch is, how the hallway turns, and whether you should turn left or right at the plant. That skill—understanding space and motion from videos—is what we want AI to learn.
🥬 The Concept (Video-Based Spatial Intelligence): What it is: It’s an AI’s ability to understand space and movement in videos—who is where, what moves how, and what will happen next. How it works (recipe): 1) Observe the video to see objects and layout. 2) Track how the camera and objects move over time. 3) Build a mental map that updates as new frames arrive. 4) Use that map to plan actions or predict outcomes. Why it matters: Without it, an AI can’t safely guide a robot, navigate a house, or answer questions about where things are and how they move. 🍞 Anchor: A home robot needs this to find the kitchen sink after watching a quick house tour.
The World Before: For years, AI benchmarks mostly tested single images or short, simple video questions. These tests were often auto-generated with templates, so models could get decent scores without truly understanding space. Many focused on one slice of the problem—like identifying objects—rather than the whole story of space plus time plus decision-making. That left a big gap between what we measured and what real-world assistants must actually do.
🍞 Hook: You know how a pop quiz that only asks multiplication doesn’t show if you can solve a word problem? Benchmarks for AI had that problem too.
🥬 The Concept (Benchmark): What it is: A benchmark is a set of standard tests that fairly measure what an AI can do. How it works: 1) Collect diverse examples. 2) Write clear, unambiguous questions with correct answers. 3) Compare models with the same rules. 4) Track progress over time. Why it matters: Without a good benchmark, we don’t know if models are truly improving on skills that matter in the real world. 🍞 Anchor: Think of it like a driving test, not just a trivia quiz, to see if you can actually drive safely.
The Problem: We needed a holistic video benchmark that checks whether models can see, reason, plan, and predict over time—just like a robot helper or navigation assistant would need to do. But earlier video benchmarks often lacked variety, used templated questions, and didn’t test hard cases like connecting multiple videos of the same place.
Failed Attempts: Researchers tried (1) single-image or simple video Q&A, which misses motion and memory; (2) auto-generated questions, which risk template overfitting; and (3) narrow scene sets (e.g., only indoors), which miss generalization. Frame sampling shortcuts that work on other tasks (like picking just the “most semantic” frames) didn’t actually capture the spatial and motion evidence needed for reasoning-heavy questions.
The Gap: We needed a fully human-annotated, diverse, multi-level benchmark that (a) stresses true spatial understanding from videos, (b) includes planning and predicting, and (c) requires connecting information across separate videos.
🍞 Hook: Imagine walking into a maze with twisty hallways, then later seeing the same maze from a balcony. You’d need to combine both views to really know the layout.
🥬 The Concept (Cross-Video Reasoning): What it is: Using clues from different videos (times or viewpoints) to build one consistent understanding. How it works: 1) Remember key facts from video A. 2) Update those facts with video B. 3) Match the same places/objects across angles or days. 4) Answer questions that need both. Why it matters: In the real world, you rarely see everything at once; you must stitch clues together. 🍞 Anchor: Security cameras around a store: one sees the front door, another sees the aisle—together they tell the full story.
Real Stakes: This matters for home robots (find the right room, not bump into things), AR navigation (guide you through airports), drones (safe path planning), and safety-critical systems (avoid risky predictions). It also helps researchers pinpoint what’s truly broken—like geometry or long-horizon memory—so the next generation of AIs gets smarter in ways that help real people.
02Core Idea
🍞 Hook: You know how a good science fair test uses many trials, different conditions, and clear scoring so you can trust the results? That’s the spirit here.
🥬 The Concept (MMSI-Video-Bench): What it is: MMSI-Video-Bench is a big, carefully hand-made exam that measures how well AI understands space and motion in videos, across four levels: Perception, Planning, Prediction, and Cross-Video Reasoning. How it works: 1) Gather diverse videos from 25 datasets plus new in-house recordings. 2) Human experts write novel multiple-choice questions with clear rationales. 3) Organize tasks into a holistic framework: Spatial Construction and Motion Understanding (Perception), Planning, Prediction, and Cross-Video Reasoning (Memory Update and Multi-View Integration). 4) Evaluate many models consistently (exact-match accuracy) under two frame-sampling settings. Why it matters: It reveals a large, precise human–AI gap, showing where models fail (geometry, motion, long-horizon reasoning), and gives a roadmap for real progress. 🍞 Anchor: It’s like an obstacle course that checks balance, speed, memory, and problem-solving—not just one skill.
Aha! Moment (one sentence): If we want trustworthy video assistants, we must test them with a complete, human-written exam that stresses real spatial reasoning, not shortcuts.
Multiple Analogies:
- Obstacle Course: The benchmark is a course with stations—seeing layout, tracking motion, choosing actions, and linking views—so models must pass every station, not just one.
- Detective Casebook: Each question is a case file; the model gathers clues across time and angles to reconstruct what happened and what will happen next.
- Map + Timeline Builder: The model must draw a map of the place (space) while keeping a timeline of actions (time), then use both to plan or predict.
Before vs After:
- Before: Tests were narrower (often single image or one skill), with templated questions that risk overfitting.
- After: MMSI-Video-Bench brings human-authored diversity, multi-video reasoning, and high-level planning/prediction, exposing the true gap to human performance.
Why It Works (intuition, no equations):
- Human-authored novelty avoids template tricks and forces real reasoning.
- Diverse sources require generalization, not memorization of one domain.
- Multi-level tasks expose where pipelines break: seeing, tracking, mapping, deciding, predicting, and cross-video memory.
- Exact-match multiple choice makes scoring unambiguous.
Building Blocks (smaller pieces):
- Data Pool: 1,278 clips from 25 datasets + in-house videos, spanning indoor scans, driving, sports, egocentric views, and more.
- Task Taxonomy: Perception (Spatial Construction + Motion Understanding), Planning, Prediction, Cross-Video (Memory Update + Multi-View Integration).
- Human Annotation Protocol: Experts design questions, answers, distractors, and rationales, then multi-stage peer review.
- Evaluation Settings: Uniform-50 (50 evenly spaced frames) and Sufficient-Coverage (all frames used by annotators).
- Error Analysis: Categorizes failures (geometric reasoning, grounding, ID mapping, prompt alignment, latent inference) to guide fixes.
- Domain Subsets: Indoor Scene Perception, Robot (manipulation, navigation), and Grounding (target and time localization) for focused evaluation.
🍞 Hook: Think of a friend giving you step-by-step bread-crumbs so you don’t get lost while exploring a new school.
🥬 The Concept (Perception → Spatial Construction): What it is: Spatial Construction is building a coherent map of where things are from partial, moving video views. How it works: 1) Notice landmarks and their relative positions. 2) Track the camera’s own motion. 3) Stitch observations over time into a global layout. 4) Use that layout to answer location questions. Why it matters: Without a solid map, planning and prediction collapse. 🍞 Anchor: Walking through a house tour video and then pointing out the kitchen’s location relative to the living room.
🍞 Hook: When you watch a skateboarder, you don’t just see the board—you feel the motion arc in your head.
🥬 The Concept (Perception → Motion Understanding): What it is: Motion Understanding tracks how the camera and objects move and interact over time. How it works: 1) Detect moving parts. 2) Keep their identities consistent. 3) Measure directions, turns, and interactions. 4) Summarize motion over long spans. Why it matters: Without it, you miss subtle moves (like a quick left turn) or lose track of who’s who. 🍞 Anchor: Counting how many left and right turns a car makes in a dashcam clip.
🍞 Hook: Planning a trip means more than reading a map—you must pick the route.
🥬 The Concept (Planning): What it is: Using video understanding to choose actions that reach a goal. How it works: 1) Read the goal. 2) Use the spatial map and motion cues. 3) Compare possible actions. 4) Pick the safest, most effective path. Why it matters: Seeing is not enough; helpers must decide. 🍞 Anchor: From your current hallway view, turn left then right to reach the sink.
🍞 Hook: Have you ever guessed where a thrown ball will land before it gets there? That’s prediction.
🥬 The Concept (Prediction): What it is: Inferring what comes next (or what would happen if…) from current video evidence and simple physical intuition. How it works: 1) Read current positions and velocities. 2) Apply priors (e.g., momentum, occlusion). 3) Consider conditions in the prompt. 4) Output the likely next state. Why it matters: Without prediction, assistants can’t anticipate hazards or outcomes. 🍞 Anchor: Guess the boxed pedestrian’s next step direction in the next frame.
🍞 Hook: Visiting the same park in the morning and at sunset, then realizing it’s the same place from different angles.
🥬 The Concept (Cross-Video Reasoning: Memory Update & Multi-View Integration): What it is: Remembering past observations and merging different viewpoints into one world model. How it works: 1) Store facts from earlier videos. 2) Update them when changes appear. 3) Match the same objects across angles. 4) Use the merged view to answer. Why it matters: Real life is not one continuous shot; you must link scattered glimpses. 🍞 Anchor: Two clips of a room hours apart; figure out what changed and where things are now.
03Methodology
High-level Pipeline: Input video(s) → Frame sampling (Uniform-50 or Sufficient-Coverage) → Question + options + timestamps → Model inference → Exact-match accuracy. In parallel: Video pool → Human question design → Multi-stage review → Final benchmark assembly → Error analysis and sub-benchmarks.
Step-by-step (what, why, example):
- Curate Diverse Videos
- What happens: Collect ~20k candidate clips from 25 datasets plus 140 in-house videos covering indoor scans, driving, sports, egocentric actions, and more; downsample and timestamp.
- Why it exists: Diversity forces true generalization and prevents overfitting to a single scene type.
- Example: A ScanNet room scan, a Waymo street drive, and an Ego4D kitchen task.
- Human-Authored Questions with Rationales
- What happens: Eleven 3D vision experts watch clips, then craft new multiple-choice questions (4–6 options) plus a brief reasoning note.
- Why it exists: Human creativity avoids template bias and ensures each item needs genuine video reasoning.
- Example: “From the spot at 1m20s, turn left 90°, walk straight, then right 90°—which object is closest now?”
- Strict Multi-Stage Review
- What happens: Cross-review ensures each question is clear, uniquely answerable, and challenging; 100% approval needed.
- Why it exists: Without this, ambiguous or too-easy items would blur results.
- Example: Reviewers reject a question if two choices could be right given the footage.
- Task Taxonomy and Four Levels
- What happens: Organize items into Perception (Spatial Construction + Motion Understanding), Planning, Prediction, and Cross-Video Reasoning (Memory Update + Multi-View Integration).
- Why it exists: A layered design reveals where models fail: seeing, tracking, mapping, deciding, predicting, or cross-connecting videos.
- Example: A Memory Update item asks what likely happened between two recordings.
- Frame Sampling Settings
- What happens: Two tracks: Uniform-50 (50 evenly spaced frames) and Sufficient-Coverage (all frames used by annotators).
- Why it exists: Models and APIs have different limits; comparing both reveals robustness to frame count and coverage.
- Example: A 2-minute clip might yield 50 frames spaced across the full duration in Uniform-50.
🍞 Hook: Skimming every tenth page in a book versus reading the exact pages your teacher used to write the quiz.
🥬 The Concept (Frame Sampling Strategy): What it is: Rules for which frames from a video the model sees. How it works: 1) Decide count (e.g., 50). 2) Choose distribution (uniform vs. consecutive). 3) Optionally pick “keyframes” by semantics (AKS). 4) Feed frames and timestamps to the model. Why it matters: The wrong frames can hide the needed evidence. 🍞 Anchor: If the left turn happens between two sampled frames, you’ll miss it.
- Inference and Scoring
- What happens: Feed frames + question to each model; parse its choice; score by exact-match accuracy.
- Why it exists: Multiple-choice with exact match makes scoring clear and consistent across models.
- Example: If the correct letter is “C,” only “C” counts.
🍞 Hook: Like grading a multiple-choice test—only the bubbled answer matters, not a half-right explanation.
🥬 The Concept (Exact-Match Accuracy): What it is: A strict score that counts an answer as correct only if it exactly matches the key. How it works: 1) Extract the predicted option. 2) Compare to the ground truth. 3) Tally corrects/total. 4) Report percent accuracy. Why it matters: Prevents fuzzy scoring and keeps comparisons fair. 🍞 Anchor: Circle “B” when the answer is “B,” not “sort of B.”
- Error Analysis
- What happens: Sample wrong cases; categorize mistakes as detailed grounding, ID mapping, geometric reasoning, prompt alignment, or latent logical inference.
- Why it exists: Knowing the “how” of failures points to real fixes, not guesswork.
- Example: A model confuses front-left and back-left (geometric reasoning error).
🍞 Hook: A coach replays the game and labels which plays failed and why.
🥬 The Concept (Error Analysis): What it is: Systematically studying mistakes to find patterns and root causes. How it works: 1) Gather wrong answers. 2) Label error type. 3) See which types cluster with which tasks. 4) Prioritize improvements. Why it matters: You can’t fix what you don’t understand. 🍞 Anchor: If most misses are geometry, you train geometry.
- Sub-Benchmarks for Focused Skills
- What happens: Create three focused subsets: Indoor Scene Perception (static/dynamic, instance- vs camera-centric), Robot (manipulation, navigation), and Grounding (target and time localization that require spatial reasoning).
- Why it exists: Domain specialists can test just what they care about.
- Example: Navigation tasks stress planning in indoor layouts.
Special Task Mechanics (two Cross-Video subtypes):
🍞 Hook: Watching yesterday’s room tour and today’s updated clip—spot the change.
🥬 The Concept (Memory Update): What it is: Adjusting your mental map when the scene changes between recordings. How it works: 1) Recall prior layout. 2) Notice differences. 3) Update positions/objects. 4) Answer based on the latest truth. Why it matters: The world changes; models must adapt. 🍞 Anchor: A chair that wasn’t there before is now near the table.
🍞 Hook: Standing on a balcony gives a different view than standing in the hallway, but it’s the same building.
🥬 The Concept (Multi-View Integration): What it is: Combining different viewpoints into one consistent scene representation. How it works: 1) Find correspondences across views. 2) Reconcile perspective differences. 3) Merge into a shared map. 4) Use it to reason. Why it matters: No single camera sees everything. 🍞 Anchor: Matching a whiteboard seen from behind to the same whiteboard seen front-on.
What’s Clever (the “secret sauce”):
- Full human authorship + rationales for clarity and novelty.
- A four-level structure that covers seeing, deciding, predicting, and cross-video stitching.
- Diverse, realistic data so models can’t memorize.
- Diagnostics (error types, frame sampling study) that turn scores into insights.
- Domain sub-benchmarks that plug right into practical niches (indoors, robots, grounding).
04Experiments & Results
The Test: 25 strong models (open and proprietary) are evaluated on 1,106 multiple-choice questions over 1,278 clips, using exact-match accuracy. Two tracks control input frames: Uniform-50 (50 evenly spaced frames) and Sufficient-Coverage (the full set used by annotators). Baselines include random guessing and human performance.
The Competition: Models include GPT-5/O3/O4-mini/GPT-4o, Gemini 3 Pro/Gemini 2.5 Flash, Claude haiku 4.5, Seed-1.6-vision, Doubao, InternVL, QwenVL, LLaVA-Video, and more. Several spatially fine-tuned or architecture-modified models (e.g., SpaceQwen, Spatial-MLLM, VLM3R) are also tested.
Scoreboard with Context:
- Human performance is 96.4%—an A+.
- The best model, Gemini 3 Pro, reaches about 38.0%—more like a D when humans ace the test, a gap near 58.4 percentage points.
- Strong open-source models (e.g., QwenVL2.5-72B) hover near 32–33% on average, lagging behind proprietary ones.
- Prediction is the toughest main category; among Spatial Construction subtypes, camera-to-instance spatial relations are the hardest—like mixing perspective and precise grounding.
- Surprisingly, Sufficient-Coverage often does not beat Uniform-50; more frames can add noise without adding usable evidence.
Frame Sampling Study:
- Uniform sampling across the whole video clearly beats taking short consecutive chunks. Broad temporal coverage is crucial to catch key events.
- Adaptive Keyframe Sampling (AKS), which helps on other benchmarks, underperforms here (e.g., GPT-4o drops from 31.6 to 28.4). Spatial reasoning needs more than semantic similarity—it needs geometry and continuity cues.
Add-ons That Didn’t Help (here):
- 3D spatial cues via VGGT reconstructions gave negligible gains (<1%); failures in complex scenes and weak utilization by models limit benefits.
- Chain-of-thought prompting (“think step by step”) did not consistently help; the bottleneck seems to be core spatial reasoning, not just missing steps.
🍞 Hook: Like a coach diagnosing why the team keeps losing certain plays.
🥬 The Concept (Error Analysis—patterns found): What it is: A labeled breakdown of why models miss answers. How it works: 1) Inspect wrong answers. 2) Tag errors: detailed grounding, ID mapping, geometric reasoning, prompt alignment, latent inference. 3) Compare across tasks. 4) Find dominant weaknesses. Why it matters: It shows where to invest effort (e.g., geometry). 🍞 Anchor: Many misses come from mixing up front-left vs. back-left or losing track of an object during a fast move.
Main Findings from Error Analysis:
- Geometric Reasoning Errors dominate Spatial Construction.
- Detailed Grounding Errors are common in Motion Understanding (fast, subtle, or long motions are hard).
- Prompt Alignment Errors often break Planning and Prediction (the model sees the video but mismatches the question’s conditions).
- Cross-Video Reasoning suffers from Latent Logical Inference and Grounding across views/times (matching is brittle).
Focused Sub-benchmarks:
- Indoor Scene Perception: GPT-5 and Gemini 2.5 Flash perform best across subtypes; camera-centric reasoning is a frequent bottleneck for weaker models.
- Robot Bench: Gemini 3 Pro leads overall; navigation shows larger gaps among models, exposing practical weaknesses.
- Grounding Bench: Gemini 2.5 Flash leads overall, especially in temporal localization; O4-mini is also strong in timing tasks.
Bottom line: The benchmark exposes a wide human–AI gap and points to geometry, long-horizon motion grounding, and cross-video stitching as the main hurdles.
05Discussion & Limitations
Limitations (honest and specific):
- Multiple-choice format ensures clear scoring but can’t assess free-form explanations or planning outputs that require action sequences.
- The benchmark covers many domains yet cannot include every real-world setting (e.g., audio cues, extreme weather, very long streaming videos).
- Sufficient-Coverage relies on annotator-used frames; if future tasks need different keyframes, this setting might still miss some evidence.
- 3D reconstructions (VGGT) are fallible in complex, dynamic scenes; improving these tools is separate from improving the models.
Required Resources:
- To evaluate large open-source models, you need multiple high-memory GPUs; proprietary models require API access and budget.
- Preprocessing videos, running multiple sampling strategies, and logging error analyses add compute and storage costs.
When NOT to Use:
- If you need open-ended, multi-step action planning outputs (e.g., long text plans or code), this multiple-choice format won’t capture the full behavior.
- If your model relies on audio or sensor streams beyond RGB frames, the current setup doesn’t evaluate those channels.
- If you only care about short, single-view clips with trivial layout, a simpler benchmark might suffice.
Open Questions:
- How to build models with robust, explicit geometry that generalizes beyond one dataset and integrates naturally with vision-language reasoning?
- Which frame sampling strategies can target evidence frames for spatial reasoning (not just semantic keyframes) and scale to long videos?
- How should models store and update memory across separate videos, while reliably matching objects and viewpoints?
- What training curricula and synthetic data can build long-horizon motion grounding without overfitting to templates?
🍞 Hook: Like trying to use shadows and floor plans together when your flashlight is weak.
🥬 The Concept (3D Spatial Cues—why they didn’t help here): What it is: Extra images rendered from a 3D reconstruction to provide depth/geometry hints. How it works: 1) Reconstruct 3D from video frames. 2) Render top/down and side views. 3) Feed these along with frames. 4) Prompt model to use them. Why it matters: In theory, it should boost geometry. 🍞 Anchor: If the 3D map is noisy or the model ignores it, you don’t get better directions.
🍞 Hook: A teacher says “show your work,” but if you don’t understand fractions, showing steps won’t fix it.
🥬 The Concept (Chain-of-Thought Prompting—limits): What it is: Asking the model to reason step by step. How it works: 1) Parse the prompt and conditions. 2) Gather video evidence. 3) Combine and infer. 4) Choose an answer. Why it matters: Steps can help organization, but not replace missing skills. 🍞 Anchor: Writing neat steps won’t solve a geometry puzzle if you never learned angles.
06Conclusion & Future Work
Three-Sentence Summary: MMSI-Video-Bench is a fully human-annotated, holistic benchmark for video-based spatial intelligence that tests perception, planning, prediction, and cross-video reasoning across 1,106 questions from 1,278 clips. Evaluations of 25 strong models reveal a massive human–AI gap (best model ~38% vs. humans at ~96.4%), with stubborn failures in geometry, motion grounding, long-horizon prediction, and cross-video correspondence. Popular add-ons like more frames, keyframe sampling (AKS), 3D cues, or chain-of-thought prompting don’t reliably help, emphasizing the need for deeper model advances.
Main Achievement: The paper delivers a realistic, challenging, and diagnostically rich testbed that finally measures the full stack of spatial intelligence needed for real-world video assistants.
Future Directions: Build models with explicit, generalizable geometric representations; design sampling that targets spatial evidence rather than semantics alone; develop durable memory and cross-view matching; craft training curricula that improve long-horizon motion grounding and prompt-evidence alignment without overfitting.
Why Remember This: MMSI-Video-Bench raises the bar from “seeing pictures” to “understanding space and time” in videos, provides clear signals about what’s broken, and sets a practical path for turning today’s video-language models into tomorrow’s safe, capable assistants.
Practical Applications
- •Evaluate and compare robot assistants on room layout understanding, navigation, and manipulation before deployment.
- •Stress-test AR wayfinding features (e.g., in malls, airports) on planning and cross-view reasoning tasks.
- •Diagnose model weaknesses (geometry vs. motion vs. prompt alignment) and design targeted training curricula.
- •Benchmark sampling strategies to choose the best frame selection policy for long videos in your pipeline.
- •Validate whether a spatially fine-tuned model actually generalizes beyond its training datasets.
- •Select models for safety-critical prediction tasks (e.g., anticipating pedestrian motion) using category-wise results.
- •Use sub-benchmarks to specialize: indoor scene perception for smart home devices, robot bench for embodied AI, grounding bench for video retrieval and localization.
- •Create unit tests for product features (e.g., camera-to-object relations) by mirroring the hardest MMSI-Video-Bench subtypes.
- •Guide data collection (what scenes, motions, and views to capture) to close observed failure modes.
- •Track progress across model updates with exact-match accuracy, ensuring real gains rather than regressions.