SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi; Wahid Faisal; Abdur Rahman; Mahfuz Ahmed Anik; Munem Shahriar; Mohsin Mahmud Topu; Sadia Tasnim Meem; Rahatun Nesa Priti; Sabrina Afroz Mitu; Md. Iqramul Hoque; Shahriyar Zaman Ridoy; Mohammed Eunus Ali; Majd Hawasly; Mohammad Raza; Md Rizwan Parvez

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman et al.2/3/2026

arXiv PDF

Key Summary

•SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.
•It contains 1,400 image questions across six skills: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subtypes (30 total).
•Models did much worse than humans: in multiple choice, the best model scored 54.93% vs. humans at 87.57%; in open-ended answers, the best model scored 40.93% vs. humans at 64.93%.
•Open-ended questions were 10–25 percentage points harder for models than multiple choice, showing that picking from options can overestimate true understanding.
•Hard spots for models were occlusion, consistent scale judgments, planning paths, and multi-step 3D reasoning.
•Tricks like chain-of-thought and multi-agent pipelines helped a bit for orientation but sometimes made other areas worse without better visual grounding.
•Supervised fine-tuning helped multiple-choice scores but sometimes hurt open-ended answers, suggesting overfitting to options instead of real understanding.
•SpatiaLab uses real, cluttered photos with many objects, shadows, reflections, and tricky angles to mirror everyday scenes.
•An LLM judge aligned well with humans for scoring open-ended answers (Cohen’s kappa around 0.74 overall), making large-scale evaluation practical.
•By revealing where models fail under real conditions, SpatiaLab guides future research toward geometry- and physics-aware VLMs that act safely in the real world.

Why This Research Matters

Safer robots: A household robot that truly understands occlusion and scale won’t knock over glassware or misjudge what fits where. Better navigation aids: AR wayfinding and assistive technologies can guide people through cluttered spaces more reliably when models reason about paths and obstacles well. Smarter vehicles: Recognizing reflections, partial views, and depth quickly can improve situational awareness for autonomous systems. More helpful apps: Educational and accessibility tools that explain or summarize scenes can give clearer, more accurate spatial descriptions. Stronger AR/VR: Overlays that respect depth, size, and reflections look natural and trustworthy. Research acceleration: A reliable, realistic benchmark pinpoints exactly what to fix in next-gen models. Everyday utility: From furniture planning to packing, spatially aware AI makes common tasks easier and safer.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine playing hide-and-seek in a messy living room. You know the couch is in front of the bookshelf, the lamp is behind a plant, and you can guess which path gets you to the kitchen fastest without bumping your knees. That’s your brain doing spatial reasoning. 🥬 The Concept: Spatial reasoning is your mind’s way of figuring out where things are, how big they seem, what blocks what, which way things face, and how to move through a space. How it works (in people): 1) Notice clues like size, shadows, and overlaps. 2) Build a mental map (what’s left/right, in front/behind). 3) Predict what happens if you move or if objects move. Why it matters: Without it, we’d constantly knock things over, get lost, or misjudge sizes. 🍞 Anchor: You decide the safest route to the fridge at night without stepping on the dog’s toy, because you can reason about space in the dark using memory and clues.

🍞 Hook: You know how a teacher can show a picture and ask you to explain it in words? 🥬 The Concept: Vision-language models (VLMs) are AIs that look at pictures and talk about them. How it works: 1) See the image and detect objects. 2) Read the question in words. 3) Combine picture clues with the question to answer. Why it matters: Without strong spatial sense, a VLM might say a cup is in front of a plate when it’s actually behind it. 🍞 Anchor: If you ask a VLM “Which cat is behind the curtain?”, it must connect the words to the picture’s hiding-and-overlap clues to answer correctly.

The World Before: VLMs got good at naming objects (“a red car”) and basic Q&A (“What color is the ball?”). But real homes and streets are messy: many objects overlap, lighting changes, reflections trick the eye, and sizes look different at varying distances. Benchmarks often used clean, synthetic scenes (like toys on simple backgrounds) or puzzle-like, templated questions. These made tests easier and let models pass by spotting patterns instead of truly understanding space.

The Problem: As models moved into real tasks—robots picking items, apps guiding AR overlays, or cars planning a path—they stumbled on deeper spatial challenges: inferring what’s behind something (occlusion), keeping size judgments consistent across viewpoints, and planning safe routes with partial visibility.

Failed Attempts: Earlier datasets tested narrow, simple relationships or reused AI-generated scenes with low clutter. Some focused only on one thing (like left/right), others only on indoors, or included very few samples. Models that looked great on these tests often failed when scenes were messy or questions required multi-step reasoning.

The Gap: We lacked a single, realistic benchmark that: 1) covers all major spatial skills (position, depth/occlusion, orientation, size/scale, navigation, and 3D geometry), 2) offers both multiple-choice and open-ended answers, 3) uses real-world, diverse, cluttered images, and 4) is large enough for reliable comparisons with humans.

🍞 Hook: Think of a coach testing all parts of a swimmer’s ability—kicking, breathing, strokes—not just floating. 🥬 The Concept: SpatiaLab is a complete test for spatial reasoning “in the wild.” What it is: A 1,400-question benchmark built from varied real photos, spanning six skills and 30 sub-skills, with both multiple-choice and open-ended questions, plus human baselines. How it works: 1) Collect diverse images (web and manual snaps). 2) Write precise spatial questions. 3) Review carefully for clarity and fairness. 4) Test many models and humans, compare results, and analyze errors. Why it matters: Without such a broad, realistic test, we can’t see true weaknesses or guide improvements. 🍞 Anchor: Instead of just asking “Is the ball left of the box?” in a simple cartoon, SpatiaLab asks questions like “Which mug blocks the view of the small spoon?” in a real kitchen photo with reflections and shadows.

Real Stakes: • Safety: A robot that misjudges occlusion can knock over fragile items. • Navigation: An assistive device must find clear paths in cluttered rooms. • AR/VR: Apps must overlay arrows and labels that respect depth and reflections. • Education and accessibility: Clear spatial language helps tutoring and assistive tools. SpatiaLab shows where today’s AIs still slip, so we can build ones that see the world more like we do.

02Core Idea

🍞 Hook: You know how a school report card covers math, reading, science, and more—so you get the full story? 🥬 The Concept: The key idea is to test spatial reasoning the way the real world does: across many skills, with messy photos, and with both multiple-choice and open-ended questions. How it works: 1) Organize all important spatial skills into six big categories and 30 sub-skills. 2) Use real, cluttered images that include shadows, reflections, and partial views. 3) Ask both pick-an-option and write-your-own-answer questions. 4) Compare many models to human scores and study errors deeply. Why it matters: Without testing the full picture, a model can look smart on one trick while failing at others that matter in real life. 🍞 Anchor: SpatiaLab is like a true obstacle course for VLMs, not just a flat track.

Aha! Moment in one sentence: If we evaluate VLMs on realistic, diverse, and comprehensive spatial tasks—with both multiple choice and open-ended formats—we reveal the real gaps between model pattern-matching and human spatial understanding.

Multiple Analogies:

Orchestra analogy: A model that only plays the violin (left/right) can’t perform the whole symphony (depth, occlusion, navigation, 3D). SpatiaLab checks the full orchestra. 2) Hiking analogy: It’s not enough to read a map; you must notice cliffs (occlusion), estimate distances (scale), and pick safe paths (navigation). SpatiaLab measures all that. 3) Sports analogy: Dribbling well (orientation) isn’t enough; you must pass (position), block (occlusion), and run plays (navigation). SpatiaLab grades the whole team.

Before vs After: • Before: Benchmarks were narrow, synthetic, or templated. Models looked strong on easy, isolated skills. • After: SpatiaLab reveals that, in the wild, even top models fall far short of humans—especially in open-ended reasoning—so the community can aim at the right fixes (geometry, physics, grounding).

Why it works (intuition, not math): • Diversity breaks shortcuts: Real photos with reflections, shadows, and clutter stop models from coasting on simple patterns. • Dual formats expose truth: Multiple-choice can mask confusion (guessing, elimination). Open-ended forces models to generate grounded answers. • Fine-grained taxonomy diagnoses exactly which micro-skills (like betweenness or transparency) are brittle, guiding targeted improvements.

Building Blocks (with the Sandwich pattern for each key skill): 🍞 Hook: You know how you say, “The book is to the left of the lamp”? 🥬 Relative Positioning: What it is: Describing where things are (left/right/above/below/between/corner). How it works: 1) Identify reference objects. 2) Compare their locations. 3) Use consistent frames (like viewer’s left). Why it matters: Without it, directions and layout descriptions fall apart. 🍞 Anchor: “Which object sits between the chair and the window?”

🍞 Hook: When someone stands in front of you, they block your view of the poster behind them. 🥬 Depth & Occlusion: What it is: Knowing what’s in front, behind, partially hidden, transparent, or reflected. How it works: 1) Use overlaps, size cues, shadows. 2) Handle glass/mirrors correctly. 3) Keep a layered picture of the scene. Why it matters: Without it, models mistake mirrors for windows or think hidden objects disappeared. 🍞 Anchor: “What object has no reflection in this image?”

🍞 Hook: Imagine tilting a straw or rotating a toy car. 🥬 Orientation: What it is: Understanding which way things face, rotate, or stack, and even tool handedness. How it works: 1) Detect object fronts and axes. 2) Track rotations and alignments. 3) Check if tools are left- or right-handed. Why it matters: Without it, a robot might grip the wrong way or place items unstablely. 🍞 Anchor: “Which straw tilts for a left-handed sip?”

🍞 Hook: A far truck can look smaller than a nearby bike, but you know the truck is bigger. 🥬 Size & Scale: What it is: Comparing sizes while compensating for distance and perspective. How it works: 1) Estimate depth. 2) Adjust perceived size vs. true size. 3) Check if sizes make sense together. Why it matters: Without it, models think mice are larger than chairs in photos. 🍞 Anchor: “Are these the smallest items of their kinds?”

🍞 Hook: Picture finding a path through chairs to reach a door without bumping them. 🥬 Spatial Navigation: What it is: Reasoning about paths, obstacles, visibility from a viewpoint, and the order you’ll meet things. How it works: 1) Check if a route exists. 2) Avoid blockers. 3) Track sequence and accessibility. Why it matters: Without it, path planning becomes unsafe. 🍞 Anchor: “Does the ambulance have a clear path?”

🍞 Hook: Building with blocks—will the tower topple? 🥬 3D Geometry: What it is: Volumes, stability, containment, and shape projections. How it works: 1) Compare 3D sizes. 2) Predict balance under gravity. 3) Imagine silhouettes after rotation. Why it matters: Without it, models misjudge fits or topple stacks. 🍞 Anchor: “Would the potted plant fit on the chair seat?”

🍞 Hook: In school, you answer both multiple-choice and short answers. 🥬 Multi-modal Evaluation (MCQ + Open-Ended): What it is: Two ways to test understanding: picking an option and explaining freely. How it works: 1) MCQ probes fine discrimination among distractors. 2) Open-ended checks compositional, grounded language. 3) Compare to human answers and analyze errors. Why it matters: Without open-ended testing, models can look better than they truly are. 🍞 Anchor: The same scene is tested with both formats to uncover guessing vs. genuine reasoning.

03Methodology

High-level Recipe: Image + Question → (A) Curate real, diverse scenes → (B) Write and review spatial questions → (C) Evaluate models in MCQ and open-ended formats → (D) Score vs. ground truth (LLM judge + human) → (E) Analyze errors and gaps.

Step A: Collecting Images from the Real World

What happens: Images are gathered through web crawling, targeted searches, and manual snapshots in diverse indoor/outdoor scenes. They capture varied lighting (high/low contrast, shadows), textures (uniform/patterned/complex), materials (transparent, reflective, opaque), and spatial relations (stacked, scattered, aligned). Average stats: ~21.48 objects per image; ~11.88 partially visible; ~3.23 depth layers; 16.71% questions require mental rotation; 56.07% require object specificity.
Why this step exists: Synthetic or overly clean images let models cheat with shortcuts; diverse, messy photos force true spatial understanding.
Example: A café photo with glass, mirrors, shadows, and many overlapping items challenges occlusion and reflection reasoning.

Step B: Writing and Reviewing Questions

What happens: Trained annotators design questions across six main categories and 30 subcategories (e.g., layering order, betweenness, tool handedness). Each question is made in both multiple-choice (4 options) and open-ended forms. A three-tier review ensures clarity, correctness, and unambiguous mapping to subcategories; a gold set is finalized.
Why this step exists: Precise wording and high-quality distractors prevent guessing and ensure each question really tests the intended skill.
Example (Orientation, MCQ): “Which drinking straw tilts for a left-handed orientation? 1) Beige bubble tea straw 2) Green tea straw 3) Both 4) None.” Answer: 1.

Step C: Dual Evaluation Formats

MCQ Evaluation: Models reply with an option number; accuracy is computed directly. Prompts fix temperature=0 to remove randomness. - Why: Forces fine-grained recognition and reduces verbosity.
Open-Ended Evaluation: Models generate free-form answers (temperature≈0.7, up to ~1200 tokens). A separate LLM judge (Gemini-2.5-Flash) scores correctness against ground truth with strict instructions (output only 0 or 1). - Why: Captures generative reasoning and checks if models can explain or directly state correct answers without choices.
Example (Navigation, open-ended): “The ambulance (orange-yellow at bottom) has a direct path to the cars ahead—true?” Ground truth: False.

🍞 Hook: Like having a teacher grade short answers for consistency with an answer key. 🥬 LLM-as-a-Judge: What it is: An automatic scorer that compares a model’s open-ended reply to the ground truth. How it works: 1) Provide the question, correct answer, and model answer. 2) Judge outputs 1 (correct) or 0 (incorrect). 3) Validated by human agreement stats (Cohen’s kappa ≈0.738 overall; category kappas often substantial to near-perfect; Fleiss’ kappa among humans ≈0.774). Why it matters: Enables large-scale, human-like scoring without hand-labeling every response. 🍞 Anchor: For “Which object shows no reflection?”, the judge checks if the model says “plate on the wall.”

Step D: Models and Baselines

Who’s tested: 25+ models including proprietary (GPT-4o-mini, GPT-5-mini, Gemini-2/2.5, Claude 3.5 Haiku), open-source (InternVL3.5, Qwen2.5-VL, Llama-3.2 Vision, Gemma-3), reasoning variants (o3/o4, thinking modes), and spatial specialists (SpaceOm, SpaceThinker, SpaceQwen). - Why: Broad coverage shows whether improvements come from scale, reasoning, or specialization.

Step E: Error Analysis Pipeline

What happens: Break down performance per category and subcategory; identify recurring failure types like mislocalization, perspective mistakes, occlusion/order errors, attribute swaps, and ungrounded rationalization.
Why this step exists: Pinpointing exact weaknesses tells researchers what to fix (e.g., geometry grounding, multi-cue fusion).
Example: Many models confuse mirror reflections with see-through glass, causing wrong depth judgments.

Secret Sauce (What makes it clever):

Realism: High object counts, partial visibility, and mixed materials push genuine spatial reasoning. - Taxonomy: 30 sub-skills reveal precise failure modes (e.g., betweenness vs. corner positioning). - Dual-format testing: Exposes when MCQ inflates apparent skill versus open-ended generation. - Human baselines and LLM-judge validation: Anchor scores to reliable standards.

Interventions Tested (brief, with Sandwich mini-notes): 🍞 Hook: Like showing your work in math. 🥬 Chain-of-Thought (CoT): What it is: Prompting models to reason step by step. How it works: 1) Ask for steps. 2) Generate intermediate thoughts. 3) Produce an answer. Why it matters: Should help logic, but without good perception, steps repeat wrong assumptions. 🍞 Anchor: CoT raised orientation slightly but often hurt depth/scale.

🍞 Hook: Practicing with answer keys. 🥬 Supervised Fine-Tuning (SFT): What it is: Training on part of SpatiaLab (40%) to improve. How it works: 1) Stratified sampling. 2) Tune parameters. 3) Evaluate held-out set. Why it matters: Boosted MCQ across all categories, but sometimes reduced open-ended performance (overfitting). 🍞 Anchor: Navigation MCQ improved; open-ended size/scale sometimes dropped.

🍞 Hook: A team dividing tasks (spotter, measurer, planner). 🥬 Multi-agent Reasoning: What it is: Specialized agents for segmentation, attributes, relation maps, etc., combined to answer. How it works: 1) Break tasks. 2) Each agent reasons. 3) Merge into a structured representation. Why it matters: Helped orientation a lot, but hurt depth/nav when perception was shaky (errors snowball). 🍞 Anchor: +36% in open-ended orientation; −24% in open-ended depth/occlusion.

04Experiments & Results

The Test: We measured accuracy on 1,400 QAs across six categories, each with five subcategories. We evaluated both MCQ (option picking) and open-ended (free-form) answers, compared many model families, and included human baselines. We also analyzed agreement of the LLM judge with human raters (strong overall, κ≈0.738).

The Competition: Proprietary models (GPT-4o-mini, GPT-5-mini, Gemini-2/2.5), open-source (InternVL3.5, Qwen2.5-VL, Llama-3.2 Vision, Gemma-3), reasoning-tuned variants (o3/o4, thinking modes), and spatial specialists (SpaceOm/SpaceThinker/SpaceQwen) were tested.

Scoreboard (with context):

MCQ Overall: Best ≈ InternVL3.5-72B at 54.93%, with strong competitors like GPT-5-mini (54.29%) and o4-mini-medium (53.21%). Humans: 87.57%. Context: 54.93% is like barely passing when humans ace the test. Random guess is 25%.
Open-Ended Overall: Best ≈ GPT-5-mini at 40.93%. Others: o4-mini-medium (37.86%), InternVL3.5-72B (23.36%), Qwen-VL2.5-72B (24.64%). Humans: 64.93%. Context: Top model’s score is far below human reliability—and often below “B” territory.
Category patterns: Orientation and 3D Geometry are relatively better; Depth & Occlusion, Size & Scale, and especially Spatial Navigation are tough across the board.

Surprising Findings:

Big Open-Ended Drop: Across 25 models, MCQ→open-ended gap averages ~23 percentage points. Takeaway: Multiple choice can mask misunderstandings; generating the right answer is much harder.
Specialization Isn’t a Free Win: Spatial specialist models didn’t dominate overall; they peaked in narrow spots but underperformed broadly. Takeaway: Real-world scenes demand integrated skills.
Reasoning Helps…Sometimes: CoT and “thinking” modes helped orientation but often failed on depth/scale/navigation without stronger visual grounding.
SFT Trade-off: Fine-tuning raised MCQ scores across all categories (e.g., +10.97% overall on Qwen-VL2.5-3B-Instruct), yet sometimes hurt open-ended answers, hinting at overfitting to option formats or forgetting linguistic priors.
Agentic Pipelines Are Double-Edged: Multi-agent reasoning boosted orientation significantly (up to +36% open-ended) but worsened depth/occlusion and navigation when initial perception was shaky—error cascades can grow.

Concrete Category Snapshots:

Orientation: Several models exceed 60% (MCQ) on sub-skills like stacking orientation; tool handedness also saw decent scores (e.g., 68% for GPT-5-mini in MCQ). - 3D Geometry: Some models manage mid-50s MCQ on tasks like stability prediction, but open-ended results remain far from humans. - Depth & Occlusion: Mixed; a few models hit ~60–66% (MCQ) on complete occlusion inference, yet open-ended performance lags, and reflections/transparency remain challenging. - Spatial Navigation: Most models struggle, often under 50% in MCQ and even lower open-ended; planning with partial observability is a key bottleneck.

Take-home Message: On this realistic, comprehensive test, models that look solid in narrow settings reveal large, systematic gaps. Open-ended answers especially expose missing geometric grounding and fragile multi-step planning.

05Discussion & Limitations

Limitations (honest view):

2D Visual Dependence: Many questions come from single images; full 3D or multi-view/video reasoning is limited, so true temporal occlusion dynamics aren’t fully covered. - Judge Sensitivity: While the LLM judge agrees well with humans overall, subtle phrasing or borderline cases can still cause small scoring errors. - Domain Scope: Despite diverse scenes, some niches (e.g., underwater or extreme weather) are rare; long-horizon embodied tasks remain out of scope.

Required Resources: Using SpatiaLab well typically needs: 1) A VLM with vision input and instruction-following, 2) Enough compute to run both MCQ and long open-ended generations, 3) Optional fine-tuning resources (LoRA/adapters) if you plan to adapt models, and 4) Evaluation infrastructure (LLM-as-a-judge, human spot checks).

When NOT to Use: - If your task doesn’t rely on spatial reasoning (e.g., pure sentiment analysis). - If you require continuous control or real-time robotics benchmarks with closed-loop feedback—SpatiaLab is diagnostic, not a control suite. - If you need only synthetic, parameter-perfect 3D scenes; SpatiaLab intentionally embraces real-world messiness.

Open Questions: - Architectural Grounding: What components best encode 3D geometry, depth layering, and physical constraints in VLMs? - Data Signals: Which training signals (embodied interaction, action-conditioned feedback, physics simulation) most reliably close the gap? - Robust CoT: How can step-by-step reasoning be made perception-aware instead of amplifying wrong priors? - Generalization: How to prevent overfitting to MCQ while improving open-ended grounding? - Multimodal Memory: Can models maintain consistent reference frames across views and time to support navigation and occlusion inference?

06Conclusion & Future Work

3-Sentence Summary: SpatiaLab is a realistic, broad benchmark that tests six core kinds of spatial reasoning with 1,400 QAs, using both multiple-choice and open-ended formats. On this test, state-of-the-art VLMs fall far short of humans—especially on open-ended questions—revealing weak geometric grounding, fragile occlusion handling, and brittle navigation. By diagnosing precise failure modes and validating scalable evaluation, SpatiaLab points the way to models that truly understand space.

Main Achievement: Establishing a comprehensive, real-world spatial reasoning benchmark—complete with rich taxonomy, dual-format testing, human baselines, and rigorous error analysis—that exposes where today’s VLMs differ most from human spatial cognition.

Future Directions: Integrate geometry- and physics-aware architectures; train with embodied, action-conditioned signals; develop perception-aware chain-of-thought; and expand to multi-view/video for temporal occlusion and long-horizon planning. Explore robust alignment methods that lift open-ended grounding without overfitting MCQ formats.

Why Remember This: Spatial understanding is the bridge from “seeing” to “safely doing.” SpatiaLab shows that present-day VLMs still miss that bridge in the wild—and provides the map to build it: realistic data, fine-grained skills, dual evaluations, and actionable diagnostics.

Practical Applications

•Robot assistants that choose stable grasps and place items without toppling stacks.
•AR navigation that plans clear indoor routes while respecting occlusions and obstacles.
•Safety checks in warehouses to verify stack stability and object accessibility.
•Photo-analysis tools that accurately describe scenes, including what’s hidden or reflected.
•Smart packing planners that judge fit and volume realistically despite perspective tricks.
•Assistive apps that tell users what lies between two landmarks and which paths are blocked.
•Education tools that teach 3D geometry with real-scene examples of shadows and projections.
•Home layout advisors that ensure scale consistency (e.g., couch fits the corner) from photos.
•Robust inspection systems that detect misalignment or improper stacking in factories.
•Simulation and game engines that evaluate AI agents’ spatial planning under realistic clutter.

Version: 1