SpatialTree: How Spatial Abilities Branch Out in MLLMs

Yuxi Xiao; Longfei Li; Shen Yan; Xinhang Liu; Sida Peng; Yunchao Wei; Xiaowei Zhou; Bingyi Kang

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Intermediate

Yuxi Xiao, Longfei Li, Shen Yan et al.12/23/2025

arXiv PDF

Key Summary

•SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.
•The four levels are L1 Perception, L2 Mental Mapping, L3 Mental Simulation, and L4 Agentic Competence, inspired by how humans develop spatial skills.
•The benchmark unifies 27 sub-abilities and evaluates many popular models with multiple-choice, numeric scores, and an AI judge for open answers.
•Results show low-level perception skills are mostly independent, but higher-level skills are strongly linked and rely on the basics.
•Targeted fine-tuning on simple skills like distance can hurt nearby L1 skills but surprisingly helps tougher, higher-level tasks like planning and manipulation.
•Naive "think more" reinforcement learning boosts complex reasoning but harms snap-judgment perception; an "auto-think" strategy fixes this by thinking only when needed.
•Gemini 3 Flash leads overall on SpatialTree-Bench (57.8), and Qwen3VL-235B leads open-source models (40.0).
•A unified action space maps language to camera moves and robot-like actions, letting models be tested on navigation and manipulation.
•SpatialTree offers a practical roadmap for growing spatial intelligence in AI step-by-step, not just piling on random tasks.

Why This Research Matters

Spatial intelligence powers robots that can safely navigate homes, assistants that can understand your room through a camera, and AR apps that guide you step-by-step. SpatialTree shows which basic skills to build first and how they combine to enable complex actions, saving training time and compute. It also introduces a practical auto-think strategy so models know when to respond quickly and when to reason carefully, improving both speed and accuracy. With a unified action space, it becomes possible to compare different models fairly on real tasks like moving and manipulating. This roadmap helps researchers, companies, and educators grow AI that can see, think, and act more like us.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you first learn to notice shapes and sizes, then later you can draw a map from memory, and eventually you can plan a route and actually walk it? Our brains build spatial skills in steps.

🥬 The Concept (Spatial Intelligence): Spatial intelligence is the ability to understand where things are, how they move, and how to act in space.

How it works: (1) See basic stuff like size and distance; (2) Link what you see to words and memories; (3) Imagine what might happen next; (4) Use all that to take actions.
Why it matters: Without it, you can’t find your backpack, drive to school, or play soccer without bumping into people.

🍞 Anchor: When you ask a robot to grab a cup, it must see the cup (perception), describe its position (mapping), predict how it will tilt if picked (simulation), and then actually grab it (agentic action).

The world before this paper: Multimodal large language models (MLLMs) learned lots of individual spatial tricks—like pointing to an object or estimating which thing is closer. Researchers built many task-based tests (one task here, another task there). But this felt like a messy toolbox: useful tools with no clear order. We didn’t know how the skills fit together. Were they separate? Did they build on each other? Could learning one help another?

The problem: Because the tests were task-centric and scattered, we couldn’t see the bigger picture—the hidden structure of spatial skills in MLLMs. We lacked a way to tell which skills are foundational and which depend on others. This made it hard to train models efficiently. Should we spend more time teaching depth estimation, or invest in planning? Would teaching shape help with robot grasping? No one knew.

Failed attempts:

Pile-on benchmarks: People added more tasks, hoping coverage would equal understanding. It didn’t—results stayed fragmented.
Single-skill training: Fine-tuning one low-level skill often helped that skill but didn’t reliably lift others; sometimes it even hurt nearby skills.
Reason-more-everywhere: Pushing models to "think more" via reinforcement learning boosted hard reasoning but weirdly damaged snap-judgment perception (like quick estimates).

The gap: We needed a capability-centric, hierarchical benchmark—something organized like a tree that shows how simple skills grow into complex ones, and how training at one branch transfers (or interferes) with others. We also needed a better training strategy that knows when to think hard and when to act intuitively.

🍞 Hook (Hierarchical Benchmark): Imagine a video game with levels: you start easy (seeing), unlock maps (understanding), earn strategy powers (simulation), and finally control a character (agentic action).

🥬 The Concept (Hierarchical Benchmark): A hierarchical benchmark tests abilities from simple to complex, showing dependencies across levels.

How it works: (1) Define levels; (2) Build tasks for each; (3) Measure results; (4) Analyze how skills depend and transfer.
Why it matters: Without it, we can’t tell which skills are building blocks vs. late-stage outcomes.

🍞 Anchor: Like school math: addition before multiplication before algebra. Test in order to learn the structure.

Real stakes:

Safer robots: Grasping, stacking, and navigating rooms require solid basics before long plans.
Smarter assistants: AR glasses or home helpers must mix quick perception with careful reasoning.
Better training: Knowing which skill transfers can save time and compute.
Fair evaluation: Companies and researchers need a common yardstick for spatial intelligence.

This paper builds SpatialTree, a cognitive-science-inspired ability tree with four levels—Perception (L1), Mental Mapping (L2), Mental Simulation (L3), and Agentic Competence (L4)—plus a benchmark that covers 27 sub-abilities. It reveals a clean structure: low-level skills are mostly independent, while higher-level ones are tightly connected. It shows surprising training dynamics: single low-level fine-tuning can harm neighbors but helps higher levels; naive "think more" reinforcement learning helps planning but hurts perception. Finally, it proposes auto-think: encourage reasoning only when useful, suppress it when intuition is best, improving performance across the whole tree.

02Core Idea

🍞 Hook: Imagine planting a tree. Roots drink water, the trunk supports, branches reach, and leaves collect sunlight. If roots are weak, the whole tree suffers; if branches coordinate, the tree thrives.

🥬 The Concept (SpatialTree Taxonomy): SpatialTree is a four-level "ability tree" that organizes spatial skills from basic seeing to acting, then measures them consistently.

How it works: (1) Define levels: L1 Perception, L2 Mental Mapping, L3 Mental Simulation, L4 Agentic Competence; (2) Curate tasks and metrics for each; (3) Evaluate many MLLMs; (4) Analyze correlations and training transfer; (5) Improve with targeted fine-tuning and a balanced, auto-think RL strategy.
Why it matters: Without a map of abilities, we waste training on the wrong things and can’t explain why models fail in the real world.

🍞 Anchor: A model that can perfectly describe a room (L2) but can’t estimate distance (L1) will still fail to plan a safe path (L3) or walk there (L4).

The “Aha!” moment in one sentence: Spatial abilities aren’t just a pile of tasks—they form a structured ladder where independent low-level senses feed strongly coupled higher-level reasoning and action, so training should respect that ladder and switch between fast intuition and slow thinking when appropriate.

Three analogies:

School: You learn to read letters (L1), then summarize stories (L2), then predict endings (L3), then write your own story (L4).
Sports: First balance and coordination (L1), then playbook language (L2), then game strategy (L3), then in-game execution (L4).
Maps: Notice landmarks (L1), draw a labeled map (L2), simulate routes (L3), then actually drive there (L4).

Before vs. After:

Before: Disconnected tasks, unclear dependencies, training often guessy; more thinking seemed always better.
After: A clear hierarchy; we know which basics to build first, where transfer happens, and when to think vs. trust perception.

Why it works (intuition):

Perception is like fast reflexes—independent senses (distance, shape, motion) don’t always help each other directly.
High-level reasoning and action weave many signals together, so their performances rise or fall together.
For training, slow careful thinking aids complex plans, but it can muddy quick perceptual judgments; switching modes (auto-think) matches the task’s cognitive need.

Building blocks (with sandwiches):

🍞 Hook: You know how your eyes quickly tell you which apple is bigger or closer without doing math? 🥬 The Concept (L1 Perception): L1 is raw seeing—size, distance, shape, motion, orientation, relations, and localization.

How it works: (1) Read visual cues; (2) Form fast judgments; (3) No long reasoning, just accurate signals.
Why it matters: If basics are wrong, everything built on top wobbles. 🍞 Anchor: Picking up a cup requires knowing where it is and how big it is right away.

🍞 Hook: Imagine telling a friend, “The red chair is left of the table” so they can picture the room. 🥬 The Concept (L2 Mental Mapping): L2 links perception to language and memory.

How it works: (1) Describe scenes; (2) Understand relations and affordances; (3) Build and query a mental map over time.
Why it matters: Without language-aligned memory, the model forgets where things were. 🍞 Anchor: "The exit is behind the couch; turn right at the lamp"—that’s mapping.

🍞 Hook: Think of playing chess in your head—imagining moves before touching a piece. 🥬 The Concept (L3 Mental Simulation): L3 is internal "what-if" thinking: causal reasoning and sequential planning.

How it works: (1) Predict dynamics; (2) Solve spatial puzzles; (3) Chain steps into a plan.
Why it matters: Actions should be safe and efficient before you try them. 🍞 Anchor: "First go to the door, then turn right, then grab the handle."

🍞 Hook: Picture a game character you control to reach a flag—seeing, planning, and moving all together. 🥬 The Concept (L4 Agentic Competence): L4 is executing actions based on perception, language, and reasoning.

How it works: (1) Observe; (2) Update memory and plan; (3) Output actions in a unified action space (move, look, grab).
Why it matters: This is where AI becomes useful in the world. 🍞 Anchor: From two photos (start and goal), the model outputs step-by-step moves to navigate there.

Training ideas (with sandwiches):

🍞 Hook: Like practicing scales on piano before playing a concerto. 🥬 The Concept (Supervised Fine-Tuning): SFT is studying from labeled examples to sharpen specific skills.

How it works: (1) Collect targeted Q&A (e.g., distance); (2) Train the model; (3) Test for transfer.
Why it matters: It’s a controlled way to boost foundations. 🍞 Anchor: Train on distance questions; later, the robot manipulates objects more precisely.

🍞 Hook: Like a puppy getting treats for good behavior. 🥬 The Concept (Reinforcement Learning): RL rewards good actions or answers to shape better policies.

How it works: (1) Define rewards; (2) Let model try; (3) Update to do better next time.
Why it matters: It fine-tunes decision-making beyond static examples. 🍞 Anchor: Reward for correct multi-step navigation plans leads to better route choices.

🍞 Hook: Playing piano helps you learn organ faster. 🥬 The Concept (Cross-level Transfer): Skills learned at one level help other, higher levels.

How it works: (1) Boost a low-level sense; (2) See gains appear in reasoning/action; (3) But beware: neighbors at the same level can compete.
Why it matters: It tells us what to prioritize. 🍞 Anchor: Better distance sense improves robot arm control.

🍞 Hook: Sometimes you should trust your gut; sometimes you should think it through. 🥬 The Concept (Auto-think Strategy): Auto-think tells the model when to think hard and when to answer fast.

How it works: (1) For perception, discourage long chains-of-thought; (2) For complex planning, encourage detailed reasoning; (3) Adjust rewards accordingly.
Why it matters: Overthinking hurts intuition; underthinking hurts planning. 🍞 Anchor: Estimate “which is closer?” quickly; plan a five-step route carefully.

03Methodology

At a high level: Inputs (images/videos/multi-view) → Level-specific Data Engines (L1–L4) → QA Templates + LLM Rephrase → SpatialTree-Bench → Training Interventions (SFT, RL with auto-think) → Output: Measured abilities and improved models.

Step-by-step (with sandwiches for key ideas):

Curate capability-aligned data (L1–L4)

What happens: The team reorganizes many datasets into the four levels and fills gaps using a "SpatialEngine" that combines expert vision models (for depth, correspondences, gravity, orientation, tracking) and 3D reconstruction. They generate multiple Q&A formats per problem to avoid template overfitting.
Why it exists: Task piles are scattered and inconsistent; this pipeline standardizes them by ability level.
Example: Use DepthAnything3 to estimate depth, OrientAnything for poses, then generate multi-choice or numeric questions like “which point is closer?” or “what’s the camera roll?”

🍞 Hook: Imagine labeling a photo album so you can search by moments, people, and places. 🥬 The Concept (Capability-centric Benchmarking): It builds tests around specific abilities, not random tasks.

How it works: (1) Group data by skill; (2) Make uniform questions; (3) Score in comparable ways.
Why it matters: We can finally compare apples to apples for each ability. 🍞 Anchor: A set of relation-only questions vs. motion-only questions.

Build L1 Perception QAs

What happens: Questions target geometry (distance, size, shape), motion (ego/allo), orientation (gravity, object pose), relation (topology, correspondence), and localization (detection, grounding). Numeric answers and MCQs are common.
Why it exists: These are the fast senses models need before language or planning.
Example: “Which point matches in the right image?” (correspondence), “Estimate gravity roll and pitch.”

Build L2 Mental Mapping QAs

What happens: From videos or multi-view, 3D reconstructions create bird’s-eye maps. Questions cover spatial captioning, relations, motion semantics, perspective-taking, affordance, and memory retrieval.
Why it exists: Ties perception to language and memory over time.
Example: “Describe chair–sofa–floor relations,” or “Recall where the red car appeared earlier.”

Build L3 Mental Simulation QAs

What happens: Reasoning-heavy prompts include chain-of-thought templates. Tasks cover causal reasoning (puzzles, dynamics) and sequential planning (step-by-step routes/actions).
Why it exists: Tests if models can simulate outcomes before acting.
Example: “Are these images from the same side? Explain,” or “Give steps to reach the doorway.”

Build L4 Agentic Competence QAs and Action Space

What happens: Curate navigation/manipulation videos (games, egocentric hands, robot grippers), then convert continuous motions into a unified, discrete action space (move/turn/roll; open/close gripper; push/pull/grab). Turn sequences into multi-step MCQs or structured action outputs.
Why it exists: Provides a consistent interface for models to output executable actions across embodiments.
Example: “From start to goal images, output at most 10 actions to move there.”

🍞 Hook: Think of video game controls: WASD to move, mouse to look, click to grab. 🥬 The Concept (Unified Action Mapping): A shared dictionary turns language plans into camera/robot moves.

How it works: (1) Decompose 6-DoF motion into primitives (truck, dolly, pedestal, pan, tilt, roll); (2) Add gripper/open/close and simple gestures; (3) Discretize steps for consistent scoring.
Why it matters: Without a shared control language, we can’t fairly test action skills. 🍞 Anchor: “Move forward 3, pan left 2, open gripper” is the same idea across a game camera or a robot wrist.

Evaluation metrics

What happens: Three main types—(a) multiple choice (most items), (b) numeric accuracy (e.g., relative error for distance, angle difference for orientation), (c) LLM-as-a-judge for open descriptions and multi-step answers; plus special step-wise accuracy for action alignment.
Why it exists: Different abilities need different, reliable score types.
Example: Judge checks if a generated action sequence matches the goal constraints, even if wording differs.

Ability dependency analysis

What happens: Compute Pearson correlations across abilities and models to see which skills move together.
Why it exists: To reveal the hidden structure—independence at L1 vs. interdependence at L3–L4.
Example: High correlation between planning and causal reasoning; low between size and motion.

🍞 Hook: If your grades in history and literature rise together, they’re probably related skills. 🥬 The Concept (Correlation Analysis): A way to see which abilities rise and fall together.

How it works: (1) Collect scores by ability; (2) Compute correlations; (3) Inspect clusters.
Why it matters: Guides where training will transfer (or collide). 🍞 Anchor: L3–L4 skills cluster tightly, L1 skills don’t.

Training interventions: SFT and RL with auto-think

SFT (targeted fine-tuning): Train on 0.25M examples for distance, size, or correspondence (and mixes) on top of a general data recipe, then test transfer.
RL (GRPO, verifiable rewards): Train policies with rewards; test naive “think everywhere” vs. hierarchy-aware rewards that suppress or encourage thinking based on level.
Why both: SFT shapes specific senses; RL aligns multi-step behavior and reasoning.
Example: After distance SFT, models generalize to complex distance tasks and even improve robot-arm control.

🍞 Hook: Practice scales (SFT) to strengthen fingers; play full pieces with a coach giving feedback (RL). 🥬 The Concept (Auto-think RL): Teach the model when to think hard and when to be quick.

How it works: (1) Penalize long reasoning on L1 perceptual tasks; (2) Reward chain-of-thought on L3–L4; (3) Train policies to adapt per task.
Why it matters: One-size-fits-all thinking wastes tokens and harms quick judgments. 🍞 Anchor: Quick answer for “which is closer?”, but detailed steps for “plan a route around obstacles.”

Secret sauce:

A capability-centric, hierarchical design that mirrors human spatial development.
A unified action space that makes agentic evaluation comparable.
A training recipe that respects cognitive modes: fast perception vs. slow reasoning.
Multi-format QAs per skill to avoid overfitting to templates.
Correlation-driven insights that guide which basics to train for the biggest downstream gains.

04Experiments & Results

The test: SpatialTree-Bench evaluates 27 sub-abilities across four levels using 70%+ multiple-choice, numeric metrics for precision tasks (like angles), LLM-as-a-judge for open responses, and step-wise action accuracy for navigation/manipulation. Why? To cover both snap judgments and long-horizon plans fairly.

The competition: Models included reasoning-augmented (thinking) systems like Gemini 3 Pro/Flash, Gemini 2.5 variants, GLM-4.5V, Seed1.5VL/1.8; non-thinking variants like Gemini 2.5 non-thinking and GPT-4o; and strong open-source models like Qwen2.5-VL (7B/32B/72B), Qwen3-VL (30B/235B), and Kimi-VL. This spread shows how reasoning styles and scales influence spatial abilities.

The scoreboard (with context):

Overall, Gemini 3 Flash tops the chart with 57.8—think of it like scoring an A- where most others are in the C-to-B range.
Among open-source, Qwen3VL-235B leads with 40.0—solid performance, roughly mid-pack among all entrants.
Pattern: L1 abilities are relatively independent (low cross-correlations), while L3–L4 are strongly correlated—doing well in planning often means doing well in causal reasoning and goal execution.

Surprising findings:

Negative transfer within L1: Fine-tuning just one L1 skill (like distance) can slightly hurt neighbors (like motion or relation). It’s like overtraining one muscle and straining another.
Cross-level transfer: That same distance fine-tuning can boost higher-level tasks (e.g., better robot manipulation and goal execution). Stronger metrics at the bottom help decisions at the top.
Synergy by mixing fundamentals: Training distance + size + correspondence together yields overall gains greater than the sum of parts, even flipping some L1 drops into improvements.
Thinking vs. perceiving trade-off in RL: Uniform “think more” reinforcement helps complex reasoning (L3–L4) but can damage intuitive L1 accuracy. Over-explaining a quick visual judgment leads to worse answers.
Auto-think wins: A hierarchy-aware reward scheme—penalize long thoughts for L1, encourage reasoning for L3–L4—improves performance broadly, including on action-heavy agentic tasks. It shifts the model into the right cognitive gear.

Concrete examples:

Post-distance SFT, models answer complex, in-the-wild distance questions more reliably (e.g., sorting multiple points by depth with new coordinate prompts), and achieve notable gains in robotic arm sequences.
Prompting with low-level hints (correspondences, depth, size) improves high-level image-goal navigation: correspondences (+7.1%), distance (+5.5%), size (+2.1%). Grounding high-level problems with low-level signals helps.

How big are the gains?

The paper’s tables show that naive RL @think improves some reasoning and agentic categories but causes large drops in others (e.g., memory or open exploration in certain settings). Switching to full RL with auto-think consistently lifts averages and reduces collateral damage, boosting both perception scores and higher-level execution versus the same model without auto-think.

Takeaways:

The hierarchy is real: basics are independent; higher skills interlock.
Training must be level-aware: single-skill specialization can backfire locally but pay off globally; mixed fundamentals and auto-think RL deliver the best balance.
Benchmarks should reflect cognitive structure, not just task piles. SpatialTree does this and reveals practical training levers.

05Discussion & Limitations

Limitations:

Coverage gaps: Even with 27 sub-abilities, space is huge (e.g., deformable objects, cluttered outdoor crowds, long-term 3D memory). L4 embodiments are representative but not exhaustive.
Evaluation noise: LLM-as-a-judge can introduce subjectivity, though constrained prompts mitigate this. Numeric metrics help but can miss partial credit nuances.
Synthetic and curated bias: Some data comes from reconstructions and curation pipelines; real-world messiness (lighting, occlusions, novel objects) may reduce transfer.
Action space simplification: The unified primitives enable fair scoring but may limit expressiveness compared to continuous control.
Overfitting risk in SFT: Targeted boosts can hurt nearby L1 skills (negative transfer) if not balanced with mixed training.

Required resources:

Datasets spanning images, videos, multi-view captures, and action sequences (games, robotics, egocentric tasks).
Expert perception tools (depth, correspondence, gravity, orientation) and 3D reconstruction pipelines.
Compute for fine-tuning (hundreds of thousands of QAs) and RL (policy optimization over multi-step tasks).

When not to use:

If your application needs only a single narrow skill (e.g., simple object detection), the full hierarchy may be overkill.
If you require unconstrained continuous-time control fidelity (e.g., torque-level robotics), the discretized action space may be too coarse.
If strict determinism is required, LLM-as-a-judge scoring on open tasks may be uncomfortable; prefer numeric-only subsets.

Open questions:

How to expand L4 with real closed-loop interaction (not just sequence prediction) while keeping evaluation safe and scalable?
Can we learn when to auto-think from data instead of hand-setting level rules, perhaps via meta-learning?
What’s the optimal cocktail of L1 fundamentals for maximum L3–L4 transfer across domains (indoor, outdoor, aerial, underwater)?
How to reduce negative transfer within L1—curriculum schedules, multi-task architectures, or shared bottlenecks?
Can world-model pretraining (video predictive models) further strengthen the bridge from L2 memory to L3 simulation and L4 action?

06Conclusion & Future Work

Three-sentence summary: SpatialTree reorganizes spatial intelligence for MLLMs into a four-level hierarchy—Perception, Mental Mapping, Mental Simulation, and Agentic Competence—and builds a thorough benchmark across 27 sub-abilities. Evaluations reveal that low-level senses are mostly independent while higher-level reasoning and action are tightly coupled, with surprising training dynamics: single-skill fine-tuning can hurt neighbors but boosts higher levels, and naive “think more” reinforcement helps complex tasks but harms snap perception. A simple auto-think strategy that suppresses overthinking for perception and encourages reasoning for planning improves performance across the entire tree.

Main achievement: Turning a messy zoo of spatial tasks into a capability-centric, cognitive-science-grounded roadmap—and proving that level-aware training (including auto-think RL) can systematically lift spatial intelligence.

Future directions:

Broaden L4 to richer, closed-loop interactions and varied embodiments; integrate real-time feedback and safety constraints.
Automate the when-to-think policy via learned detectors; combine with stronger world models for simulation.
Explore curricula that mix the most transferable L1 skills to supercharge L3–L4 without negative L1 interference.

Why remember this: SpatialTree doesn’t just test models; it explains them, revealing how spatial skills grow and connect. It provides practical levers—what to train, how to reason, when to be quick—that help build AIs that can see, think, and act in our 3D world.

Practical Applications

•Design training curricula that first boost the most transferable L1 skills (e.g., distance) to unlock L3–L4 gains.
•Adopt auto-think inference: disable chain-of-thought on simple perception QAs; enable it on planning and causal reasoning.
•Use the unified action space to prototype vision-to-action agents for navigation and robot manipulation.
•Run correlation analysis on your model’s SpatialTree scores to identify bottleneck abilities to target.
•Convert your video logs into step-wise action annotations to evaluate agentic competence consistently.
•Augment navigation with correspondence/depth prompts to improve target-direction recognition.
•Mix multiple fundamentals (distance + size + correspondence) in SFT to reduce negative transfer within L1.
•Use LLM-as-a-judge only where needed; prefer numeric and MCQ metrics for foundational skills to reduce noise.
•Create curricula that gradually shift from fast perception training to slow, reasoned planning as models mature.
•Benchmark new MLLMs on SpatialTree-Bench to report level-wise strengths and weaknesses transparently.

Version: 1