Papers11

#spatial reasoning

Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang, Xiaoyang Wu et al.Mar 3arXiv

Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.

#Utonia#point cloud#self-supervised learning

Not triaged yet

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Intermediate

Zhenyu Tang, Chaoran Feng et al.Feb 27arXiv

This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.

#spatial reasoning#reward modeling#preference learning

Not triaged yet

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin et al.Feb 24arXiv

PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.

#agentic multimodal models#reinforcement learning#dynamic tooling

Not triaged yet

A Very Big Video Reasoning Suite

Intermediate

Maijunxian Wang, Ruisi Wang et al.Feb 23arXiv

This paper builds a gigantic library of video puzzles (VBVR) so AI can practice not just making pretty videos, but actually thinking through what happens over time.

#video reasoning#rule-based evaluation#in-domain generalization

Not triaged yet

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Intermediate

Pingyue Zhang, Zihan Huang et al.Feb 4arXiv

This paper asks a simple question with big consequences: can today’s AI models actively explore a new space and build a trustworthy internal map of it?

#active exploration#cognitive map#spatial belief

Not triaged yet

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Intermediate

Yu Bai, MingMing Yu et al.Feb 4arXiv

EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.

#EgoActing#vision-language model#humanoid robot

Not triaged yet

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate

Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Not triaged yet

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Intermediate

Baorui Ma, Jiahui Yang et al.Jan 29arXiv

Metric Anything is a new way to teach AI real, ruler-like distances (metric depth) from very mixed and noisy 3D data.

#metric depth estimation#sparse metric prompt#monocular depth

Not triaged yet

World Craft: Agentic Framework to Create Visualizable Worlds via Text

Intermediate

Jianwen Sun, Yukang Feng et al.Jan 14arXiv

World Craft lets anyone turn a short text description into a playable, visual game world without coding.

#AI Town#multi-agent framework#layout generation

Not triaged yet

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Not triaged yet

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

Intermediate

Yuxin Wang, Lei Ke et al.Dec 18arXiv

This paper teaches a vision-language model to first find objects in real 3D space (not just 2D pictures) and then reason about where things are.

#3D grounding#vision-language models#spatial reasoning

Not triaged yet