Papers16

#spatial reasoning

Utonia: Toward One Encoder for All Point Clouds

Yujia Zhang, Xiaoyang Wu et al.Mar 3arXiv

Utonia is a single brain (encoder) that learns from many kinds of 3D point clouds, like indoor rooms, outdoor streets, tiny toys, and even city maps.

#Utonia#point cloud#self-supervised learning

Not triaged yet

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Beginner

Jiachun Li, Shaoping Huang et al.Mar 2arXiv

MMR-Life is a new test (benchmark) that checks how AI understands everyday situations using several real photos at once.

#multimodal reasoning#multi-image understanding#real-life benchmark

Not triaged yet

Enhancing Spatial Understanding in Image Generation via Reward Modeling

Intermediate

Zhenyu Tang, Chaoran Feng et al.Feb 27arXiv

This paper teaches image generators to place objects in the right spots by building a special teacher called a reward model focused on spatial relationships.

#spatial reasoning#reward modeling#preference learning

Not triaged yet

PyVision-RL: Forging Open Agentic Vision Models via RL

Intermediate

Shitian Zhao, Shaoheng Lin et al.Feb 24arXiv

PyVision-RL teaches vision-language models to act like curious agents that think in multiple steps and use Python tools to inspect images and videos.

#agentic multimodal models#reinforcement learning#dynamic tooling

Not triaged yet

A Very Big Video Reasoning Suite

Intermediate

Maijunxian Wang, Ruisi Wang et al.Feb 23arXiv

This paper builds a gigantic library of video puzzles (VBVR) so AI can practice not just making pretty videos, but actually thinking through what happens over time.

#video reasoning#rule-based evaluation#in-domain generalization

Not triaged yet

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

Intermediate

Pingyue Zhang, Zihan Huang et al.Feb 4arXiv

This paper asks a simple question with big consequences: can today’s AI models actively explore a new space and build a trustworthy internal map of it?

#active exploration#cognitive map#spatial belief

Not triaged yet

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Intermediate

Yu Bai, MingMing Yu et al.Feb 4arXiv

EgoActor is a vision-language model that turns everyday instructions like 'Go to the door and say hi' into step-by-step, egocentric actions a humanoid robot can actually do.

#EgoActing#vision-language model#humanoid robot

Not triaged yet

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Intermediate

Azmine Toushik Wasi, Wahid Faisal et al.Feb 3arXiv

SpatiaLab is a new test that checks if vision-language models (VLMs) can understand real-world spatial puzzles, like what’s in front, behind, bigger, or reachable.

#SpatiaLab#spatial reasoning#vision-language models

Not triaged yet

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Intermediate

Baorui Ma, Jiahui Yang et al.Jan 29arXiv

Metric Anything is a new way to teach AI real, ruler-like distances (metric depth) from very mixed and noisy 3D data.

#metric depth estimation#sparse metric prompt#monocular depth

Not triaged yet

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner

Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

Not triaged yet

Think3D: Thinking with Space for Spatial Reasoning

Beginner

Zaibin Zhang, Yuhan Wu et al.Jan 19arXiv

Think3D lets AI models stop guessing from flat pictures and start exploring real 3D space, like walking around a room in a video game.

#Think3D#spatial reasoning#3D reconstruction

Not triaged yet

STEP3-VL-10B Technical Report

Beginner

Ailin Huang, Chengyuan Yao et al.Jan 14arXiv

STEP3-VL-10B is a small (10 billion parameters) open multimodal model that sees images and reads text, yet scores like much larger models.

#multimodal foundation model#unified pre-training#perception encoder

Not triaged yet

1 2