Big language models are great at words but waste lots of time and energy when they try random actions in non-language games like Sudoku, Sokoban, 2048, FrozenLake, and Rubikโs Cube.
This paper teaches AI to look around a 3D place step by step, instead of staring at a fixed set of pictures, so it can answer tricky spatial questions better.
SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.