The paper teaches multimodal large language models (MLLMs) to stop guessing from just text or just images and instead check both together before answering.
SpatialTree is a new, four-level "ability tree" that tests how multimodal AI models (that see and read) handle space: from basic seeing to acting in the world.