The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?