The paper tackles a big blind spot in vision-language models: understanding how objects move and relate in 3D over time (dynamic spatial reasoning, or DSR).
CRISP turns a normal phone video of a person into a clean 3D world and a virtual human that can move in it without breaking physics.
D4RT is a new AI model that turns regular videos into moving 3D scenes (4D) quickly and accurately.