VidVec shows that video-capable multimodal language models already hide strong matching signals between videos and sentences inside their middle layers.
This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?