The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities βspeakβ the same language.
SS4D is a new AI model that turns a short single-camera video into a full 3D object that moves over time (thatβs 4D), and it does this in about 2 minutes.