The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities โspeakโ the same language.
ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.