The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.
RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.
Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.