Papers3

#text-to-video retrieval

Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne et al.Mar 1arXiv

The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.

#v-Sonar#OmniSONAR#concept space alignment

Not triaged yet

RANKVIDEO: Reasoning Reranking for Text-to-Video Retrieval

Intermediate

Tyler Skow, Alexander Martin et al.Feb 2arXiv

RANKVIDEO is a video-native reasoning reranker that helps search engines find the right videos for a text query by directly looking at the video’s visuals and audio, not just text captions.

#text-to-video retrieval#video-native reranking#multimodal reasoning

Not triaged yet

Action100M: A Large-scale Video Action Dataset

Intermediate

Delong Chen, Tejaswi Kasarla et al.Jan 15arXiv

Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.

#Action100M#open-vocabulary action recognition#hierarchical temporal segmentation

Not triaged yet