Papers2

#video captioning

Unified Vision-Language Modeling via Concept Space Alignment

Yifu Qiu, Paul-Ambroise Duquenne et al.Mar 1arXiv

The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities “speak” the same language.

#v-Sonar#OmniSONAR#concept space alignment

Not triaged yet

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

Intermediate

Jihao Qiu, Lingxi Xie et al.Feb 24arXiv

LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.

#long video understanding#video navigation#multimodal large language model

Not triaged yet