Unified Vision-Language Modeling via Concept Space Alignment
IntermediateYifu Qiu, Paul-Ambroise Duquenne et al.Mar 1arXiv
The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities โspeakโ the same language.
#v-Sonar#OmniSONAR#concept space alignment