The paper builds v-Sonar, a bridge that maps images and videos into the same meaning-space as text called Sonar, so all modalities โspeakโ the same language.
LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.