Papers3

#video QA

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.

#long video understanding#video navigation#multimodal large language model

Not triaged yet

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

Intermediate

Qian Chen, Jinlan Fu et al.Jan 20arXiv

FutureOmni is the first benchmark that tests if multimodal AI models can predict what happens next from both sound and video, not just explain what already happened.

#multimodal LLM#audio-visual reasoning#future forecasting

Not triaged yet

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Intermediate

Christopher Clark, Jieyu Zhang et al.Jan 15arXiv

Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.

#vision-language model#video grounding#pointing and tracking

Not triaged yet