Papers9

#temporal grounding

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

LongVideo-R1 is a smart video-watching agent that jumps to the right moments in long videos instead of scanning everything.

#long video understanding#video navigation#multimodal large language model

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Intermediate

Linli Yao, Yuancheng Wei et al.Feb 9arXiv

This paper teaches AI to write movie-like scripts for videos by adding exact timestamps and rich details about what you see and hear.

#Omni Dense Captioning#time-aware video captioning#audio-visual understanding

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Intermediate

Shuming Liu, Mingchen Zhuge et al.Jan 8arXiv

The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?

#video reasoning#adaptive reasoning#early exit

Factorized Learning for Temporally Grounded Video-Language Models

Intermediate

Wenzheng Zeng, Difei Gao et al.Dec 30arXiv

This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.

#temporal grounding#video-language models#evidence tokens

Active Perception Agent for Omnimodal Audio-Video Understanding

Intermediate

Keda Tao, Wenjie Du et al.Dec 29arXiv

This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.

#active perception#omnimodal understanding#audio-guided event localization

Streaming Video Instruction Tuning

Intermediate

Jiaer Xia, Peixian Chen et al.Dec 24arXiv

Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.

#streaming video LLM#real-time video understanding#instruction tuning

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Intermediate

Runtao Liu, Ziyi Liu et al.Dec 23arXiv

LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.

#long video question answering#multi-agent reasoning#temporal grounding

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Intermediate

Jitesh Jain, Jialuo Li et al.Dec 15arXiv

SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.

#any-horizon reasoning#video agents#temporal grounding

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate

Le Thien Phuc Nguyen, Zhuoran Yu et al.Dec 1arXiv

This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.

#audiovisual reasoning#speaker attribution#temporal grounding