The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?
This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.
This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.
LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.
SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.
This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.