This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.
SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.
The paper shows that video AIs do not need long, human-like chains of thought to reason well.