🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers7

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#temporal grounding

VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Intermediate
Shuming Liu, Mingchen Zhuge et al.Jan 8arXiv

The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?

#video reasoning#adaptive reasoning#early exit

Factorized Learning for Temporally Grounded Video-Language Models

Intermediate
Wenzheng Zeng, Difei Gao et al.Dec 30arXiv

This paper teaches video-language models to first find when the proof happens in a video and then answer with that proof, instead of mixing both steps together.

#temporal grounding#video-language models#evidence tokens

Active Perception Agent for Omnimodal Audio-Video Understanding

Intermediate
Keda Tao, Wenjie Du et al.Dec 29arXiv

This paper introduces OmniAgent, a smart video-and-audio detective that actively decides when to listen and when to look.

#active perception#omnimodal understanding#audio-guided event localization

Streaming Video Instruction Tuning

Intermediate
Jiaer Xia, Peixian Chen et al.Dec 24arXiv

Streamo is a real-time video assistant that knows when to stay quiet, when to wait, and when to speak—while a video is still playing.

#streaming video LLM#real-time video understanding#instruction tuning

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Intermediate
Runtao Liu, Ziyi Liu et al.Dec 23arXiv

LongVideoAgent is a team of three AIs that work together to answer questions about hour‑long TV episodes without missing small details.

#long video question answering#multi-agent reasoning#temporal grounding

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Intermediate
Jitesh Jain, Jialuo Li et al.Dec 15arXiv

SAGE is a smart video-watching agent that decides when to answer quickly and when to take multiple steps, just like how people skim or rewind long videos.

#any-horizon reasoning#video agents#temporal grounding

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate
Le Thien Phuc Nguyen, Zhuoran Yu et al.Dec 1arXiv

This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.

#audiovisual reasoning#speaker attribution#temporal grounding