Long texts overwhelm many language models, which forget important bits and slow down as the context grows.
The paper asks a simple question: do video AIs really need to βthink out loudβ every time, or can they answer quickly most of the time and think deeply only when needed?