Real instructions often have logic like and first-then and if-else and this paper teaches models to notice and obey that logic.
The paper asks a simple question: do video AIs really need to “think out loud” every time, or can they answer quickly most of the time and think deeply only when needed?
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.
AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.
This paper teaches vision-language models to reason about pictures using puzzles instead of expensive human labels.