The paper asks a simple question: do video AIs really need to βthink out loudβ every time, or can they answer quickly most of the time and think deeply only when needed?
This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.