BabyVision is a new test that checks if AI can handle the same basic picture puzzles that young children can do, without leaning on language tricks.
This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.
TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).
GEPA is a new way to improve AI prompts by letting the AI read its own work, reflect in plain language on what went wrong, and then rewrite its instructions.