This paper builds Step-GUI, a pair of small-but-strong GUI agent models (4B/8B) that can use phones and computers by looking at screenshots and following instructions.

#GUI automation#multimodal large language models#trajectory-level calibration

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Intermediate

Jun Zhang, Teng Wang et al.Dec 16arXiv

TimeLens studies how to teach AI not just what happens in a video, but exactly when it happens, which is called video temporal grounding (VTG).

#video temporal grounding#multimodal large language models#benchmark re-annotation

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Intermediate

Lakshya A Agrawal, Shangyin Tan et al.Jul 25arXiv

GEPA is a new way to improve AI prompts by letting the AI read its own work, reflect in plain language on what went wrong, and then rewrite its instructions.

#GEPA#reflective prompt evolution#Pareto frontier