This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
Real people often ask vague questions with pictures, and todayβs vision-language models (VLMs) struggle with them.