This paper put real AI agents into a safe, live playground and asked expert testers to mess with them to see what breaks.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.