This paper put real AI agents into a safe, live playground and asked expert testers to mess with them to see what breaks.