This paper put real AI agents into a safe, live playground and asked expert testers to mess with them to see what breaks.
Multi-agent AI teams are not automatically better; their success depends on matching the teamβs coordination style to the jobβs structure.