This paper put real AI agents into a safe, live playground and asked expert testers to mess with them to see what breaks.
Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.