This paper put real AI agents into a safe, live playground and asked expert testers to mess with them to see what breaks.
OpenRT is a big, open-source test bench that safely stress-tests AI models that handle both text and images.