DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.
Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.