This paper teaches AI to look things up on the web and fix its own mistakes mid-thought instead of starting over from scratch.
DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.