Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.
Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.
The paper introduces DeepResearchEval, a fully automated way to build realistic deep research tasks and to grade long research reports from AI systems.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.