SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.
Re-TRAC is a new way for AI search agents to learn from each try, write a clean summary of what happened, and then use that summary to do better on the next try.
The paper introduces DeepResearchEval, a fully automated way to build realistic deep research tasks and to grade long research reports from AI systems.
The paper shows that language models with a search tool often look up too much information, which wastes compute and can make answers worse on unanswerable questions.