The paper tackles a new integrity problem in science: large language models sometimes invent realistic-looking citations that do not exist.
The paper introduces DeepResearchEval, a fully automated way to build realistic deep research tasks and to grade long research reports from AI systems.