Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.