Deep search agents can plan and browse the web in many steps, but they often fail because they donβt notice when their own thinking drifts off-track.
The paper fixes a big problem in training web-searching AI: rewarding only the final answer makes agents cut corners and sometimes hallucinate.