ToolPRMBench is a new benchmark that checks, step by step, whether an AI agent using tools picks the right next action.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.
This paper teaches AI models to reason better by first copying only good examples and later learning from mistakes too.