AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
This paper shows that letting an AI search many places at the same time (in parallel) can beat making it think in long, slow chains.
LLMs can think for many steps, but when they keep every step forever, the extra tokens turn into noise and make answers worse, not better.
AgenticPay is a safe playground where AI agents practice buying and selling by talking, not just by typing numbers.
Reasoning Cache (RC) is a new way for AI to think in steps: it writes some thoughts, makes a short summary, throws away the long thoughts, and then keeps going using only the summary.
Deep search agents can plan and browse the web in many steps, but they often fail because they don’t notice when their own thinking drifts off-track.
The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.
MemOCR is a new way for AI to remember long histories by turning important notes into a picture with big, bold parts for key facts and tiny parts for details.
PACEvolve is a new recipe that helps AI agents improve their ideas step by step over long periods without getting stuck.
This paper turns an AI agent’s memory from a flat list of notes into a logic map of events connected by cause-and-time links.
Turn-PPO is a new way to train chatty AI agents that act over many steps, by judging each conversation turn as one whole action instead of judging every single token.
NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.