AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
Long-horizon AI assistants can grab old, low-quality, or conflicting memories and then answer with too much confidence, which is dangerous.
GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.