AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
Long-horizon AI assistants can grab old, low-quality, or conflicting memories and then answer with too much confidence, which is dangerous.
GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.
Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.