AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?
This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.
This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.