SAW-Bench is a new test that checks if AI can understand the world from a first-person view, like wearing smart glasses.
ResearchGym is a new "gym" where AI agents are tested on real research projects end to end, not just on toy problems.
EcoGym is a new open test playground where AI agents run small businesses over many days to see if they can plan well for the long term.
CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.