CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.
DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.
SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.