The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.
DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.
This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.