Papers14

#benchmark

CL-bench: A Benchmark for Context Learning

CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.

#context learning#benchmark#rubric-based evaluation

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Intermediate

Yu Zeng, Wenxuan Huang et al.Feb 2arXiv

The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.

#multimodal large language model#visual question answering#vision deep research

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Beginner

Kaiyuan Chen, Qimin Wu et al.Jan 28arXiv

This paper builds a new test called AgentIF-OneDay that checks if AI helpers can follow everyday instructions the way people actually give them.

#AgentIF-OneDay#instruction following#AI agents

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Intermediate

Runjie Zhou, Youbo Shao et al.Jan 28arXiv

WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.

#WorldVQA#atomic visual knowledge#multimodal large language models

Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Beginner

Zengbin Wang, Xuecai Hu et al.Jan 28arXiv

Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.

#text-to-image#spatial intelligence#occlusion

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Intermediate

Xilin Jiang, Qiaolin Wang et al.Jan 25arXiv

AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.

#AVMeme Exam#multimodal large language models#audio-visual memes

Rethinking Video Generation Model for the Embodied World

Beginner

Yufan Deng, Zilin Pan et al.Jan 21arXiv

Robots need videos that not only look pretty but also follow real-world physics and finish the task asked of them.

#robot video generation#embodied AI#benchmark

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate

Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Intermediate

Yifei Shen, Yilun Zhao et al.Jan 14arXiv

This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.

#clinical text-to-SQL#EHR#MIMIC-IV

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Intermediate

Yang Zhou, Hao Shao et al.Jan 4arXiv

DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.

#generative video#autonomous driving#world models

SVBench: Evaluation of Video Generation Models on Social Reasoning

Beginner

Wenshuo Peng, Gongxuan Wang et al.Dec 25arXiv

SVBench is the first benchmark that checks whether video generation models can show realistic social behavior, not just pretty pictures.

#social reasoning#video generation#benchmark

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

1 2