Search

"benchmarking"20 resultsKeyword

T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Qinsi Wang, Hancheng Ye et al.Mar 4arXiv

This paper shows that teaching AI to first draw a simple map of a text (nodes and links) before answering questions makes it smarter and more reliable.

#Structure of Thought#Text-to-Structure#Intermediate Representation

Not triaged yet

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Intermediate

Xintong Zhang, Xiaowen Zhang et al.Feb 2arXiv

AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.

#Adaptive Multimodal Reasoning#Vision-Language Models#Tool Invocation

Not triaged yet

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

Beginner

Haonan Bian, Zhiyuan Yao et al.Jan 11arXiv

RealMem is a new benchmark that tests how well AI assistants remember and manage long, ongoing projects across many conversations.

#RealMem#long-term memory#project-oriented interactions

Not triaged yet

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Intermediate

Le Thien Phuc Nguyen, Zhuoran Yu et al.Dec 1arXiv

This paper introduces AV-SpeakerBench, a new test that checks if AI can truly see, hear, and understand who is speaking, what they say, and when they say it in real videos.

#audiovisual reasoning#speaker attribution#temporal grounding

Not triaged yet

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth

Intermediate

Weihao Zeng, Yuzhen Huang et al.Feb 8arXiv

LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.

#LOCA-bench#long-context agents#context rot

Not triaged yet

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate

Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

Not triaged yet

Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

Beginner

Honglin Lin, Chonghan Qin et al.Jan 17arXiv

The paper studies how to make and judge scientific images that are not just pretty but scientifically correct.

#scientific image synthesis#text-to-image (T2I)#programmatic diagram generation

Not triaged yet

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Intermediate

Gilat Toker, Nitay Calderon et al.Jan 15arXiv

This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.

#concept-based explanations#structural counterfactuals#structured causal models

Not triaged yet

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Beginner

Zecheng Tang, Baibei Ji et al.Jan 17arXiv

This paper builds MemoryRewardBench, a big test that checks if reward models (AI judges) can fairly grade how other AIs manage long-term memory, not just whether their final answers are right.

#reward models#long-term memory#long-context reasoning

Not triaged yet

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Intermediate

Jie Yang, Honglin Guo et al.Jan 16arXiv

ABC-Bench is a new test that checks if AI coding agents can really do backend work from start to finish, not just write a few lines of code.

#ABC-Bench#agentic backend coding#end-to-end API testing

Not triaged yet

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate

Keyu Li, Junhao Shi et al.Jan 16arXiv

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Not triaged yet

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Intermediate

Yang Zhou, Hao Shao et al.Jan 4arXiv

DrivingGen is a new, all-in-one test that fairly checks how well AI can imagine future driving videos and motions.

#generative video#autonomous driving#world models

Not triaged yet

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Intermediate

Kevin Cannons, Saeed Ranjbar Alvar et al.Dec 4arXiv

This paper builds TAD, a brand-new test that checks if AI can understand what happens over time in real driving videos.

#Temporal understanding#Autonomous driving#Vision-language models

Not triaged yet

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Intermediate

Chenrui Fan, Yijun Liang et al.Dec 12arXiv

This paper introduces V-REX, a new benchmark that tests how AI systems reason about images through step-by-step exploration, not just final answers.

#V-REX#Chain-of-Questions#Exploratory visual reasoning

Not triaged yet

How Much 3D Do Video Foundation Models Encode?

Intermediate

Zixuan Huang, Xiang Li et al.Dec 23arXiv

This paper asks a simple question: do video AI models trained only on 2D videos secretly learn about 3D worlds?

#video foundation models#3D awareness#temporal reasoning

Not triaged yet

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

Intermediate

Haoyu Dong, Pengkun Zhang et al.Dec 15arXiv

FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.

#FINCH benchmark#finance and accounting AI#spreadsheet agents

Not triaged yet

From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

Intermediate

Zongzhao Li, Xiangzhe Kong et al.Dec 11arXiv

The paper defines Microscopic Spatial Intelligence (MiSI) as the skill AI needs to understand tiny 3D things like molecules from 2D pictures and text, just like scientists do.

#microscopic spatial intelligence#vision-language models#orthographic projection

Not triaged yet

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

Beginner

Yixin Wan, Lei Ke et al.Dec 11arXiv

This paper creates MotionEdit, a high-quality dataset that teaches AI to change how people and objects move in a picture without breaking their looks or the scene.

#motion-centric image editing#optical flow#MotionEdit dataset

Not triaged yet

Rethinking Expert Trajectory Utilization in LLM Post-training

Intermediate

Bowen Ding, Yuhan Chen et al.Dec 12arXiv

The paper asks how to best use expert step-by-step solutions (expert trajectories) when teaching big AI models to reason after pretraining.

#Supervised Fine-Tuning#Reinforcement Learning#Expert Trajectories

Not triaged yet

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Beginner

Sara Papi, Javier Garcia Gilabert et al.Dec 18arXiv

This paper builds a big, fair test called Hearing to Translate to check how well different speech translation systems work in the real world.

#speech translation#Speech-LLM#cascaded ASR-MT

Not triaged yet