Papers4

#error analysis

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

Not triaged yet

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Intermediate

Xin Guo, Rongjunchen Zhang et al.Jan 10arXiv

This paper builds BizFinBench.v2, a big bilingual (Chinese–English) test that checks how well AI models really handle finance using real business data from China and the U.S.

#BizFinBench.v2#financial benchmark#bilingual evaluation

Not triaged yet

MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

Intermediate

Jingli Lin, Runsen Xu et al.Dec 11arXiv

This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.

#video-based spatial intelligence#multimodal large language models#spatial construction

Not triaged yet