Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.
This paper builds BizFinBench.v2, a big bilingual (Chinese–English) test that checks how well AI models really handle finance using real business data from China and the U.S.
This paper introduces MMSI-Video-Bench, a big, carefully hand-made test to check how well AI understands space and motion in videos.