Papers6

#rubric-based evaluation

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song et al.Feb 10arXiv

LiveMedBench is a new, always-updating test for medical AIs that keeps test questions safely separated from training data to avoid cheating by memorization.

#LiveMedBench#medical benchmark#data contamination

Not triaged yet

CL-bench: A Benchmark for Context Learning

Beginner

Shihan Dou, Ming Zhang et al.Feb 3arXiv

CL-bench is a new test that checks whether AI can truly learn new things from the information you give it right now, not just from what it memorized before.

#context learning#benchmark#rubric-based evaluation

Not triaged yet

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Intermediate

Keyu Li, Junhao Shi et al.Jan 16arXiv

AgencyBench is a giant test that checks how well AI agents can handle real, long, multi-step jobs, not just short puzzles.

#autonomous agents#long-horizon evaluation#agent benchmarking

Not triaged yet

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Intermediate

Yifei Shen, Yilun Zhao et al.Jan 14arXiv

This paper introduces CLINSQL, a 633-task benchmark that turns real clinician-style questions into SQL challenges over the MIMIC-IV v3.1 hospital database.

#clinical text-to-SQL#EHR#MIMIC-IV

Not triaged yet

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Intermediate

Hao Bai, Alexey Taymanov et al.Jan 5arXiv

WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.

#WebGym#visual web agents#vision-language models

Not triaged yet

A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Intermediate

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan et al.Dec 18arXiv

This paper builds a new test, LongShOTBench, to check if AI can truly understand long videos by using sight, speech, and sounds together.

#long-form video understanding#multimodal reasoning#audio-visual-speech alignment

Not triaged yet