Papers17

#LLM-as-a-Judge

OmniGAIA: Towards Native Omni-Modal AI Agents

Xiaoxi Li, Wenxiang Jiao et al.Feb 26arXiv

OmniGAIA is a new test that checks if AI can watch videos, look at images, listen to audio, and use web and code tools in several steps to find a verified answer.

#OmniGAIA#OmniAtlas#Tool-Integrated Reasoning

Not triaged yet

QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

Intermediate

Santiago Gonzalez, Alireza Amiri Bavandpour et al.Feb 24arXiv

This paper shows that when AI models grade university-level math proofs, they often disagree with human experts in systematic ways.

#LLM-as-a-Judge#mathematical proof evaluation#alignment gap

Not triaged yet

Computer-Using World Model

Intermediate

Yiming Guan, Rui Yu et al.Feb 19arXiv

The paper builds a Computer-Using World Model (CUWM) that lets an AI “imagine” what a desktop app (like Word/Excel/PowerPoint) will look like after a click or keystroke—before doing it for real.

#world model#GUI agent#desktop automation

Not triaged yet

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Beginner

Zhiling Yan, Dingjie Song et al.Feb 10arXiv

LiveMedBench is a new, always-updating test for medical AIs that keeps test questions safely separated from training data to avoid cheating by memorization.

#LiveMedBench#medical benchmark#data contamination

Not triaged yet

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Intermediate

Changze Lv, Jie Zhou et al.Feb 3arXiv

DeepResearch agents write long, evidence-based reports, but teaching and grading them is hard because there is no single 'right answer' to score against.

#DeepResearch#query-specific rubrics#human preference learning

Not triaged yet

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Intermediate

Chenlong Wang, Yuhang Chen et al.Feb 2arXiv

This paper shows that many AI models that both read images and write images are not truly unified inside—they often understand well but fail to generate (or the other way around).

#Unified Multimodal Models#GAPEVAL#Gap Score

Not triaged yet

Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Beginner

Shaohan Wang, Benfeng Xu et al.Feb 2arXiv

This paper builds a live challenge that tests how well Deep Research Agents (DRAs) can write expert-level Wikipedia-style articles.

#Deep Research Agents#Wikipedia Good Articles#Benchmark

Not triaged yet

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Intermediate

Zhuochun Li, Yong Zhang et al.Jan 30arXiv

Big models are often used to grade AI answers, but they are expensive, slow, and depend too much on tricky prompts.

#Representation-as-a-Judge#Semantic Capacity Asymmetry#LLM-as-a-Judge

Not triaged yet

CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty

Intermediate

Johannes Kirmayr, Lukas Stappen et al.Jan 29arXiv

CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.

#LLM agents#benchmarking#consistency

Not triaged yet

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Intermediate

Peizhou Huang, Zixuan Zhong et al.Jan 18arXiv

This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.

#Multimodal Deep Research#Benchmark#Citation Grounding

Not triaged yet

Agent-as-a-Judge

Beginner

Runyang You, Hongru Cai et al.Jan 8arXiv

This survey explains how AI judges are changing from single smart readers (LLM-as-a-Judge) into full-on agents that can plan, use tools, remember, and work in teams (Agent-as-a-Judge).

#Agent-as-a-Judge#LLM-as-a-Judge#multi-agent collaboration

Not triaged yet

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Intermediate

Tingyu Wu, Zhisheng Chen et al.Jan 8arXiv

KnowMe-Bench is a new test that checks if AI helpers truly understand a person, not just remember facts.

#person understanding#autobiographical narratives#cognitive stream

Not triaged yet

1 2