Papers4

#reference-free evaluation

DREAM: Deep Research Evaluation with Agentic Metrics

Elad Ben Avraham, Changhao Li et al.Feb 21arXiv

Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.

#deep research agents#agentic evaluation#capability parity

Not triaged yet

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Intermediate

Elena Bruches, Vadim Alperovich et al.Jan 26arXiv

This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.

#unit test maintenance#LLM for software engineering#reference-free evaluation

Not triaged yet

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Intermediate

Yibo Wang, Lei Wang et al.Jan 14arXiv

The paper introduces DeepResearchEval, a fully automated way to build realistic deep research tasks and to grade long research reports from AI systems.

#deep research agents#agentic evaluation#persona-driven tasks

Not triaged yet

SAM Audio: Segment Anything in Audio

Intermediate

Bowen Shi, Andros Tjandra et al.Dec 19arXiv

SAM Audio is a new AI that can pull out exactly the sound you want from a noisy mix using text, clicks on a video, and time ranges—together or separately.

#audio source separation#multimodal prompting#text-guided separation

Not triaged yet