Papers2

#precision and recall

ContextBench: A Benchmark for Context Retrieval in Coding Agents

ContextBench is a new benchmark that checks not just whether a coding AI fixes a bug, but whether it found and used the right pieces of code along the way.

#context retrieval#coding agents#software engineering benchmarks

Not triaged yet

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Beginner

Nikita Gupta, Riju Chatterjee et al.Jan 28arXiv

DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.

#DeepSearchQA#agentic information retrieval#systematic collation

Not triaged yet