How I Study AI - Learn AI Papers & Lectures the Easy Way

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Intermediate

Guoxin Chen, Fanzhe Meng et al.Mar 3arXiv

BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.

#BeyondSWE#code agents#software engineering benchmark

Not triaged yet

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

Intermediate

Hanna Yukhymenko, Anton Alexandrov et al.Feb 25arXiv

The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.

#machine translation#multilingual benchmarks#test-time compute scaling

Not triaged yet

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents

Intermediate

Tiansheng Hu, Yilun Zhao et al.Feb 5arXiv

SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.

#SAGE benchmark#scientific literature retrieval#deep research agents

Not triaged yet

Papers3

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents