BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.
SAGE is a new test for how well AI research agents find scientific papers when questions require multi-step reasoning.