Papers3

#execution-based evaluation

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Qixing Zhou, Jiacheng Zhang et al.Feb 11arXiv

FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.

#FeatureBench#agentic coding#execution-based evaluation

Not triaged yet

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Intermediate

Chuanzhe Guo, Jingjing Wu et al.Jan 30arXiv

This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.

#environment construction#software engineering agents#Fail-to-Pass (F2P)

Not triaged yet

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Intermediate

Jingzhe Ding, Shengda Long et al.Dec 14arXiv

NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.

#NL2Repo-Bench#autonomous coding agents#long-horizon reasoning

Not triaged yet