FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.
This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.
NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.