BeyondSWE is a new benchmark that tests code agents on tougher, more real-life tasks than single-repo bug fixing.
NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.