πŸŽ“How I Study AIHISA
πŸ“–Read
πŸ“„PapersπŸ“°Blogs🎬Courses
πŸ’‘Learn
πŸ›€οΈPathsπŸ“šTopicsπŸ’‘Concepts🎴Shorts
🎯Practice
πŸ“Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers3

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#execution-based evaluation

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Intermediate
Qixing Zhou, Jiacheng Zhang et al.Feb 11arXiv

FeatureBench is a new benchmark that tests AI coding agents on building real software features, not just fixing small bugs.

#FeatureBench#agentic coding#execution-based evaluation

Not triaged yet

MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering

Intermediate
Chuanzhe Guo, Jingjing Wu et al.Jan 30arXiv

This paper builds a smart team of AI helpers, called MEnvAgent, that automatically sets up the right computer environments for code projects in many languages.

#environment construction#software engineering agents#Fail-to-Pass (F2P)

Not triaged yet

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Intermediate
Jingzhe Ding, Shengda Long et al.Dec 14arXiv

NL2Repo-Bench is a new benchmark that tests if coding agents can build a whole Python library from just one long natural-language document and an empty folder.

#NL2Repo-Bench#autonomous coding agents#long-horizon reasoning

Not triaged yet