Papers3

#OpenHands

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Wayne Chi, Yixiong Fang et al.Feb 11arXiv

GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.

#GameDevBench#Godot#multimodal agents

Not triaged yet

CLI-Gym: Scalable CLI Task Generation via Agentic Environment Inversion

Intermediate

Yusong Lin, Haiyang Wang et al.Feb 11arXiv

CLI-Gym is a new way to create lots of realistic computer-fixing tasks for AI by safely breaking and then repairing software environments inside containers.

#agentic coding#command line interface#Dockerfile

Not triaged yet

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Intermediate

Jie Yang, Honglin Guo et al.Jan 16arXiv

ABC-Bench is a new test that checks if AI coding agents can really do backend work from start to finish, not just write a few lines of code.

#ABC-Bench#agentic backend coding#end-to-end API testing

Not triaged yet