Papers3

#Docker containers

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Ibragim Badertdinov, Maksim Nekrashevich et al.Feb 27arXiv

SWE-rebench V2 is a giant, language-agnostic robot pipeline that turns real GitHub pull requests into safe, runnable software tasks for training AI coding agents.

#SWE-rebench V2#software engineering agents#reinforcement learning

Not triaged yet

On Data Engineering for Scaling LLM Terminal Capabilities

Intermediate

Renjie Pi, Grace Lam et al.Feb 24arXiv

This paper shows that you can vastly improve a model’s command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.

#Terminal-Bench 2.0#terminal agents#synthetic task generation

Not triaged yet

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Intermediate

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

Not triaged yet