Papers3

#reproducibility

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw et al.Jan 17arXiv

Terminal-Bench 2.0 is a tough test that checks how well AI agents can solve real, professional tasks by typing commands in a computer terminal.

#Terminal-Bench#command line interface#Docker containers

Not triaged yet

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Intermediate

Hao Liang, Xiaochen Ma et al.Dec 18arXiv

DataFlow is a building-block system that helps large language models get better data by unifying how we create, clean, check, and organize that data.

#DataFlow#LLM data preparation#operator pipeline

Not triaged yet

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Intermediate

Mengzhang Cai, Xin Gao et al.Dec 16arXiv

OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.

#OpenDataArena#post-training datasets#data-centric AI

Not triaged yet