This paper shows that you can vastly improve a modelβs command-line (terminal) skills by carefully engineering the training data, not just by using a bigger model.
LOCA-bench is a test that challenges AI agents to work correctly as their to-do list and background information grow very, very long.
This paper fixes a common problem in reasoning AIs called Lazy Reasoning, where the model rambles instead of making a good plan.
ASTRA is a fully automated way to train tool-using AI agents by making both their practice stories (trajectories) and their practice worlds (environments) without humans in the loop.