Big AI models do great in the lab but stumble in the real world because the world keeps changing.
The paper introduces Trainee-Bench, a new way to test AI agents that feels like a real first day at work, with tasks arriving over time, hidden clues, and changing priorities.
This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.
Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.