AI helpers often don’t know new users’ tastes and can’t keep up when those tastes change.
ResearchGym is a new "gym" where AI agents are tested on real research projects end to end, not just on toy problems.
Big AI models do great in the lab but stumble in the real world because the world keeps changing.
The paper introduces Trainee-Bench, a new way to test AI agents that feels like a real first day at work, with tasks arriving over time, hidden clues, and changing priorities.
This paper teaches a computer agent to grow a toolbox of skills that are real, runnable programs, not just text ideas.
Supervised fine-tuning (SFT) often makes a model great at a new task but worse at its old skills; this paper explains a key reason why and how to fix it.