This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.
The paper tackles a real-life problem: people often give phones short, vague instructions, so agents must guess the missing details using what they know about the user.