TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Elena Bruches; Vadim Alperovich; Dari Baturova; Roman Derunets; Daniil Grebenkin; Georgy Mkrtchyan; Oleg Sedukhin; Mikhail Klementev; Ivan Bondarenko; Nikolay Bushkov; Stanislav Moiseev

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Intermediate

Elena Bruches, Vadim Alperovich, Dari Baturova et al.1/26/2026

arXiv PDF

Key Summary

•This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.
•Unlike older studies that only made single test cases for single functions, TAM-Eval works at the test file level and gives models the real project context they need.
•The benchmark includes 1,539 carefully filtered scenarios from Python, Java, and Go, so results reflect real-world challenges.
•TAM-Eval scores models without needing a 'gold' answer by using pass rate, code coverage change, and mutation testing change.
•Modern AIs still struggle: even the best model’s pass rate was about 42% and mutation coverage gains rarely exceeded 12 percentage points.
•Letting models try again with feedback (like error messages) helps, showing that iterative repair and verifier signals matter.
•Go code was surprisingly friendlier for models than Java or Python, likely due to its simple syntax and strict typing.
•Updating tests to match new code was the hardest task; creating tests from scratch showed the biggest coverage boosts.
•Most failures were not syntax errors but runtime and import problems, meaning execution correctness is the main bottleneck.
•TAM-Eval is open-source and extensible, enabling the community to benchmark and improve test maintenance agents.

Why This Research Matters

Every app you use depends on unit tests to avoid crashes and broken features after updates. Maintaining those tests is time-consuming and often neglected, which can lead to bugs slipping into production. TAM-Eval pushes AI to help with the real job: not just writing a test once, but keeping the whole suite healthy over time. By judging results through behavior (running, covering, and catching bugs), it focuses progress on what truly improves software quality. The findings show today’s AIs need verifier feedback and better context handling, guiding where researchers and engineers should invest. As this improves, teams can ship features faster with fewer regressions, keeping users safer and happier.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Let’s build the story step by step, using simple ideas first.

🍞 Hook: You know how a school has lots of rules, and teachers check that students follow them with small quizzes? Software has rules too, and unit tests are like those quizzes that make sure each tiny part of the program behaves. 🥬 The Concept (Unit Test Maintenance): It’s the ongoing job of keeping those little quizzes (unit tests) correct and up to date as the program changes. How it works: 1) When code changes, check which tests break. 2) Fix broken tests or add new ones. 3) Make sure the tests actually check important behavior. Why it matters: Without this, old tests lie or don’t catch new bugs, just like outdated quizzes would miss new classroom rules. 🍞 Anchor: When an app adds a dark mode, tests must be updated to check colors and contrast; otherwise, users might not see buttons properly.

🍞 Hook: Imagine a super-smart helper that can read and write code. 🥬 The Concept (Large Language Models, LLMs): These are powerful AIs that can understand and generate text, including code. How it works: 1) Read the prompt and code context. 2) Predict the next best tokens to write. 3) Use patterns learned from lots of examples. Why it matters: They can draft tests fast—but they still need checking because they can guess wrong. 🍞 Anchor: You ask the AI to write tests for a calculator's divide function; it writes several, including one for dividing by zero.

🍞 Hook: Think of a science fair where projects are judged by what they do, not how they look. 🥬 The Concept (TAM-Eval Benchmark): It’s a big, organized test to see how good AIs are at maintaining real unit tests across create, repair, and update tasks, using whole test files and real repositories. How it works: 1) Collect high-quality projects. 2) Turn them into tasks (create/repair/update). 3) Run AI-written tests and score them by execution results (no answer key needed). Why it matters: We measure what truly counts: do the tests run, cover code, and catch bugs? 🍞 Anchor: TAM-Eval takes a Go project, wipes a test file to ‘from scratch,’ asks the AI to rebuild it, runs the suite, and measures coverage and mutation score.

🍞 Hook: Imagine judging cookies by taste, not by comparing them to a perfect recipe you don’t have. 🥬 The Concept (Reference-free Evaluation): It evaluates AI outputs by how they perform when run, not by matching a prewritten ‘gold’ answer. How it works: 1) Run the tests. 2) Count passes and coverage. 3) Try tiny code changes (mutations) and see if tests catch them. Why it matters: Real software cares about behavior, not matching text. 🍞 Anchor: Two different-looking test files can both be great if they run cleanly and catch the same bugs.

🍞 Hook: Picture shooting basketballs and counting how many go in. 🥬 The Concept (Pass Rate): It’s the percentage of tests that successfully run and pass. How it works: 1) Run all tests. 2) Count how many pass. 3) Divide passes by total. Why it matters: Failing tests can’t be trusted to check behavior. 🍞 Anchor: If 8 out of 10 tests pass, the pass rate is 80%.

🍞 Hook: If you read more pages of a book, you know more of the story. 🥬 The Concept (Code Coverage): It’s the share of the program’s lines exercised by the tests. How it works: 1) Run tests with a coverage tool. 2) Mark lines touched. 3) Compute percentage of touched lines. Why it matters: Low coverage means large parts of the program might hide bugs. 🍞 Anchor: A test suite that touches 70 of 100 lines has 70% coverage.

🍞 Hook: Sprinkle a tiny bit of ‘wrong’ into a recipe to see if your taste-testers notice. 🥬 The Concept (Mutation Testing): It measures test strength by making small code changes (mutants) and checking if tests fail as they should. How it works: 1) Create mutants (like flipping conditions). 2) Run tests against them. 3) Count how many mutants are ‘killed’ (tests fail). Why it matters: High coverage alone can miss weak checks; mutation tests reveal weak or missing assertions. 🍞 Anchor: If tests don’t fail when you flip a > to < in a key function, they’re not protective enough.

🍞 Hook: Think of caring for a garden—plant new seeds, fix broken fences, and trim overgrowth. 🥬 The Concept (Test Creation, Repair, and Updating): These are the three maintenance jobs: make new tests, fix broken tests, and update tests to match changed code. How it works: 1) Create: add tests where none or too few exist. 2) Repair: fix syntax, imports, or wrong assertions. 3) Update: adjust tests after code evolves. Why it matters: Without all three, your test garden wilts. 🍞 Anchor: After a refactor renames functions, update test calls; if an assert is wrong, repair it; if a new feature arrives, create tests.

🍞 Hook: Imagine a robot toy that can detect and fix itself when a gear is loose. 🥬 The Concept (Automated Program Repair, APR): Tools or AIs that automatically fix code problems. How it works: 1) Detect failing behavior. 2) Propose code edits. 3) Re-test for success. Why it matters: Reduces manual fixing time, especially for routine errors. 🍞 Anchor: A tool sees a missing import in a Java test and adds it automatically, making the test compile.

🍞 Hook: If you only keep players who perform well in real games, your team stays strong. 🥬 The Concept (Execution-based Filtering): Selecting data or outputs by whether they actually build, run, and pass within limits. How it works: 1) Try building and testing. 2) Drop flaky or slow samples. 3) Keep stable, runnable cases. Why it matters: Ensures the benchmark reflects practical, trustworthy scenarios. 🍞 Anchor: If a project’s tests are flaky, TAM-Eval excludes it so scores aren’t noisy.

The world before: AI could write single tests for small functions, but often ignored the hard, real-life part—keeping entire test files healthy as software grows. The problem: Companies spend huge time maintaining tests; when code changes, outdated tests break CI pipelines and miss regressions. Failed attempts: Early LLM approaches either wrote isolated assertions, produced non-runnable code, or optimized only coverage (not test strength). The gap: No unified, reference-free, repository-aware way to judge create/repair/update as a realistic maintenance loop. Real stakes: Better test maintenance means fewer bugs in the apps we use—payments work, photos upload, and updates don’t break your favorite features.

02Core Idea

The aha! moment in one sentence: Measure test maintenance by how tests behave when run (pass rate, coverage change, and mutation score change) across create–repair–update scenarios, using real projects and whole test files.

Three analogies for the same idea:

Sports tryouts: Don’t compare new players to a ‘golden player’ on paper—watch them play real games (run tests), count points (pass rate), see how much of the field they cover (coverage), and how well they defend against tricky plays (mutation testing).
Cooking: Don’t grade a dish by recipe matching—taste it (pass rate), check if all flavors were explored (coverage), and see if tasters can detect sneaky swaps like salt for sugar (mutation).
School exams: Rather than matching answers word-for-word, credit students who show correct methods (pass), who solve across topics (coverage), and who catch trick questions (mutation).

Before vs. After:

Before TAM-Eval: LLMs were tested mainly on isolated test creation; results often overfit formatting and surface coverage, missing whether tests truly detect faults. Maintenance tasks (repair/update) were underexplored and lacked repository context.
After TAM-Eval: We have a unified, practical yardstick across create, repair, and update, at the test-file level, with full execution-based scoring. We can see which models actually improve suites, not just generate pretty code.

Why it works (intuition, no equations):

Execution is truth. If tests don’t run and pass, they’re not useful. If coverage doesn’t increase, you likely didn’t explore new behavior. If mutation score doesn’t rise, assertions aren’t strong enough to catch wrong behavior. By combining these, we capture not just quantity (lines touched) but quality (bugs caught).
Repository realism makes skills transfer. Handling imports, build steps, and project structure is half the battle in real maintenance; testing at file level forces models to learn that.
Iterative feedback helps models learn from verifiers. Error messages, stack traces, and compiler hints steer models toward fixes, like a coach giving targeted feedback after each play.

Building blocks of the idea:

Carefully filtered, runnable projects (execution-based filtering) ensure clean baselines.
Three task types reflect real maintenance: Create (from scratch, add tests, recover tests), Repair (syntax/import/assert fixes), Update (bring tests in line with evolved code).
Reference-free metrics: Pass rate, delta coverage (how much more code is exercised), and delta mutation coverage (how many more mutants are killed).
Unified prompting and multi-attempt recovery: One instruction template across tasks, plus up to k retries with feedback.
Modular infra: Docker sandboxes and language-specific tools (coverage.py, JaCoCo, Go cover; MutPy, PIT, go-mutesting) make it extensible.

Put simply, TAM-Eval shifts the question from “Can an AI write a test?” to “Can an AI keep a living test suite healthy as the code changes, and can we measure that fairly by running it?”

03Methodology

At a high level: Input (focal code + current/broken/old test file and minimal instructions) → Model rewrites the whole test file → Sanity and syntax checks → Run tests to get pass rate + line coverage → Run mutation testing → Aggregate deltas (coverage and mutation) → Score.

Step-by-step recipe with what, why, and examples:

Collect real projects (Repository Selection)

What happens: Use GitHub API to gather open-source repos with recent updates, permissive licenses, stars, and multiple contributors; prefer recent commits to reduce training contamination.
Why this step exists: Ensures data is real, modern, and unlikely to be memorized by models.
Example: Keep a Python library updated in 2025 with at least 40 stars and multiple maintainers.

Make sure projects build and run (Execution-based Filtering)

What happens: Auto-detect build/test commands; require tests to finish fast (≤30s); run twice to filter flaky tests; ensure the tests and focal files are properly linked; require baseline coverage ≥40%.
Why: If a project can’t run reliably, scores become noise. A solid baseline avoids trivial gains.
Example: Drop a Java project whose tests pass once but fail the second time (flaky), or a Python pair with 15% baseline coverage.

Keep only meaningful, well-sized files (Content-based Filtering)

What happens: Enforce at least two test cases; focal functions with ≥5 executable lines; file length within sensible bounds; low comment-only content; remove auto-generated files; balance sampling across repos.
Why: Avoid toy or misleading samples and ensure variety.
Example: Exclude a test file with a single trivial test or a focal file that’s mostly comments.

Create three kinds of maintenance tasks (Task Construction)

Test Creation: From Scratch (wipe the test file), Add New Tests (partial coverage <100%), Recover Tests (remove some cases from a good suite).
Why: Reflects common needs: writing fresh suites, expanding coverage, or rebuilding missing logic.
Example: Wipe a Go test file and ask the model to rebuild tests for all main functions.
Test Repair: Inject faults (syntax/import/runtime issues, weak assertions, coverage-reducing edits) using heuristics (tree-sitter) and LLM-generated defects.
Why: Real life includes broken tests and missing checks.
Example: Remove an import in a Java test so it fails to compile; ask the model to fix it.
Test Update: Revert the test file a few commits back while keeping current code; keep it only if this causes test failures or ≥5% drops in coverage/mutation.
Why: Tests must track code evolution.
Example: A Python module renamed a function; the old test calls the old name and now fails.

Unified prompting with iterative attempts (Generation Stage)

What happens: Provide the focal file and the current/broken/old test file; ask the model to rewrite the test file to improve effectiveness; allow up to k attempts. If an attempt fails, feed back error logs and stack traces for the next try.
Why: One prompt avoids overfitting to task-specific instructions; feedback mimics developer loops.
Example: Attempt 1 returns a syntax error; Attempt 2 fixes syntax but has an import issue; Attempt 3 adds the missing import and passes.

Sanity and syntax validation

What happens: Reject empty outputs or identical copies; parse with language-specific parsers to catch syntax issues.
Why: Filters trivial or invalid generations before wasting runtime.
Example: If the model returns a comment-only file, the pipeline discards it.

Execution and coverage measurement

What happens: Run tests; record pass rate and line-level coverage using coverage.py (Python), JaCoCo (Java), or cover (Go); compute changes relative to the initial baseline.
Why: Running the tests reveals if they’re usable; coverage shows depth of code exercised.
Example: Coverage rises from 52% to 65% after the model adds new tests.

Mutation testing

What happens: Generate mutants with MutPy (Python), PIT (Java), or go-mutesting (Go); run tests against fixed mutant sets; compute mutation coverage change.
Why: Strong tests should fail on mutated, subtly wrong code; it measures assertion quality.
Example: Mutation coverage increases from 18% to 27%, showing stronger bug-detection power.

Aggregate metrics

What happens: For each sample, compute pass rate and deltas (ΔTestCov, ΔMutCov), then average across scenarios, languages, and tasks.
Why: Gives fair overall comparisons among models.
Example: GPT-OSS-120B averages +16.3 percentage points in coverage across all tasks.

The secret sauce (what’s clever):

Reference-free, behavior-first scoring: No need for a perfect ‘gold’ test; the code’s own execution, coverage, and mutation responses are the ground truth.
Maintenance realism: Working at test-file granularity with real repos forces handling imports, naming, build tools, and structure—the actual pain points.
Feedback-aware attempts: Compiler errors and stack traces guide the model like a coach, leading to steady gains by Attempt@3.
Strong filtering: Multi-stage curation yields stable, runnable, non-trivial scenarios, so improvements are meaningful.
Language-extensible design: Docker isolation and pluggable coverage/mutation tools make it easy to add more ecosystems later.

04Experiments & Results

The test: Measure whether models can keep tests healthy across creation, repair, and update by checking three things: Do the tests run and pass? Do they touch more of the code? Do they catch more subtle bugs (mutants)? This matters because useful tests must run, explore, and protect.

The competition: Six advanced models were evaluated (via a consistent API and settings): Devstral-Small, Qwen3 Coder 480B A35B, DeepSeek V3.1 671B, GPT-OSS-120B, Gemini 2.5 Flash, and GPT-5.

Scoreboard with context (Attempt@3):

GPT-5 led with a pass rate of about 42% (42.37%), ΔTestCov ≈ +20.8 points, and ΔMutCov ≈ +11.7 points—think of this as the top student getting a solid A- while others hover around B- or C+.
GPT-OSS-120B followed with pass rate ≈ 32.8%, ΔTestCov ≈ +16.3, and ΔMutCov ≈ +8.5—strong but behind the leader.
Qwen3 Coder 480B A35B and DeepSeek V3.1 671B showed moderate gains; Devstral-Small and Gemini 2.5 Flash trailed.

Attempts matter: Across models, metrics climbed from Attempt@1 to Attempt@3. Feeding back compiler errors and stack traces helped AIs fix their own outputs. GPT-5 started ahead even at Attempt@1 (~30.7% pass rate), but still improved with retries. This shows the power of ‘listen to the verifier’ loops.

By language:

Go was the friendliest for models: GPT-5 reached ~67.8% pass rate, ΔTestCov ~ +38.8, ΔMutCov ~ +28.1; GPT-OSS-120B also did very well. Likely reasons: simpler syntax, strict typing, lower semantic noise.
Java had surprisingly lower ΔTestCov and ΔMutCov even when pass rates were okay (e.g., GPT-5 ~29.3% pass rate but modest coverage gains), suggesting executability doesn’t always equal meaningful behavioral checks in verbose, statically typed settings.
Python sat in between; interestingly, GPT-OSS-120B topped Python coverage and mutation gains among non-GPT-5 models.

By task type:

Create (From Scratch) produced the biggest ΔTestCov and ΔMutCov because starting from zero makes any good suite a big lift. Add New Tests and Recover Tests had decent pass rates but smaller gains, meaning models often added runnable but not deeply probing tests.
Repair often succeeded at fixing syntax/imports but struggled with strengthening assertions and raising mutation coverage enough.
Update was hardest: All models showed reduced ΔMutCov. Precisely aligning tests to changed code requires strong context tracking and careful edits.

Surprising or notable findings:

Most failures weren’t syntax—they were execution issues (over 60% in many cases): missing imports, undefined names, wrong calls, and runtime exceptions. So models can write well-formed code but still not wire it correctly.
Go outputs had frequent unused imports and undefined references; Python from some models showed a wide variety of exceptions; Java commonly hit NullPointerException and IndexOutOfBoundsException.
Qwen3 Coder 480B A35B produced the shortest files on average; Gemini 2.5 Flash produced the longest. Despite shorter Java tests, Java saw the highest assert density, hinting that length alone doesn’t guarantee stronger tests.
Mutation gains rarely exceeded 12 percentage points on average, even for top models—clear headroom remains for better assertions and behavior checks.

Bottom line: Modern LLMs can help, especially with iterative feedback, but robust, behavior-strong test maintenance across languages and tasks is still an open challenge.

05Discussion & Limitations

Limitations (what this can’t do yet):

First-try weakness: Without feedback, initial attempts often fail at execution, especially import/wiring issues; practical use will likely need verifier-guided retries.
Mutation ceiling: Even top models show modest mutation gains, meaning assertions and behavior checks remain shallow in many outputs.
Update fragility: Aligning old tests to new code is hard; understanding semantic changes and intended behavior remains challenging.
Language bias: Results suggest language-dependent difficulty (Go easiest, Java hardest for meaningful gains), which may reflect training data or toolchain differences.
Repository scope: While large and diverse, 1,539 scenarios can’t cover all frameworks, build systems, or niche patterns.

Required resources to use TAM-Eval:

Containerized environment (Docker) with language-specific build/test tooling (coverage.py/JaCoCo/cover; MutPy/PIT/go-mutesting).
Access to LLM inference (through APIs or local models), with support for iterative prompting and feedback.
Reasonable compute for running tests and mutation analysis (mutation testing can be time-consuming for large files).

When not to use this approach:

Extremely time-constrained CI where mutation testing overhead is unacceptable.
Projects with heavy external dependencies or environment-specific setup that can’t be containerized or auto-detected.
Highly flaky test suites; TAM-Eval filters flakiness, but if a project is inherently flaky, execution-based scoring will be noisy.

Open questions:

How to better incorporate repository-wide context (build graphs, dependency resolution, fixtures) without exceeding model limits?
Can we train models to use verifier signals (compiler, coverage, mutation) more natively, e.g., with reinforcement or reward models tailored to testing?
What prompts or planning strategies help most for the Update task (e.g., diff-aware instructions, commit-message summaries)?
Can hybrid systems (static analysis + LLM + symbolic execution) raise mutation scores reliably across languages?
How do results change with broader ecosystems (JavaScript/TypeScript, C/C++, Rust) and with project-specific testing frameworks?

06Conclusion & Future Work

Three-sentence summary: TAM-Eval is an open, reference-free benchmark that tests whether AI models can create, repair, and update unit tests for real projects by measuring pass rate, code coverage change, and mutation score change. Experiments across 1,539 Python, Java, and Go scenarios show that today’s models improve with feedback but still struggle to deliver strong, reliable test maintenance—especially for updates and mutation robustness. This framework reveals clear gaps and provides a practical path for iterative, verifier-guided improvement.

Main achievement: Turning test maintenance into a realistic, runnable, and reference-free evaluation that captures both breadth (coverage) and strength (mutation), at test-file granularity and across languages.

Future directions: Enrich repository context and long-context handling; integrate structured diff/commit information for updates; train with verifier-derived rewards (compiler, coverage, mutation); explore hybrid static–dynamic analysis; extend to more languages and frameworks. Also, use TAM-Eval as a development loop: diagnose failure types, fine-tune models or agents, and re-measure progress over time.

Why remember this: It shifts the focus from writing pretty tests to keeping living test suites healthy—the real job developers face daily. By judging behavior instead of text matching, TAM-Eval spotlights what truly matters: runnable, exploring, bug-catching tests. Its open design invites the community to compete, learn, and steadily raise the bar for AI-assisted software quality.

Practical Applications

•Add an AI-powered test maintenance step in CI that proposes fixes when tests fail to compile or run.
•Use iterative AI attempts with compiler and stack-trace feedback to auto-repair broken unit tests.
•Schedule AI-driven ‘from-scratch’ test generation for untested files to bootstrap coverage.
•Run mutation testing nightly and ask the AI to strengthen assertions where mutants survive.
•Apply the Update workflow after significant refactors: revert tests N commits, generate AI updates, and review diffs.
•Adopt TAM-Eval locally to benchmark your chosen model mix on your codebase before deployment.
•Build a guardrail that rejects AI outputs failing sanity/syntax checks and auto-triggers a second attempt.
•Prioritize Go modules for AI-assisted maintenance if your stack includes Go, leveraging its higher success rates.
•Train or fine-tune reward models using pass/coverage/mutation signals to steer AI toward stronger tests.
•Create dashboards tracking ΔTestCov and ΔMutCov per module to spot weak areas and regressions.

Version: 1