Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Zhi Chen; Zhensu Sun; Yuling Shi; Chao Peng; Xiaodong Gu; David Lo; Lingxiao Jiang

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Intermediate

Zhi Chen, Zhensu Sun, Yuling Shi et al.2/8/2026

arXiv

Key Summary

•This paper asks a simple question: do tests written by AI coding agents actually help them fix real software bugs, or do they just look helpful?
•Across six strong language models on the SWE-bench Verified benchmark, most agents wrote lots of tests, but success rates were similar whether tests were written or not.
•One model (GPT-5.2) almost never wrote new tests (0.6% of tasks) yet solved nearly as many issues (71.8%) as the top test-writing models.
•When agents did write tests, they mostly used value-revealing prints to observe program behavior instead of strict assertions to verify correctness.
•Encouraging agents to write more tests barely changed how often they solved problems, but it did increase API calls and token usage (cost).
•Discouraging test writing in test-heavy models sharply reduced cost (up to about half the input tokens) with only small drops in success rates.
•Agent-written tests behaved more like a way to peek at program values than a dependable check that improves the final patch.
•The big takeaway: writing more tests mainly changes the process budget and style, not the final outcome, under a light, high-autonomy setup.
•Future work should focus on smarter, higher-value testing strategies and on measuring test quality as code evolves during the agent’s run.

Why This Research Matters

If AI agents spend lots of time and tokens writing low-value tests, teams pay more without fixing more bugs. This study shows that many current agent-written tests act like observations (prints) rather than strong checks (assertions), so they rarely change whether a bug gets solved. Knowing this helps teams choose prompts and scaffolds that save budget and focus effort on reasoning and patch quality. It also encourages building better test oracles and smarter “when to test” policies instead of just writing more tests. In real software work, that means faster fixes, lower cloud bills, and more predictable engineering pipelines. Finally, these insights guide research toward testing strategies that genuinely move success rates, not just the process footprint.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how in class projects we sometimes add checklists because we think more checks will automatically make the project better? That only helps if the checks are truly useful.

🥬 The Concept (Large Language Model, LLM):

What it is: An LLM is a computer program that reads and writes language to help with tasks like coding.
How it works: 1) It reads your prompt. 2) It predicts likely next words based on huge training data. 3) It chains predictions into plans, code, and explanations. 4) It uses tools (like a terminal) when allowed.
Why it matters: Without LLMs, AI agents couldn’t discuss code, plan fixes, or write patches. 🍞 Anchor: When you ask an AI to fix a bug, the LLM reads the error and code, plans a change, and writes the patch.

🥬 The Concept (Agent-Written Tests):

What it is: These are tests that the AI agent creates during problem solving, not tests that came with the project.
How it works: 1) The agent writes a test file. 2) It runs the test to see results. 3) It observes outputs or failures. 4) It edits code and repeats.
Why it matters: If these tests are wrong or low-value, they can waste time and money without improving the fix. 🍞 Anchor: The agent writes test_bugfix.py to print a value and then adjusts the function until the print looks right.

🥬 The Concept (SWE-bench Verified):

What it is: A big set of real GitHub issues used to test how well agents fix bugs.
How it works: 1) Give the agent an issue and a frozen repo snapshot. 2) Let it run commands and edit code. 3) Check if the final patch truly solves the issue with the official harness.
Why it matters: Without a fair benchmark, we couldn’t compare agent behaviors or success reliably. 🍞 Anchor: It’s like a science fair where every team gets the same puzzle box and the same rules.

The world before: AI coding agents got good at editing real repositories. Many top agents learned a human-looking habit: write tests while working. The common belief was that more tests would guide agents to better fixes, just like how checklists guide students.

The problem: A surprise popped up. GPT-5.2 solved almost as many issues as the top model but barely wrote any tests. That raises a big question: are agent-written tests helping, or are agents just imitating what humans do without real gains?

Failed attempts: Earlier studies judged AI-generated tests in fixed settings (like unit tests for a single snapshot). But real bug fixing is messy: code and ideas change step by step. So these older evaluations didn’t tell us how tests behave inside a live, evolving agent workflow.

The gap: We lacked an empirical look at agent-written tests during full, real-world issue resolution: how often they’re written, what feedback they provide, and whether changing test-writing actually changes outcomes.

Real stakes: Writing and running tests costs API calls, tokens, and time. If the signals from those tests are weak, agents waste budget that could have gone into better reasoning or patching. That matters for teams paying real money per token and per minute.

🥬 The Concept (Agent Trajectories):

What it is: The step-by-step path an agent takes (commands, edits, observations) from start to finish.
How it works: 1) The agent plans. 2) It runs a command. 3) It reads outputs. 4) It revises code or tests. 5) Repeat.
Why it matters: Without studying trajectories, we only see the final grade, not how the student studied. 🍞 Anchor: It’s the breadcrumb trail showing every move the agent made to fix a bug.

02Core Idea

🍞 Hook: Imagine fixing a bike. You can keep checking the bell and flashing the lights, but if the chain is off, those checks won’t help you ride.

Aha! The key insight in one sentence: In high-autonomy code-fixing, more agent-written tests mostly change process cost—not whether the bug actually gets fixed—because those tests act like observations (prints) rather than solid verifications (assertions).

Three analogies:

Detective vs. Note-taker: A good detective tests a suspect’s story with hard evidence (assertions). A note-taker just writes down what people say (prints). Most agent tests behaved like note-taking.
Cooking: A chef tasting (prints) learns flavors but needs a recipe check (assertions) to ensure the cake truly set; many agent tests did lots of tasting, little recipe-checking.
Sports practice: Filming drills (prints) shows what happened; score rules (assertions) say what must happen. Agents mostly filmed, less often set rules.

Before vs After:

Before: People assumed more AI-written tests would boost success.
After: The study shows success barely changes when you add or remove lots of agent-written tests, but cost can swing a lot.

Why it works (intuition):

If tests mostly print internal values, they inform the agent but don’t enforce correctness. Without strong, well-designed assertions, the agent may chase outputs that look okay without truly fixing the root cause.
Strong verification requires specifying expected behavior precisely. That’s harder than printing values and is where many agents under this light scaffold fall short.

Building blocks (the paper’s logic):

Measure behaviors (how often, when, and how intensely tests are written and run).
Measure feedback content (prints vs. assertions; what kinds of assertions).
Change prompts to causally nudge test writing up or down and see if outcomes or costs move.

🥬 The Concept (Feedback Signals):

What it is: The messages tests give during runs—either strict checks (assertions) or value peeks (prints).
How it works: 1) The code runs. 2) Prints show variable values. 3) Assertions pass or fail based on rules. 4) The agent reads these signals and decides next steps.
Why it matters: Weak signals lead to guesswork; strong signals steer fixes. 🍞 Anchor: Printing total=5 tells you what happened; asserting total==5 demands it be correct.

🥬 The Concept (mini-SWE-agent, light scaffold):

What it is: A minimal agent loop that lets the model use a bash shell—no special testing tools.
How it works: 1) The agent runs terminal commands. 2) It edits files. 3) It runs tests it writes. 4) It repeats until done or budget ends.
Why it matters: With fewer built-in rules, we see the model’s natural habits—its native testing style. 🍞 Anchor: It’s like handing a student a pencil and paper instead of a full lab kit to see how they think on their own.

03Methodology

At a high level: Issue + Repo snapshot → Agent runs with a light bash tool → We track if/when/how tests are written and used → We parse tests to extract feedback signals → We tweak prompts to induce/suppress tests → Output: impact on success and cost.

Step-by-step recipe:

Collect realistic tasks

What happens: Use SWE-bench Verified (500 real GitHub issues) with a fixed repo snapshot and a standard evaluation harness.
Why this step exists: Ensures fair, repeatable testing on real problems.
Example: Give an agent an issue from a Python project and see whether its final patch passes the benchmark’s harness.

Run a light, tool-limited agent

What happens: Use mini-SWE-agent so the model can only interact via bash (run commands, edit files, write tests).
Why this step exists: Avoids special testing modules that could hide the model’s native testing behavior.
Example: The agent writes tests/test_bugfix.py using a bash here-doc and runs python -m pytest or python test_bugfix.py.

Measure testing behavior (RQ1)

What happens: For each task, record whether tests were written, how many, when during the run, and how often they were executed (and if executions failed at the process level).
Why this step exists: Without frequency/timing/execution data, we can’t compare process styles or see how testing fits into the workflow.
Example with data: Some models wrote tests in about 90%+ of tasks, others (like GPT-5.2) in almost none (0.6%). Unresolved runs tended to execute tests more times per test artifact.

Extract feedback signals from tests (RQ2)

What happens: Parse agent-written Python test files with an AST to count value-revealing prints and assertions, then categorize assertions into four types (sanity, property, relational, exact).
Why this step exists: Counting and categorizing signals shows whether tests are mostly observations or true verifications.
Example: A test that prints obj.attr five times and asserts obj is not None once is observation-heavy.

Causally test the role of tests (RQ3)

What happens: Change only the prompt to either encourage writing at least one new runnable test file or discourage writing any new tests, and rerun tasks.
Why this step exists: Isolate whether changing test-writing behavior itself changes outcomes and/or cost.
Example: For a test-heavy model, we add “do not write new test files” and see if success drops and how cost changes.

Track outcome and efficiency

What happens: Compare resolution rates (success/fail) and process costs (API calls, input/output tokens) between baseline and intervention runs.
Why this step exists: Even if outcomes don’t change much, big cost swings matter to real users.
Example: Discouraging tests in a test-heavy model reduces input tokens by roughly a third to a half, with only small dips in success.

The secret sauce:

Light scaffold: By not forcing a testing workflow, we observe the model’s native tendencies.
AST-based signal parsing: Reliable extraction of prints and different assertion types.
Prompt-only interventions: Clean causal probes—change test-writing instructions, keep everything else the same.

🍞🥬 Concept blocks for key methods

🥬 The Concept (Prompt Interventions):

What it is: Small instruction tweaks that nudge the agent to write more or fewer tests.
How it works: 1) Append a line: “write at least one runnable new test” or “do not write new test files.” 2) Run again. 3) Compare behaviors and outcomes.
Why it matters: Lets us test cause-and-effect without changing tools or models. 🍞 Anchor: Like telling a teammate, “Try solving it with examples,” vs “Solve it without extra examples,” then seeing what changes.

🥬 The Concept (Causal Evaluation):

What it is: Checking if changing X (test-writing) truly changes Y (success/cost).
How it works: 1) Hold everything else steady. 2) Flip one instruction. 3) Measure before vs after on the same tasks.
Why it matters: Otherwise we might mistake coincidence for cause. 🍞 Anchor: Water two identical plants, but only give extra light to one; compare growth to see if light caused a difference.

🥬 The Concept (Efficiency Analysis):

What it is: Measuring how many API calls and tokens the agent spends to solve tasks.
How it works: 1) Count requests (calls). 2) Count input and output tokens. 3) Compare across runs.
Why it matters: Time and money are limited; waste hurts. 🍞 Anchor: It’s like tracking how many ingredients and steps your recipe used to make the same cake.

04Experiments & Results

The test (what they measured and why):

Testing behavior (RQ1): How often agents write tests, when they write them, and how intensely they run them—because process style might hint at effectiveness or cost.
Feedback content (RQ2): How many prints vs assertions appear, and what kinds of assertions—because verification strength could explain impact.
Outcome and efficiency (RQ3): Whether more/less test writing changes success rates and resource usage—because stakes are cost and correctness.

The competition (who/what was compared):

Six strong models under the same light agent (mini-SWE-agent) and the same SWE-bench Verified tasks: claude-opus-4.5, gemini-3-pro-preview, GPT-5.2, kimi-k2-thinking, minimax-m2, deepseek-v3.2-reasoner.

The scoreboard (with context):

Test-writing frequency (RQ1): Most models wrote tests in the majority of tasks (e.g., often 80–99%), but GPT-5.2 wrote tests in just 0.6% of tasks—yet still solved 71.8%, close to the top models (around 74%). That’s like a student who almost never uses practice quizzes but still scores nearly the same grade.
Timing and execution (RQ1): Test writing usually wrapped up late in runs; unresolved tasks often ran tests more times per test. Think of struggling students rechecking notes more often.
Feedback signals (RQ2): Prints consistently outnumbered assertions across all models. Assertions, when present, were mostly local-property or exact-value checks; relational/range checks were rare. That’s like many measurements but few pass/fail rules.
Outcome impact (RQ3): Flipping test-writing (adding or removing tests for hundreds of tasks) left most outcomes unchanged—on average, about 83% of tasks had the same final result after intervention.
Efficiency impact (RQ3): Costs shifted a lot. Encouraging tests in a low-test model (GPT-5.2) increased API calls and output tokens without improving success. Discouraging tests in test-heavy models cut input tokens by about one-third to one-half and reduced API calls substantially, with only small drops in success.

Surprising findings:

A nearly no-test model (GPT-5.2) performed close to top models that wrote tests frequently, challenging the “more tests = more success” assumption.
The biggest differences showed up in cost, not correctness: writing more tests mainly changed budgets and interaction patterns, not the final fix rate.

05Discussion & Limitations

Limitations:

Benchmark and scaffold: Results come from SWE-bench Verified under a light, bash-only agent. Heavier scaffolds or other languages/tools may yield different patterns.
Model/version variance: Different providers or future model updates could change behaviors.
Nondeterminism: Decoding randomness and environment quirks can affect behaviors and outcomes; comparisons are controlled but not perfect.
Test detection: The study identifies tests via common file patterns and parses Python ASTs; unusual test formats or helpers might be undercounted.

Required resources:

OpenAI/SWE tooling, provider APIs, and enough compute to run 500-task sweeps per model; the study reported about USD $1,600 in API spend overall.

When not to use agent-written tests (as-is):

If you pay high per-token costs, and your agent tends to spam prints rather than write strong assertions.
When your CI requires rigorous verification that casual prints won’t provide.
When tight budgets mean you must prefer reasoning and targeted edits over exploratory test writing.

Open questions:

How to evaluate on-the-fly test quality when code changes between runs? We need snapshotting and metrics that work on transient states.
How to teach agents when to test and how to design high-value oracles (assertions) without choking exploration?
Can agents self-adapt their test strategies over time, balancing cost and verification strength under a fixed budget?

06Conclusion & Future Work

Three-sentence summary:

This paper shows that in a light, high-autonomy setup, agent-written tests are common but mostly observational (prints), and success rates barely change when you add or remove lots of them.
However, costs can swing sharply: encouraging tests increases API calls and tokens in low-test models, while suppressing tests in test-heavy models cuts cost with only small drops in success.
So, agent-written tests largely reflect process style and budget usage rather than reliably improving final bug fixes.

Main achievement:

A careful, causal study that separates “habit” from “help,” revealing that current agent-generated tests often deliver low marginal utility for resolution under a light scaffold.

Future directions:

Design higher-value testing strategies (better oracles, smarter timing), measure test quality in evolving code states, and explore self-evolving policies that adapt testing to task context and budget.

Why remember this:

Because the expensive part of agent development isn’t just whether it works—it’s how it works. This study warns us not to pay for testing rituals that don’t move the scoreboard and encourages investing in verification that truly changes outcomes.

Practical Applications

•Turn off or limit agent-written tests by default in cost-sensitive runs; enable only when needed.
•Prompt agents to prefer concise, high-value assertions over many prints (e.g., assert relationships and exact outputs).
•Add a budget monitor that flags heavy test-writing loops and suggests returning to reasoning or code inspection.
•Use a two-phase strategy: first reason and localize the bug, then write a minimal, targeted test to lock in behavior.
•Snapshot code states when running tests so you can replay and evaluate test quality even as code evolves.
•Create a library of assertion templates (C2–C4) to encourage stronger oracles instead of pure observation.
•Run A/B prompts in your pipeline to compare “with tests” vs “no new tests” modes and choose the cheaper option with similar success.
•Teach agents to delete or disable low-signal tests (mostly prints) after they’ve served their temporary debugging role.
•Track and rank tests by marginal utility (which test changed a decision) to prune noisy test artifacts.
•Expand CI to reward relational and exact checks while capping value-revealing prints in agent-written test files.

Version: 1