Agentic Code Reasoning

Shubham Ugare; Satish Chandra

Agentic Code Reasoning

Intermediate

Shubham Ugare, Satish Chandra3/2/2026

arXiv

Key Summary

•The paper teaches AI agents to understand big codebases without running the code by following a strict, step-by-step thinking template called semi-formal reasoning.
•This template makes the agent write down premises (facts), trace how tests would run through the code, and give a formal yes/no conclusion with evidence.
•On tricky “are these two patches equivalent?” problems, accuracy jumps from about 78% to nearly 89% on curated examples, and to 93% on real patches with tests.
•For code question answering, the structured method boosts accuracy to 87%, a big improvement over standard reasoning.
•For finding buggy lines in Java projects (Defects4J), the method improves Top-5 accuracy by 5–12 percentage points, depending on the setup.
•A real Django example shows why structure matters: the agent discovers a local function named format that shadows Python’s builtin, changing the result and proving the patches are not equivalent.
•Because it works without executing repository code, this approach can make RL training and code review cheaper and safer.
•The method is general: it uses natural language “certificates,” not hard-to-build formal proofs, so it scales to real repositories across languages and frameworks.
•The tradeoff is cost: the agent takes about 2–3 times more steps, and it can still miss paths or library details.

Why This Research Matters

Software teams spend huge time and money setting up sandboxes and running tests just to check if a patch is good. This work shows that, with the right structure, agents can read and reason their way to reliable answers without executing the repo. That means faster reviews, cheaper RL training signals, and safer pipelines (no running untrusted code). It also scales across languages and frameworks because it uses natural-language certificates instead of rigid formal proof systems. Developers get more trustworthy reasoning they can quickly audit. And organizations get a practical middle ground that boosts quality while cutting costs.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how your teacher asks you to show your work in math, not just write the final answer? That way, the teacher can see if you really understood the steps, not just guessed.

🥬 Filling (What/How/Why):

What it is: This paper studies how AI “code detectives” can understand big codebases by reading and reasoning, without running the programs, using a strict show-your-work template called semi-formal reasoning.
How it works: The agent explores files, writes down clear premises (facts), traces how tests would flow through functions, and commits to a formal conclusion, backed by evidence it found in the code.
Why it matters: Without showing work, agents can guess and get fooled by look-alike code. Structure keeps them honest and thorough.

🍞 Bottom Bread (Anchor): Imagine grading a math test: the student who neatly shows each step is easier to trust than the one who writes just “42.” That’s this paper’s whole idea for code.

Now let’s build the story of why this research exists.

The world before: AI coding agents could already search files, read functions, and write patches. But they often checked their own work by running tests in a sandbox. That’s slow, costly, and sometimes impossible (dependencies, licenses, security). If we could trust an agent’s reading and reasoning—no execution—we’d save time and money and enable safer pipelines.

The problem: Agents using unstructured chain-of-thought often say things like “this function probably does X,” or “these patches look the same,” without proving it. That’s risky in real repos, where behavior can hinge on tiny details: a shadowed function name, an import that changes semantics, or an edge case hidden in tests.

🍞 Agentic Code Reasoning (concept 1)

Top Bread: Imagine being a detective in a library. You can open any book, take notes, and connect clues—but you can’t run experiments in the lab.
Filling:
- What it is: Agentic code reasoning is when an AI agent navigates a repository and reasons about code behavior by reading, not executing.
- How it works: It uses tools (like grep, cat) to find definitions, follows function calls across files, and gathers context step by step.
- Why it matters: When execution isn’t allowed or is too expensive, this is the only way to judge changes safely.
Bottom Bread: The agent sees two patches, opens the relevant files, follows calls from tests into code, and decides if tests would pass—without pressing “Run.”

🍞 Unstructured Chain-of-Thought (concept 2)

Top Bread: Imagine trying to solve a puzzle while talking out loud with no plan—“maybe this piece fits… or that one…”
Filling:
- What it is: Unstructured chain-of-thought is free-form reasoning with no required proof steps.
- How it works: The model writes whatever thoughts come to mind and jumps to a conclusion.
- Why it matters: Without structure, it can skip cases, assume behavior, and miss tricky details.
Bottom Bread: An agent sees format(x, '04d') and assumes it’s Python’s builtin, never checking if the project defines its own format—leading to a wrong answer.

🍞 Semi-Formal Reasoning (concept 3)

Top Bread: Think of a science fair where you must list your hypothesis, method, data, and conclusion.
Filling:
- What it is: Semi-formal reasoning is a structured prompt template that forces the agent to state premises, trace execution per test, and deliver a formal conclusion.
- How it works: The agent fills a certificate: what changed, which tests matter, how each test flows through the code, and whether outcomes match—citing file:line evidence.
- Why it matters: It prevents hand-waving. The agent can’t claim “these are the same” without showing identical test outcomes step by step.
Bottom Bread: In Django, the agent spots a module-level format function that expects a datetime, not an int—proving two patches behave differently under tests.

Failed attempts and the gap: Pure LLM judges and rubric checkers help but still allow unsupported claims. Full formal proofs (Lean/Coq/Datalog) are precise but impractical for messy, multi-language repos. What’s missing is a middle path: human-checkable, natural-language certificates with strict structure—stronger than free-form thoughts, lighter than full formal verification.

🍞 Patch Equivalence Verification (concept 4)

Top Bread: Imagine checking if two recipes bake the same cake by comparing steps and what the tasters would think, without actually baking.
Filling:
- What it is: Patch equivalence asks whether two code changes produce the same test pass/fail results.
- How it works: The agent compares patches, reads tests (F2P and P2P), traces expected flows, and predicts pass/fail for each patch.
- Why it matters: If we can trust this prediction, we can train agents and review patches without running the full test suite every time.
Bottom Bread: If Patch A breaks a date formatter because of a shadowed symbol but Patch B doesn’t, the outcomes differ—so not equivalent.

Real stakes: This helps in RL training (cheaper rewards without execution), code review (safer approvals), static program analysis (finding subtle bugs), and security (no risky code execution). It lowers costs, reduces wait times, and avoids spinning up complex environments—while catching tricky issues that pretty diffs can hide.

02Core Idea

The “Aha!” moment in one sentence: If you make the AI write a semi-formal certificate—premises, per-test execution traces, and a formal conclusion—it stops guessing and starts proving, which makes non-execution code reasoning much more reliable.

Three analogies for the same idea:

Courtroom analogy: The agent is a lawyer who must present admissible evidence (file:line facts), walk the jury through the timeline (execution trace), and issue a verdict tied to a legal definition (equivalence modulo tests). No evidence, no claim.
Science lab analogy: The agent states a hypothesis, records the method (which functions/tests were traced), logs observations (what each line does), and writes a conclusion—so others can replicate the reasoning by re-reading the same files.
Road trip analogy: Before declaring you’ve reached the destination, you list every turn you took, the road signs you saw (code definitions), and a photo at each stop (citations). If a turn is missing, you didn’t prove the route.

Before vs. After:

Before: Agents often relied on surface similarity or function names, skipping deep checks. They got decent averages but failed on subtle, real-world traps.
After: With semi-formal templates, agents enumerate the code paths tests would exercise, document evidence, and compare outcomes patch-by-patch. Accuracy jumps 5–12 percentage points across tasks, reaching 93% on real patch verification when test specs are visible.

Why it works (intuition, not equations):

Structure enforces completeness: Listing tests and tracing each forces coverage of relevant paths.
Evidence reduces hallucination: File:line citations anchor claims to reality.
Definitions nail semantics: “Equivalent modulo tests” ties the verdict to objective outcomes, not vibes.
Interprocedural discipline: To explain a test’s path, you must follow calls across functions/files—so the agent naturally reasons deeper.

Building blocks (the idea broken into parts):

Clear definitions: The agent writes down what “equivalent modulo tests” means.
Premises: What each patch changes; what the tests check (F2P, P2P).
Per-test tracing: For each test, predict PASS/FAIL for each patch, citing the exact call chain and lines.
Counterexample logic: If not equivalent, show a specific test where outcomes differ—and why.
Formal conclusion: A crisp YES/NO tied to the earlier definition.

🍞 Semi-Formal Certificate (concept 5)

Top Bread: Imagine a checklist you must complete before leaving class: name on paper, answers filled, proof steps shown, final answer circled.
Filling:
- What it is: A structured “certificate” the agent must complete, covering definitions, premises, traces, and conclusion.
- How it works: The prompt provides a template. The agent must fill every bracket with concrete evidence (files changed, test names, code paths), or it can’t finish.
- Why it matters: It transforms fuzzy thoughts into a verifiable artifact that humans can skim to trust or challenge.
Bottom Bread: In the Django case, the certificate includes: “grep -n 'def format' … line 340 shows module-level format(value, forma $t_s$ tring).” That one line changes the verdict.

🍞 Equivalent Modulo Tests (concept 6)

Top Bread: You know how two kids might use different methods but both get all the quiz answers right? For the teacher, that’s equivalent—because the grading is by questions, not by method.
Filling:
- What it is: Two patches are equivalent modulo tests if they make exactly the same tests pass and fail.
- How it works: We compare predicted outcomes on F2P and P2P tests for both patches; identical outcomes mean equivalent.
- Why it matters: It gives an objective, repository-grounded target the agent can reason about without execution.
Bottom Bread: If both patches make all F2P tests pass and keep all P2P tests passing, they’re equivalent—even if the code looks very different.

What changes because of this idea:

For patch verification: Moves from “looks similar” to “proven same outcomes,” hitting 93% with test patches visible.
For code QA: Replaces guessing with trace tables and data-flow notes, lifting Opus-4.5 from 78.3% to 87.0%.
For fault localization: Turns vague hunches into ranked, evidence-cited predictions, improving Top-5 by up to 12 percentage points on smaller, fit-in-context sets.

In short, the key innovation is not a new model or a new training trick—it’s a new way to think: make the agent produce a structured, evidence-backed certificate before it’s allowed to answer.

03Methodology

At a high level: Input → Explore repository (no execution) → Fill a semi-formal certificate (premises, per-test traces) → Produce YES/NO (with counterexample if NO).

Available tools and guardrails:

The agent can navigate files using shell tools (like grep, cat, sed), but cannot install dependencies or run the repo’s tests.
It may run tiny, independent Python snippets only to probe language behavior—not the project code.
Git history is off-limits to prevent peeking at the answer.

Step-by-step “recipe” for patch equivalence (the flagship task):

Read the task and patches

What happens: The agent loads both code diffs and the test patch (F2P tests) and scans P2P tests if relevant.
Why this exists: You need the full picture of what changed and what is being checked; without it, you miss the target behavior.
Example: Two Django patches both aim to format 2-digit years; tests focus on years < 1000.

Write down definitions

What happens: The agent copies Definition D1: “Equivalent modulo tests” means identical pass/fail across the full suite ( $F2P ∪ P2P$ ).
Why: This tethers the conclusion to a crisp goal; without it, the agent could reason vaguely and grade itself on the wrong scale.
Example: If one patch throws AttributeError and the other returns '76', outcomes differ → not equivalent by D1.

State premises (P1–P4)

What happens: The agent specifies which files each patch changes, what each change does, and what F2P/P2P tests check.
Why: These are the facts the rest of the proof leans on; without accurate premises, the traces don’t mean much.
Example: “P1: Patch 1 calls format(self.data.year, '04d')” and “P3: F2P tests exercise date formatting for years < 1000.”

For each test, trace the execution twice (once per patch)

What happens: The agent follows the test’s call path into production code, stating expected behavior step-by-step with file:line evidence.
Why: This prevents skipping edge cases; without per-test tracing, the agent might assume behavior from names or surface similarity.
Example: “Claim: With Patch 1, test tes $t_y$ ea $r_b$ efor $e_1$ 000 FAILS because y() calls module-level format(value, forma $t_s$ tring) at dateformat.py:340, which expects a datetime and raises AttributeError on int.”

Compare outcomes and seek a counterexample (if any)

What happens: After predicting PASS/FAIL for each patch, the agent marks SAME/DIFFERENT per test. If any DIFFERENT appears, it writes a counterexample block explaining the divergence.
Why: Counterexamples make “NOT EQUIVALENT” concrete and checkable; without one, a NO verdict would feel hand-wavy.
Example: “Counterexample: tes $t_y$ ea $r_b$ efor $e_1$ 000 → Patch 1 FAILS (AttributeError), Patch 2 PASSES ('%02d' formatting).”

Formal conclusion

What happens: The agent summarizes predicted outcomes and answers YES/NO strictly by Definition D1.
Why: This locks the verdict to the earlier definition, avoiding moving goalposts.
Example: “Outcomes DIFFER; therefore NOT EQUIVALENT modulo tests.”

Sanity checks and alternative hypotheses (especially for QA and FL)

What happens: The template asks the agent to consider the opposite answer and look for evidence that would support it.
Why: This reduces confirmation bias; without it, the agent might stop after finding the first plausible path.
Example (QA): Verify both API calls use the same enum keys; if not, at() vs [] could differ on missing keys.

Secret sauce—what’s clever here:

The certificate makes the agent do interprocedural reasoning naturally: to justify a test’s behavior, it must follow calls across files and layers.
Evidence is grounded: claims cite file:line, making it easy for humans (or future tools) to audit.
The structure is task-specific but reusable: different templates for patch equivalence, fault localization, and code QA share the same premise → trace → conclusion logic.

🍞 Interprocedural Reasoning (concept 7)

Top Bread: Imagine following a rumor from one friend to another across the whole school to find where it started.
Filling:
- What it is: Reasoning that follows function calls across multiple files and layers.
- How it works: Start at the test, step through each call, and record what each callee actually does.
- Why it matters: Many bugs and differences hide in other functions, not the one you first see.
Bottom Bread: The Django bug only surfaces when y() calls a format that resolves to the module’s function, not Python’s builtin—discovered by tracing into the right file.

Extensions to other tasks:

Fault localization: The template adds “Test Semantics,” “Code Path Tracing,” “Divergence Claims,” and “Ranked Predictions,” forcing each suspected line to be tied back to a specific failing test behavior.
Code QA: The template requires a function trace table, data-flow analysis, and an alternative hypothesis check, curbing name-based guesswork.

Tradeoffs and controls:

More steps: Semi-formal takes ~2. $8× the$ steps versus standard reasoning (e.g., 10 → 28 on curated patch verification).
Still static: It doesn’t run project code; third-party libraries may require inferred semantics if source is unavailable.
Safer by design: No executing untrusted repo code, which is valuable for security and cost.

🍞 Per-Test Execution Tracing (concept 8)

Top Bread: Think of a referee watching an instant replay from start to finish for each disputed play.
Filling:
- What it is: Simulating, on paper, how each test would move through the code.
- How it works: Start at test entry, follow calls, note key branches and data, and predict PASS/FAIL.
- Why it matters: It stops the agent from guessing that “most cases pass” while missing a crucial test branch.
Bottom Bread: A session-related test fails because one patch forgets to update middleware; the trace shows the missing call that flips the outcome.

04Experiments & Results

The tests and why they matter:

Patch equivalence (main): Checks if two patches produce the same test pass/fail pattern—critical for execution-free rewards in RL and safe code review.
Fault localization (Defects4J): Measures whether the agent can point to the actual buggy lines—vital for repair systems.
Code QA (RubberDuckBench): Evaluates nuanced understanding with rubric-graded answers—important for developer assistance.

Competitors (what it’s compared to):

Standard reasoning: Free-form chain-of-thought, agentic or single-shot.
Single-shot LLMs: No exploration; optionally given full modified files.
difflib similarity: A code-diff similarity baseline (no LLM), used as a simple proxy reward in prior RL work.

Scoreboard with context:

Patch equivalence—curated hard set (Opus-4.5)

Standard: 78.2% accuracy, ~10 steps.
Semi-formal: 88.8% accuracy, ~28 steps.
Context: That’s like going from a solid C+ to a high B+/A-, cutting errors nearly in half. The price is ~2. $8× more$ reasoning steps.

Patch equivalence—real agent patches with test specs visible (200 examples, balanced)

difflib (best threshold 0.4): 73%.
Opus-4.5 single call: 86%; +file context: 87.5%.
Opus-4.5 agentic standard: 87.0%; semi-formal: 93.0% (~38 steps).
Sonnet-4.5 single call: 80%; +file context: 82%.
Sonnet-4.5 agentic standard: 84.5%; semi-formal: 91.5%.
Context: 93% is like an A on a tough final. The jump from single-call to semi-formal agentic shows exploration plus structure is the winning combo.
Surprises: Many cases truly need repo exploration beyond the patch diff; simply pasting changed files helps only a little.

Fault localization—small-scale (fit-in-context, Opus-4.5)

All-hunks Top-5: Standard single-shot 55.6% → Semi-formal single-shot 63.9% (+8pp); Standard agentic 60.5% → Semi-formal agentic 72.1% (+12pp).
Any-hunk Top-5: Up to 88.4% with semi-formal agentic.
Context: Structured tracing and divergence claims help the agent move from crash-site guesses to root-cause lines.

Fault localization—larger sample (90 evaluable bugs)

Opus-4.5 All-hunks Top-5: Standard 43.3% → Semi-formal 47.8% (+4.5–5pp).
Any-hunk Top-5: Standard 65.6% → Semi-formal 68.9%.
Context: Gains persist at scale, though harder, multi-file bugs reduce absolute numbers.

Code QA—RubberDuckBench

Opus-4.5: Single-shot 76.2% → Agentic standard 78.3% → Agentic semi-formal 87.0% (+8.7pp over standard agentic).
Sonnet-4.5: Single-shot 78.7% → Agentic standard 84.2% → Agentic semi-formal 84.8% (near plateau).
Context: Trace tables and data-flow sections especially help models prone to name-based guessing.

Most common failure modes:

Incomplete tracing: The agent assumes a function’s behavior instead of fully following the path.
Third-party semantics: When source isn’t visible, the agent infers from names and can be wrong.
Dismissing subtle differences: It finds a real divergence but wrongly decides tests won’t hit it.

Takeaway: Across very different tasks, the same structural idea—premises → traces → conclusion—consistently boosts reliability without running repo code. It’s a broadly useful discipline, not a task-specific trick.

05Discussion & Limitations

Limitations (honest view):

More steps and time: Semi-formal reasoning costs ~2– $3× more$ actions than standard reasoning.
Static-only blind spots: With no repo execution, behavior hidden in binary libs or complex frameworks can be misread.
Template fit: If the template doesn’t match the task well, the agent may fill it neatly but miss the heart of the problem.
Missed paths: Even with structure, agents can overlook a branch or assertion and draw the wrong verdict.
Grader dependence (QA): Rubric-based LLM grading isn’t a perfect oracle, though agreement is high.

Required resources:

An agentic environment (SWE-style tools: search, file read) with context windows large enough for multi-file reading.
A strong base LLM (e.g., Opus-4.5/Sonnet-4.5) that can follow templates and cite evidence.
Access to tests (especially F2P) improves patch verification significantly.

When not to use:

Time-critical responses where extra steps are too expensive.
Tasks needing exact formal guarantees (e.g., safety-critical code proofs)—semi-formal is checkable by humans but not machine-verified.
Projects dominated by opaque third-party binaries where reading can’t reveal true behavior.

Open questions:

Can post-training bake the template into the model to reduce step count without losing rigor?
How best to blend this with lightweight formal tools (symbolic execution, Datalog checks) for stronger guarantees?
Can we auto-generate and tailor templates per repo or per framework to reduce template-task mismatch?
How to robustly handle third-party semantics when sources aren’t available (API mining, doc grounding, learned surrogates)?
Can we convert natural-language certificates into partially machine-checkable artifacts to catch subtle slip-ups automatically?

06Conclusion & Future Work

Three-sentence summary: This paper shows that making AI agents fill a semi-formal certificate—premises, per-test execution traces, and a formal conclusion—dramatically improves their ability to reason about code without running it. Across patch equivalence, code question answering, and fault localization, the structured approach beats standard reasoning by 5–12 percentage points, reaching 93% verification accuracy on real patches with test specs. The result is a practical, scalable middle path between free-form thoughts and hard formal proofs.

Main achievement: Turning “show your work” into a reusable, task-specific template that consistently upgrades agent reliability for semantic code analysis without execution.

Future directions: Fine-tune models to internalize the structure (fewer steps, same rigor); extend templates to security, API misuse, and smells; and hybridize with symbolic or lightweight formal checks for added guarantees. Also, explore auto-adapting templates to languages/frameworks and partially machine-checkable certificates.

Why remember this: It reframes code reasoning from guessy storytelling to documented proof, usable today across real repositories. By staying in natural language while enforcing discipline, it offers a powerful new default for agent reliability—saving cost, improving trust, and opening safer, execution-free workflows in training and review.

Practical Applications

•Execution-free patch triage: Quickly decide if an incoming patch matches a reference fix before running CI.
•Cheaper RL rewards: Train coding agents using equivalence certificates instead of expensive test execution.
•Pre-merge code review: Ask the agent for a semi-formal certificate explaining risks and test impacts.
•Static bug hunting: Use fault-localization templates to pinpoint likely buggy lines without running code.
•Repository Q&A assistant: Answer deep “what does this code do?” questions with trace tables and citations.
•Regression risk scanning: Compare two candidate fixes and predict which tests would differ.
•Framework migration checks: Verify that refactors preserve behavior modulo tests across modules.
•Security pre-scan: Reason about potential misuse paths (e.g., dangerous call sites) without executing untrusted code.
•Third-party impact analysis: Trace how API changes ripple through the codebase based on call graphs and tests.
•Education and onboarding: Teach new engineers code structure with evidence-cited traces and data-flow summaries.

Version: 1