šŸŽ“How I Study AIHISA
šŸ“–Read
šŸ“„PapersšŸ“°BlogsšŸŽ¬Courses
šŸ’”Learn
šŸ›¤ļøPathsšŸ“šTopicsšŸ’”ConceptsšŸŽ“Shorts
šŸŽÆPractice
ā±ļøCoach🧩Problems🧠ThinkingšŸŽÆPrompts🧠Review
SearchSettings
ContextBench: A Benchmark for Context Retrieval in Coding Agents | How I Study AI

ContextBench: A Benchmark for Context Retrieval in Coding Agents

Intermediate
Han Li, Letian Zhu, Bohan Zhang et al.2/5/2026
arXiv

Key Summary

  • •ContextBench is a new benchmark that checks not just whether a coding AI fixes a bug, but whether it found and used the right pieces of code along the way.
  • •It comes with human-verified gold contexts—compact sets of code that experts say are necessary to solve each issue—so we can grade the AI’s scavenger hunt, not only its final answer.
  • •The benchmark spans 1,136 real issues across 66 repositories and 8 programming languages, with gold contexts labeled at file, block (AST definition), and line levels.
  • •An automated framework records what code the agent actually read and compares it to the gold context using recall, precision, and F1, plus new process metrics like efficiency, redundancy, and evidence drop.
  • •Across five agents, fancier scaffolding did not reliably improve context retrieval; a simple baseline often matched or beat them (ā€œThe Bitter Lessonā€ for coding agents).
  • •Across four frontier LLMs, models tended to favor recall over precision—grabbing lots of files/lines (coverage) but also extra noise—hurting overall F1.
  • •Balanced retrieval (moderate steps and moderate context per step) led to better accuracy and lower costs than grabbing huge chunks or taking too many tiny steps.
  • •Agents often looked at the right code during exploration but failed to keep or use it when writing the final patch, showing a big gap between retrieved and utilized context.
  • •Gold contexts were robust: even when multiple correct patches existed for the same issue, their needed contexts matched with high Jaccard similarity (ā‰ˆ0.95).

Why This Research Matters

ContextBench helps teams build coding AIs that are reliable partners, not just lucky patch generators. By checking whether an agent read the right files, functions, and lines—and whether it kept and used them—teams can diagnose and fix the real causes of failures. This reduces wasted tokens, speeds up debugging, and cuts cloud costs by encouraging balanced, efficient retrieval. For safety and trust, it exposes when agents succeed for the right reasons instead of overfitting to tests. And because the benchmark spans many languages and real repositories, improvements transfer to real-world software work, from open source maintenance to enterprise codebases.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

šŸž You know how when you’re solving a big jigsaw puzzle, finding the right pieces matters as much as finishing the picture? If a friend hands you the finished puzzle, you can’t tell whether they found the right pieces quickly or just tried every piece randomly until it worked.

🄬 The Concept: What the world was like before ContextBench

  • What it is: Before this research, coding AIs (agents) were judged mostly by whether their final code passed the tests, not by how they searched through the codebase to solve the problem.
  • How it works (before):
    1. Give an AI a real GitHub issue.
    2. Let it explore a large repository.
    3. If its final patch passes the test suite, we call it a success.
  • Why it matters: Without seeing how the AI found information, we don’t know if it truly understood the code or just got lucky. That makes it hard to improve agents or trust them in real projects.

šŸž Anchor: Imagine grading a math student only by the final answer, never checking their work. You’d miss whether they used the right steps, made lucky guesses, or copied.

šŸž You know how librarians don’t just care if you bring back a book, but also that you picked the right book in the first place? They want to know your search worked.

🄬 The Concept: The problem researchers faced

  • What it is: Existing benchmarks (like SWE-bench and SWE-bench Pro) focus on end-to-end success (Pass@k) but ignore how agents find code context inside huge repositories.
  • How it works (the gap):
    1. Agents might scroll through tons of files.
    2. They might miss the key function or pull in lots of irrelevant code.
    3. We have no standardized way to check if what they read matched what they needed.
  • Why it matters: If agents pass tests by trial-and-error or overfitting, they won’t be reliable teammates for developers. Companies care about cost, speed, and trust—not just green test bars.

šŸž Anchor: It’s like saying a detective is great because the culprit was caught, without checking whether they followed clues or just arrested everyone until someone confessed.

šŸž Imagine your teacher highlighting the exact paragraphs you need to answer a question. If you study those, you should solve the problem faster and better.

🄬 The Concept: Gold contexts

  • What it is: Human-annotated gold contexts are compact sets of files, functions/classes (blocks), and lines that experts say are sufficient to fix an issue.
  • How it works:
    1. Start from the real patch that fixed the issue.
    2. Trace code dependencies and related definitions.
    3. Keep only what’s necessary (minimal but sufficient).
    4. Verify sufficiency by asking a strong LLM to fix the issue using only that context and checking tests.
  • Why it matters: With gold contexts, we can fairly grade whether an agent found what matters—not just whether it shipped a passing patch.

šŸž Anchor: It’s like a treasure map that marks only the essential clues to find the treasure, not every rock and tree.

šŸž You know how sports coaches use stats during the game (passes, shots, turnovers), not just the final score? Those stats tell you what worked.

🄬 The Concept: Process-oriented evaluation

  • What it is: Instead of grading only the final patch, ContextBench tracks which code the agent looked at during the task and measures how well that matches the gold context.
  • How it works:
    1. Instrument the agent to log every file and line it reads (agent trajectory).
    2. Map both gold and agent-read code into the same coordinates: file, block (AST definition), and line.
    3. Compute recall, precision, and F1 over those granularities.
  • Why it matters: These signals reveal if the agent looked at the right places early, added noise, looped, or dropped crucial info before the final patch.

šŸž Anchor: Like a fitness tracker that shows not just that you reached 10,000 steps, but when you walked, how efficiently, and whether you repeated the same route again and again.

šŸž Imagine two ways to study: read the whole textbook (high recall, low precision) or read only a few pages (high precision, low recall). Neither alone is ideal.

🄬 The Concept: Recall, precision, and F1

  • What it is: Recall is ā€œhow much of the needed stuff did you find?ā€ Precision is ā€œhow much of what you grabbed was actually needed?ā€ F1 balances both.
  • How it works:
    1. Recall = overlap with gold / size of gold.
    2. Precision = overlap with gold / size of what agent grabbed.
    3. F1 = harmonic mean that rewards balanced performance.
  • Why it matters: Agents that read everything drown in noise; agents that read too little miss key clues. F1 shows the balance.

šŸž Anchor: If the gold context has 10 important paragraphs and you read 9 of them (recall 0.9) but you also read 90 random pages (low precision), you probably won’t write a clear essay.

šŸž Think of reading a chapter vs a specific paragraph vs a single sentence to answer a question.

🄬 The Concept: Multi-granularity (file, block, line)

  • What it is: ContextBench grades retrieval at three levels: whole files, definition blocks (functions/classes), and specific lines.
  • How it works:
    1. File level: Did you open the right files?
    2. Block level: Did you read the right functions/classes (AST definitions)?
    3. Line level: Did you inspect the exact lines that matter?
  • Why it matters: A model might find the right file but miss the key function; finer levels expose where retrieval breaks.

šŸž Anchor: Finding the right book (file) is good, the right chapter (block) is better, and the exact paragraph (line) is best for precise fixes.

šŸž You know how some students create super fancy study systems that don’t actually help them learn faster?

🄬 The Concept: The Bitter Lesson for coding agents

  • What it is: More complicated agent scaffolding didn’t reliably beat a simpler baseline in finding the right context.
  • How it works:
    1. Compare five agents with different retrieval designs.
    2. Measure recall/precision/F1 at three granularities.
    3. See that simple, iterative shell-based exploration often matches or outperforms fancy orchestration.
  • Why it matters: Over-engineering can add cost and confusion without better results.

šŸž Anchor: A neat, simple to-do list can beat a complicated app with too many buttons when you just need to study the right pages.

Finally, why this matters: in real software teams, we want agents we can trust. That means they must find, keep, and use the right code context—not just stumble into a passing patch. ContextBench gives us the flashlight to see that process clearly.

02Core Idea

šŸž Imagine giving a treasure hunter not only the treasure’s final location but also the exact set of clues that are sufficient to find it. Now you can judge how well they follow the clues—not just whether they reached the treasure.

🄬 The Concept: The ā€œAha!ā€ in one sentence

  • What it is: ContextBench introduces gold contexts and an automated, multi-granular comparison of what agents read vs what they needed, turning opaque end-to-end success into a transparent, process-level evaluation.
  • How it works:
    1. Build a benchmark of real issues with expert-verified, minimal-sufficient code contexts (gold contexts).
    2. Instrument agents to log every file/block/line they read (their trajectory).
    3. Align both sets via a shared coordinate system (file paths, AST definition blocks, line ranges) and compute recall/precision/F1 plus new dynamics metrics.
  • Why it matters: We can finally see whether agents retrieved the right evidence, how early they found it, how much they re-read, and what they dropped before patching.

šŸž Anchor: Like grading a science fair project by the quality of the lab notes and experiments, not only the shiny poster at the end.

Multiple analogies to lock it in:

  • Map analogy: The gold context is the essential route; the agent’s path is what it actually walked. We compare the two routes.
  • Study guide analogy: The gold context is the teacher’s highlighted notes; the agent’s reading log is what it studied. We check overlap and extra fluff.
  • Cooking analogy: The gold context is the minimal recipe; the agent’s shopping list is what it bought. Did it get the needed ingredients without overfilling the cart?

Before vs After:

  • Before: Benchmarks said ā€œpass or failā€ based on tests, hiding whether the agent understood the repo’s structure or just brute-forced.
  • After: We see retrieval quality at file/block/line levels, track efficiency (how early gold is found), redundancy (how much re-reading), cost patterns, and the painful gap between explored and actually used evidence.

Why it works (intuition, not equations):

  • If you know what evidence is sufficient (gold context), you can ask whether the agent ever saw that evidence (recall), how cleanly it selected it (precision), and how well it balanced the two (F1).
  • Breaking the comparison across file, block, and line levels shows exactly where retrieval fails (book vs chapter vs paragraph).
  • Logging the trajectory lets you diagnose process health: early coverage is good (efficiency), repeated re-reading wastes budget (redundancy), and dropping seen-good evidence signals consolidation problems (evidence drop).

Building blocks (each with a mini-sandwich):

šŸž You know how teachers remove duplicate questions in a test bank? 🄬 What it is: Task deduplication removes exact/near-duplicate issues from multiple sources so each task is unique. How: Rule-based ID checks, embedding similarity, and manual review. Why: Prevents double-counting easy patterns and keeps evaluation fair. šŸž Anchor: Like making sure your quiz doesn’t accidentally ask the same question twice.

šŸž Imagine picking puzzle challenges that really test your skills, not just easy border pieces. 🄬 What it is: Task selection uses difficulty signals—agent solvability, edit scope, and edit dispersion—to choose challenging cases. How: Prefer tasks unsolved or rarely solved; prefer patches that touch more files and are dispersed across the repo tree. Why: Forces real context retrieval, not toy lookups. šŸž Anchor: Like choosing puzzles where important pieces are scattered across the table, not clumped.

šŸž Think of a teacher highlighting only the necessary sentences to answer a question. 🄬 What it is: Gold context annotation traces from the true patch to find necessary files/blocks/lines, then trims extras. How: Experts trace dependencies, refine for compactness, and verify sufficiency with an LLM that gets only that context. Why: Creates a ground truth for what information actually matters. šŸž Anchor: A clean, minimal study guide that still lets you ace the test.

šŸž Picture turning a messy bookshelf into a neat catalog so you can compare two reading lists. 🄬 What it is: A shared coordinate system using file paths, AST blocks (definitions), and line ranges. How: Parse repos with tree-sitter, standardize blocks as function/class/interface/trait definitions across languages. Why: Makes apples-to-apples comparisons across languages possible. šŸž Anchor: Labeling every book by shelf, chapter, and page so two readers’ notes can be aligned.

šŸž Think of a coach tracking not just total points but also early leads and wasted moves. 🄬 What it is: Process metrics—efficiency (early gold coverage), redundancy (re-reading), and evidence drop (saw it but didn’t keep it). How: Compute coverage curves over steps, overlap between steps, and kept-vs-seen gold evidence. Why: Reveals where the process breaks: late discovery, looping, or forgetting. šŸž Anchor: A replay that shows you led early but kept running the same play and then forgot the key move at crunch time.

That’s the core idea: with gold contexts and trajectory-aware, multi-granular metrics, we can finally see and score how coding agents actually think with code.

03Methodology

At a high level: Real issues → (1) Deduplicate tasks → (2) Select hard tasks → (3) Annotate/verify gold context → (4) Run agents with trajectory logging → (5) Map agent-read vs gold to the same coordinates → (6) Compute recall/precision/F1 and dynamics → (7) Analyze results by agent and model.

Step 1: Task Deduplication

  • What happens: Combine issues from SWE-bench Verified, Multi-SWE-bench, SWE-PolyBench PB500, and SWE-bench Pro. Remove duplicates with rule checks and embedding-based similarity, plus human review.
  • Why this step exists: If the same or near-same issue appears twice, results are biased and inflated.
  • Example: Two tasks with nearly identical descriptions across forks—keep one to preserve fairness.

Step 2: Task Selection

  • What happens: Rank by (a) agent solvability, (b) edit scope (how many files changed), and (c) edit dispersion (how far apart edits are in the repo tree via tree-sitter structure). Prefer unsolved/rarely solved tasks with broad, scattered edits.
  • Why it matters: Hard, dispersed edits force real context retrieval across modules.
  • Example: Prefer a bug whose fix touches three libraries and a core module over a one-line change in a single file.

Step 3: Expert Annotation of Gold Contexts

  • What happens: Start from the ground-truth patch. Trace dependencies (functions, classes, control/data flow) to gather only the necessary files/blocks/lines. Then verify sufficiency by prompting a strong LLM with only that context to produce a passing patch (≄1 of 5 attempts). Trim any redundancies through second-annotator review and consensus.
  • Why it matters: Produces compact-yet-sufficient gold references to grade retrieval, not just correctness.
  • Example: For a change in manager.py, include the class definition it modifies, the helper function it calls in utils.py, and the constant in settings.py it relies on—but exclude unrelated neighbors.

Step 4: Agent Context Tracing (Trajectory Logging)

  • What happens: Instrument agents to log every viewed code segment: absolute file path and line ranges. Normalize to <PATCHCONTEXT> blocks. Require a final, explicit declaration of the code context used to craft the patch.
  • Why it matters: Creates a complete trail of what the agent actually read vs what it claims it used—crucial for grading process, not just outcomes.
  • Example: The log shows the agent opened parser.go lines 40–95 and model.py lines 120–168; later it declares only parser.go in the final context—revealing a potential evidence drop.

Step 5: Shared Coordinate System via tree-sitter

  • What happens: Parse repositories with tree-sitter to extract standardized units: files, definition-level blocks (functions/classes/interfaces/traits), and precise line ranges. Map both gold and agent contexts into this grid for alignment.
  • Why it matters: Ensures consistent, cross-language comparison at multiple granularities.
  • Example: A Python function definition and a Java method definition are both treated as blocks with start/end lines for fair comparison.

Step 6: Metrics Computation (Retrieval and Dynamics)

  • What happens: Compute recall, precision, and F1 at file, block, and line levels using interval overlaps. Also compute process metrics:
    • Efficiency (AUC-Cov): how quickly cumulative gold coverage grows over steps.
    • Redundancy: how much each new step repeats previously read code.
    • Evidence drop: fraction of seen-gold not kept in the final context.
  • Why it matters: These reveal whether the agent finds the right places early, spins wheels, or forgets critical evidence before patching.
  • Example: An agent with high recall but low F1 might be grabbing too many unrelated lines (low precision). A high drop means it saw gold but didn’t carry it into the final patch.

Step 7: Evaluation Settings and Comparisons

  • What happens: Evaluate five agents (mini-SWE-agent baseline; Agentless; SWE-agent; OpenHands; Prometheus) and four LLMs (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Devstral 2). For LLM comparisons, use the same simple scaffold (mini-SWE-agent) to standardize conditions.
  • Why it matters: Separates the effect of agent scaffolding from the model’s own retrieval behavior.
  • Example: If a fancy scaffold underperforms the baseline with the same LLM, that suggests over-engineering.

Secret sauce (what’s clever):

  • Gold contexts are minimal-but-sufficient and verified by executable tests using an LLM restricted to that context, turning subjective annotation into actionably validated ground truth.
  • Multi-granularity alignment (file/block/line) finds exactly where retrieval fails—book, chapter, or paragraph.
  • Process metrics move from static snapshots to dynamic health checks of retrieval: early hits (efficiency), re-reading (redundancy), and memory/selection failures (evidence drop).

Mini-sandwiches for key methodology concepts:

šŸž You know how a librarian labels books by shelf, section, and page ranges so two readers can compare notes? 🄬 What it is: AST block alignment standardizes ā€œblocksā€ as definition-level units across languages. How: Use tree-sitter to extract function/class/interface/trait definitions with start/end lines. Why: Cross-language fairness. šŸž Anchor: Matching a Python function and a Java method as comparable recipe steps.

šŸž Picture checking if a student studied the teacher’s highlights early, or only at the last minute. 🄬 What it is: Efficiency (AUC-Cov) measures how early the agent covers gold context. How: Track cumulative recall over steps; compute normalized area under the curve. Why: Early coverage helps reasoning and reduces wasted steps. šŸž Anchor: Getting the key hints in the first few minutes of an exam.

šŸž Think of re-reading the same paragraph over and over. 🄬 What it is: Redundancy measures repeated retrieval of already seen regions. How: Compute overlap of each step with previous steps. Why: Wastes tokens, time, and cost. šŸž Anchor: Looping the same page instead of moving to the next clue.

šŸž Imagine noticing the right clue while studying but forgetting to bring it to the test. 🄬 What it is: Evidence drop measures gold evidence seen during exploration but missing from the final context used for patching. How: Compare seen-gold vs kept-gold in the final <PATCHCONTEXT>. Why: Reveals consolidation/selection failures. šŸž Anchor: You highlighted the answer but didn’t copy it to your cheat sheet (allowed one!).

Together, these steps turn opaque agent runs into measurable, comparable, and improvable processes.

04Experiments & Results

The test: what and why

  • What they measured: Context recall, precision, and F1 at file, block, and line granularity; process metrics (efficiency, redundancy, evidence drop); and final Pass@1 (did the patch pass on the first try?).
  • Why: To see whether agents and LLMs retrieve the right evidence, balance breadth vs focus, manage cost, and actually use what they find to produce correct patches.

The competition: who was compared

  • Agents: mini-SWE-agent (simple baseline), Agentless, SWE-agent, OpenHands, Prometheus—each run with a strong LLM (GPT-5) for a fair scaffold comparison.
  • LLMs: GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Devstral 2—each run inside the same simple scaffold (mini-SWE-agent) for a fair model comparison.

Scoreboard highlights (ContextBench Lite):

  • Agent comparison (with GPT-5):
    • Simple often wins: mini-SWE-agent achieved competitive or better context retrieval than more complex scaffolds. For example, file-level recall/precision/F1 ā‰ˆ 0.682/0.709/0.634; line-level F1 ā‰ˆ 0.312; Pass@1 ā‰ˆ 0.472.
    • Complex scaffolding did not guarantee better retrieval: some agents had higher recall but worse precision and F1 than the baseline, echoing the ā€œBitter Lesson.ā€
  • LLM comparison (in mini-SWE-agent):
    • GPT-5: file-level ā‰ˆ 0.682/0.709/0.634; block-level ā‰ˆ 0.645/0.369/0.375; line-level ā‰ˆ 0.606/0.301/0.312; Pass@1 ā‰ˆ 0.472.
    • Claude Sonnet 4.5: stronger balance; e.g., line-level precision/recall/F1 ā‰ˆ 0.374/0.588/0.344 and the best Pass@1 ā‰ˆ 0.530 among listed results.
    • Gemini 2.5 Pro: tended toward precision in some tiers but struggled on line-level F1 (ā‰ˆ 0.311) and Pass@1 ā‰ˆ 0.364.
    • Devstral 2: mixed performance; block-level precision relatively higher (ā‰ˆ 0.576) but overall line-level F1 ā‰ˆ 0.332 and Pass@1 ā‰ˆ 0.402.

Make the numbers meaningful:

  • File-level F1 around 0.6 is like finding the right books most of the time.
  • Block-level F1 under ~0.45 is like often missing the right chapters once inside the right books.
  • Line-level F1 under ~0.35 is like frequently skimming past the exact paragraphs that matter.
  • When Claude gets better F1 and Pass@1 than GPT-5 here, it’s like scoring an A- while others score B’s on picking and using the right clues.

Surprising (and useful) findings:

  1. More scaffolding ≠ better retrieval. Fancy orchestration sometimes added overhead and noise without improving F1 or Pass@1 over the simple baseline.
  2. Recall over precision bias. All LLMs tended to grab broad context (high recall) but with lots of extra lines (low precision), lowering F1—and sometimes final success.
  3. Balanced retrieval saves money. Models that took moderate steps with moderate chunk sizes did better and cost less than models that either grabbed huge chunks in few steps or took many tiny steps.
  4. Big gap between explored vs used context. Agents often saw the right lines during exploration but dropped them before patching (high evidence drop). This hurts even if recall looked good mid-run.

Process dynamics (examples from reported tables):

  • Steps, granularity, and cost:
    • GPT-5 retrieved in fewer rounds (~5.87) with bigger chunks (~119 lines/step), costing ~$0.45/instance.
    • Claude balanced rounds (~14.38) and chunk size (~29.74), costing ~$0.76/instance.
    • Gemini used ~7.57 rounds and ~26.29 lines/step, with ~$0.38/instance.
    • Devstral 2 used many rounds (~22.16) with tiny chunks (~11.98), costing ~$0.91/instance.
    • Takeaway: Moderate steps and chunk sizes correlated with better F1 and Pass@1; excessive rounds inflated cost.
  • Efficiency, redundancy, evidence drop:
    • Claude showed high efficiency (~0.658) but also high redundancy (~0.708), suggesting it found gold early but re-read a lot.
    • GPT-5: efficiency ~0.591, redundancy ~0.487, drop ~0.179.
    • Gemini and Devstral showed large evidence drops (~0.431 and ~0.435), aligning with lower Pass@1.

Case studies (why failures happen):

  • Missing constructor semantics: An agent read methods of a dict-like class but missed its init contract, causing many test failures despite partial retrieval.
  • File localization failure: Chasing surface keywords led an agent away from the true source file where the error code’s logic lived.
  • Cross-module tunnel vision: Grep anchored exploration to MySQL while the bug also lived in SQLite/Oracle; evidence from reproduction didn’t shift the search.

Overall: ContextBench turns these into measurable patterns so we can fix the root causes: over-broad grabs, late coverage, looping, and especially evidence consolidation.

05Discussion & Limitations

Limitations (honest look):

  • Gold contexts are ā€œcompact and sufficient,ā€ not guaranteed globally minimal. There might be other equally valid small sets. The benchmark mitigates this with LLM-based sufficiency checks and secondary annotator pruning.
  • Annotation depends on expert judgment. Guidelines and cross-checks reduce bias, but human choices still shape what’s considered sufficient.
  • Some issues admit multiple correct patches. A case study showed high consistency (ā‰ˆ0.95 Jaccard), but not perfect; rare cases may differ.
  • Repository parsing and AST alignment rely on tree-sitter grammars. Edge cases in language parsing or unconventional code can affect block boundaries.
  • Environment and harness quirks can influence agent behavior (e.g., tool output, CLI prompts), though the framework standardizes as much as possible.

Required resources:

  • Datasets: Access to the curated repositories and test harnesses.
  • Tooling: tree-sitter for parsing; instrumented agent frameworks to emit <PATCHCONTEXT>.
  • Human time (if extending): Experts to annotate/verify new gold contexts.
  • Compute: Running agents and models over 1k+ tasks; token budgets for retrieval-heavy runs.

When not to use ContextBench (or use with care):

  • Tiny, single-file toy tasks where retrieval is trivial—file/block/line granularity adds little value.
  • Pure generation tasks (e.g., writing a module from scratch) where the challenge isn’t locating existing context.
  • Non-code artifacts (design docs, tickets without code links) where tree-sitter-based alignment doesn’t apply.

Open questions:

  • How to reduce evidence drop? Better memory, summarization, or structured note-taking could help agents keep the gold they already saw.
  • Can active learning use gold-context feedback to teach agents better retrieval policies?
  • How to automatically tune the recall–precision balance per task to optimize F1 and cost?
  • Can we generalize block definitions beyond AST to semantic slices (e.g., dataflow regions) for languages with metaprogramming or heavy indirection?
  • How to integrate runtime signals (failing tests, stack traces) into retrieval policies without overfitting to symptoms?

Bottom line: ContextBench doesn’t solve retrieval, but it finally lets us see it clearly—where it succeeds, where it fails, and how to improve it.

06Conclusion & Future Work

Three-sentence summary:

  • ContextBench is a benchmark that evaluates how coding agents retrieve and use the right code context, not just whether their final patch passes tests.
  • It supplies expert-verified gold contexts and an automated framework that tracks agent trajectories and scores retrieval at file, block, and line levels, plus process dynamics.
  • Results across five agents and four LLMs show that simple scaffolds can match or beat complex ones, models often favor recall over precision, and many agents drop crucial evidence before patching.

Main achievement:

  • Turning opaque, end-to-end task success into transparent, process-level evaluation with gold contexts and multi-granular, trajectory-aware metrics—giving the community actionable signals to improve retrieval and reasoning.

Future directions:

  • Train agents with gold-context supervision to reduce evidence drop and improve consolidation.
  • Adaptive retrieval policies that balance recall and precision per task for higher F1 and lower cost.
  • Richer semantic blocks (dataflow/symbol graphs) beyond AST definitions to capture tricky dependencies.
  • Integrate runtime hints (error codes, stack traces) without drifting into symptom-only fixes.

Why remember this:

  • If you can’t see how an agent found its clues, you can’t fix its study habits. ContextBench shines light on that path—showing when agents read the right book, the right chapter, and the exact paragraph—and whether they kept those clues to craft a reliable fix.

Practical Applications

  • •Tune agent retrieval policies to balance recall and precision, targeting higher F1 at line level.
  • •Add consolidation steps (summaries or structured notes) to reduce evidence drop between exploration and patching.
  • •Use efficiency and redundancy metrics to cap re-reading loops and cut token costs.
  • •Train with gold-context supervision (e.g., reward models that cover gold early and keep it in final context).
  • •Auto-adjust chunk size and step count per task difficulty (adaptive granularity).
  • •Incorporate error-code backtracing (map symptoms to source files) to improve file localization.
  • •Adopt multi-language AST definitions (via tree-sitter) to standardize retrieval units across stacks.
  • •Use ContextBench Lite for quick regression checks when updating agents or prompts.
  • •Diagnose failure modes by comparing explored vs kept gold to decide whether to improve search or consolidation.
  • •Benchmark different scaffolds with the same LLM (and vice versa) to isolate where gains come from.
#context retrieval#coding agents#software engineering benchmarks#gold context#precision and recall#F1 score#AST block alignment#tree-sitter#agent trajectories#process-oriented evaluation#evidence drop#retrieval efficiency#redundancy#SWE-bench#repository-level tasks
Version: 1

Notes

0/2000
Press Cmd+Enter to submit