SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Yuhang Wang; Yuling Shi; Mo Yang; Rongrui Zhang; Shilin He; Heng Lian; Yuting Chen; Siyu Ye; Kai Cai; Xiaodong Gu

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents

Intermediate

Yuhang Wang, Yuling Shi, Mo Yang et al.1/23/2026

arXiv PDF

Key Summary

•Coding agents waste most of their tokens just reading giant files, which makes them slow and expensive.
•SWE-Pruner teaches the agent to say what it’s looking for (a Goal Hint) and then trims code lines that don’t help with that goal.
•A tiny helper model (0.6B parameters) acts like a high-speed skimmer, keeping only the relevant lines and preserving code structure.
•Working at the line level protects syntax, so the agent can still understand and edit code correctly.
•Across multi-turn agent tasks, SWE-Pruner cuts 23–54% of tokens while slightly improving success by about 1–1.4 percentage points.
•On single-turn long-context tasks, it reaches up to 14.84× compression with minimal accuracy loss.
•It also shortens conversations (up to 26% fewer rounds), so agents finish faster with fewer detours.
•Unlike fixed, one-size-fits-all compression, SWE-Pruner adapts to the task step-by-step using the Goal Hint.
•Latency overhead is tiny (first-token under ~100 ms), so savings in decoding time and API cost dominate.
•SWE-Pruner can plug into existing agents with a small wrapper around file-read tools, making adoption easy.

Why This Research Matters

In real software projects, agents spend more time reading than thinking, which inflates cost and slows down delivery. SWE-Pruner lets agents declare their goal and trims the view to just the helpful lines, so they stay focused and act faster. This lowers API bills, reduces latency, and cuts down on mistaken edits that come from noisy inputs. Teams running automated bug fixes or CI checks can scale further without exploding costs. Individual developers get a snappier coding assistant that finds and fixes issues with fewer detours. Over time, this kind of goal-aware skimming will be a standard add-on for any coding LLM that works with large repos.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how your backpack gets heavy if you pack every single thing you might need? Carrying too much slows you down and makes you tired. Coding agents (AI helpers for programmers) have the same problem: when they open big codebases, they read too much at once. That makes them slow and costly because large language models (LLMs) pay for every token they read and write.

Before this paper, coding agents got really good at using tools like a terminal, an editor, and file search. They could explore repositories, run tests, and even fix bugs end-to-end. But there was a big roadblock: the context wall. If an agent keeps stuffing more and more code into its short-term memory (the context window), three things happen: the bill goes up, answers take longer, and the model gets distracted by noise. In fact, a careful measurement showed something surprising: about three-quarters of all tokens in an agent’s session were spent just on reading files, not on thinking or editing.

People tried to shrink the context. Some tools delete low-importance tokens (like removing little words), and some write summaries. Those tricks help on regular text, but code is picky. If you delete the wrong token from code, you can break the syntax. If you summarize code too much, you might lose tiny but crucial details, like a colon or a variable name—exactly the stuff you need to debug. Another issue: earlier methods used fixed rules (like a fixed compression ratio or perplexity thresholds) that don’t care what the agent is trying to do right now. But coding is a moving target: what matters in one step (e.g., "Where’s the error handling?") might be irrelevant in the next (e.g., "How is MRO computed?").

Because of these problems, agents either over-read (wasting tokens and time) or over-shrink (losing the pieces they need). What was missing was a way to keep just the helpful lines of code, while knowing what the current task is. Think of a programmer skimming a file: they don’t read every line—they scan with a goal in mind, like “Find where we set the timeout.”

This paper fills that gap with SWE-Pruner, a self-adaptive, task-aware way to cut down context. The trick is to have the agent say its goal out loud in plain language (a Goal Hint), then use a small, fast model to keep only the lines that match that goal. It trims at the line level (not at the token level), so code structure stays intact. It also adapts every round, following the agent’s changing needs.

Why should anyone care? Less reading means lower API bills, faster results, and fewer hallucinations because the model isn’t drowning in irrelevant code. That’s good for everyday developers who want a helpful, fast assistant inside their IDE, for teams running CI bots that must be efficient, and for companies operating at scale where every second and cent matters. SWE-Pruner shows you can be both smart and frugal: keep the right details, skip the fluff, and stay focused on the task.

02Core Idea

Aha! Moment in one sentence: Let the agent tell a small helper what it’s looking for right now, then keep only the lines that match that goal.

Multiple analogies:

Packing a trip bag: Before, the agent stuffed the whole closet into the suitcase. After, it reads the itinerary (Goal Hint) and packs only what’s needed for today.
A museum tour: Before, you wandered every hall. After, a guide (the skimmer) leads you straight to the exhibits that match your interests.
Highlighter reading: Before, you read every word. After, you highlight relevant sentences based on your question and skim the rest.

Before vs. After:

Before: Agents shovel in huge file dumps; tokens balloon; answers slow down; syntax often breaks with token-level pruning; summarizers drop key details.
After: Agents generate a Goal Hint; a tiny neural skimmer scores each line for relevance; only relevant lines are kept; tokens drop by 23–54% on agent tasks; success nudges up ~1–1.4 percentage points; conversations often end sooner (up to 26% fewer rounds).

Why it works (intuition without equations):

Query awareness: Relevance is defined by the task at hand, not by generic statistics like perplexity. This makes the filter pick the right things for the current step.
Line-level granularity: Keeping whole lines preserves structure (functions, blocks), so the code remains understandable and editable.
Sequence sense: A CRF head prefers smooth, sensible keep/remove patterns (e.g., don’t drop the start of a function but keep its middle), so syntax and logic stay coherent.
Lightweight helper: A 0.6B reranker-based skimmer is fast, so the time you spend skimming is much smaller than the time and money you save on the big model’s long inputs and outputs.

Building blocks explained with the Sandwich pattern:

🍞 Top Bread (Hook): Imagine you’re cleaning your room: you put only the things you need for school into your backpack. 🥬 The Concept: Self-Adaptive Context Pruning is a way for an agent to keep only the code lines it needs right now, based on its current goal.

How it works:
1. The agent states its goal (Goal Hint).
2. A small model scores each line for relevance to that goal.
3. Lines with high scores are kept; others are removed.
Why it matters: Without it, the agent drags around too much code, gets distracted, and wastes tokens. 🍞 Bottom Bread (Anchor): When fixing a bug, the agent keeps only lines about error handling instead of the entire file.

🍞 Top Bread (Hook): You know how a teacher asks, “What are you trying to find?” before helping? 🥬 The Concept: Goal Hint Generation is the agent writing a short, clear question about what it needs (e.g., “How is authentication handled?”).

How it works:
1. The agent adds a natural-language question to read commands.
2. The skimmer uses the question to decide which lines to keep.
3. If no hint is given, full output is returned (backward compatible).
Why it matters: Without a goal, filters guess and often keep the wrong things. 🍞 Bottom Bread (Anchor): “Focus on MRO resolution logic” helps the skimmer keep the class and method lines about MRO and drop unrelated lines.

🍞 Top Bread (Hook): Think of a lifeguard quickly scanning a crowded pool for swimmers who need help. 🥬 The Concept: A Lightweight Neural Skimmer is a small model (0.6B) that rapidly scores lines for relevance to the goal.

How it works:
1. It reads the goal and the code.
2. It scores tokens, averages them per line, and uses CRF to favor coherent keep/remove decisions.
3. It keeps lines over a threshold and returns the pruned text.
Why it matters: A small, fast helper saves time and money while guiding the big model. 🍞 Bottom Bread (Anchor): It keeps the 7 lines where a docstring is inherited but drops hundreds of unrelated lines.

🍞 Top Bread (Hook): When reading a book, you keep sentences whole; you don’t delete random letters. 🥬 The Concept: Line-Level Granularity means choosing entire lines to keep or drop, protecting code structure.

How it works:
1. Score tokens.
2. Average scores per line.
3. Keep or drop whole lines.
Why it matters: Token-level cuts can break syntax; line-level keeps functions and blocks readable and valid. 🍞 Bottom Bread (Anchor): Keep the whole “if/else” block instead of slicing words out of it.

🍞 Top Bread (Hook): A jigsaw puzzle makes more sense when you place connected pieces together. 🥬 The Concept: Conditional Random Fields (CRF) help choose a smooth sequence of keep/drop labels so nearby lines form sensible chunks.

How it works:
1. Predict how likely each line should be kept.
2. Add a bonus for consistent neighbors (e.g., keep lines within the same function).
3. Decode the best overall sequence of decisions.
Why it matters: Without CRF, you might keep the middle of a function but drop its header or closing brace. 🍞 Bottom Bread (Anchor): If you keep a method body, CRF nudges you to also keep its def line and closing indentation.

03Methodology

At a high level: Raw file output → Agent adds Goal Hint → Neural skimmer scores lines → Keep relevant lines → Pruned context goes back to the agent.

Step-by-step (with what, why, example):

Input comes in from file reads (grep, cat, nl -ba, sed)

What happens: The agent runs a read command; the environment captures the output (often thousands of lines).
Why it exists: Agents must explore unfamiliar repos to find where logic lives, but raw reads are huge and noisy.
Example: cat -n django/db/models/sql/query.py prints >1,000 numbered lines.

The agent adds a Goal Hint (context_focus_question)

What happens: Alongside the command, the agent includes a plain-language question like “Focus on MRO resolution logic.” If omitted, the pruner is bypassed for full compatibility.
Why it exists: The hint tells the skimmer what the agent is trying to do right now.
Example: After cat -n, the agent adds “Where is the combined query cloning logic handled?”

Token scoring by the lightweight neural skimmer

What happens: The 0.6B reranker backbone reads the goal and the code, producing relevance scores for tokens, then averages them per line.
Why it exists: Tokens provide fine detail; averaging per line preserves structure and readability.
Example: Lines 537–541 (docstring inheritance) score high; lines about JSON encoding score low.

Structured keep/drop with CRF

What happens: A CRF layer encourages coherent sequences: if a function body is useful, it’s more likely to keep the function header too.
Why it exists: Prevents choppy outputs that break code understanding.
Example: Keeping the if super_method is not None: line also keeps the surrounding for loop and method signature.

Adaptive selection with a threshold

What happens: Lines with average score above τ (e.g., 0.5) are kept; others are dropped. Chunks are processed in parallel for speed.
Why it exists: A simple, fast rule converts scores into usable text while meeting compression targets.
Example: Output shrinks from 1,500 lines to 120 lines that answer the goal.

Return pruned context to the agent

What happens: The agent receives a smaller, cleaner view of the file and continues reasoning or editing.
Why it exists: Less noise leads to clearer, faster decisions and fewer interaction rounds.
Example: The agent pinpoints the bug in Query.clone() and edits it directly.

Training the skimmer (recipe-style):

Data: 61,184 synthetic samples built with a teacher LLM. Each sample has (Goal, Code, Line Mask, Doc Score).
Labels: Line masks show which lines to keep; doc scores teach the reranker to rate whole files.
Objective: Two heads are trained together—CRF-NLL for line sequences; MSE for doc-level scores.
Coverage: Queries span 9 common tasks (debugging, locating, refactoring, etc.) to generalize across agent workflows.

Integration (how to plug in):

Wrap read tools with an optional context_focus_question parameter.
If provided, route output through prune(); otherwise, pass through unchanged.
Works across agents (e.g., Mini SWE Agent, OpenHands) and models (Claude Sonnet 4.5, GLM-4.6), with minimal code changes.

What breaks without each step:

No Goal Hint: The skimmer guesses relevance and may miss the point.
No line-level aggregation: Token chopping breaks syntax and logic, hurting understanding and edits.
No CRF: Output becomes choppy; you might lose function headers or close braces.
No thresholding: You can’t guarantee compression and might return too much.
No pass-through mode: You’d harm compatibility and sometimes need the full file.

Concrete mini example:

Goal Hint: “How is error handling performed in this function?”
Raw output: 800 lines; the function appears at lines 490–560 with try/except at 535–548.
Skimmer scores lines; averages; CRF smooths; threshold keeps 498–541 and nearby structure.
Pruned output: About 120 lines centered on the target logic, easy to scan and edit.

The secret sauce:

Task-aware filtering (Goal Hints) means the skimmer stays aligned with the agent’s changing objectives.
Line-level decisions plus CRF keep code syntactically and semantically coherent.
A tiny, fast skimmer adds under ~100 ms TTFT while saving far more on the big model’s decoding and API costs.

04Experiments & Results

The test: Can SWE-Pruner cut tokens and time while keeping or improving accuracy? The authors measure success rate on real bug-fixing (SWE-Bench Verified), answer quality on repo Q&A (SWE-QA), and accuracy on long single-turn tasks (Long Code Completion, Long Code QA). They also track token counts, compression ratios, interaction rounds, and API costs.

The competition: Baselines include keeping everything (Full) or nothing (No Context), token-level pruning (LLMLingua-2, Selective-Context), retrieval (RAG), code-structure compression (LongCodeZip), and summarization (LLM Summarize). All are tested under matching 4× and 8× constraints when applicable.

Scoreboard with context:

SWE-Bench Verified (multi-turn bug-fixing):
- Claude Sonnet 4.5 agent: Tokens drop 23.1% (0.911M → 0.701M), success rises from 70.6% to 72.0% (+1.4 pts), rounds fall 18.2% (51.0 → 41.7), cost down ~26.8%.
- GLM-4.6 agent: Tokens drop 38.3% (0.791M → 0.488M), success rises 55.4% → 56.6% (+1.2 pts), rounds fall 25.7% (49.3 → 36.6), cost down ~36.4%.
- Think of tokens like minutes spent reading: cutting a quarter to a third while scoring a little higher is like finishing your homework faster and getting a slightly better grade.
SWE-QA (repo Q&A): Across Streamlink, Reflex, Conan, tokens fall 29–54% with similar or slightly better average judge scores on Claude; GLM sometimes explores more rounds, but still uses far fewer tokens overall.
Single-turn tasks:
- Long Code Completion: Up to 10.92× effective compression at 8× target with ES ~57.6 and EM ~31.0, beating token-level baselines that drop sharply under strong compression.
- Long Code QA: Up to 14.84× compression at 8× with ~58.7% accuracy, outpacing other methods at similar targets.

Head-to-head strategy test (SWE-Bench subset of 50):

SWE-Pruner gets the top success rate (64%) using 31% fewer tokens than the vanilla agent and beats LLMLingua-2 (54%) and RAG (50%). Summarization helps a bit (56%) but adds generation latency and can miss fine details.

Surprising findings:

GLM-4.6 sometimes takes more exploration steps after pruning, likely because the focused views encourage careful checking. Even then, total tokens still drop a lot, so the approach remains efficient.
Syntax stays healthy under line-level pruning. AST correctness remains high compared to token-level methods, which can nearly break code structure (near-zero AST correctness for some baselines).
The skimmer’s first-token latency is tiny (under ~100 ms even at 8K tokens), especially compared to big models that can exceed a second. This means the pruning cost is easily repaid by the savings during big-model decoding and fewer rounds.

Takeaway: SWE-Pruner is like getting an A- to A result while studying 30–50% less time, and sometimes even bumping your grade up a point—thanks to better focus.

05Discussion & Limitations

Limitations:

Mostly evaluated on Python repos. While the idea is language-agnostic, broader multilingual tests are future work.
Synthetic training data is filtered to reduce leakage, but ongoing testing on fresh repos remains important.
The skimmer adds small latency (~40–100 ms). It’s typically far outweighed by decoding savings, but ultra-low-latency edge cases may care.

Required resources:

A small 0.6B model for pruning (CPU or a modest GPU works), plus your main coding LLM (e.g., Claude Sonnet 4.5 or GLM-4.6). Minimal engineering: wrap read commands to accept an optional Goal Hint and call prune().

When NOT to use:

Tiny files or short contexts where pruning overhead isn’t worth it.
Cases where you must preserve every character (e.g., exact byte-level diffs) and can’t risk removal.
When the agent can’t produce a good Goal Hint (the guidance is the engine of relevance).
Binary or generated artifacts where line-level meaning doesn’t match source logic.

Open questions:

How to best auto-tune thresholds by task difficulty or agent confidence?
Can we distill the skimmer further or add early-exit to make it even faster?
How does pruning interact with advanced history managers (e.g., learned conversation summarizers) in long projects?
What’s the ideal balance between structural compression (e.g., AST-aware chunks) and adaptive, goal-driven pruning across languages like Java, C++, or Rust?

06Conclusion & Future Work

Three-sentence summary: SWE-Pruner lets coding agents say what they need (Goal Hint) and then keeps only the relevant lines using a tiny, fast neural skimmer with line-level and CRF smarts. This self-adaptive pruning cuts 23–54% of tokens on multi-turn tasks and up to 14.84× on single-turn tasks, while preserving syntax and nudging success rates slightly upward. It plugs into existing agents easily and pays for itself in lower latency, fewer rounds, and reduced costs.

Main achievement: Turning skimming into a first-class, goal-aware capability for coding agents—line-level, structure-preserving, and fast—so focus increases while waste shrinks.

Future directions:

Multilingual code support and broader repo coverage.
Even lighter skimmers via distillation or early-exit.
Smarter thresholds that adapt to agent confidence and task phase.
Deeper combinations with history compression and retrieval frameworks.

Why remember this: SWE-Pruner shows that a little, well-aimed attention beats a lot of unfocused reading. By making agents say their goal and trimming context accordingly, we get a practical, drop-in speed-and-cost upgrade that keeps code intact and thinking sharp.

Practical Applications

•Add SWE-Pruner to IDE agents to read large files with a Goal Hint and return only relevant lines.
•Speed up CI bots that triage issues by pruning logs and source around suspected faults.
•Integrate into code review assistants to focus diffs and surrounding context tied to reviewer questions.
•Use in repo Q&A tools to answer developer questions from code while minimizing token usage.
•Enhance automated debugging agents to keep error-handling and stack-related lines front and center.
•Combine with retrieval (RAG) so coarse chunks are fetched and fine-grained lines are pruned.
•Deploy in low-latency settings (e.g., on-call tooling) where quick, focused reads matter.
•Support education tools that show only the lines that illustrate a concept, preserving syntax for learning.
•Install as middleware in agent frameworks (Mini SWE Agent, OpenHands) via a small tool wrapper.
•Enforce privacy by pruning sensitive sections when the Goal Hint doesn’t require them.

Version: 1