SWE-RM: Execution-free Feedback For Software Engineering Agents

KaShun Shum; Binyuan Hui; Jiawei Chen; Lei Zhang; X. W.; Jiaxi Yang; Yuzhen Huang; Junyang Lin; Junxian He

SWE-RM: Execution-free Feedback For Software Engineering Agents

Intermediate

KaShun Shum, Binyuan Hui, Jiawei Chen et al.12/26/2025

arXiv PDF

Key Summary

•Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
•SWE-RM is a new execution-free reward model that scores whole coding attempts without running tests, giving rich and continuous feedback.
•The paper shows that a verifier that looks good at test-time scaling (picking the best of many tries) may still fail at reinforcement learning.
•Two extra qualities matter for RL: discrimination (AUC) and calibration (ECE), which tell how well scores separate good from bad and how trustworthy the scores are.
•Large ablations reveal a practical recipe: more diverse data, a 2:1 positive-to-negative ratio, mixed policies, multiple sources, and very long context windows.
•SWE-RM uses a 30B Mixture-of-Experts model (activating 3B) with up to 256k tokens, so it can read long trajectories and code histories.
•On SWE-Bench Verified, SWE-RM boosts Qwen3-Coder-Flash from 51.6% to 62.0% and Qwen3-Coder-Max from 67.0% to 74.6% with test-time scaling.
•As a reward for RL, SWE-RM improves training stability and adds about 3 absolute points over execution-based-only feedback.
•Hybrid rewards (execution-free + tests) work best: smooth early learning from continuous scores plus trustworthy anchors from tests.
•This establishes well-calibrated, execution-free reward modeling as a strong foundation for building better software engineering agents.

Why This Research Matters

Software bugs waste time, create outages, and can even put people at risk; better coding agents mean faster, safer fixes. SWE-RM helps agents learn from every attempt, not just a pass/fail at the end, so they grow more capable with fewer reruns. By caring about discrimination and calibration, the model gives scores you can trust, which keeps training stable and efficient. Long context means it can judge real-world, messy codebases with multi-file edits and long logs. As a result, teams can ship fixes sooner, reduce toil on repetitive debugging, and improve user experiences across apps, tools, and services.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how when you’re doing a big school project, a teacher’s quick comments on every page help more than just the final grade at the end? That page-by-page feedback is rich, while the final grade is just pass/fail.

🥬 The Concept: Software engineering (SWE) agents are AI helpers that try to read issues, edit code, run tools, and fix bugs. They learn best when they get good feedback on their attempts.

What it is: SWE agents are multi-step problem solvers for real codebases.
How it works: They read a bug report, explore files, edit code, run tools, and propose a patch.
Why it matters: Without steady, helpful feedback after each try, they can’t learn what helped or hurt.

🍞 Anchor: Imagine an AI trying to fix a broken science fair robot. If it only hears “pass” or “fail” at the end, it doesn’t know which steps were useful. It needs guidance along the way.

🍞 Hook: Imagine your teacher only marking your test as “pass” or “fail” without telling you which questions were right. That’s not very helpful for learning.

🥬 The Concept (Execution-based feedback): Unit tests give a simple pass/fail after running code in a sandbox.

What it is: A test suite that runs your code and tells you if it passes.
How it works: For each candidate patch, run tests; count passes; choose the patch with the most passes.
Why it matters: If tests are missing, noisy, or unrelated, the signal is sparse and sometimes misleading.

🍞 Anchor: If a math worksheet only says “fail” without pointing to the two wrong answers, you don’t know what to fix next.

🍞 Hook: Imagine a coach who can watch your whole routine and give a detailed score—even if they don’t make you perform on stage each time.

🥬 The Concept (Execution-free feedback / Reward model): A model that scores a whole coding attempt without running tests.

What it is: A “verifier” that reads the full trajectory (thoughts, tool calls, patch) and returns a score from 0 to 1.
How it works: Turn the trajectory into text, ask the model “YES or NO?”, and convert its confidence into a continuous score.
Why it matters: You get fine-grained signals on every attempt, even when tests are missing or weak.

🍞 Anchor: Like a writing tutor who scores your draft’s clarity and logic without sending it to a contest each time.

🍞 Hook: Picture picking the best cookie from a batch by tasting a few—grabbing the top one.

🥬 The Concept (Test-Time Scaling, TTS): Try multiple candidate patches and use a verifier to pick the best.

What it is: “Best-of-k” selection among many attempts.
How it works: Sample k solutions; score them; pick the top one.
Why it matters: If your picker is good, your final answer improves without changing the underlying model.

🍞 Anchor: If you bake 10 cookies and choose the tastiest, your dessert improves even if your recipe didn’t change.

🍞 Hook: Think of training a puppy: give treats for good tricks, no treats for mistakes, and it learns which actions pay off.

🥬 The Concept (Reinforcement Learning, RL): The agent improves by getting reward signals for its actions.

What it is: Learning by trial-and-reward.
How it works: Generate trajectories, score them, adjust the policy to make high-score actions more likely.
Why it matters: If rewards are too sparse or wrong, learning slows or collapses.

🍞 Anchor: A puppy trained with late, confusing treats won’t learn; timely, accurate treats work wonders.

The world before: Most systems leaned on execution-based verifiers (unit tests). These gave a binary “pass/fail,” struggled when tests were missing or off-target, and couldn’t tell apart two fails or two passes with nuance. People tried auto-extracting tests from GitHub or having models write tests—but coverage was uneven and correctness not guaranteed. In practice, this limited which data you could trust for learning.

The problem: Researchers saw that verifiers that looked equally strong for TTS could behave very differently for RL. One verifier helped RL improve smoothly; another, with similar TTS, made RL unstable. So TTS alone wasn’t telling the full story about verifier quality.

🍞 Hook: You know how a thermometer might always read 5 degrees too high? You’ll dress wrong for the weather even if you check it carefully.

🥬 The Concept (Calibration): Scores should match reality; a 0.8 score should mean “about 80% chance of being correct.”

What it is: Reliability of confidence.
How it works: Compare predicted confidence to actual success frequency (ECE measures this gap).
Why it matters: Miscalibrated scores trick RL into over- or under-rewarding certain behaviors.

🍞 Anchor: If the weather app says “90% chance of rain” but it rains only half the time, you’ll carry umbrellas unnecessarily.

🍞 Hook: Imagine a judge who can really tell great dances from okay ones across the whole competition, not just the winner.

🥬 The Concept (AUC, discrimination): How well scores separate good from bad across many attempts.

What it is: A measure of pairwise ranking quality over all positives vs. negatives.
How it works: Compute the chance that a random good trajectory scores higher than a random bad one.
Why it matters: Low AUC means lots of mis-ordering, which sends RL gradients the wrong way.

🍞 Anchor: If a judge often ranks weaker routines above stronger ones, the trophy might go to the wrong team.

The gap: TTS measures top-1 picking, but misses discrimination (AUC) and reliability (calibration). The paper fills that gap by defining and optimizing all three. The real stakes: Better verifiers mean coding agents that fix real bugs more reliably, save developer time, and make software safer—from school apps to hospital systems.

02Core Idea

🍞 Hook: Imagine choosing a team captain by only checking who won a single sprint, ignoring their passing skills and teamwork. You might pick the wrong leader.

🥬 The Concept (Key Insight): A verifier that’s good at picking the single best attempt (TTS) isn’t necessarily good for training; you also need strong discrimination (AUC) and reliable confidence (calibration).

What it is: The “Aha!” is to judge and train execution-free verifiers on three pillars—TTS, AUC, and calibration—so they work for both selection and learning.
How it works: Build a reward model that reads full trajectories, outputs a YES/NO score, and is trained and evaluated to maximize top-1 selection, ranking quality, and score trustworthiness.
Why it matters: Without discrimination, RL updates get noisy; without calibration, rewards mislead training; with only TTS, you can still crash RL.

🍞 Anchor: Picking the best cookie helps dessert tonight; knowing what makes a cookie good helps you bake better forever. This work delivers both.

Three analogies for the same idea:

Sports scout: TTS is like picking today’s top scorer. AUC is judging overall play quality across many matchups. Calibration is trusting that an 8/10 rating really means “very likely to help the team.”
Teacher grading essays: TTS is selecting the single best paper. AUC is reliably ranking A’s over B’s over C’s across the whole stack. Calibration means a 90 truly predicts “about 90% mastery.”
Weather forecaster: TTS is choosing the day with best picnic weather. AUC is separating clear vs. stormy days across the month. Calibration is 70% forecasts matching 70% outcomes.

Before vs. After:

Before: Verifiers were often tuned or judged mainly by TTS. Two models with similar TTS could act very differently in RL, sometimes destabilizing training.
After: SWE-RM is trained and validated on TTS + AUC + calibration, creating a reward model that both picks better solutions and teaches policies more safely and effectively. Result: state-of-the-art TTS and smoother, stronger RL.

Why it works (intuition):

RL updates weight actions by reward. If a bad trajectory gets a high score (poor AUC), gradients point the wrong way—like praising mistakes.
If scores don’t match reality (poor calibration), RL is either too bold or too timid; both slow or break learning.
If you only optimize top-1 picking (TTS), you learn little about the rest of the distribution that RL explores; training needs faithful signals everywhere, not just at the top.

Building Blocks (each with a purpose):

Execution-free scoring: fine-grained, continuous signals for every try.
Generative classification with YES/NO token: simple, stable mapping to a probability-like reward.
Mixture-of-Experts backbone (30B total, 3B active): strong coding priors with efficient inference.
Long context (up to 256k): read the whole story—multi-file code, long tool logs—so scores aren’t blind.
Data scaling and composition: enough diverse, labeled trajectories to generalize; a 2:1 positive:negative ratio balances learning.
Mixed policies and sources: variety reduces overfitting to one agent’s style or one dataset’s quirks.
Tri-metric evaluation (TTS, AUC, ECE): measure what matters to both pickers and learners.

🍞 Anchor: It’s like training a great coach: they must pick the best player for today’s game (TTS), rank players fairly across the roster (AUC), and give advice whose confidence matches reality (calibration). SWE-RM is that coach for coding agents.

03Methodology

At a high level: Input (multi-turn coding trajectory) → Reward model reads everything → Outputs YES/NO probabilities → Convert to a score in [0,1] → Use the score for test-time selection or as RL reward.

Step 1: Collect rich trajectories

What happens: Use agent scaffolds (like OpenHands) with different policy models (e.g., Qwen3-Coder, Claude) to roll out thousands of multi-turn attempts across sources (SWE-Gym, SWE-rebench, SWE-smith, R2E-Gym). Each attempt includes the problem, tool calls, code edits, and a patch.
Why it exists: A verifier must see realistic, messy, long sequences to learn what success looks like in the wild.
Example: For a Django bug, the agent opens several files, runs linters, edits two modules, and proposes a patch. This entire story becomes one “trajectory.”

Step 2: Label resolved vs. unresolved

What happens: Mark each trajectory positive (resolved) or negative (unresolved) based on execution results against fail-to-pass tests; filter out broken or uninformative items.
Why it exists: The reward model needs supervision to learn what patterns correlate with true fixes.
Example: If tests pass after the patch, label YES; if not, label NO. If tests look irrelevant or no attempt ever succeeds, drop the sample.

Step 3: Train a generative classifier

What happens: Format the whole trajectory as input; the model must output a special token YES or NO. Train with next-token prediction loss on this single token.
Why it exists: This forces the model to compress the entire trajectory’s evidence into a calibrated decision.
Example: The model reads tool logs that show failing import paths; it learns that such unresolved signals nudge toward NO, unless followed by a correct fix.

Step 4: Turn logits into a continuous score

What happens: Convert the YES/NO logits into a softmax probability r in [0,1].
Why it exists: Continuous rewards help both ranking (TTS) and smooth RL updates.
Example: A borderline attempt might get r = 0.58 (some promise), while a clean fix gets r = 0.95.

Step 5: Architecture and context for realism

What happens: Use a Mixture-of-Experts (30B total, 3B active) backbone with up to 256k tokens.
Why it exists: SWE trajectories can be long; reading more context reduces truncation errors and lets the verifier consider cross-file clues.
Example: A patch touches three files and references a design doc; with 256k context, all of it is in view.

Step 6: Data scaling and composition

What happens: Train on up to ~100k curated trajectories; prefer a 2:1 positive:negative ratio; mix on-policy (same as the target agent) and off-policy data; combine multiple sources.
Why it exists: More, cleaner, and more varied data improves generalization, AUC, and calibration. The 2:1 ratio uses scarce positives efficiently while keeping negatives informative.
Example: With few positives, a 1:8 ratio collapses calibration; with 2:1, the model learns sharper boundaries and more reliable scores.

Step 7: Evaluate on three pillars

What happens: For TTS, sample k candidates per instance and report RM@k (resolve rate of the chosen patch). For AUC, measure how well positives outrank negatives. For calibration, compute ECE from reliability diagrams.
Why it exists: TTS alone can hide problems that break RL. AUC captures global ranking; ECE captures score trustworthiness.
Example: Two verifiers tie on RM@32 but differ in AUC by 0.095 and ECE by 3x—one trains RL well; the other destabilizes it.

Step 8: Use the score for TTS and RL

What happens: For TTS, pick the top-scoring patch among k samples. For RL, combine the score with execution-based reward in a hybrid signal.
Why it exists: TTS improves pass@1 at inference; RL improves the policy itself. The hybrid reward is smoother (thanks to the continuous score) but still grounded (thanks to tests).
Example: r = 0.83 adds +0.83 to the reward whether tests pass or not; if tests pass, add a bonus (e.g., +1), creating a strong, confident signal.

What breaks without each step

No data diversity: Overfits to one style; AUC drops on new agents.
No positives: The model can’t learn what success looks like; calibration drifts.
Short context: Long attempts can’t be scored; TTS and RL miss good solutions.
Only TTS evaluation: You can pick winners but still train policies in the wrong direction.

Concrete mini-walkthrough

Input: 3 candidate patches for a scikit-learn issue. Trajectory A shows careful diagnosis and a 2-line fix; B flips an unrelated flag; C adds logs but no fix.
Scoring: SWE-RM reads all three; assigns r(A)=0.92, r(B)=0.18, r(C)=0.36.
TTS: Choose A; pass@1 improves.
RL: If A passes tests, reward might be 1 + 0.92; else unfinished might be −0.5 + 0.92. Either way, gradients nudge the policy toward trajectories like A.

The secret sauce

Tri-metric design (TTS + AUC + ECE) as success criteria.
Data recipe (2:1 positives, mixed policies, multiple sources) that empirically tightens discrimination and calibration.
Long-context MoE that can actually “read the whole story,” avoiding blind spots that cause score noise.
Simple, stable YES/NO head that maps naturally to a probability-like reward.

🍞 Hook: Think of it as teaching a judge to read the whole essay, grade fairly across the class, and give trustworthy scores. That judge then both picks better winners and teaches better writers.

🥬 The Concept (Putting it all together): SWE-RM is a long-context, execution-free reward model trained with a data-and-metrics recipe that optimizes selection, discrimination, and calibration.

What it is: A practical verifier that boosts TTS and powers RL.
How it works: It reads full trajectories, predicts YES/NO with calibrated confidence, and its scores are used for choosing patches and shaping policy gradients.
Why it matters: It sets a new state of the art for open-source SWE agents and makes RL training smoother and stronger.

🍞 Anchor: In practice, SWE-RM lifts Qwen3-Coder-Flash from 51.6% to 62.0% and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified with TTS, and adds ~3 points in RL when combined with execution-based signals.

04Experiments & Results

The test: Can execution-free scoring trained with the tri-metric recipe improve both selection (TTS) and learning (RL) on real SWE tasks?

Why these tests: TTS reflects inference-time selection power. AUC shows global ranking skill. ECE shows if scores match reality, crucial for RL. RL tests check whether the reward truly accelerates and stabilizes learning.

The competition (baselines):

Execution-based: Agentless; DeepSWE-EB style hybrids that rely on test outcomes.
Execution-free: SWE-Gym Verifier; DeepSWE-EF; OpenHands Critic-like approaches.
Notably, some baselines focus mainly on TTS; calibration and AUC are less explored.

Scoreboard with context:

TTS on SWE-Bench Verified:
- Qwen3-Coder-Flash: 51.6% → 62.0% with SWE-RM (about a 10.4-point jump—going from a solid B- to a strong A-).
- Qwen3-Coder-Max: 67.0% → 74.6% (a 7.6-point jump—A to A+ among open-source peers).
- Across models including OpenHands-LM-32B, SWE-RM reaches top RM@32 with the best AUC and lowest ECE among execution-free verifiers.
Discrimination (AUC): SWE-RM consistently raises AUC over execution-free baselines (e.g., from ~0.72–0.76 to ~0.78 or higher depending on the policy), indicating fewer mis-ordered pairs and cleaner gradients for RL.
Calibration (ECE): SWE-RM shows substantially lower ECE (down to ~0.05), about 2–4x better than some baselines, which means “0.8” really behaves like “80% likely to be correct.”

Two surprising and important findings:

TTS ties, RL divides: Two verifiers can tie on TTS yet diverge in AUC and ECE; the one with weaker AUC/calibration destabilizes RL and collapses training, while the other trains smoothly. Lesson: TTS alone is not enough.
Long context changes the game: Bumping context from 32k to 256k raises the fraction of scorable trajectories to nearly 100% and improves RM@32. Many hard issues simply don’t fit in short windows; being able to read everything boosts both selection and learning.

Data scaling and composition effects:

Training size: Below ~5k examples, TTS may degrade as k increases (overfitting and OOD fragility). Around 20–25k, curves improve; 100k shows diminishing but real gains and much better calibration (ECE drops from ~0.48 at 500 examples to ~0.07 at 100k).
Positive:Negative ratio: A 2:1 ratio generally yields the best AUC, ECE, and RM@32 across tested policies, likely due to scarce positives carrying high signal.
Policy mixing: Combining on-policy and off-policy rollouts improves overall robustness (AUC/ECE) beyond either alone.
Source mixing: SWE-rebench leads on TTS/AUC; adding SWE-smith and SWE-Gym further improves calibration and scaling.

RL outcomes (no test-time scaling at eval):

Rewards tested: Execution-based only (sparse 0/1), Execution-free only (continuous SWE-RM scores), Hybrid (execution-based + SWE-RM), and a poorly calibrated RM.
Results: Hybrid wins, gaining about +3 absolute points over execution-based-only and learning faster and more smoothly. Execution-free only learns faster early (thanks to dense signals) but converges lower than hybrid (due to occasional inaccuracies). Poorly calibrated RM hurts training and generalization across other SWE tasks.
Generalization: On SWE-Bench Live (Lite), SWE-Bench Multilingual, Multi-SWE-Bench Mini, and Terminal Bench, hybrid remains strongest, confirming the approach is not overfit to a single benchmark.

Takeaway with meaning: Think of the numbers like grades. Where others scored a B or B+, SWE-RM pushes into the A range and also becomes a better teacher for RL. The stronger AUC means it usually knows who did better, and the low ECE means its confidence is trustworthy. This combination turns from a good judge into a great coach that both picks winners today and trains better players for tomorrow.

05Discussion & Limitations

Limitations:

Long-context compute: Reading up to 256k tokens increases memory needs (though generation is just one token, prompting is heavy). Training used multi-node H100s; not every lab can match this.
Label noise from tests: Even though the verifier is execution-free, its supervision labels come from execution outcomes that can be imperfect. Data cleaning helps, but some noise remains.
Data hunger: The best gains appear after tens of thousands of examples; smaller datasets showed OOD fragility and poor calibration.
Scope: The method scores trajectories; it doesn’t invent tests or guarantee semantic correctness beyond patterns it learned.

Required resources:

Hardware: Multi-GPU nodes with sufficient memory for 256k context training; at inference, memory still scales with context size.
Data: Diverse, well-filtered trajectories from several sources and policies; positives are particularly valuable.
Tooling: An agent scaffold (e.g., OpenHands), long-context training stack (e.g., Megatron), and logging for reliability diagrams and AUC/ECE tracking.

When not to use:

Extremely resource-constrained settings where 256k context is infeasible and truncation would drop most of the signal.
Domains with no reliable proxy labels at all (even noisy ones), making initial supervision too weak.
Scenarios demanding binary, legal-grade guarantees; an execution-free score is guidance, not a formal proof.

Open questions:

Calibration without labels: Can we self-calibrate scores post-hoc or with small trusted sets, reducing dependence on noisy test labels?
Active data selection: Which trajectories most improve AUC and ECE per GPU-hour? Can we prioritize borderline or novel cases?
Architecture vs. data: How much of the gains come from MoE and long context versus data recipe and tri-metric training?
Beyond SWE: Do the tri-metric principles (TTS/AUC/ECE) hold for other agent domains (e.g., robotics, multi-step tool use)?
Human-in-the-loop: Can lightweight human spot-checks close the last gap in calibration and reduce reward hacking?

06Conclusion & Future Work

Three-sentence summary: This paper shows that test-time scaling alone cannot judge a verifier’s fitness for training coding agents; discrimination (AUC) and calibration (ECE) are equally crucial. Using a data-and-metrics recipe with long context and a Mixture-of-Experts backbone, the authors build SWE-RM—an execution-free reward model that achieves state-of-the-art TTS and stabilizes RL, adding roughly 3 points over execution-based-only reward. The result is a verifier that both picks better solutions and teaches policies more effectively.

Main achievement: Establishing a practical, tri-metric standard (TTS, AUC, ECE) and delivering a concrete model—SWE-RM—that measurably advances both inference-time selection and training-time learning for SWE agents.

Future directions: Improve calibration with lighter supervision, explore active data curation for maximum AUC/ECE gain per sample, test alternative backbones (dense/adapters) under the same data recipe, and extend the approach to other agent domains. Investigating hybrid verification schemes that combine symbolic checks, light execution, and model scores may further boost reliability.

Why remember this: SWE-RM reframes verifier quality from “can it pick a winner?” to “can it pick fairly, score honestly, and teach well?” That shift—from a single metric to three—turns a good judge into a great coach and sets a stronger foundation for the next generation of coding agents that are both more accurate today and easier to train tomorrow.

Practical Applications

•Prioritize candidate patches in continuous integration by scoring them before expensive full test runs.
•Accelerate triage of GitHub issues by ranking agent-generated fixes with trustworthy confidence.
•Use hybrid rewards to stably fine-tune in-house coding agents for proprietary codebases.
•Filter and deduplicate low-quality agent trajectories during dataset curation using calibrated scores.
•Drive targeted test generation: prioritize writing tests for areas where the verifier is uncertain (low calibration).
•Enable long-context code review bots that consider entire diffs, logs, and discussions in one pass.
•Deploy safer auto-fix bots gated by minimum score thresholds tuned via reliability diagrams.
•Benchmark internal agents with TTS/AUC/ECE dashboards to catch regressions beyond pass@1.
•Guide human-in-the-loop review by surfacing top fixes plus score-based rationales.
•Support multilingual repository maintenance by scoring cross-language fixes consistently.

Version: 1