SWE-RM: Execution-free Feedback For Software Engineering Agents
Key Summary
- ā¢Coding agents used to fix software rely on feedback; unit tests give only pass/fail signals that are often noisy or missing.
- ā¢SWE-RM is a new execution-free reward model that scores whole coding attempts without running tests, giving rich and continuous feedback.
- ā¢The paper shows that a verifier that looks good at test-time scaling (picking the best of many tries) may still fail at reinforcement learning.
- ā¢Two extra qualities matter for RL: discrimination (AUC) and calibration (ECE), which tell how well scores separate good from bad and how trustworthy the scores are.
- ā¢Large ablations reveal a practical recipe: more diverse data, a 2:1 positive-to-negative ratio, mixed policies, multiple sources, and very long context windows.
- ā¢SWE-RM uses a 30B Mixture-of-Experts model (activating 3B) with up to 256k tokens, so it can read long trajectories and code histories.
- ā¢On SWE-Bench Verified, SWE-RM boosts Qwen3-Coder-Flash from 51.6% to 62.0% and Qwen3-Coder-Max from 67.0% to 74.6% with test-time scaling.
- ā¢As a reward for RL, SWE-RM improves training stability and adds about 3 absolute points over execution-based-only feedback.
- ā¢Hybrid rewards (execution-free + tests) work best: smooth early learning from continuous scores plus trustworthy anchors from tests.
- ā¢This establishes well-calibrated, execution-free reward modeling as a strong foundation for building better software engineering agents.
Why This Research Matters
Software bugs waste time, create outages, and can even put people at risk; better coding agents mean faster, safer fixes. SWE-RM helps agents learn from every attempt, not just a pass/fail at the end, so they grow more capable with fewer reruns. By caring about discrimination and calibration, the model gives scores you can trust, which keeps training stable and efficient. Long context means it can judge real-world, messy codebases with multi-file edits and long logs. As a result, teams can ship fixes sooner, reduce toil on repetitive debugging, and improve user experiences across apps, tools, and services.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: You know how when youāre doing a big school project, a teacherās quick comments on every page help more than just the final grade at the end? That page-by-page feedback is rich, while the final grade is just pass/fail.
š„¬ The Concept: Software engineering (SWE) agents are AI helpers that try to read issues, edit code, run tools, and fix bugs. They learn best when they get good feedback on their attempts.
- What it is: SWE agents are multi-step problem solvers for real codebases.
- How it works: They read a bug report, explore files, edit code, run tools, and propose a patch.
- Why it matters: Without steady, helpful feedback after each try, they canāt learn what helped or hurt.
š Anchor: Imagine an AI trying to fix a broken science fair robot. If it only hears āpassā or āfailā at the end, it doesnāt know which steps were useful. It needs guidance along the way.
š Hook: Imagine your teacher only marking your test as āpassā or āfailā without telling you which questions were right. Thatās not very helpful for learning.
š„¬ The Concept (Execution-based feedback): Unit tests give a simple pass/fail after running code in a sandbox.
- What it is: A test suite that runs your code and tells you if it passes.
- How it works: For each candidate patch, run tests; count passes; choose the patch with the most passes.
- Why it matters: If tests are missing, noisy, or unrelated, the signal is sparse and sometimes misleading.
š Anchor: If a math worksheet only says āfailā without pointing to the two wrong answers, you donāt know what to fix next.
š Hook: Imagine a coach who can watch your whole routine and give a detailed scoreāeven if they donāt make you perform on stage each time.
š„¬ The Concept (Execution-free feedback / Reward model): A model that scores a whole coding attempt without running tests.
- What it is: A āverifierā that reads the full trajectory (thoughts, tool calls, patch) and returns a score from 0 to 1.
- How it works: Turn the trajectory into text, ask the model āYES or NO?ā, and convert its confidence into a continuous score.
- Why it matters: You get fine-grained signals on every attempt, even when tests are missing or weak.
š Anchor: Like a writing tutor who scores your draftās clarity and logic without sending it to a contest each time.
š Hook: Picture picking the best cookie from a batch by tasting a fewāgrabbing the top one.
š„¬ The Concept (Test-Time Scaling, TTS): Try multiple candidate patches and use a verifier to pick the best.
- What it is: āBest-of-kā selection among many attempts.
- How it works: Sample k solutions; score them; pick the top one.
- Why it matters: If your picker is good, your final answer improves without changing the underlying model.
š Anchor: If you bake 10 cookies and choose the tastiest, your dessert improves even if your recipe didnāt change.
š Hook: Think of training a puppy: give treats for good tricks, no treats for mistakes, and it learns which actions pay off.
š„¬ The Concept (Reinforcement Learning, RL): The agent improves by getting reward signals for its actions.
- What it is: Learning by trial-and-reward.
- How it works: Generate trajectories, score them, adjust the policy to make high-score actions more likely.
- Why it matters: If rewards are too sparse or wrong, learning slows or collapses.
š Anchor: A puppy trained with late, confusing treats wonāt learn; timely, accurate treats work wonders.
The world before: Most systems leaned on execution-based verifiers (unit tests). These gave a binary āpass/fail,ā struggled when tests were missing or off-target, and couldnāt tell apart two fails or two passes with nuance. People tried auto-extracting tests from GitHub or having models write testsābut coverage was uneven and correctness not guaranteed. In practice, this limited which data you could trust for learning.
The problem: Researchers saw that verifiers that looked equally strong for TTS could behave very differently for RL. One verifier helped RL improve smoothly; another, with similar TTS, made RL unstable. So TTS alone wasnāt telling the full story about verifier quality.
š Hook: You know how a thermometer might always read 5 degrees too high? Youāll dress wrong for the weather even if you check it carefully.
š„¬ The Concept (Calibration): Scores should match reality; a 0.8 score should mean āabout 80% chance of being correct.ā
- What it is: Reliability of confidence.
- How it works: Compare predicted confidence to actual success frequency (ECE measures this gap).
- Why it matters: Miscalibrated scores trick RL into over- or under-rewarding certain behaviors.
š Anchor: If the weather app says ā90% chance of rainā but it rains only half the time, youāll carry umbrellas unnecessarily.
š Hook: Imagine a judge who can really tell great dances from okay ones across the whole competition, not just the winner.
š„¬ The Concept (AUC, discrimination): How well scores separate good from bad across many attempts.
- What it is: A measure of pairwise ranking quality over all positives vs. negatives.
- How it works: Compute the chance that a random good trajectory scores higher than a random bad one.
- Why it matters: Low AUC means lots of mis-ordering, which sends RL gradients the wrong way.
š Anchor: If a judge often ranks weaker routines above stronger ones, the trophy might go to the wrong team.
The gap: TTS measures top-1 picking, but misses discrimination (AUC) and reliability (calibration). The paper fills that gap by defining and optimizing all three. The real stakes: Better verifiers mean coding agents that fix real bugs more reliably, save developer time, and make software saferāfrom school apps to hospital systems.
02Core Idea
š Hook: Imagine choosing a team captain by only checking who won a single sprint, ignoring their passing skills and teamwork. You might pick the wrong leader.
š„¬ The Concept (Key Insight): A verifier thatās good at picking the single best attempt (TTS) isnāt necessarily good for training; you also need strong discrimination (AUC) and reliable confidence (calibration).
- What it is: The āAha!ā is to judge and train execution-free verifiers on three pillarsāTTS, AUC, and calibrationāso they work for both selection and learning.
- How it works: Build a reward model that reads full trajectories, outputs a YES/NO score, and is trained and evaluated to maximize top-1 selection, ranking quality, and score trustworthiness.
- Why it matters: Without discrimination, RL updates get noisy; without calibration, rewards mislead training; with only TTS, you can still crash RL.
š Anchor: Picking the best cookie helps dessert tonight; knowing what makes a cookie good helps you bake better forever. This work delivers both.
Three analogies for the same idea:
- Sports scout: TTS is like picking todayās top scorer. AUC is judging overall play quality across many matchups. Calibration is trusting that an 8/10 rating really means āvery likely to help the team.ā
- Teacher grading essays: TTS is selecting the single best paper. AUC is reliably ranking Aās over Bās over Cās across the whole stack. Calibration means a 90 truly predicts āabout 90% mastery.ā
- Weather forecaster: TTS is choosing the day with best picnic weather. AUC is separating clear vs. stormy days across the month. Calibration is 70% forecasts matching 70% outcomes.
Before vs. After:
- Before: Verifiers were often tuned or judged mainly by TTS. Two models with similar TTS could act very differently in RL, sometimes destabilizing training.
- After: SWE-RM is trained and validated on TTS + AUC + calibration, creating a reward model that both picks better solutions and teaches policies more safely and effectively. Result: state-of-the-art TTS and smoother, stronger RL.
Why it works (intuition):
- RL updates weight actions by reward. If a bad trajectory gets a high score (poor AUC), gradients point the wrong wayālike praising mistakes.
- If scores donāt match reality (poor calibration), RL is either too bold or too timid; both slow or break learning.
- If you only optimize top-1 picking (TTS), you learn little about the rest of the distribution that RL explores; training needs faithful signals everywhere, not just at the top.
Building Blocks (each with a purpose):
- Execution-free scoring: fine-grained, continuous signals for every try.
- Generative classification with YES/NO token: simple, stable mapping to a probability-like reward.
- Mixture-of-Experts backbone (30B total, 3B active): strong coding priors with efficient inference.
- Long context (up to 256k): read the whole storyāmulti-file code, long tool logsāso scores arenāt blind.
- Data scaling and composition: enough diverse, labeled trajectories to generalize; a 2:1 positive:negative ratio balances learning.
- Mixed policies and sources: variety reduces overfitting to one agentās style or one datasetās quirks.
- Tri-metric evaluation (TTS, AUC, ECE): measure what matters to both pickers and learners.
š Anchor: Itās like training a great coach: they must pick the best player for todayās game (TTS), rank players fairly across the roster (AUC), and give advice whose confidence matches reality (calibration). SWE-RM is that coach for coding agents.
03Methodology
At a high level: Input (multi-turn coding trajectory) ā Reward model reads everything ā Outputs YES/NO probabilities ā Convert to a score in [0,1] ā Use the score for test-time selection or as RL reward.
Step 1: Collect rich trajectories
- What happens: Use agent scaffolds (like OpenHands) with different policy models (e.g., Qwen3-Coder, Claude) to roll out thousands of multi-turn attempts across sources (SWE-Gym, SWE-rebench, SWE-smith, R2E-Gym). Each attempt includes the problem, tool calls, code edits, and a patch.
- Why it exists: A verifier must see realistic, messy, long sequences to learn what success looks like in the wild.
- Example: For a Django bug, the agent opens several files, runs linters, edits two modules, and proposes a patch. This entire story becomes one ātrajectory.ā
Step 2: Label resolved vs. unresolved
- What happens: Mark each trajectory positive (resolved) or negative (unresolved) based on execution results against fail-to-pass tests; filter out broken or uninformative items.
- Why it exists: The reward model needs supervision to learn what patterns correlate with true fixes.
- Example: If tests pass after the patch, label YES; if not, label NO. If tests look irrelevant or no attempt ever succeeds, drop the sample.
Step 3: Train a generative classifier
- What happens: Format the whole trajectory as input; the model must output a special token YES or NO. Train with next-token prediction loss on this single token.
- Why it exists: This forces the model to compress the entire trajectoryās evidence into a calibrated decision.
- Example: The model reads tool logs that show failing import paths; it learns that such unresolved signals nudge toward NO, unless followed by a correct fix.
Step 4: Turn logits into a continuous score
- What happens: Convert the YES/NO logits into a softmax probability r in [0,1].
- Why it exists: Continuous rewards help both ranking (TTS) and smooth RL updates.
- Example: A borderline attempt might get r = 0.58 (some promise), while a clean fix gets r = 0.95.
Step 5: Architecture and context for realism
- What happens: Use a Mixture-of-Experts (30B total, 3B active) backbone with up to 256k tokens.
- Why it exists: SWE trajectories can be long; reading more context reduces truncation errors and lets the verifier consider cross-file clues.
- Example: A patch touches three files and references a design doc; with 256k context, all of it is in view.
Step 6: Data scaling and composition
- What happens: Train on up to ~100k curated trajectories; prefer a 2:1 positive:negative ratio; mix on-policy (same as the target agent) and off-policy data; combine multiple sources.
- Why it exists: More, cleaner, and more varied data improves generalization, AUC, and calibration. The 2:1 ratio uses scarce positives efficiently while keeping negatives informative.
- Example: With few positives, a 1:8 ratio collapses calibration; with 2:1, the model learns sharper boundaries and more reliable scores.
Step 7: Evaluate on three pillars
- What happens: For TTS, sample k candidates per instance and report RM@k (resolve rate of the chosen patch). For AUC, measure how well positives outrank negatives. For calibration, compute ECE from reliability diagrams.
- Why it exists: TTS alone can hide problems that break RL. AUC captures global ranking; ECE captures score trustworthiness.
- Example: Two verifiers tie on RM@32 but differ in AUC by 0.095 and ECE by 3xāone trains RL well; the other destabilizes it.
Step 8: Use the score for TTS and RL
- What happens: For TTS, pick the top-scoring patch among k samples. For RL, combine the score with execution-based reward in a hybrid signal.
- Why it exists: TTS improves pass@1 at inference; RL improves the policy itself. The hybrid reward is smoother (thanks to the continuous score) but still grounded (thanks to tests).
- Example: r = 0.83 adds +0.83 to the reward whether tests pass or not; if tests pass, add a bonus (e.g., +1), creating a strong, confident signal.
What breaks without each step
- No data diversity: Overfits to one style; AUC drops on new agents.
- No positives: The model canāt learn what success looks like; calibration drifts.
- Short context: Long attempts canāt be scored; TTS and RL miss good solutions.
- Only TTS evaluation: You can pick winners but still train policies in the wrong direction.
Concrete mini-walkthrough
- Input: 3 candidate patches for a scikit-learn issue. Trajectory A shows careful diagnosis and a 2-line fix; B flips an unrelated flag; C adds logs but no fix.
- Scoring: SWE-RM reads all three; assigns r(A)=0.92, r(B)=0.18, r(C)=0.36.
- TTS: Choose A; pass@1 improves.
- RL: If A passes tests, reward might be 1 + 0.92; else unfinished might be ā0.5 + 0.92. Either way, gradients nudge the policy toward trajectories like A.
The secret sauce
- Tri-metric design (TTS + AUC + ECE) as success criteria.
- Data recipe (2:1 positives, mixed policies, multiple sources) that empirically tightens discrimination and calibration.
- Long-context MoE that can actually āread the whole story,ā avoiding blind spots that cause score noise.
- Simple, stable YES/NO head that maps naturally to a probability-like reward.
š Hook: Think of it as teaching a judge to read the whole essay, grade fairly across the class, and give trustworthy scores. That judge then both picks better winners and teaches better writers.
š„¬ The Concept (Putting it all together): SWE-RM is a long-context, execution-free reward model trained with a data-and-metrics recipe that optimizes selection, discrimination, and calibration.
- What it is: A practical verifier that boosts TTS and powers RL.
- How it works: It reads full trajectories, predicts YES/NO with calibrated confidence, and its scores are used for choosing patches and shaping policy gradients.
- Why it matters: It sets a new state of the art for open-source SWE agents and makes RL training smoother and stronger.
š Anchor: In practice, SWE-RM lifts Qwen3-Coder-Flash from 51.6% to 62.0% and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified with TTS, and adds ~3 points in RL when combined with execution-based signals.
04Experiments & Results
The test: Can execution-free scoring trained with the tri-metric recipe improve both selection (TTS) and learning (RL) on real SWE tasks?
- Why these tests: TTS reflects inference-time selection power. AUC shows global ranking skill. ECE shows if scores match reality, crucial for RL. RL tests check whether the reward truly accelerates and stabilizes learning.
The competition (baselines):
- Execution-based: Agentless; DeepSWE-EB style hybrids that rely on test outcomes.
- Execution-free: SWE-Gym Verifier; DeepSWE-EF; OpenHands Critic-like approaches.
- Notably, some baselines focus mainly on TTS; calibration and AUC are less explored.
Scoreboard with context:
- TTS on SWE-Bench Verified:
- Qwen3-Coder-Flash: 51.6% ā 62.0% with SWE-RM (about a 10.4-point jumpāgoing from a solid B- to a strong A-).
- Qwen3-Coder-Max: 67.0% ā 74.6% (a 7.6-point jumpāA to A+ among open-source peers).
- Across models including OpenHands-LM-32B, SWE-RM reaches top RM@32 with the best AUC and lowest ECE among execution-free verifiers.
- Discrimination (AUC): SWE-RM consistently raises AUC over execution-free baselines (e.g., from ~0.72ā0.76 to ~0.78 or higher depending on the policy), indicating fewer mis-ordered pairs and cleaner gradients for RL.
- Calibration (ECE): SWE-RM shows substantially lower ECE (down to ~0.05), about 2ā4x better than some baselines, which means ā0.8ā really behaves like ā80% likely to be correct.ā
Two surprising and important findings:
- TTS ties, RL divides: Two verifiers can tie on TTS yet diverge in AUC and ECE; the one with weaker AUC/calibration destabilizes RL and collapses training, while the other trains smoothly. Lesson: TTS alone is not enough.
- Long context changes the game: Bumping context from 32k to 256k raises the fraction of scorable trajectories to nearly 100% and improves RM@32. Many hard issues simply donāt fit in short windows; being able to read everything boosts both selection and learning.
Data scaling and composition effects:
- Training size: Below ~5k examples, TTS may degrade as k increases (overfitting and OOD fragility). Around 20ā25k, curves improve; 100k shows diminishing but real gains and much better calibration (ECE drops from ~0.48 at 500 examples to ~0.07 at 100k).
- Positive:Negative ratio: A 2:1 ratio generally yields the best AUC, ECE, and RM@32 across tested policies, likely due to scarce positives carrying high signal.
- Policy mixing: Combining on-policy and off-policy rollouts improves overall robustness (AUC/ECE) beyond either alone.
- Source mixing: SWE-rebench leads on TTS/AUC; adding SWE-smith and SWE-Gym further improves calibration and scaling.
RL outcomes (no test-time scaling at eval):
- Rewards tested: Execution-based only (sparse 0/1), Execution-free only (continuous SWE-RM scores), Hybrid (execution-based + SWE-RM), and a poorly calibrated RM.
- Results: Hybrid wins, gaining about +3 absolute points over execution-based-only and learning faster and more smoothly. Execution-free only learns faster early (thanks to dense signals) but converges lower than hybrid (due to occasional inaccuracies). Poorly calibrated RM hurts training and generalization across other SWE tasks.
- Generalization: On SWE-Bench Live (Lite), SWE-Bench Multilingual, Multi-SWE-Bench Mini, and Terminal Bench, hybrid remains strongest, confirming the approach is not overfit to a single benchmark.
Takeaway with meaning: Think of the numbers like grades. Where others scored a B or B+, SWE-RM pushes into the A range and also becomes a better teacher for RL. The stronger AUC means it usually knows who did better, and the low ECE means its confidence is trustworthy. This combination turns from a good judge into a great coach that both picks winners today and trains better players for tomorrow.
05Discussion & Limitations
Limitations:
- Long-context compute: Reading up to 256k tokens increases memory needs (though generation is just one token, prompting is heavy). Training used multi-node H100s; not every lab can match this.
- Label noise from tests: Even though the verifier is execution-free, its supervision labels come from execution outcomes that can be imperfect. Data cleaning helps, but some noise remains.
- Data hunger: The best gains appear after tens of thousands of examples; smaller datasets showed OOD fragility and poor calibration.
- Scope: The method scores trajectories; it doesnāt invent tests or guarantee semantic correctness beyond patterns it learned.
Required resources:
- Hardware: Multi-GPU nodes with sufficient memory for 256k context training; at inference, memory still scales with context size.
- Data: Diverse, well-filtered trajectories from several sources and policies; positives are particularly valuable.
- Tooling: An agent scaffold (e.g., OpenHands), long-context training stack (e.g., Megatron), and logging for reliability diagrams and AUC/ECE tracking.
When not to use:
- Extremely resource-constrained settings where 256k context is infeasible and truncation would drop most of the signal.
- Domains with no reliable proxy labels at all (even noisy ones), making initial supervision too weak.
- Scenarios demanding binary, legal-grade guarantees; an execution-free score is guidance, not a formal proof.
Open questions:
- Calibration without labels: Can we self-calibrate scores post-hoc or with small trusted sets, reducing dependence on noisy test labels?
- Active data selection: Which trajectories most improve AUC and ECE per GPU-hour? Can we prioritize borderline or novel cases?
- Architecture vs. data: How much of the gains come from MoE and long context versus data recipe and tri-metric training?
- Beyond SWE: Do the tri-metric principles (TTS/AUC/ECE) hold for other agent domains (e.g., robotics, multi-step tool use)?
- Human-in-the-loop: Can lightweight human spot-checks close the last gap in calibration and reduce reward hacking?
06Conclusion & Future Work
Three-sentence summary: This paper shows that test-time scaling alone cannot judge a verifierās fitness for training coding agents; discrimination (AUC) and calibration (ECE) are equally crucial. Using a data-and-metrics recipe with long context and a Mixture-of-Experts backbone, the authors build SWE-RMāan execution-free reward model that achieves state-of-the-art TTS and stabilizes RL, adding roughly 3 points over execution-based-only reward. The result is a verifier that both picks better solutions and teaches policies more effectively.
Main achievement: Establishing a practical, tri-metric standard (TTS, AUC, ECE) and delivering a concrete modelāSWE-RMāthat measurably advances both inference-time selection and training-time learning for SWE agents.
Future directions: Improve calibration with lighter supervision, explore active data curation for maximum AUC/ECE gain per sample, test alternative backbones (dense/adapters) under the same data recipe, and extend the approach to other agent domains. Investigating hybrid verification schemes that combine symbolic checks, light execution, and model scores may further boost reliability.
Why remember this: SWE-RM reframes verifier quality from ācan it pick a winner?ā to ācan it pick fairly, score honestly, and teach well?ā That shiftāfrom a single metric to threeāturns a good judge into a great coach and sets a stronger foundation for the next generation of coding agents that are both more accurate today and easier to train tomorrow.
Practical Applications
- ā¢Prioritize candidate patches in continuous integration by scoring them before expensive full test runs.
- ā¢Accelerate triage of GitHub issues by ranking agent-generated fixes with trustworthy confidence.
- ā¢Use hybrid rewards to stably fine-tune in-house coding agents for proprietary codebases.
- ā¢Filter and deduplicate low-quality agent trajectories during dataset curation using calibrated scores.
- ā¢Drive targeted test generation: prioritize writing tests for areas where the verifier is uncertain (low calibration).
- ā¢Enable long-context code review bots that consider entire diffs, logs, and discussions in one pass.
- ā¢Deploy safer auto-fix bots gated by minimum score thresholds tuned via reliability diagrams.
- ā¢Benchmark internal agents with TTS/AUC/ECE dashboards to catch regressions beyond pass@1.
- ā¢Guide human-in-the-loop review by surfacing top fixes plus score-based rationales.
- ā¢Support multilingual repository maintenance by scoring cross-language fixes consistently.