Agentic Rubrics as Contextual Verifiers for SWE Agents
Key Summary
- ā¢The paper introduces Agentic Rubrics, a new way to check code fixes without running the code by creating a smart checklist from the project itself.
- ā¢An expert agent explores the repository, writes a rubric.yaml with specific doās and donāts, and then candidates are graded against it with no execution.
- ā¢On SWE-Bench Verified with 16 rollouts, Agentic Rubrics achieved 54.2% with Qwen3-Coder-30B-A3B and 40.6% with Qwen3-32B, beating strong baselines by 3.5ā4.6 percentage points.
- ā¢Rubric scores align well with ground-truth tests (ROC-AUC 0.886; PR-AUC 0.722) and offer graded, interpretable feedback instead of only pass/fail.
- ā¢Rubrics catch issues that tests may miss, like unnecessary edits, missing edge-case handling, or wrong-layer fixes.
- ā¢Gathering repository context agentically is crucial; removing it drops performance, proving that codebase grounding matters.
- ā¢Strong models write better, more granular rubrics; distilled open-weight rubric agents also work and outperform patch classifiers.
- ā¢Rubrics are cost-efficient for Test-Time Scaling and less brittle than generating runnable tests or relying on stylistic patch similarity.
- ā¢Ablations show small sensitivity to the judge model, and combining rubrics with generated tests can perform even better.
- ā¢Overall, Agentic Rubrics provide a scalable, interpretable, and execution-free verification signal for training and selecting SWE agent patches.
Why This Research Matters
Better verification means better software, faster. Agentic Rubrics help teams choose the right fix from many options without the heavy cost of running every test suite. Because the rubric is grounded in the real project, the feedback is specific, interpretable, and harder to game than style-based checks. This improves reliability for code assistants in real-world repositories, not just toy examples. The approach is cost-effective for large-scale deployments where many candidates must be scored. Finally, rubrics can power training signals, teaching future agents what āgoodā really looks like.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): You know how teachers use a grading checklist to mark your science project even if they donāt run every experiment you did? That checklist makes judging fast and fair.
š„¬ Filling (The Actual Concept): What this paper is about: software engineering (SWE) agentsāAI helpers that edit codeāneed a fair judge to tell which fix is best. Traditionally, the judge runs tests (actually executing code) to see if things work. Thatās reliable, but itās slow and tricky to set up for every project. On the other hand, not running code at all is fast but can be shallow and miss important details. How it works today vs. before: Before this work, many verifiers either (a) executed testsāgreat when they run, but heavy and sometimes brittleāor (b) skipped executionālightweight, but less grounded and sometimes fooled by style. The community wanted something scalable, grounded in the real project, and easy to understand. Why it matters: Without a strong verifier, two big things suffer: (1) training agents using rewards (they need a good signal of what counts as a good fix), and (2) test-time scalingāpicking the best from many candidate fixes. A fuzzy or costly verifier slows progress or picks the wrong patch.
š Bottom Bread (Anchor): Imagine 16 classmates each submit a repair plan for a broken robot. If the teacher must fully assemble and test all 16 robots each time, thatās slow. If the teacher only skims for neat handwriting, thatās unfair. A smart, detailed checklist about the robotās real parts and connections lets the teacher grade quickly and fairly.
New concepts introduced here (using the Sandwich pattern):
š Hook: Imagine a team of helpful robots that can fix your LEGO car when it breaks. š„¬ Concept: SWE Agents are AI programs that read, modify, and create code to fix bugs or add features. How it works:
- Read the issue description (whatās broken). 2) Explore files. 3) Propose code changes (a patch). 4) Submit the patch. Why it matters: Without capable SWE agents, fixing bugs at scale remains slow and manual. š Anchor: A SWE agent updates a Python function to avoid a crash when an input is missing.
š Hook: You know how judges score gymnastics routines with clear criteria? š„¬ Concept: Verification is judging if a code patch is correct, safe, and complete. How it works: Look at a patch and the task, then decide if it meets the requirements. Why it matters: Without verification, we canāt pick the best patch or train agents well. š Anchor: If a patch fixes a crash but breaks another feature, verification should catch that.
š Hook: If you try a maze 10 different times, your chance of winning goes up if you pick the best path afterward. š„¬ Concept: Test-Time Scaling (TTS) means sampling many candidate patches and choosing the best using a verifier. How it works: 1) Generate K candidates. 2) Score each. 3) Pick the top one. Why it matters: Without TTS, we waste extra tries and often settle for a weaker answer. š Anchor: Try 16 code fixes, then use a judge to select the winner.
The problem the paper tackles: Many SWE verifiers need to run code (tests), which is costly in new repos (you have to set up environments, dependencies, and sandboxes). And execution-free verifiers often arenāt grounded in the specific codebase, so they can be fooled by surface-level cues. Failed attempts: Patch classifiers (LLMs that say YES/NO) can be fast but are sometimes swayed by style or shallow similarity. Generated tests can be great but are brittle: writing runnable, discriminating tests in unfamiliar repos is hard, and setup time balloons. The gap: We need a verifier that is (1) scalable like execution-free methods, (2) grounded like tests, and (3) interpretable like a teacherās rubric. This paperās idea: Agentic Rubricsāan expert agent explores the repository to build a project-specific rubric (a checklist), then uses that rubric to score candidate patches without executing the code. Real stakes: Faster, fairer verification helps pick better fixes and train better agents. That means fewer bugs in the apps you use, safer updates, and faster development.
š Anchor: Think of a teacher who first walks around the science fair, studies each projectās materials and instructions, then writes a custom checklist for that fair. Every studentās work is scored against that checklistāquickly and fairly, no re-running every experiment.
02Core Idea
š Top Bread (Hook): Imagine youāre grading a cookie recipe. You donāt need to bake the cookie every time if you know what a correct recipe must include (butter, sugar, flour, oven temp) and what mistakes to avoid.
š„¬ Filling (The Actual Concept): The aha! moment in one sentence: Let an expert agent read the repo and issue, write a precise, context-grounded rubric, and then grade code patches against that rubric without running the code. How it works (simple):
- Explore the repository (files, functions, interfaces). 2) Write a rubric.yaml organized into File Change, Spec Alignment, Integrity, and Runtime. 3) For each candidate patch, a judge assigns 0/1 to each rubric item and aggregates a weighted score. 4) Pick the patch with the highest score. Why it matters: Without a grounded rubric, execution-free checks are vague; without execution-free scoring, scaling is slow and costly.
Three analogies:
- Cooking: A head chef checks recipes for required steps and safe temperatures on paper before anyone turns on the oven.
- School: A teacher writes a clear grading rubric after reading the assignment and the textbookās exact chapter, then grades projects quickly and fairly.
- Sports: Judges know the elements of a perfect dive; they score routines even without jumping in the pool themselves.
Before vs. After:
- Before: Verifiers either ran tests (heavy, brittle) or skipped running (light, but shallow and ungrounded).
- After: Agentic Rubrics give you a grounded-yet-execution-free signal: specific to the codebase, easy to scale, and interpretable.
Why it works (intuition):
- Context grounding: The agentic phase ties criteria to real files, classes, and behaviors in this repository, cutting ambiguity.
- Decomposition: Breaking correctness into small, weighted checks gives a dense, graded signal that separates partial progress from full solutions.
- Robustness: Instead of matching a reference patch (which penalizes stylistic differences) or generating runnable tests (which may fail to build or be too narrow), rubrics state what should hold and check that.
Building blocks (each with a Sandwich explanation):
š Hook: You know how editors prefer small, targeted changes over giant, risky rewrites? š„¬ Concept: File Change axis checks that edits are minimal, local, and in the right places. How it works: Count touched files, confirm the right file/symbol changed, and avoid unrelated churn. Why it matters: Without it, patches may over-edit and break other parts. š Anchor: Fix a bug by adding a two-line guard in xyz.py, not by renaming the whole utils folder.
š Hook: If the assignment says āchange timeout to 30s,ā you should change it to 30sānot 25s or 40s. š„¬ Concept: Spec Alignment axis checks the patch matches the issueās requirements. How it works: Look for exact constants, options, or behaviors promised in the PR description. Why it matters: Without it, a patch might look tidy but miss the actual user need. š Anchor: The rubric verifies the change_timeout(30) was truly applied where required.
š Hook: No cheating allowedādonāt delete the quiz! š„¬ Concept: Integrity axis prevents hacks like weakening tests or breaking APIs. How it works: Forbid skipping tests, mass renames, or API shape changes unless specified. Why it matters: Without integrity checks, a patch could āpassā by silencing the verifier instead of fixing the bug. š Anchor: The rubric flags a patch that comments out failing tests.
š Hook: A bridge blueprint should guarantee cars wonāt wobble in the wind. š„¬ Concept: Runtime axis encodes intended behavior and safety (determinism, error messages, performance bounds) in natural language. How it works: Require guards, stable exceptions, and backward-compatible flows, checked via textānot execution. Why it matters: Without runtime intent, a patch might look right on paper but behave unstably. š Anchor: The rubric expects a specific ValueError message when a bad input arrives, not a vague crash.
š Hook: Imagine trying 16 puzzle solutions and picking the best with a fair rubric. š„¬ Concept: BEST@K selection means scoring K candidates and choosing the top one. How it works: Compute the rubric score for each and take the max. Why it matters: Without reliable scoring, adding more candidates doesnāt help as much. š Anchor: With 16 patches, the rubric-based choice beats random and shallow heuristics.
03Methodology
At a high level: Problem + Repo ā Agent explores and writes rubric.yaml ā Candidates are scored (no execution) ā Highest-scoring patch is selected.
Step-by-step, like a recipe:
- Inputs gathered
- What happens: You start with a SWE-Bench Verified issue (the PR description) and the full repository snapshot in a sandbox.
- Why this step exists: The rubric must reflect the real codebase; otherwise criteria are vague.
- Example: The issue says āChange timeout to 30s in xyz.py; donāt modify utils.py.ā The repo has src/, tests/, and docs/.
- Agentic repository exploration
- What happens: An expert rubric agent uses tools like search and file viewers to find relevant files, symbols, and code paths.
- Why it matters: Without grounding, rubrics become generic (ātouch the right fileā) and are hard to grade.
- Example: The agent searches for ātimeoutā references, opens xyz.py, finds change_timeout(), and checks call sites.
- Rubric construction (rubric.yaml)
- What happens: The agent writes a structured YAML file with axes: File Change (4ā8 items), Spec Alignment (3ā6), Integrity (3ā6), Runtime (3ā6). Each item has an id, a short description, and a weight in {1,2,3}.
- Why it matters: Decomposing correctness yields many small, clear checks. Weighted aggregation gives a dense score in [0,1].
- Example items:
- FC1 (weight 3): Edits only xyz.py and avoids utils.py.
- SA1 (weight 3): Sets the timeout constant to exactly 30 seconds.
- I1 (weight 2): Does not modify tests or public API signatures.
- R1 (weight 2): Adds a guard that returns a specific error for invalid inputs.
- Candidate patch generation (K rollouts)
- What happens: A separate SWE agent (e.g., Qwen3-32B or Qwen3-Coder-30B-A3B) produces K=16 independent patches by interacting with the sandbox.
- Why it matters: Test-Time Scaling needs multiple candidates to pick from.
- Example: 16 different diffs change timeout in slightly different ways.
- Rubric grading (execution-free)
- What happens: An LLM judge reads the problem, the rubric, and the candidate patch, then assigns 0/1 to each item. A weighted average yields the final score S in [0,1].
- Why it matters: Execution-free scoring scalesāno environment boot, no flakiness from runtime.
- Example: A candidate that edits xyz.py (OK), doesnāt touch utils.py (OK), uses 30s (OK), and preserves API (OK), but forgets the runtime guard (Fail) might score around 0.85.
- Best-of-K selection
- What happens: Choose the patch with the highest rubric score.
- Why it matters: Turning many tries into a better final answer is the heart of Test-Time Scaling.
- Example: If scores are [0.41, 0.64, 0.72, 0.90, ...], pick 0.90.
- Optional: Use rubric scores for training
- What happens: Treat the score as a reward signal to fine-tune agents.
- Why it matters: Denser, interpretable rewards can steer learning better than sparse pass/fail.
- Example: Encourage patches that consistently pass high-weight items.
The secret sauce (what makes it clever):
- Agentic grounding: The rubric is tied to real files, classes, and behaviors in the repoāmaking checks unambiguous.
- Decomposed scoring: Many small criteria capture partial progress, enabling finer reranking.
- Execution-free scalability: After rubric generation, grading many candidates is cheap.
- Interpretability: Natural-language criteria explain why a patch scores up or down, surfacing failure modes tests may miss.
Mini Sandwich explanations of key components:
š Hook: Like keeping a house repair tiny and local so you donāt break the plumbing next door. š„¬ File Change axis: Verifies small, targeted edits in the right places; penalizes unrelated churn. Why it matters: Prevents scope creep and accidental regressions. š Anchor: Change a constant in one file; donāt rename the whole package.
š Hook: If the instructions say ā30s,ā then 30s it must be. š„¬ Spec Alignment axis: Checks the patch matches the problem statementās exact requirements. Why it matters: Ensures the userās need is truly met. š Anchor: The diff shows timeout=30; rubric marks SA1=1.
š Hook: No cheating on the test. š„¬ Integrity axis: Forbids weakening tests, mass renames, or API breaks unless required. Why it matters: Preserves trust and backward compatibility. š Anchor: A patch that comments out a failing test gets I1=0.
š Hook: A good blueprint must imply a safe, stable building. š„¬ Runtime axis: Encodes intended runtime properties (guards, exceptions, determinism) in words. Why it matters: Guards against silent crashes or flaky behavior. š Anchor: Expect a ValueError with a specific message for bad inputs.
Data flow summary:
- Input: PR description + repository
- Agentic phase: Explore ā synthesize rubric.yaml
- Rerank phase: For each candidate patch ā judge scores rubric ā select best
What breaks without each step:
- No exploration: Rubrics become generic; grading gets ambiguous.
- No decomposition: One fuzzy score hides where the patch succeeds or fails.
- No execution-free scoring: Scaling to many candidates becomes expensive.
- No integrity checks: Cheating slips through.
Concrete example snippet:
- PR: āChange timeout to 30s in xyz.py; do not edit utils.py; add a guard for missing inputs.ā
- Rubric:
- FC1: Only xyz.py changed (w=3)
- SA1: timeout set to 30 (w=3)
- I1: No test weakening (w=2)
- R1: Guarded path raises ValueError("missing input") (w=2)
- Patch A: Changes xyz.py, sets 30, adds guard with correct message, no other edits ā Score ~1.0
- Patch B: Changes xyz.py and utils.py, sets 30, no guard ā FC1=0, SA1=1, I1=1, R1=0 ā Score lower; A wins.
04Experiments & Results
The test: Evaluate verifiers for selecting the best patch out of K=16 candidates on SWE-Bench Verified. A problem counts as solved if the selected patch passes the hidden official tests (Fail-to-Pass and Pass-to-Pass). We compare several verifiers and report BEST@K.
The competition (baselines):
- Oracle Pass@16: Upper boundāchoose by ground-truth tests (not available to real verifiers).
- Random@16: Pick randomly.
- Non-agentic verifiers: Self-Consistency (choose the patch most similar to others), Patch Classifier (LLM YES/NO score).
- Agentic verifiers: Agentic Tests (generate a runnable test file), Agentic Patch Similarity (compare to a proxy reference patch), and our Agentic Rubrics.
The scoreboard (with context):
- Qwen3-32B generator, K=16:
- Random: ~22.6% (like guessing on a tough quiz)
- Patch Classifier: 37.1% (solid B-)
- Agentic Patch Similarity: 35.0% (B- but slightly worse)
- Agentic Tests: 33.6% (B-, but hampered by setup/brittleness)
- Agentic Rubrics (ours): 40.6% (like turning that B- into a strong B+), +3.5 points over the best baseline
- Qwen3-Coder-30B-A3B generator, K=16:
- Random: 39.6%
- Patch Classifier: 50.2%
- Agentic Patch Similarity: 49.6%
- Agentic Tests: 49.0%
- Agentic Rubrics (ours): 54.2% (an A- when others hover around B+), +4.0 to +4.6 points improvement
- Scaling curves: As K increases, rubrics keep their advantage, showing benefits arenāt tied to a single setting.
Alignment and grading quality:
- Rubric scores correlate strongly with ground-truth test outcomes:
- ROC-AUC = 0.886; PR-AUC = 0.722. High scores concentrate near 0.85ā1.0 for true passes; false patches score lower and spread wider.
- By axis:
- Failing patches often stumble on File Change (unnecessary edits), Spec Alignment (missed requirements), and Runtime (unstable behavior), while usually preserving Integrity.
- Passing patches saturate Spec Alignment and Integrity, but sometimes still get dinged for over-scoped edits or subtle runtime checks.
Surprising findings:
- Rubrics sometimes reject test-passing patches for high-utility reasons (about 54% of such disagreements): missed root causes, wrong-layer fixes, or missing edge cases the tests didnāt cover.
- Agentic grounding matters: Removing repository interaction to write ānon-agenticā rubrics drops performance notably (e.g., ā4.0 points on Qwen3-32B rollouts), proving that codebase-specific details improve scoring.
- Judge model sensitivity is small: Increasing judge reasoning nudges BEST@16 from roughly 54.2% to 55.0%, so fancy judges arenāt required when rubrics are atomic and clear.
- Cost-effectiveness: Among agentic methods using the same model, rubrics deliver higher performance at lower total cost per instance than generated tests or patch similarity (after accounting for artifact generation and grading many rollouts).
- Distillation works: Fine-tuning an open-weight model as a rubric generator outperforms fine-tuning it as a patch classifier, suggesting rubric creation is a stronger, more robust objective for execution-free verification.
Mini Sandwich on Execution-free Verification: š Hook: Like checking a recipeās steps without baking every batch. š„¬ Concept: Execution-free verification scores patches without running the code. How it works: Read problem, rubric, and patch; assign item scores; aggregate. Why it matters: It scales to many candidates quickly. š Anchor: Grade 16 patches in seconds without spinning up containers.
05Discussion & Limitations
Limitations:
- Rubric quality varies: A subset of rubrics are low-utility (over-specified, redundant, or mismatched with tests). This can be mitigated with human-in-the-loop refinement and better prompts/templates.
- Reward hacking risk: If used for training, models might learn to āgameā rubric cues. Careful design, evolving rubrics, and audits are needed.
- Coverage gaps: While rubrics are grounded, they still canāt fully replace dynamic checks for performance or concurrency bugs that only appear at runtime.
- Dependency on agent capability: Stronger rubric agents make better, more granular rubrics; weaker ones may miss key criteria.
Required resources to use this method:
- A sandboxed repository environment with tools for search and file viewing.
- One capable model to act as the rubric-generation agent, and a smaller judge model to score items.
- Optional: Storage and logging to audit rubrics and scores.
When not to use Agentic Rubrics:
- Pure performance tuning or timing-sensitive issues where only execution reveals the truth.
- Heavy integration or system-level behavior (e.g., networking) where runtime side effects are central.
- Repos with very sparse signals (few identifiers or unclear structure), where grounding becomes too ambiguous.
Open questions:
- How to best combine rubrics with execution-based tests (hybrid verifiers) for maximum benefit?
- How to evolve rubrics over time (e.g., curriculum or adaptive criteria) without becoming prescriptive?
- How to robustly use rubric signals for RL while avoiding reward hacking and ensuring generalization?
- Can we automate human-in-the-loop rubric refinement efficiently (e.g., quick edits, templates, auto-detection of over-specification)?
06Conclusion & Future Work
Three-sentence summary: Agentic Rubrics let an expert agent read the repository and write a project-specific checklist, then grade code patches without running them. This approach is scalable, interpretable, and groundedāoutperforming strong baselines on SWE-Bench Verified in Test-Time Scaling. Rubric scores align with real tests and even flag issues tests can miss.
Main achievement: Turning verification into a context-grounded, execution-free rubric that delivers both performance gains and clear, granular feedback for selection and training.
Future directions:
- Hybrid verifiers that blend rubrics with generated or existing tests for even better coverage.
- Human-in-the-loop rubric refinement and template reuse to boost fidelity and reduce low-utility items.
- Using rubric rewards for post-training (e.g., RL) while guarding against reward hacking.
Why remember this: It shows you donāt always need to run code to judge codeāif you can write the right grounded checklist first. That insight unlocks scalable verification, better test-time selection, clearer feedback, and stronger training signals for next-generation SWE agents.
Practical Applications
- ā¢Fast patch triage in large organizations by grading many candidate fixes without spinning up full environments.
- ā¢Pre-merge code review assistance that highlights risky scope creep or API breaks using rubric axes.
- ā¢Post-training rewards for coding agents, using rubric scores to reinforce grounded, high-quality edits.
- ā¢Repository onboarding: auto-generate checklists that encode key contracts, guards, and style for new contributors.
- ā¢Safety auditing for critical paths (e.g., auth, payments) with integrity and runtime criteria clearly spelled out.
- ā¢Hybrid verification pipelines that combine rubric checks with selective execution-based tests for high coverage.
- ā¢Refactoring guidance to keep edits minimal and localized, reducing regression risk.
- ā¢Automated feedback for LLM-generated pull requests, with itemized reasons for acceptance or rejection.
- ā¢Continuous integration optimization by using rubric scores as a fast first-pass filter before costly test runs.