SERA: Soft-Verified Efficient Repository Agents

Ethan Shen; Danny Tormoen; Saurabh Shah; Ali Farhadi; Tim Dettmers

SERA: Soft-Verified Efficient Repository Agents

Intermediate

Ethan Shen, Danny Tormoen, Saurabh Shah et al.1/28/2026

arXiv PDF

Key Summary

•SERA is a new, low-cost way to train coding helpers (agents) that learn the style and secrets of your own codebase.
•Instead of running heavy tests for every example, SERA uses soft verification, which checks whether two code changes overlap line by line.
•SERA’s data factory, called Soft-Verified Generation (SVG), makes two passes: first it creates a change with a vague instruction, then it tries to recreate that same change from a synthetic pull request.
•This soft check is good enough to train strong agents and removes the need for complicated test infrastructure.
•On the standard SWE-bench Verified benchmark, SERA matches or beats other fully open methods while costing 26–57 times less to reach similar performance.
•SERA can specialize to a single repository and match or exceed its teacher model in about 8,000 examples, which costs roughly $1,300.
•Simple choices like keeping reasoning traces, limiting very long patches, and ordering truncated examples by how complete they are can noticeably boost results.
•Longer context helps evaluations a lot, but SERA already performs strongly at 32K context and remains competitive at 64K despite being trained at 32K.
•The team shows careful statistics and scaling laws, warning that many 1–3% gains in past papers can be noise without multiple random seeds.
•Everything is open: code, data, models, and a proxy to plug SERA into real coding tools.

Why This Research Matters

SERA makes it practical and affordable for small teams to build coding agents that know their own codebases deeply. This reduces the need to ship private code to external services, improving privacy and control. It also cuts costs dramatically compared to reinforcement learning or test-heavy synthetic pipelines, which lowers the barrier to entry for startups, researchers, and open-source maintainers. Because SERA’s soft checks work on any repository, organizations can train on exactly what they care about, quickly and repeatedly as their code evolves. The statistical care and scaling laws help teams budget realistically and avoid chasing noisy 1–3% gains. Overall, SERA turns the promise of open-weight specialization into a real, everyday tool.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you join a new school where every classroom has different rules. If a helpful robot can learn your classroom’s rules, it will help you a lot more than a general robot that doesn’t know them.

🥬 The Concept (Agent Workflow): A coding agent is a tool-using helper that reads files, edits code, runs commands, and then turns in a final patch.

How it works:
1. It sees the problem (like a GitHub issue).
2. It uses tools to read files, search, and run scripts.
3. It edits code and keeps notes (a trajectory of steps).
4. It submits a final code change (a patch).
Why it matters: Without a clear workflow and reliable tool use, the agent gets lost and can’t solve real tasks. 🍞 Anchor: Think of a chef following a recipe, tasting, adjusting seasoning, and finally plating a dish.

🍞 Hook: You know how you write down each step when solving a hard math problem so your teacher can see how you thought it through?

🥬 The Concept (Trajectory): A trajectory is the full step-by-step record of what the agent saw, thought, and did during a task.

How it works:
1. Start when the agent reads the issue.
2. Log every tool call and observation.
3. Record edits and reasoning until the patch is submitted.
Why it matters: Without trajectories, we can’t train the agent on the exact process of solving problems. 🍞 Anchor: It’s like a science fair journal: question, hypothesis, tests, results.

🍞 Hook: When you fix a typo in an essay, you only change a few lines, not the whole book.

🥬 The Concept (Patch): A patch is the line-by-line difference showing what was added and removed in the code.

How it works:
1. Compare old and new code.
2. List added lines with plus signs and removed lines with minus signs.
3. Apply this diff to update the codebase.
Why it matters: Patches are the final product we grade in coding tasks. 🍞 Anchor: It’s a before-and-after snapshot of a haircut, but for files instead of hair.

🍞 Hook: Schools use tests to check learning, but setting up tests for every tiny lesson takes lots of time.

🥬 The Concept (Synthetic Data Generation): Synthetic data generation makes practice tasks and answers using a strong teacher model so a smaller student model can learn.

How it works:
1. Pick a codebase.
2. Ask a powerful model to create and solve pretend issues.
3. Save the steps and patches as training data for a smaller model.
Why it matters: Real labeled data is rare; synthetic data keeps training moving. 🍞 Anchor: Like practicing driving with cones in a parking lot before going on real roads.

🍞 Hook: Puppy training works with treats; computers can also learn from rewards.

🥬 The Concept (Reinforcement Learning): RL teaches a model by letting it try actions and giving rewards when it succeeds.

How it works:
1. The agent attempts tasks.
2. It gets a reward if tests pass.
3. It updates itself to get more rewards next time.
Why it matters: RL can push models past their teachers—but it needs heavy, complex infrastructure. 🍞 Anchor: A game character learns to win levels by getting points.

🍞 Hook: Sometimes a helpful hint is enough—you don’t need the whole exam to see if a student understood.

🥬 The Concept (Supervised Fine-Tuning, SFT): SFT means showing the model good examples and asking it to imitate them.

How it works:
1. Collect high-quality trajectories.
2. Train the model to predict the next action or tool call.
3. Repeat until it follows the style well.
Why it matters: SFT is simpler and cheaper than RL and is stable for many teams. 🍞 Anchor: A piano student improves by copying a teacher’s hand movements on a song.

The world before SERA: Most open-weight coding agents looked promising for private codebases, but training them well was hard. Reinforcement learning needed sandboxes, orchestration, and big teams. Synthetic pipelines that relied on unit tests (like bug injection) were fragile, limited to repos with good tests, and expensive per sample. Many results used single random seeds, making tiny improvements look bigger than they were.

The problem: How can small teams cheaply train strong coding agents—especially ones that specialize to their own codebases—without building huge testing machinery or RL systems?

Failed attempts and gaps:

Heavy unit-test verification caps how much data you can make and where you can make it (only repos with strong tests).
RL brings instability and infrastructure overhead.
Long contexts make everything slower and pricier, so you can’t easily repeat experiments for reliable stats.

What was missing: A way to create tons of useful training trajectories from any repository, with minimal setup, while still filtering for quality—plus a reliable recipe that works with SFT.

Real stakes:

Privacy: Many teams cannot send code to the cloud.
Speed: Private specialization should be fast and repeatable as codebases evolve.
Cost: Startups and labs need solutions that fit tight budgets. SERA shows that soft checks and clever data generation can get you there, turning a theoretical advantage of open models (specialization) into something very practical.

02Core Idea

🍞 Hook: You know how two students comparing their homework can tell if they solved the same problem even if they used different words?

🥬 The Concept (Soft Verification): Soft verification checks if two code changes overlap enough line by line, instead of running full test suites.

How it works:
1. Generate a change once.
2. Try to recreate that same change from a short description.
3. Compare patches using line-level recall (how much of A appears in B).
Why it matters: It removes the need for unit test infrastructure and lets you generate data from any repo at scale. 🍞 Anchor: Two essays may have different styles, but if they both fix the same grammar mistakes, you can see they match.

🍞 Hook: Imagine asking a friend, “Tidy up this room a bit,” then later seeing if another friend can tidy it the same way using your note.

🥬 The Concept (Soft-Verified Generation, SVG): SVG is a two-rollout data factory that makes and then remakes a change, keeping examples where the second change matches the first well enough.

How it works:
1. Rollout 1: Give a vague instruction at a random function; the teacher model edits code and produces a patch and trajectory.
2. Turn that into a synthetic PR.
3. Rollout 2: Ask the teacher to reproduce the patch from the PR.
4. Soft-verify by line overlap and keep good pairs.
Why it matters: It’s cheap, needs no tests, and creates lots of diverse, realistic coding data (not just bug fixes—also refactors, docs, clarity). 🍞 Anchor: It’s like tracing a drawing twice and keeping the pairs where the second tracing closely matches the first.

The “Aha!” in one sentence: You can train strong, specialized coding agents without heavy test scaffolding by asking a teacher model to make a change, restate the change as a PR, recreate it, and keep examples where both changes agree enough.

Three analogies:

Music: A teacher plays a tune (first rollout), writes a short score (PR), then plays it again from the score (second rollout). If the two performances match in most notes (soft verification), it’s a good training clip.
Crafts: A potter shapes a bowl, writes simple steps, then reshapes a similar bowl from those steps; if both bowls look similar, the steps are useful.
Cooking: A chef tries a new dish, drafts a recipe, then cooks it again from the recipe; if the flavors overlap a lot, you keep the recipe.

Before vs. After:

Before: Needed unit tests, bug injection, and complex verification; limited to well-tested repos; costly per sample.
After: No unit tests required; works on any repo; cheap and fast per sample; realistic variety (refactors, docs) from vague prompts; enables private specialization.

Why it works (intuition, not equations): Learning to solve coding tasks isn’t only about perfect correctness—it’s also about skills like navigating files, planning tool calls, and turning intent into edits. SVG captures these skills at scale. Line-overlap is enough signal to filter out noisy pairs while keeping a lot of data. Vague instructions broaden the behavior space so the student learns general coding moves beyond fixing tests.

Building blocks:

Vague prompts to diversify edits.
Two-rollout loop to create and re-create changes.
Line-level recall as a soft quality gate.
SFT on full trajectories with rich reasoning traces.
Practical heuristics: prefer high truncation ratios, filter overly long patches or tool outputs when specializing, and mind context length in evaluation.

🍞 Hook: Bigger notebooks help you keep more notes; longer context helps models too.

🥬 The Concept (Context Length): Context length is how much text the model can remember in one go.

How it works:
1. More context fits more files, tool outputs, and reasoning.
2. But it also increases memory and cost.
3. Evaluations with longer context often score higher, so compare fairly by context.
Why it matters: Without fair context control, results can look better just because they used longer memory. 🍞 Anchor: A larger whiteboard lets a class keep more steps visible at once, making it easier to solve a long math problem.

🍞 Hook: If you study your own textbook, your quiz scores go up faster than if you study a random one.

🥬 The Concept (Repository Specialization): Specialization means fine-tuning on data from a specific codebase so the model learns its style and patterns.

How it works:
1. Generate SVG data on the target repo.
2. Train with SFT on those trajectories.
3. Evaluate on tasks from that repo.
Why it matters: The student can match or beat the teacher because it encodes repo knowledge directly in its weights, not just in the prompt. 🍞 Anchor: A tour guide who studies one city’s streets can out-navigate a general GPS that has to load maps each time.

03Methodology

High-level recipe: Input (repo + random function + vague instruction) → Rollout 1 (trajectory T1, patch P1) → Make a synthetic PR → Rollout 2 (trajectory T2, patch P2) → Soft-verify overlap between P1 and P2 → Keep good pairs → Supervised fine-tune the student model.

Step-by-step, like a friend explaining it:

Choose a starting point in the code.

What happens: Pick a random function in the repository and a vague bug or improvement prompt from a list (51 types). Ask the teacher to make a change.
Why it exists: Starting randomly avoids cherry-picking and explores the repo broadly.
Example: “There is an issue with state handling near function process_user(). Improve it.”

Rollout 1: Let the teacher act.

What happens: Using a simple toolset (view, edit, run, submit), the teacher navigates, edits, and submits a patch, producing trajectory T1 and patch P1.
Why it exists: This captures realistic problem-solving behavior and the final change.
Example: Teacher edits 6 lines to standardize a flag and adds 3 tests or logs (still allowed in vague prompts).

Turn T1 into a synthetic PR.

What happens: Feed the trajectory and a demo PR format to the teacher and ask it to write a well-structured PR text that describes what changed and why.
Why it exists: Real dev workflows pivot around PRs; this creates a compact, human-style task description.
Example: “Refactor: Ensure is_active is consistently updated in process_user; add docstring and clarify edge case.”

Rollout 2: Re-create the change from the PR.

What happens: Give only the synthetic PR (no T1 or P1). The teacher tries to reproduce the same change, generating T2 and P2.
Why it exists: This tests if the PR faithfully describes the change and if the change is reproducible.
Example: Teacher edits near the same lines and submits P2.

Soft verification: Check overlap.

What happens: Compute line-level recall: what fraction of P1’s changed lines appear in P2?
Why it exists: It’s a light-quality filter without unit tests. At r = 1, it’s hard-verified; at 0 < r < 1, soft-verified.
Example: If 8 of 10 changed lines match, r = 0.8 and we keep it.

Build the training set and fine-tune.

What happens: Collect many (T1 or T2) trajectories (with reasoning), filter out overly long ones or those that overflow context, and SFT a student model (e.g., Qwen 3-32B) for 3 epochs at 32K context.
Why it exists: Good SFT on diverse, tool-following trajectories reliably improves agent skills.
Example: Train on 16k–25k samples for strong general agents; around 8k samples to specialize to a single repo.

The secret sauce:

Vague instructions: They cause the teacher to perform a wider variety of edits—refactors, clarity fixes, documentation tweaks—mirroring real PRs beyond just bug fixes.
Two-pass agreement: If two independent rollouts converge on similar edits, it’s a strong signal the PR is clear and the change is learnable.
No unit tests needed: This removes the biggest bottleneck in synthetic pipelines and unlocks any repository for data generation.

Practical choices that matter:

Reasoning traces: Keeping the teacher’s step-by-step thoughts dramatically boosts student performance; removing them hurts a lot.
Truncation strategy: Long trajectories often exceed 32K. Instead of random slicing, sort by “truncation ratio” (how much of the original fits) and prefer high ratios (~0.95). This keeps the informative parts while avoiding the noisy tail (like redundant submit steps).
Filtering for specialization: When targeting a repo, it can help to filter overly long patches (e.g., >40 lines) or giant tool outputs (e.g., >600 tokens), depending on the repo. Different repos benefit from different filters.
Context length in evals: Longer contexts often score higher. Compare methods at the same context to be fair.

🍞 Hook: When grading class projects, a quick rubric can work as well as a full lab test—especially early on.

🥬 The Concept (Line-Level Recall in Soft Verification): This is a score showing what fraction of original changed lines reappear in the re-created patch.

How it works:
1. Count changed lines in P1 that also show up in P2.
2. Divide by the number of changed lines in P1.
3. Use thresholds (like 0.5 or 1.0) to select data.
Why it matters: It’s fast, model-agnostic, and doesn’t require test rigs; at scale, it performed as well as strict tests for training quality. 🍞 Anchor: If your friend re-draws 8 of the 10 stars you drew, you know they captured most of your idea.

Putting it together: SVG makes lots of affordable, good-enough training pairs from any repo. SFT on these trajectories teaches reliable tool use and code-edit skills. Add a little smart filtering and careful truncation, and you get powerful general agents and repo specialists—without the heavy test or RL machinery.

04Experiments & Results

🍞 Hook: If you want to compare soccer teams, you need the same field, same ball, same time—fair rules make scores meaningful.

🥬 The Concept (SWE-bench Verified): SWE-bench Verified is a widely used set of real GitHub issues with tests where a task is solved if failing tests now pass and no new tests break.

How it works:
1. Present an issue from repos like Django or Sympy.
2. Agent submits a patch.
3. Run tests before and after; if the fix is clean, it’s a win.
Why it matters: It’s the common playing field for coding agents, so we can compare apples to apples. 🍞 Anchor: It’s like a league where all teams play under the same rules and refs.

The test: Use SWE-bench Verified and control for context length (32K vs 64K) to be fair. Report averages over three random seeds to avoid noise.

The competition: Compare against synthetic-data pipelines (e.g., SWE-smith), RL-based agents (e.g., SkyRL, DeepSWE), and strong open-weight models (e.g., Devstral-Small-2) and teachers (GLM-4.5-Air, GLM-4.6).

The scoreboard with context:

General agents (32K): SERA-32B reaches about 49.5% at 32K, which is state-of-the-art among fully open-source methods and within the uncertainty range of strong open-weight baselines.
At 64K eval: SERA-32B hits about 54.2%. Even though it was trained at 32K, it stays competitive at 64K, while some baselines were trained for longer contexts.
Cost wins: Matching RL or prior synthetic baselines takes 26× to 57× less money using SERA when self-hosted—and even larger savings using a low-cost API. That’s like getting an A at a fraction of the tutoring bill.

Surprising findings:

Verification level matters less than expected: Training on soft-verified (or even unverified) data performed about as well as hard-verified data at the tested scales. The skill of turning intent into edits may dominate early gains.
Truncation order matters a lot: Curating by high truncation ratio (~0.95) beats random truncation. Keeping the meaningful first steps and trimming the redundant tail works best.
Reasoning traces are gold: Removing them caused big drops, confirming they carry essential learning signal.
Vague prompts help: They produce diverse edits beyond bug fixes (like refactors and docs), which improved benchmark performance.

Repository specialization:

With 8,000 trajectories per repo (about $1,300), SERA matched or beat its teacher on Django and Sympy and was competitive on Sphinx at 32K. This supports the idea that a student, once it encodes repo-specific knowledge, can outdo a teacher that only sees the repo through a prompt.
Mixing specialized data for two repos (Django + Sympy) slightly lowers each individual score but improves the average vs. general data, suggesting multi-repo specialization is viable.

Scaling laws and costs:

Performance vs. cost follows a predictable power law with very small average prediction error, letting you estimate the budget to match a target model (e.g., Devstral-Small-2) before you spend.
With a cost of roughly $2,000 total (about 40 GPU-days), SERA achieved strong open-source results. The same approach predicts matching certain open-weight baselines at single-digit thousands of dollars, depending on setup.

Statistical care:

Many 1–3% reported gains in the literature can be pure noise unless you average across multiple seeds.
The authors consistently report means and standard deviations and analyze signal-to-noise to keep conclusions honest.

Bottom line: SERA delivers state-of-the-art open-source results with much lower costs, works on any repo (tests or not), and makes private specialization practical and fast.

05Discussion & Limitations

Limitations and where it might struggle:

Saturation at higher levels: Soft verification worked as well as hard verification at the tested scales; however, at very high performance, you might need more strictly correct code to keep improving.
Benchmark scope: Results are centered on SWE-bench Verified. While behaviors looked good in real usage, other benchmarks could reveal new quirks.
Model family focus: The base models (Qwen 3) and teachers (GLM-4.5/4.6) were the main focus. While some cross-model signs are positive, results may shift with other families.
Public-repo bias: Specialization tests used public repos that base models might have seen during pretraining. True private repos could behave differently (though the core idea should transfer).
Context mismatch: SERA was trained at 32K but evaluated at both 32K and 64K. Training natively at 64K+ might close remaining gaps but increases cost.

Required resources:

One 80GB GPU (e.g., A100/H100) for serving SERA-32B comfortably; quantization can help.
For data generation, either self-host the teacher with vLLM or use a low-cost API with cached input pricing.
Simple agent scaffold compatibility is important—format mismatches can degrade performance.

When not to use:

If you need the absolute frontier best across all public repos right now and can’t fine-tune, a massive closed model at long context could still win.
If your task requires strict formal correctness guarantees from training data alone (e.g., safety-critical patches with hard proofs), soft verification may be insufficient.

Open questions:

Where is the tipping point where hard verification begins to clearly beat soft verification?
What’s the best recipe for multi-repo specialization across very different domains (e.g., data science libs + web frameworks)?
How do these findings transfer to other languages and build systems (Java, Rust, Bazel, etc.)?
Can we further automate filtering heuristics (e.g., learn-to-filter) and truncation policies?
What’s the ideal mix of T1 vs. T2 trajectories for varying budgets and target repos?

06Conclusion & Future Work

Three-sentence summary: SERA introduces Soft-Verified Generation (SVG), a cheap, test-free way to create high-quality training data by making and remaking code changes and keeping pairs that overlap enough. With straightforward supervised fine-tuning on these trajectories, SERA matches or beats fully open alternatives and rivals strong open-weight systems at a fraction of the cost. Crucially, it makes private repository specialization practical: about 8,000 examples can match or exceed the teacher on that repo.

Main achievement: Turning the theoretical advantage of open-weight models (they can be specialized to private codebases) into a practical, affordable, and reproducible pipeline.

Future directions:

Train natively at longer contexts (64K–128K+) and across more languages and build systems.
Study where hard verification reclaims the edge and how to combine it selectively with soft checks.
Automate filtering and truncation strategies and refine multi-repo specialization mixes.
Extend the statistical toolkit: stronger seed counts, broader benchmarks, and more robust scaling analyses.

Why remember this: SERA shows that you don’t need massive RL machinery or fragile test pipelines to build strong coding agents. With a simple two-pass agreement check, thoughtful data curation, and SFT, small teams can build agents that learn their own codebases and improve quickly—privately, cheaply, and reliably.

Practical Applications

•Train a private coding copilot specialized to your company’s monorepo without exposing code externally.
•Continuously re-train a repo-specific assistant after every major release to keep it aligned with evolving code.
•Create an internal PR reviewer that understands project style, common patterns, and preferred refactors.
•Build a lightweight assistant for legacy codebases with poor tests, using SERA’s test-free data generation.
•Prototype domain-specific agents (e.g., data pipelines, scientific computing) by specializing on targeted subrepos.
•Run cost-aware experiments using scaling laws to estimate budget needed to reach target performance.
•Improve reliability of tool use by fine-tuning on trajectories with rich reasoning traces and correct formats.
•Speed up onboarding by offering a specialized agent that explains and edits code following local conventions.
•Use filtering heuristics (patch-size and tool-output limits) to tailor a specialist for tricky repos.
•Stand up an on-prem inference service (e.g., vLLM) with a SERA model for privacy-sensitive environments.

Version: 1