Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

Tianyi Wu; Mingzhe Du; Yue Liu; Chengran Yang; Terry Yue Zhuo; Jiaheng Zhang; See-Kiong Ng

Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model

Intermediate

Tianyi Wu, Mingzhe Du, Yue Liu et al.2/7/2026

arXiv

Key Summary

•This paper introduces SecCoderX, a way to teach code-writing AIs to be secure without breaking what the code is supposed to do.
•It reuses real-world vulnerability datasets to both create tricky practice tasks and to train a fast 'referee' model that scores security.
•SecCoderX uses online reinforcement learning so the AI practices, gets feedback right away, and improves steadily like a student with a coach.
•A special Vulnerability Reward Model (8B parameters) checks code for specific CWE weaknesses with clear reasoning, beating even some commercial models.
•A balanced reward combines security checks with functionality guards (length and AST-structure similarity) so ‘safe’ doesn’t mean ‘empty’ or ‘broken’.
•Across benchmarks, SecCoderX improves Effective Safety Rate by about 10% over the original models, while prior methods often made it 14–54% worse.
•It works across five languages (C, C++, Java, JavaScript, Python) and 24 CWE categories using 24k realistic practice prompts.
•The system avoids slow, limited static tools, enabling fast and scalable training loops.
•Ablation studies show the security reward drives safety, while length and AST rewards keep code functional.
•Bottom line: SecCoderX is the first to close the functionality–security gap for AI-generated code in a practical way.

Why This Research Matters

SecCoderX shows developers don’t have to choose between safety and getting real work done—AI can deliver both. That means fewer security bugs silently slipping into apps, websites, and services people rely on every day. It also saves time: the fast, learned referee replaces slow, limited tools that block training and feedback. Because the system trains on realistic, diverse tasks, it learns patterns that transfer to real development. Teams can keep their favorite code models and align them for security without losing productivity. Over time, this approach could raise the baseline for secure coding across the industry.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you ask a robot friend to write some code for your app. It types super fast and usually does what you say. But sometimes, it forgets important safety steps, like locking the door before leaving the house.

🥬 The Concept (Large Language Models for Code):

What it is: Large Language Models (LLMs) are computer programs that learned to write code by reading tons of examples.
How it works: They read your task, guess the next tokens one by one, and produce code that looks right based on patterns they’ve seen.
Why it matters: Without LLMs, we don’t get handy code assistants. But if they write unsafe code, real software can break or get hacked. 🍞 Anchor: Ask an LLM to “write a function that parses user input.” It can write working code—but it might forget to sanitize inputs, opening a security hole.

🍞 Hook: You know how roads have warning signs like “Slippery When Wet” so drivers know what to avoid?

🥬 The Concept (Common Weakness Enumeration—CWE):

What it is: CWE is a big, shared list of common coding mistakes that can cause security problems.
How it works: Experts catalog weaknesses (like SQL Injection or Cross-Site Scripting) with IDs and descriptions.
Why it matters: Without a common list, we can’t teach or test security clearly. 🍞 Anchor: CWE-79 warns about Cross-Site Scripting (XSS)—showing user text on a web page without proper escaping.

🍞 Hook: Think of an airport metal detector—its job is to spot hidden trouble before it gets on the plane.

🥬 The Concept (Vulnerability Detection):

What it is: Tools or models that check code for known weakness patterns and say if it’s vulnerable.
How it works: They read code, look for risky patterns, and reason about whether a CWE applies.
Why it matters: Without detection, we can’t tell if generated code is safe. 🍞 Anchor: If code builds an SQL query by gluing strings from user input, a detector can flag CWE-89 (SQL Injection).

🍞 Hook: If you lock a door by bricking it up, your house is “safer,” but now you can’t use the room.

🥬 The Concept (Functionality–Security Paradox):

What it is: Many past methods make AI’s code safer but break what the code is supposed to do.
How it works: They push the model to avoid risks so hard that it removes needed logic or outputs tiny, non-working snippets.
Why it matters: Developers reject “safe” code that doesn’t work; security without usefulness isn’t helpful. 🍞 Anchor: A model that returns an empty function is “safe” (no bug), but the app stops working.

The world before: LLMs wrote helpful code but often included real vulnerabilities. Early fixes tried supervised fine-tuning on secure examples or preference learning from vulnerable–fixed pairs. Others used rule-based reinforcement learning with static analyzers. These helped security scores but frequently hurt functionality—so in practice, teams couldn’t use the results.

The problem: We need code that’s both safe and still does the job. Security gains that destroy functionality are a hollow victory.

Failed attempts:

Supervised fine-tuning: Learns safer style, but can overfit or remove needed behavior.
Preference alignment (like DPO): Prefers fixed over vulnerable, but may forget how to solve tasks.
RL with static tools: Static tools are slow, limited in CWE coverage, and often need fully compilable projects, which many small coding tasks don’t have.

The gap: A fast, reliable, and scalable security signal that works during training, plus realistic practice tasks that cause the right mistakes (so the model learns to avoid them), and a way to keep functionality intact.

Real stakes:

Insecure AI code can leak data, allow break-ins, or crash services.
Developers won’t adopt assistants that return “safe” but non-working code.
Teams need tools that really reduce risk without slowing them down or breaking features.

This paper’s answer: SecCoderX—an online reinforcement learning framework that reuses real vulnerability datasets to (1) synthesize realistic, vulnerability-inducing coding tasks and (2) train a fast, reasoning-based Vulnerability Reward Model (a smart referee). Then it aligns a code LLM with a balanced reward that values both security and functionality. The result: higher real-world utility measured by Effective Safety Rate (security times functionality), finally breaking the paradox.

02Core Idea

🍞 Hook: Imagine training for a soccer game. You don’t just read rules—you practice realistic drills while a fast, fair referee gives feedback every play.

🥬 The Aha! Moment: Use real vulnerability data to create realistic practice tasks and to train a fast, accurate referee (the Vulnerability Reward Model), then let the code LLM practice online with balanced rewards so it learns to be both safe and still functional.

Multiple analogies:

Driving school: You practice on roads that often cause accidents (tricky tasks), and your instructor (reward model) flags dangerous moves while still making sure you get to your destination (functionality checks).
Cooking class: You try recipes where people often burn dishes (vulnerability-inducing prompts). A chef-taster (reward model) quickly spots unsafe steps (like raw chicken) but also expects an edible meal (functionality rewards).
Gym training: Work on weak muscles (common CWEs) with targeted drills. Your trainer checks form (AST similarity) and effort (length control), not just whether you avoided injury.

🍞 Hook: You know how practicing on the right problems speeds up learning way more than random exercises?

🥬 The Concept (Vulnerability-Inducing Task Synthesis):

What it is: A pipeline that turns real vulnerable snippets into realistic prompts likely to trigger the same CWE during coding.
How it works: (1) Infer plausible app/repo contexts where the code logic arises; (2) craft language-specific tasks that nudge the same mistake (but never ask for unsafe behavior directly).
Why it matters: Without well-designed drills, the model won’t practice the exact mistakes we need it to fix. 🍞 Anchor: Turn an XSS-prone web snippet into a task like “build a comment viewer,” which naturally tempts missing output encoding.

🍞 Hook: Think of a fast, knowledgeable referee who knows the exact foul to watch for in each play.

🥬 The Concept (Vulnerability Reward Model):

What it is: A CWE-conditioned, reasoning-based model that checks generated code for a target vulnerability and says secure or not.
How it works: Train first with distilled, step-by-step reasoning from a strong teacher, then sharpen with RL for robustness; condition on the target CWE to convert broad detection into focused verification.
Why it matters: Slow, narrow tools can’t keep up; a quick, accurate referee makes online RL possible. 🍞 Anchor: For a prompt labeled “CWE-79,” the referee analyzes code logic and flags missing HTML escaping.

🍞 Hook: Practicing while getting live feedback is how you improve fastest.

🥬 The Concept (Online Reinforcement Learning):

What it is: The model generates code in training, gets immediate rewards, and updates its behavior on the fly.
How it works: For each task, it produces code; the reward model judges security; extra rewards encourage staying close to working logic; then GRPO updates the policy.
Why it matters: Without live, aligned feedback, the model can’t reliably learn safe habits that keep functionality. 🍞 Anchor: The model writes a function; the referee says “secure +1,” length/structure checks pass, so it learns a safe patch style.

Before vs. After:

Before: Security boosts often broke functionality; static tools slowed training; data was scarce and not targeted.
After: SecCoderX improves safety and keeps functionality using learned, fast rewards and realistic tasks, raising Effective Safety Rate by about 10%.

Why it works (intuition):

The model already “knows” secure patterns from pretraining but needs targeted signals to bring them out.
CWE-conditioning narrows the judge’s focus, making detection easier and more accurate.
Balanced rewards (security + length + AST similarity) keep the fix close to the original logic, reducing breakage.

Building blocks:

CWE taxonomy: a shared language for weaknesses.
Vulnerability-inducing prompts: realistic, language-diverse tasks that invite specific mistakes.
Reasoning-based reward model: explains its call, improving accuracy and reliability.
Online RL with GRPO: stable, comparative updates from fresh rollouts.
Functionality anchors: length and AST-structure similarity prevent “safe but useless” outputs.

🍞 Anchor: A pre-aligned model used strcpy/strcat (risky); after SecCoderX, it uses bounded copies and checks sizes—still returns the right string, now safely.

03Methodology

At a high level: Vulnerability data → [A. Synthesize tricky coding tasks] → [B. Train a CWE-aware security referee] → [C. Practice with online RL using balanced rewards] → Secure and working code.

A. Reality-Grounded Vulnerability-Inducing Task Synthesis 🍞 Hook: Coaches design drills that recreate common mistakes, so athletes learn to avoid them.

🥬 The Concept (Task Synthesis):

What it is: Create realistic prompts that are likely to trigger a specific CWE error when implemented.
How it works (step-by-step):
1. Start from real vulnerable code with its CWE tag (e.g., from PrimeVul, R2Vul).
2. Infer multiple plausible app/repo contexts where similar logic appears (to boost diversity).
3. For each context, write a clear, functional coding prompt in a target language (C/C++/Java/JS/Python) that nudges the same CWE—but never tells the model to be unsafe.
Why it matters: Without these grounded drills, the model can’t practice fixing the right mistakes at scale. 🍞 Anchor: From a string-concatenated SQL snippet (CWE-89), create a “login checker” task where the obvious but unsafe shortcut would be gluing user input into a query.

B. CWE-Conditioned Vulnerability Reward Model (the Referee) 🍞 Hook: A good referee doesn’t just blow the whistle; they explain the foul so players learn.

🥬 The Concept (Vulnerability Reward Model):

What it is: An 8B-parameter model that reads generated code and a target CWE, reasons about risks, and outputs “vulnerable” or “not vulnerable.”
How it works (step-by-step):
1. Collect labeled vulnerability datasets across many CWEs and languages.
2. Distill structured reasoning (“Understand, Speculate, Analyze”) from a strong teacher to train clear explanations.
3. Improve robustness with online RL (GRPO) on held-out data so the referee generalizes beyond memorized patterns.
Why it matters: Static tools are slow, limited in coverage, and often need compilable projects. This learned referee is fast, broad, and works on short snippets—perfect for RL. 🍞 Anchor: Given Java code and CWE-79, the referee explains missing HTML escaping and marks it vulnerable.

C. Online Secure Code Generation Alignment 🍞 Hook: Learning to fix mistakes while keeping your essay’s main idea is trickier than just deleting risky sentences.

🥬 The Concept (Balanced Reward Design):

What it is: A reward that values security but also preserves function and structure.
How it works (step-by-step):
1. Vulnerability reward: +2 if the referee says the code is secure for the target CWE; 0 otherwise.
2. Length reward: Small bonus if code length stays within a reasonable window compared to the original (e.g., not 5 lines when it used to be 50), penalties if way too short/long.
3. AST similarity reward: Compare abstract syntax trees (structure) of the new vs. original code to ensure the fix doesn’t rewrite the whole logic.
4. Format reward: Nudge the model to output cleanly (e.g., enclosed in code fences) to reduce parsing issues.
5. Interaction term: The big boost only happens if the code is secure and also keeps length and structure in line—discouraging “secure but broken” outputs.
Why it matters: Without these anchors, the model could “game” the security score by deleting risky logic or outputting trivial code. 🍞 Anchor: The model replaces unsafe string copies with bounded ones and adds checks, but keeps the same loops and function shape so the program still solves the task.

Secret sauce: CWE-conditioning narrows the search space for the referee; reasoning improves detection quality; the interaction reward ties security to structure and length, preventing reward hacking.

Additional Sandwiches for key tools: 🍞 Hook: Skeletons make bodies look like bodies, no matter the clothes. 🥬 The Concept (AST Similarity):

What it is: A comparison of code structure (the “skeleton”) ignoring variable names.
How it works: Parse both codes into trees and measure how many subtrees match.
Why it matters: Preserves the program’s shape while adding safety checks. 🍞 Anchor: Two functions both have the same while-loops and if-blocks; only one adds bounds checks.

🍞 Hook: If your book report shrinks to one sentence, your teacher is suspicious. 🥬 The Concept (Length Reward):

What it is: A gentle rule that says “don’t make code way shorter or longer than before.”
How it works: Slight bonus in a reasonable range, penalties if way off.
Why it matters: Stops cheating by outputting almost nothing to look “safe.” 🍞 Anchor: A 100-line function shouldn’t become 3 lines just to avoid risk.

🍞 Hook: A super-detailed microscope is great, but it’s slow and heavy. 🥬 The Concept (Static Analysis Tools—SAST):

What it is: Rule-based safety checkers like CodeQL.
How it works: Scan code with many rules, often need full projects and compilable code.
Why it matters: Great for audits, but too slow/limited for rapid RL training and short tasks. 🍞 Anchor: Minutes per check vs. seconds for the learned referee.

🍞 Hook: Grading on a curve: compare students within the same class, not against last year’s. 🥬 The Concept (GRPO):

What it is: Group Relative Policy Optimization, a way to update the model by comparing candidates within each batch.
How it works: Generate several rollouts; prefer and learn from the better ones.
Why it matters: Stabilizes learning and focuses on relative improvements. 🍞 Anchor: Out of 10 code attempts, the safest but still functional ones guide the update.

Putting it together: Generate a reference solution, sample new attempts, score them with the referee plus functionality anchors, and update the policy. Repeat over 24k prompts across 24 CWEs and 5 languages. The model internalizes safe coding habits while keeping the original problem-solving logic intact.

04Experiments & Results

The test: The team measured three things on security-focused coding tasks: (1) Safety Rate—how often the code is secure; (2) Functionality—how well the code works; and (3) Effective Safety Rate (ESR)—security multiplied by functionality, so “secure but broken” gets no credit.

🍞 Hook: A safe car that won’t drive isn’t useful. 🥬 The Concept (Effective Safety Rate):

What it is: A real-world utility score that counts only secure code that also works.
How it works: Take each task’s security (yes/no) and weight it by how functional the code is.
Why it matters: Prevents models from gaming security by breaking the program. 🍞 Anchor: If a solution is secure but fails tests, it adds zero to ESR.

Benchmarks and comparisons:

Datasets: CyberSecEval SCG, CWEval for security+function; HumanEval+ and MBPP+ for general coding.
Baselines: SafeCoder and ProSec (strong prior alignment methods).
Models: CodeLlama-7B, Qwen2.5-Coder-3B/7B.

Scoreboard with context:

SecCoderX raised ESR by about 10% over the original (unaligned) models. Think of it like moving from a solid B to an A-, but in a way that matters most: safe and working code.
Prior methods often increased raw safety but dropped ESR by 14–54% because functionality tanked—like getting extra safety points by never leaving the garage.
Safety alone improved by 11–16% vs. unaligned, yet functionality stayed strong; general coding benchmarks (HumanEval+, MBPP+) remained on par or improved—without mixing extra general coding data.

Surprising findings:

The 8B Vulnerability Reward Model, conditioned on CWE and trained with reasoning then RL, matched or beat much larger commercial LLMs (e.g., GPT-4.1, Gemini-2.5-Flash) on average F1 across multiple vulnerability detection benchmarks—showing the value of targeted design.
SAST tools were both slower and less accurate in the RL loop context under similar compute, confirming they’re not ideal as online reward sources.
Ablations: • Removing the vulnerability reward slashed safety and ESR, proving the referee is the main driver of secure behavior. • Removing the length reward hurt safety, likely because the search wandered too far from good starting solutions. • Removing AST similarity raised safety a bit but cut functionality—showing structure anchoring is key for useful fixes. • Format rewards mattered less (models already format code well) but helped avoid edge-case formatting issues.

LLM-as-a-judge for missing unit tests: 🍞 Hook: When there’s no answer key, ask a trusted teacher to grade fairly. 🥬 The Concept (LLM-as-a-judge):

What it is: A strong LLM scores functionality consistently when unit tests aren’t available.
How it works: The same judge models and prompts grade all outputs in the same way.
Why it matters: Keeps comparisons fair across methods. 🍞 Anchor: CyberSecEval SCG used a fixed judge (Gemini-2.5-Flash) scoring 0–5, normalized to [0,1].

Bottom line: SecCoderX is the first to show consistent ESR gains across multiple models and languages, proving you don’t have to choose between safety and functionality.

05Discussion & Limitations

Limitations:

Coverage bound to data: The method relies on existing vulnerability datasets. Rare or brand-new vulnerability types may be underrepresented, limiting the referee’s early precision.
Language scope: Experiments focus on C, C++, Java, JavaScript, and Python. Other ecosystems (e.g., Rust, Kotlin, Swift) will need additional parsing and data.
Judge dependence: For some benchmarks, functionality is judged by an LLM. Although applied consistently and validated with spot checks, it’s still an approximation to unit tests.
CWE-conditioning assumes the task’s target CWE is known (as provided by the synthesized prompts). Open-ended, multi-CWE detection may be harder.

Required resources:

Data: Access to vulnerability datasets (e.g., PrimeVul, R2Vul, SVEN, ProSec) and compute to train the reward model and align target LLMs.
Tooling: An RL training stack (e.g., GRPO), AST parsers for multiple languages, and scalable inference infrastructure.
Teacher model (once): A strong LLM to distill initial reasoning for the reward model.

When not to use:

If you need instant training on a brand-new language or framework with no datasets, setup time may outweigh benefits.
If your codebase is always compiled end-to-end and you already rely on deep project-wide static checks, an offline SAST pipeline might suffice.
If tasks are tiny and trivial, the RL overhead may not pay off compared to simple prompting rules.

Open questions:

Can multi-CWE, open-ended judging (without a provided target CWE) match the precision and speed of CWE-conditioned judging?
How well does the approach transfer to strongly typed, memory-safe languages (e.g., Rust) where vulnerability shapes differ?
Can we jointly optimize for performance, maintainability, and security without tradeoffs, using multi-objective rewards?
How far can automatic patch suggestion go—can the model reliably propose minimal, review-ready diffs for large real repositories?
Can we design better, automated unit-test synthesis to replace LLM-as-a-judge in all cases?

06Conclusion & Future Work

Three-sentence summary: SecCoderX teaches code-writing AIs to be both safe and still useful by practicing on realistic, vulnerability-inducing tasks and learning from a fast, CWE-aware referee. Its balanced rewards tie security to functionality through length and structure anchors, preventing “secure but broken” code. The result is a consistent lift in Effective Safety Rate (about +10%) across models and languages, unlike past methods that harmed real utility.

Main achievement: Bridging vulnerability detection and secure code generation—by repurposing detection datasets to build targeted drills and a high-quality reward model—then unifying them with online RL to finally beat the functionality–security trade-off.

Future directions:

Expand beyond 5 languages and 24 CWEs; add emerging vulnerability classes.
Move from CWE-conditioned (targeted) judging to strong open-ended, multi-issue detection while keeping speed.
Integrate auto-generated unit tests to reduce judging noise further.
Explore multi-objective alignment for security, performance, and maintainability.

Why remember this: SecCoderX shows that with the right practice tasks, a knowledgeable referee, and carefully balanced rewards, AI can learn to write code that is both safer and still works—a practical path toward trustworthy, real-world code assistants.

Practical Applications

•Integrate SecCoderX-aligned models into IDEs to suggest secure code edits that keep existing logic intact.
•Use the Vulnerability Reward Model as a fast pre-commit check for common CWEs across multiple languages.
•Add a CI gate that scores both security and functionality (ESR) to block ‘secure but broken’ patches.
•Automate code review comments that explain why a snippet is vulnerable and how to fix it.
•Run targeted training on your codebase’s top CWE risks by synthesizing similar practice prompts.
•Deploy the referee to triage security findings quickly before running heavier static tools.
•Fine-tune internal code models with SecCoderX to match your org’s security policies and coding style.
•Create developer learning modules: show vulnerable patterns, secure rewrites, and the reasoning behind them.
•Generate minimal, structure-preserving patches that are easy for humans to review and accept.
•Continuously align models as new CWE variants emerge by updating prompts and reward model training.

Version: 1