X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Jie Wu; Haoling Li; Xin Zhang; Jiani Guo; Jane Luo; Steven Liu; Yangyu Huang; Ruihang Chu; Scarlett Li; Yujiu Yang

X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests

Intermediate

Jie Wu, Haoling Li, Xin Zhang et al.1/11/2026

arXiv PDF

Key Summary

•X-Coder shows that models can learn expert-level competitive programming using data that is 100% synthetic—no real contest problems needed.
•The team builds a domain-specific feature tree, turns it into challenging tasks via a two-stage process, and then auto-generates both solutions and tests.
•A dual-verification strategy first votes across many candidate solutions to label test outputs, then picks a single 'golden' solution using weighted tests plus a hold-out check.
•Trained with supervised fine-tuning and then reinforcement learning, X-Coder-7B reaches 62.9% avg@8 on LiveCodeBench v5 and 55.8% on v6, beating larger models trained on real data.
•Scaling more unique tasks helps more than adding extra solutions per task, and longer reasoning chains (long CoT) lead to higher final performance.
•Reinforcement learning provides a big extra boost and is surprisingly robust to slightly noisy synthetic tests.
•Domain-specific evolution (not generic prompts) is crucial to create solvable yet hard problems with realistic corner cases.
•The approach reduces data contamination risks found in real-world datasets and offers a scalable path to keep models improving.
•The system uses a distributed sandbox and careful test weighting to ensure accurate, fair rewards during training.
•Insights include the value of reasoning-heavy tasks, the 'good-gets-better' effect in RL, and detailed failure modes that guide future improvements.

Why This Research Matters

This work shows we can escape the limits of real contest data and still train excellent reasoners, which keeps progress moving even when public datasets run dry. It lowers leakage risk so benchmark scores remain trustworthy, helping the whole community compare models fairly. Schools and bootcamps can generate fresh, auto-graded practice that builds true problem-solving, not memorization. Companies can safely train internal coding assistants on synthetic corpora without licensing or privacy worries. Stronger tests and dual verification also improve reliability, so models learn to handle corner cases that matter in real software. Finally, the insights about scaling, long reasoning, and RL robustness guide future systems beyond coding.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your math teacher runs out of practice problems, but your class still wants to get top scores. What if the teacher could invent new, fair, challenging problems every day that teach you the same (or better) lessons as the real ones?

🥬 The Concept (Competitive Programming):

What it is: Competitive programming is like puzzle-solving with code, where problems demand smart algorithms, speed, and careful handling of tricky corner cases.
How it works: 1) You read a problem. 2) You design an algorithm (like Dijkstra or DP). 3) You write fast code. 4) Your code is judged on many hidden tests.
Why it matters: Without strong skills in algorithms and testing, your code might be slow, wrong on edge cases, or fail under pressure.

🍞 Anchor: Think of being asked to route buses through a city with traffic that changes—your program must decide the fastest, most efficient paths quickly and correctly.

The World Before: Code models could already pass basic assignments (like small function writing on HumanEval/MBPP). But the frontier moved to competitive programming, which needs deeper reasoning and algorithmic mastery. Teams often trained on collections of real contest problems. This came with two big issues: scarcity (there aren’t infinite curated problems) and contamination (models leak knowledge from benchmarks they saw during training, making scores less trustworthy).

The Problem: Could we get expert-level code reasoning without relying on real contests at all? That means synthesizing three things from scratch: tasks (that are challenging but solvable), solutions (that are logically sound), and tests (that are thorough and accurate enough to judge solutions for RL rewards). Off-the-shelf synthesis methods often made tasks too easy or fuzzy, solutions not fully reliable, and tests too weak to guide RL.

🥬 The Concept (Fully Synthetic Data):

What it is: Fully synthetic data means all problems, their solutions, and their tests are invented by models/tools rather than copied from the real world.
How it works: 1) Make features of algorithms and data structures. 2) Compose them into realistic tasks. 3) Generate varied test inputs. 4) Produce several candidate solutions. 5) Verify both solutions and tests.
Why it matters: Without high-quality synthetic data, models either stall (not enough variety) or learn the wrong habits (noisy, trivial, or misleading problems).

🍞 Anchor: Like a music teacher composing brand-new songs to teach scales, rhythms, and styles so students don’t just memorize old sheet music—they truly learn to play.

Failed Attempts: General-purpose task generators (designed for simpler coding) didn’t transfer well. They tended to oversimplify problem prompts, missed competitive-level tricky constraints, and didn’t produce strong tests. Some teams mixed math data for RL to compensate for limited code problems, but that was a workaround, not a fix.

The Gap: We needed a method that (1) constructs domain-specific, complex-but-solvable tasks; (2) ensures solution correctness without ground-truth; and (3) builds strong tests to power RL rewards without collapsing under noise.

Real Stakes:

Better training without leaking test answers improves fairness of leaderboards.
Programmers get smarter assistants for hard tasks, from optimizing routes to planning schedules.
Education gains fresh, infinite practice problems with auto-grading.
Companies can safely train in-house models without licensing or privacy risks.
The approach scales: when you want more data, you synthesize more—no waiting for the world to publish new contests.

02Core Idea

🍞 Hook: You know how LEGO sets come with themed pieces that snap together to make new creations? If you pick the right pieces and follow a smart build-and-check routine, you can make reliably cool builds again and again.

🥬 The Concept (Key Insight):

What it is: With a domain-aware recipe that evolves useful algorithmic features and double-checks both tests and solutions, fully synthetic data can train expert competitive programming models.
How it works (recipe): 1) Build a feature tree from real algorithmic themes (graphs, DP, number theory). 2) Select compatible subsets and write realistic tasks (no handholding hints). 3) Generate rich test inputs via prompting and tools. 4) Sample multiple solutions from strong reasoners. 5) Use dual verification: vote to label outputs, then pick the best solution using weighted tests and a hold-out split. 6) Train with SFT, then boost with RL.
Why it matters: Without this, tasks stay trivial, solutions stay shaky, and RL never gets a reliable reward—so models plateau.

🍞 Anchor: Like a science fair where you invent your own experiments, measure results with two different meters, and then study the best-performing setup to learn the most.

Three Analogies:

Cooking: You curate ingredients (feature tree), follow a precise recipe (two-stage task generation), taste-test with multiple tasters (voting), and pick the winning dish (weighted validation) before teaching others how to cook it (SFT→RL).
Sports Drills: You design drills targeting exact skills (domain-specific evolution), run them many times (candidate solutions), score them fairly (strong tests), and train players first with coaching (SFT) then scrimmages (RL).
Mapmaking: You outline terrains (features), draft routes (tasks), drive test cars on different roads (test generation), compare their GPS traces (voting), and choose the most reliable map (hold-out validation) to teach future drivers (training).

Before vs After:

Before: Most believed real contest data was required; synthetic sets lacked rigor and trust.
After: Fully synthetic, if evolved for the domain and double-verified, not only works—it can outperform models trained on real data.

Why It Works (intuition):

Feature evolution ensures tasks combine the right algorithmic gears (e.g., flow + shortest path + segment tree) in solvable, meaningful ways.
Dual verification stabilizes labels: consensus voting filters out flukes, weighting prioritizes edge/stress cases, and hold-out rechecks generalization.
SFT teaches long reasoning patterns; RL explores beyond what SFT covered, even with slightly noisy rewards.

Building Blocks (Sandwich intros):

🍞 Hook: Imagine making a playlist from song genres, then mashing up styles that go well together. 🥬 The Concept (Feature-Based Synthesis):
- What it is: A way to compose new coding tasks from a tree of algorithm/data-structure features.
- How it works: Extract features from code examples; expand them in depth and breadth; assemble compatible sets into a single scenario.
- Why it matters: Random mixes can be nonsense; feature-based design keeps tasks coherent and targeted. 🍞 Anchor: Combining “shortest paths,” “flow,” and “segment tree” to simulate a changing city network.
🍞 Hook: Plants grow best when bred for local weather. 🥬 The Concept (Domain-Specific Evolution):
- What it is: Tailoring feature evolution specifically for competitive programming.
- How it works: Emphasize contest-relevant skills (e.g., DP transitions, edge-case math, time/space trade-offs) and evolve deeper variants.
- Why it matters: Generic prompts miss contest-style constraints; domain focus yields solvable yet hard tasks. 🍞 Anchor: Evolving “prefix sums” into “Fenwick tree” to handle fast range updates.
🍞 Hook: First list ingredients, then write the recipe. 🥬 The Concept (Two-Stage Task Generation):
- What it is: Separate step 1 (select compatible features) from step 2 (write the task text without hints).
- How it works: Choose a consistent subtree; state a realistic scenario; avoid giving away the method.
- Why it matters: One-shot prompts often collapse to trivial problems; two-stage preserves difficulty. 🍞 Anchor: Turning a selected set (Dijkstra + flow + segment tree) into a Codeforces-style narrative.
🍞 Hook: Testing a bridge with small cars, heavy trucks, and rush-hour traffic. 🥬 The Concept (Test Input Generation):
- What it is: Creating diverse inputs (small, edge, stress) via prompting or a tool library.
- How it works: Prompting interprets constraints; tools (like CYaRon) systematically build random, boundary, and scalable cases.
- Why it matters: Weak tests miss bugs, misguide RL, and inflate scores. 🍞 Anchor: For a graph task, include tiny graphs, max-sized graphs, and tricky corner connections.
🍞 Hook: Have several students solve the same puzzle before grading. 🥬 The Concept (Candidate Solutions):
- What it is: Multiple long-reasoning solution attempts from strong models.
- How it works: Enforce complete reasoning + valid code; filter out syntax issues; keep a diverse pool.
- Why it matters: A single attempt can be wrong or lucky; many give signal for voting. 🍞 Anchor: Eight Python solutions compete; their outputs get compared.
🍞 Hook: Two doctors confirm a diagnosis. 🥬 The Concept (Dual-Verification):
- What it is: Two-step checking that validates tests via voting and selects a best solution via weighted scoring plus a hold-out set.
- How it works: 1) Majority vote labels outputs; weight hard tests more. 2) Pick the top solution by weighted score, then confirm on unseen tests.
- Why it matters: Prevents overfitting to easy cases and catches label noise. 🍞 Anchor: The winner must also perform well on a secret, unweighted set.
🍞 Hook: Learn from examples, then practice in the real gym. 🥬 The Concept (SFT-then-RL):
- What it is: First imitate high-quality reasoning (SFT), then optimize behavior with rewards (RL, via GRPO).
- How it works: SFT teaches long chain-of-thought; RL rewards passing more tests; a stronger start leads to better RL (“good-gets-better”).
- Why it matters: SFT alone plateaus; RL finds extra gains even with slightly noisy feedback. 🍞 Anchor: A student studies detailed solutions, then plays many matches to tune strategies.

03Methodology

At a high level: Features → (Select Subtree) → (Write Task) → (Generate Tests) → (Sample Solutions) → (Dual-Verify) → (SFT) → (RL) → Trained Model.

Step A: Build and Evolve a Feature Tree

What happens: Extract competitive-programming features (algorithms, data structures, tricks) from code snippets and evolve them deeper/wider.
Why it exists: Random or generic features produce incoherent tasks; evolution ensures a rich, contest-relevant library.
Example: “prefix sum” evolves into “difference array” and “Fenwick tree,” enabling harder update/query tasks.

Step B: Two-Stage Task Generation

What happens: 1) Select a compatible subtree of features. 2) Write a realistic task (Codeforces/LeetCode/AtCoder style) that hides the solution method but requires those features.
Why it exists: Putting selection and writing together often leads to trivial prompts; separating them preserves difficulty and coherence.
Example Data: Choose {Dijkstra, Edmonds-Karp, segment tree}; produce a narrative about dynamic city transport capacity updates.

Step C: Generate Test Inputs (Prompting + Tools)

What happens: Use (i) prompting to read constraints and propose diverse inputs, and (ii) a tool library (CYaRon) to create random, boundary, scalable, and stress tests reproducibly.
Why it exists: RL needs many, strong, diverse tests to measure progress fairly and avoid reward noise.
Example: For a graph problem, include tiny graphs (n=2), max graphs near limits, disconnected cases, and stress tests with long paths.

Step D: Sample Candidate Solutions

What happens: Use advanced open-source reasoners to produce multiple solutions per task with complete reasoning and Python code; AST-check for syntax; filter for format.
Why it exists: Multiple independent attempts let us use consensus to label outputs and spot robust solutions.
Example: 8 solutions solve the same task; their outputs on each test input are collected.

Step E: Dual-Verification (Secret Sauce)

What happens: Two steps.
1. Consensus voting for test outputs: For each test input, take the majority output across solutions as the provisional label; weight tougher tests higher (by size or category). This produces a weighted test suite.
2. Select the golden solution: Pick the solution with the highest weighted score on the weighted suite, then confirm it also wins or ties on a separate, unseen hold-out set.
Why it exists: Prevents labeling errors from any single model, prioritizes edge cases, and guards against overfitting to seen tests.
Example with data: If 5/8 solutions say “42” for input X, label becomes 42. If a candidate solves all stress tests (heavily weighted) but fails an easy one, total score still reflects robustness. The final winner must also top an unweighted hold-out.

Step F: Supervised Fine-Tuning (SFT)

What happens: Train the student model on (task, golden solution) pairs, including long chain-of-thought.
Why it exists: Teaches the full reasoning patterns and code templates, not just final answers.
Example: The model learns to plan, verify small examples, optimize for complexity, then implement cleanly.

Step G: Reinforcement Learning (RL with GRPO)

What happens: For each task, the model generates multiple rollouts; reward is fraction of passed tests. GRPO compares rollouts directly without a critic network, stabilizing learning with clipping and KL control to a reference policy.
Why it exists: RL finds strategies beyond SFT’s imitation and copes with slightly noisy tests by averaging across many rollouts.
Example: If a rollout compiles and passes 60% of tests, it gets more reward than 0%; the policy improves toward higher pass fractions over time.

Step H: Distributed Sandbox & Execution

What happens: A microservice with FastAPI + Redis queues and worker pools executes code safely and in parallel, handling timeouts and capturing results.
Why it exists: Training requires millions of executions; reliability and concurrency are essential to keep feedback fast and accurate.
Example: Many test jobs are distributed to workers; results return through per-task queues; stuck workers are auto-detected.

What breaks without each step:

No evolved feature tree: tasks become generic or incoherent, lowering difficulty and learning value.
No two-stage generation: prompts collapse to easy cases; the model stops improving.
Weak tests: bugs slip by; RL follows bad signals.
No dual-verification: mislabeled outputs and overfitted solutions contaminate data.
No SFT: RL starts too weak and struggles to explore productively.
No RL: performance plateaus below the method’s ceiling.

Secret Sauce:

Domain-specific feature evolution turns the generator into a real contest setter.
Dual-verification converts noisy multi-solution outputs into reliable labels and robust “golden” solutions.
Weighted tests emphasize corner cases; hold-out confirmation avoids overfitting.
Emphasis on long CoT teaches deeper reasoning habits the model can generalize.

04Experiments & Results

The Test: The team evaluates on LiveCodeBench (v5 and v6), a leading contamination-resistant benchmark for competitive programming. They report avg@8, which means: sample 8 solutions per problem and check if any pass—like getting 8 tries on a test and seeing if one is correct. This fairly measures both quality and diversity of the model’s reasoning.

Competition: X-Coder is compared with strong SFT-only models, RL-only models, and SFT-then-RL hybrids that rely on real-world data or math mixtures. Baselines include Qwen3-8B, Bespoke-Stratos, MiMo, Klear-Reasoner, DeepCoder-Preview-14B-RL, and more.

Scoreboard (with context):

X-Coder-7B (fully synthetic) hits 62.9% avg@8 on v5 and 55.8% on v6. That’s like getting an A when many others hover around B’s—even some with larger model sizes and access to real data.
On Qwen3-8B bases, X-Coder reaches 64.0% avg@8 on v5, confirming generality.
Head-to-head on the same token budget, fully synthetic SFT beats OpenCodeReasoning’s real-world-based dataset by +6.7 avg@8, especially on medium/hard problems.

SFT Scaling Insights:

More unique tasks > more solutions per task: going from 32k to 200k unique tasks steadily increases scores, while swapping to 8 solutions for fewer tasks helps less. Diversity of challenges grows generalization.
Long CoT wins big: long-reasoning training converges slower but ends much higher (≈ +17 points), showing deep chains teach robust strategies.

RL Insights:

RL is a major booster: starting from a strong SFT checkpoint, RL adds ~4–5 absolute points on avg@8.
“Good-gets-better”: stronger SFT initializers achieve better RL rewards under identical settings, implying SFT quality is crucial.
RL tolerates some noise: despite synthetic tests not being perfect, rewards are still informative enough to improve policy.

Test-time Scaling:

X-Coder’s pass@k curves rise faster; with fewer rollouts it matches or surpasses stronger baselines. This means it explores more diverse, useful reasoning paths across attempts.

Error and Difficulty Analyses:

Main failures are “wrong answer,” “no code block” (overlong thoughts get truncated), and “time limit exceeded.” This underlines the need for efficient algorithms and controlled reasoning length.
Longer generated reasoning at test-time often signals harder problems and correlates with lower pass rates—difficulty drives length, not always accuracy.

Surprises:

Tool-generated tests (CYaRon) produce higher-quality and more discriminative test suites than prompting alone.
Domain-adapted evolution overwhelms general-purpose synthesis; injecting contest-specific knowledge is essential.
Rationale-based task selection (favoring problems that elicit longer CoT) outperforms random or difficulty-only selection—longer thinking often marks richer training value.

Takeaway: Fully synthetic, when evolved for the domain and double-verified, doesn’t just match real-data pipelines—it can surpass them on the hardest public coding benchmark suites.

05Discussion & Limitations

Limitations:

Compute-heavy: Verifying 200k samples can require over a million solution traces and tens of millions of executions. Long-CoT SFT also needs more epochs and memory.
Residual noise: Voting is strong but not perfect; a small error rate remains in labeled tests and selected solutions.
Domain lock-in: The synthesis is tuned for competitive programming; translating it to other domains (e.g., systems programming, UI tooling) needs new feature trees and constraints.
Context and efficiency: Some failures occur from overlong reasoning (truncation) or slow code (timeouts), hinting at the need for better planning and efficiency rewards.
Reward hacking risk: RL can occasionally exploit edge cases for partial credit, requiring careful test design and monitoring.

Required Resources:

Strong teacher models for solutions/tests; GPUs for long-CoT SFT and multi-rollout RL; a distributed sandbox (e.g., Redis + worker pool) to safely execute many programs concurrently.
Time and storage for generating, filtering, and verifying large synthetic corpora.

When NOT to Use:

If you already have abundant, clean, uncontaminated real contest data with verified tests.
If you cannot afford large-scale code execution (e.g., limited compute or sandboxing constraints).
For interactive or multi-file problems the current setup doesn’t model well.
When you need deterministic, regulation-audited datasets and cannot accept any labeling noise.

Open Questions:

Can we cut verification costs while keeping quality? (e.g., smarter selection, active labeling, better confidence models.)
How to design rewards that nudge efficiency (avoid TLE) and concise reasoning without hurting accuracy?
Can we auto-balance task difficulty so models see just-right challenges over time?
How to extend the method to new programming domains (e.g., concurrency, security) and languages beyond Python?
What test-time search strategies (e.g., better sampling, reranking) most improve pass@k with minimal compute?

06Conclusion & Future Work

Three-Sentence Summary: X-Coder proves that fully synthetic tasks, solutions, and tests—if carefully evolved for the domain and double-verified—can train expert competitive programming models. By pairing long-CoT SFT with RL over strong, weighted test suites, the approach achieves state-of-the-art results while avoiding data contamination. It also reveals practical scaling laws and insights that guide future data and training design.

Main Achievement: Turning fully synthetic data into a reliable, scalable, and superior alternative to real-world contest corpora through domain-specific feature evolution and dual verification.

Future Directions: Reduce verification cost via smarter selection and confidence modeling, design rewards that encourage both correctness and efficiency, broaden beyond single-file Python tasks, and continuously adapt the feature tree to emerging contest patterns. Explore test-time strategies and safety checks that preserve gains while preventing reward hacking.

Why Remember This: It flips the script—great reasoning models don’t have to depend on finite, possibly contaminated real data. With the right recipe, we can manufacture high-quality learning experiences at scale, pushing the frontier of code intelligence while keeping evaluations fair and forward-looking.

Practical Applications

•Create unlimited, auto-graded practice problems for classrooms or coding bootcamps with adjustable difficulty.
•Build interview-style question banks with thorough tests to fairly screen candidates without reusing public problems.
•Train internal coding copilots on synthetic corpora to avoid licensing/privacy risks tied to real contest data.
•Continuously fine-tune code LLMs using fresh synthetic tasks to prevent overfitting and keep improving.
•Stress-test code generators with tool-built boundary and stress cases to uncover hidden bugs.
•Prototype RL pipelines for code using the distributed sandbox and dual-verified tests for stable rewards.
•Curate domain-specific feature trees (e.g., graph theory, number theory) to target the skills your team needs.
•Use rationale-based data selection (favor tasks that elicit longer CoT) to maximize learning per training token.
•Adopt pass@k-aware generation and reranking to boost test-time success with fewer rollouts.
•Benchmark new base models quickly by plugging them into the X-Coder SFT-then-RL recipe.

Version: 1