UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Jiajun Wu; Jian Yang; Wei Zhang; Lin Jing; Yuqing Ma; Ensheng Shi; Yuchi Ma; Zhoujun Li; Xianglong Liu

UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models

Intermediate

Jiajun Wu, Jian Yang, Wei Zhang et al.12/19/2025

arXiv PDF

Key Summary

•The paper introduces UCoder, a way to teach a code-generating AI to get better without using any outside datasets, not even unlabeled code.
•It works by “probing” the model’s own hidden knowledge: the model invents problems, writes tests, tries many solutions, then keeps the solutions that agree most when executed.
•A key idea is execution-driven consensus clustering: correct programs behave the same across tests, while wrong programs disagree in many different ways.
•This self-bootstrapping loop turns the model’s best self-generated solutions into new training data, steadily improving performance.
•Across standard coding benchmarks, UCoder (7B, 14B, 32B) reaches accuracy close to or better than supervised baselines that used curated instruction data.
•Smaller models improve the most, showing an inverse-scaling effect that saves compute compared to simply training bigger models.
•Ablations show that consensus-based selection beats simpler filters like just picking low-perplexity code or randomly sampling from passing solutions.
•The approach still needs executable tests and multiple samples per problem, so it can be compute-heavy and is best when tests are available.
•The method mainly measures functional correctness and may miss style or maintainability issues.
•Overall, UCoder shows we can unlock LLMs’ hidden coding skills using only their internal knowledge plus program execution feedback.

Why This Research Matters

This work shows a path to strong coding assistants that don’t need massive, curated datasets, making advanced tooling more accessible to schools, startups, and privacy-sensitive teams. Because the method relies on running code, it aligns learning with what really matters in software: does the program work on real tests? Organizations can fine-tune models locally without sending data to third parties, improving security and compliance. Smaller models benefit the most, offering cost-effective upgrades without scaling parameter counts. Over time, this could democratize high-quality code generation and enable personalized, on-device coding help that improves itself responsibly. It also encourages a broader shift toward evaluation-grounded AI training, where behavior—not just style—guides learning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how some students learn best when a teacher gives them lots of example questions with answers, and others get better by practicing on their own and checking their work? Most code AIs today learn like the first kind: they rely on big, carefully prepared sets of programming problems and example solutions. That’s powerful, but it is also expensive, slow, and not always possible when data is private.

Before this work, improving code generation usually meant supervised instruction tuning: humans (or other strong AIs) write and verify many problem–solution pairs. Even when people tried to skip labels by using huge piles of raw code from the internet, they still depended on external corpora and lots of compute. As models grew, the price of curating, cleaning, and verifying new instruction data kept rising. Teams faced a trade-off: pay more for better datasets or accept weaker models.

The problem the researchers faced was simple to say but hard to solve: Could a code LLM improve itself after pretraining without any outside data at all—not even unlabeled code? If that were possible, you could upgrade a base model anywhere, anytime, without collecting new data.

People tried some clever workarounds. One family of methods asked an LLM to write its own instructions and answers using existing open-source code as seeds, or used unlabeled code snippets to synthesize Q&A. These helped but still relied on external corpora and could inherit their issues (like duplication, license concerns, or contamination with benchmarks). Others tried to score generations by language fluency (perplexity), but fluent doesn’t always mean correct—pretty-looking code can still be wrong.

So there was a gap: we needed a way to squeeze more skill out of a model using only what it already knows inside, and a way to check its answers that doesn’t need gold solutions. Programs have a superpower here: you can run them. If a program passes well-designed tests, you don’t need a human to say it’s right.

This paper’s big bet is to use the model’s internal knowledge to generate the whole learning loop by itself. The model invents problems with clear function signatures, writes tests (including edge cases), produces many candidate solutions, runs them, and then keeps only the solutions that behave consistently across tests. Over and over, it retrains on that “best of itself” data.

Why should you care? In daily life, this means organizations without big data budgets—or with private code they can’t share—could still improve code AIs. It could help students get custom coding tutors that get better locally without sending data away. It can reduce reliance on expensive annotation pipelines and lower the barrier to creating strong coding assistants. And because it uses program execution as the judge, it’s grounded in what really matters: does the code work?

02Core Idea

Aha moment in one sentence: If you let a code LLM create its own problems, write its own tests, try many solutions, and then learn only from the solutions that behave the same when executed, it can teach itself to code better without any external data.

Three analogies to see the same idea:

Cooking analogy: Imagine a chef with a pantry of hidden skills. The chef invents recipes, taste-tests many versions, and keeps the ones several tasters agree are delicious. Next time, the chef starts from those winning versions and cooks even better.
Science fair analogy: A student designs experiments (tests), runs many trials (solutions), and trusts results that repeat consistently. Repeated success becomes the student’s new study notes.
Sports practice analogy: A team tries many plays (solutions) against drills (tests). The plays that reliably score become the playbook for future games.

Now, let’s introduce each concept using the Sandwich pattern, in the right order.

🍞 Hook: You know how you can sometimes figure out patterns just by exploring and practicing, even if no one gives you the answers? 🥬 The Concept (Unsupervised Learning): It’s when an AI learns without labeled examples, discovering structure and improving by using signals it can measure itself.

How it works: (1) Generate data or views of data, (2) apply self-checks or rules, (3) update the model based on what seems consistent or useful, (4) repeat.
Why it matters: Without it, we always need expensive answer keys; learning slows or stops when labeled data is scarce. 🍞 Anchor: An AI that groups similar pictures together without labels is doing unsupervised learning.

🍞 Hook: Imagine you take a practice test, grade it yourself using an answer sheet you trust, and then study your own correct answers. 🥬 The Concept (Self-Training): The model learns from its own best guesses by treating them like training data for the next round.

How it works: (1) Produce multiple answers, (2) pick the reliable ones, (3) fine-tune on them, (4) generate better answers next time.
Why it matters: Without self-training, the model can’t steadily upgrade itself; it would be stuck with its current habits. 🍞 Anchor: A spelling app that saves your correctly spelled words and uses them to teach you more similar words is self-training.

🍞 Hook: Think of a big library in your head you don’t always remember until a clue jogs your memory. 🥬 The Concept (Latent Knowledge in LLMs): Pretrained models store hidden know-how that isn’t always directly visible.

How it works: Training on huge text/code builds internal patterns; the right prompts or procedures can surface them.
Why it matters: If we can’t access this hidden knowledge, we waste what the model already knows. 🍞 Anchor: A model that suddenly recalls a sorting trick when asked for a “divide-and-conquer” approach is tapping latent knowledge.

🍞 Hook: Imagine asking your brain targeted questions to discover what you already know. 🥬 The Concept (Internal Probing): Systematically prompting a model to reveal specific abilities it holds inside.

How it works: (1) Ask it to create problems, (2) rate and refine them, (3) sketch solution structures, (4) inspect patterns in its own responses.
Why it matters: Without probing, the model’s useful skills may remain hidden and unused. 🍞 Anchor: You prompt a model, “Write a function signature and a brief spec for a graph problem,” and it produces a solid starting point—probing found a skill.

🍞 Hook: When you dump out a box of puzzles, you first see what kinds you even have. 🥬 The Concept (Problem Space Probing): Getting the model to generate diverse, well-specified programming tasks.

How it works: (1) Create problems with clear function signatures and docstrings, (2) rate quality, (3) draft solution skeletons.
Why it matters: Without good problems, you can’t practice effectively or measure progress. 🍞 Anchor: The model writes “def count_islands(grid: List[List[int]]) -> int:” plus a description, giving a clear target.

🍞 Hook: Before playing a game, you mark the boundaries so you know what counts as in or out. 🥬 The Concept (Test Understanding Probing): Making lots of input tests—including edge cases—to check if solutions truly understand the problem.

How it works: (1) Generate around 100 tests, (2) include boundaries and tricky cases, (3) ensure determinism.
Why it matters: Without solid tests, wrong code can look right. 🍞 Anchor: For a “reverse string” task, tests include empty strings, emojis, and very long inputs.

🍞 Hook: When building with LEGO, you try many ways before keeping your favorite castle. 🥬 The Concept (Solution Space Probing): Sampling many different solution attempts for each problem.

How it works: (1) Densely sample (e.g., 128 solutions), (2) run them on tests, (3) record behaviors.
Why it matters: If you try too few, you might miss the correct idea. 🍞 Anchor: For “fibonacci,” the model tries recursion, iteration, and memoization.

🍞 Hook: If five clocks show the same time, you trust that time more than a single clock. 🥬 The Concept (Execution-Driven Consensus Clustering): Grouping solutions that act the same on all tests and trusting the biggest agreeing group.

How it works: (1) Execute candidates, (2) group by identical pass/fail patterns, (3) pick the largest nontrivial group, (4) optionally favor cleaner code.
Why it matters: Without consensus, you might pick a fluent but wrong program. 🍞 Anchor: Ten solutions for “two-sum” pass all tests the same way; that cluster is likely correct.

🍞 Hook: After practice, you write neat notes of the best tricks so you can reuse them. 🥬 The Concept (Knowledge Consolidation): Turning the best, most consistent solutions into training data and fine-tuning on them.

How it works: (1) Keep high-consensus code, (2) fine-tune the model, (3) repeat in iterations.
Why it matters: Without consolidation, the model forgets or never locks in the winning patterns. 🍞 Anchor: The model saves its best “binary search” implementation and learns to produce it more reliably next time.

🍞 Hook: When your friends independently pick the same restaurant, you feel confident it’s a good choice. 🥬 The Concept (Consensus-Based Filtering): Selecting outputs backed by agreement and stable behavior.

How it works: (1) Filter low-success runs, (2) choose the largest behaviorally identical cluster, (3) within it, prefer code that’s both successful and fluent.
Why it matters: Without this filter, noise and lucky guesses pollute training. 🍞 Anchor: From 128 candidate programs, the method keeps the 20 that pass tests identically and look clean.

Before vs. After: Before, we needed curated Q&A to upgrade code LLMs. After, we can upgrade them by mining their own internal know-how and letting program execution be the judge.

Why it works intuitively: Correct programs tend to agree; wrong ones fail in many different ways. By finding agreement under execution, we isolate correctness signals without ever seeing a gold solution.

Building blocks you can picture: problem maker, test builder, many-solution sampler, execution-based grouper, and a trainer that learns from the winners.

03Methodology

At a high level: Natural-language prompt or seed idea → Problem Space Probing (make tasks, rate them, sketch code) → Test Understanding Probing (generate edge-case tests) → Solution Space Probing (sample many candidate programs) → Execution-Driven Consensus Clustering (group by behavior, pick the biggest correct-looking cluster) → Knowledge Consolidation (fine-tune on winners) → Improved model outputs.

Step-by-step with what, why, and a concrete feel:

Problem Space Probing: Make strong practice tasks.

What happens: The base LLM is prompted to generate programming problems with function signatures, clear docstrings, arguments, and return types. It also rates the quality and drafts short skeletons that outline the approach.
Why this step exists: Clear targets prevent ambiguity. If tasks are vague, tests will be weak and solutions misleading.
Example: The model writes “def shortest_path(n: int, edges: List[Tuple[int,int,int]], s: int, t: int) -> int:” with a docstring mentioning Dijkstra or BFS on weighted graphs.

Quality Oracle (internal rating): Keep good problems.

What happens: The model assigns a quality score and explains why (clarity, completeness, difficulty). Low-quality prompts are discarded or refined.
Why this step exists: Garbage in, garbage out. Weak problems lead to noisy training signals.
Example: A task missing edge cases like negative weights gets a lower rating and is revised.

Interface Synthesizer: Provide a skeleton.

What happens: The model outputs imports, type hints, and a short plan inside a docstring—like a template that guides implementations.
Why this step exists: Skeletons reduce off-target code and align solutions with the intended contract.
Example: A merge-intervals problem includes a docstring: “Sort by start, merge overlapping ranges.”

Test Understanding Probing: Build tough tests.

What happens: For each problem, the model generates around 100 tests, emphasizing boundary and edge cases (empty inputs, extremes, unusual characters, large sizes). Tests are deterministic and executable.
Why this step exists: Without strong tests, wrong code can slip by. Tests turn “I think it’s right” into “It works on evidence.”
Example: For a string normalization task, include empty strings, mixed Unicode, and already-normalized inputs.

Solution Space Probing: Try many answers (dense sampling).

What happens: The model samples n = 128 candidate solutions per problem. These include multiple algorithmic styles (recursion, iteration, memoization), library choices, and code structures.
Why this step exists: More attempts increase the chance that at least a few are correct and diverse enough to form a reliable consensus.
Example: For top-k frequent elements, some solutions use heapq, others use Counter.most_common, others manual buckets.

Execution-Driven Consensus Clustering: Let tests be the judge.

What happens: Every candidate is executed on the full test suite. Candidates are grouped into clusters by identical pass/fail patterns (their execution signature). The biggest nontrivial cluster is chosen. Before that, reliability filtering removes candidates with low execution success (e.g., threshold ρ = 0.8). Within the chosen cluster, tie-breaking can prefer code that is both successful and reasonably fluent.
Why this step exists: Correctness is singular; incorrectness is diverse. The largest agreement under execution likely corresponds to correct logic.
Example with actual flavor: On a two-sum problem, 19 candidates pass all 100 tests identically; 5 others pass many but fail edge cases; dozens fail differently. The 19 form the consensus cluster to keep.

Knowledge Consolidation and Reinforcement: Learn from the best.

What happens: The problem description q and the selected solution r* become a training pair. The model fine-tunes on these pairs for a few epochs, then repeats the whole loop with the improved model, forming iterations (Iter 1, Iter 2, …).
Why this step exists: To lock in the correct patterns so future generations are better out of the box.
Example: After fine-tuning on high-consensus code for sliding-window maximum, the model more reliably uses deque and handles boundaries next time.

What breaks without each piece:

No problem probing: You get vague tasks and messy signals.
No tests: You can’t tell good from bad; fluent but wrong code sneaks in.
No dense sampling: You might miss the correct idea entirely.
No consensus clustering: You may trust a one-off lucky pass or stylish but incorrect code.
No consolidation: Improvements don’t stick; the next round repeats the same mistakes.

Secret sauce: Execution-driven consensus clustering. It converts raw generations into a behavior-based agreement test. Because correct solutions tend to behave identically across many inputs, the maximum-consensus cluster isolates correctness without needing gold labels. A simple intuition guarantee supports this: if wrong programs rarely make the same mistakes on all tests, increasing test count separates them from the correct cluster. The method also blends extra signals—execution success rate and code fluency—as tie-breakers, but behavior under tests remains the primary judge.

Data and signals the system pays attention to:

Execution success rate e(r): how often a candidate runs and passes tests.
Consensus strength s(r): how many siblings behave exactly the same.
Code fluency f(r): a light proxy for readability/stability when needed as a tie-breaker.

Iteration dynamics:

Start with a base model (no instruction tuning).
Generate tasks, tests, and many solutions.
Pick high-consensus winners; fine-tune.
Repeat for several rounds until validation suggests diminishing returns.

What comes out:

An improved UCoder model that solves more problems on the first try (higher Pass@1) and generalizes across varied coding benchmarks without ever seeing external curated data.

04Experiments & Results

The test: The authors measured code correctness using Pass@1 (does the single top answer pass all tests?) across standard Python and multi-skill benchmarks: HumanEval and MBPP/MBPP+ (classic functions), BigCodeBench-Complete and -Instruct (broader context and API usage), LiveCodeBench (competitive programming with careful contamination checks), and FullStackBench (realistic, multi-skill coding tasks). Pass@1 makes results easy to compare: how often does the first try just work?

The competition: UCoder is compared to popular supervised or instruction-tuned code LLMs like CodeLlama, DeepSeek-Coder (v1/v2), StarCoder2, OpenCoder, and Qwen2.5-Coder (base and instruct), as well as closed APIs like GPT-4o and Claude 3.5 Sonnet. UCoder uses Qwen2.5-Coder (7B/14B/32B) as the base but does not use any external datasets for post-training—only its self-generated tasks and tests.

The scoreboard with context:

Overall strength: UCoder models are competitive with instruction-tuned peers despite using no external instruction data. For example, UCoder-32B hits around 89.7% on BigCodeBench-Complete and 75.7% on MBPP+; UCoder-14B reaches 86.5% on MBPP and 74.3% on MBPP+. UCoder-7B reaches 85.2% on MBPP, nearly matching larger 32B baselines (~86.2%). Think of this like a student who studies only from their own practice still getting nearly the same scores as classmates who had a full answer key.
Harder, broader tasks: UCoder often shines on the more diverse or realistic benchmarks (MBPP/MBPP+, BigCodeBench-Complete, FullStackBench) across all sizes. While some instruct models still lead on HumanEval, the gap narrows as model size increases.

What changes over iterations:

Iterative self-improvement works. Compared to the starting point (Iter 0), UCoder-7B adds between about +6 to +13 points on many benchmarks (e.g., up to 85.2% on MBPP). UCoder-14B sees +4 to +11 point jumps; UCoder-32B gains about +3 to +5 points.
Inverse scaling of improvement: The smaller 7B model improves the most. This suggests self-bootstrapping helps unlock skills that smaller models know but don’t reliably use; learning from their best consistent outputs narrows the gap to larger models without needing more parameters.
Convergence: Best results occur after a handful of rounds (about 6 for 7B, 5 for 14B, 4 for 32B). Beyond that, scores wiggle but don’t collapse, pointing to a sweet spot and the value of early stopping.

Why consensus matters (ablation evidence):

The authors compared their consensus-based selection to several alternatives: random picks from passing code, clustering by output hashes, choosing lowest-perplexity code, or just using execution success rate thresholds. Across sizes (7B/14B/32B), consensus-based selection wins most often and by clear margins on tough suites like FullStackBench. This means agreement-under-execution is a stronger quality signal than just “looks fluent” (low perplexity) or “passed once.”

Surprising and insightful findings:

Clear fluency-quality break: High-quality solutions tend to have very low perplexity (around 1.01–1.05), with success rates dropping quickly above ~1.05. So fluency helps as a tie-breaker, but you still need behavioral consensus to avoid pretty-but-wrong code.
Diversity isn’t lost: After filtering by consensus, the kept data retains strong lexical and structural variety (entropy stays similar), while success rate and error-free rate jump. This shows you can raise quality without collapsing into a few code templates.
Rich solution space: With n = 128 candidates per task and millions of samples overall, the generated code spans many AST node types and complexities. That diversity is the raw fuel consensus needs; without variety, you wouldn’t form reliable correct clusters.

Big picture: The results show that just by using what the model already knows plus execution feedback, you can reach close-to-supervised performance. It’s like building a high-scoring study routine with no answer book—just lots of thoughtful practice, strict grading, and keeping the best notes.

05Discussion & Limitations

Limitations and where this may stumble:

Needs executable tests: The whole method leans on running code against tests. If your domain can’t be easily tested (e.g., GUI interactions, complex side effects, or proprietary systems you can’t execute), building reliable signals is hard.
Compute to sample and run: Generating 128 candidates per problem and executing ~100 tests each costs time and resources. It’s far cheaper than building huge labeled datasets, but it’s not free; practical deployments need batching, caching, or fewer samples when possible.
Narrow view of quality: Passing tests measures functional correctness, not maintainability, security posture, or style guidelines. Additional checks or linters could help, but the core method doesn’t optimize for them by default.
Diminishing returns: After several iterations, improvements plateau or oscillate, hinting at overfitting to synthetic distributions. Validation-based early stopping is important.
Language scope: Most analysis here is in Python. Other languages with different build and runtime behaviors (C++, Rust, Java) may require more sophisticated sandboxing, dependency handling, or test generation tweaks.

Required resources to use UCoder well:

A safe executor/sandbox for running candidate code and collecting pass/fail signals.
Enough compute to sample multiple candidates (dozens to low hundreds) and to fine-tune for a few epochs per iteration.
Prompt templates for generating problems, skeletons, and robust tests.
Simple quality gates (e.g., execution success threshold, cluster size threshold) and logging to monitor consensus health.

When not to use it:

If you can’t run code safely or deterministically (e.g., heavy external I/O, networking, or stateful systems that produce flaky outcomes).
If you already have high-quality labeled instruction data tailored to your domain; supervised fine-tuning might be simpler and faster.
If your constraints demand strict style, documentation, or performance profiles beyond what tests capture, unless you add those checks.

Open questions worth exploring:

Beyond functional tests: How to include maintainability, security, and performance signals into the consensus (e.g., static analysis, taint tracking, microbenchmarks)?
Smarter sampling: Can we adaptively pick how many candidates to try, focusing compute where uncertainty is highest?
Cross-language generalization: What changes are needed for compiled languages, complex build systems, or cross-file projects?
Human-in-the-loop light touch: Can small amounts of expert review amplify gains without turning this back into fully supervised training?
Curriculum shaping: Can we automatically schedule problems from easy to hard, using model uncertainty as a guide, to accelerate learning?

06Conclusion & Future Work

Three-sentence summary: UCoder shows that a code LLM can teach itself using only its internal knowledge plus execution feedback—no external datasets required. By generating its own problems and tests, sampling many solutions, and keeping the largest behaviorally consistent cluster, it builds reliable training data and improves iteratively. Across benchmarks, this unsupervised loop reaches performance close to supervised instruction tuning while reducing data dependence.

Main achievement: Turning program execution and consensus into a dependable teacher that extracts and strengthens the model’s hidden coding skills without any human-written labels or external corpora.

What’s next: Incorporate richer quality signals (security, maintainability, speed), adapt to more languages and project scales, reduce compute with adaptive sampling, and mix in tiny amounts of expert feedback when it gives big boosts. Also, design smarter test generation to cover tricky behaviors and real-world edge cases.

Why remember this: It reframes how we think about improving code AIs—don’t just feed them more curated data; let them practice, test themselves, agree on what works, and learn from their own best work. It’s a practical path to stronger models in settings where data is scarce, private, or expensive.

Practical Applications

•Create a local coding tutor that invents practice problems, auto-writes tests, and improves itself without internet data.
•Fine-tune a private code assistant inside a company using only internally executed tests, preserving IP and compliance.
•Bootstrap coding support for niche domains (e.g., scientific tooling) by generating domain-specific tasks and tests on the fly.
•Add consensus-based filtering to existing code-gen pipelines to raise correctness without adding labeled data.
•Use the framework to continuously regress-test an in-house code model, catching drifts and reinforcing stable solutions.
•Deploy on mid-range hardware by reducing candidate count adaptively (e.g., 32→64 only for hard tasks) to save compute.
•Extend CI systems: when a model proposes patches, auto-generate tests and accept only consensus-passing fixes.
•Build classroom tools that let students see many solution styles, with the system highlighting the consensus-correct ones.
•Prototype multilingual code assistants by adapting execution sandboxes per language and reusing the same consensus logic.
•Combine with static analyzers and linters so the consensus also favors secure and maintainable code, not just passing code.

Version: 1