BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

Jingwen Xu; Yiyang Lu; Zisu Huang; Changze Lv; Xiaohua Wang; Shizheng Li; Zhibo Xu; Zhengkang Guo; Zhengyuan Wang; Muzhao Tian; Xuanjing Huang; Xiaoqing Zheng

BatCoder: Self-Supervised Bidirectional Code-Documentation Learning via Back-Translation

Intermediate

Jingwen Xu, Yiyang Lu, Zisu Huang et al.1/30/2026

arXiv PDF

Key Summary

•BatCoder teaches a code model to write both code and its documentation by doing a round trip: from code to docs and back to code.
•It learns without needing paired code–doc examples by using a self-made reward: how close the rebuilt code is to the original.
•This reward is measured with a semantic code similarity score, so the model gets credit when it preserves meaning, not just exact words.
•Reinforcement learning updates the model in both directions, so better docs lead to better code, and better code leads to better docs.
•The method scales well: as data and model size grow, the learning signal gets stronger and results improve.
•On HumanEval and MBPP, BatCoder with a 7B model beats strong open-source models of similar or even larger size.
•It shines in low-resource languages like Ruby and Go, where paired code–doc data is scarce.
•Compared to supervised fine-tuning on synthetic pairs, reinforcing the round-trip similarity works better.
•A careful filtering step checks that generated docs follow a clean format before using them to rebuild code.
•This approach can be extended to related tasks like code completion and code translation in the future.

Why This Research Matters

Most code on the internet lacks high-quality documentation, and creating labeled pairs is expensive and uneven across languages. BatCoder learns from raw code by rewarding itself when documentation truly captures behavior and the code rebuilt from it matches the original. This makes it easier to support less popular languages where labeled data is scarce. For teams, it means faster onboarding, clearer codebases, and fewer bugs caused by misunderstandings. For tools, it means smarter code assistants that can both explain code and write code from explanations. In the long run, this reduces costs, boosts developer productivity, and makes high-quality software more accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how, when you’re building LEGO sets, the booklet (documentation) tells you how to assemble the pieces (code), and looking at the finished model helps you explain what the booklet meant? Software is like that: code and documentation are two ways to describe the same idea—one for computers, one for humans.

🍞 Hook: Imagine reading a recipe (the doc) and then cooking the dish (the code). If the dish turns out right, the recipe was good. If you can also taste the dish and write a matching recipe, you’ve got a solid two-way connection.

🥬 The Concept (Code–Documentation Alignment): It means the words in the documentation truly match what the code does. How it works:

Read the code and understand its behavior.
Write a description that captures the inputs, outputs, and steps.
Check that the description is specific enough to recreate the behavior. Why it matters: Without alignment, people misunderstand code, bugs slip in, and teams move slower. 🍞 Anchor: A doc that says 'add 100, multiply by 2.5, take the square root, and round to two decimals' matches a function that does exactly that.

The world before BatCoder: Big code models could already generate code from English and write comments for code. But training them well often needed tons of high-quality paired examples: a piece of code and its perfect human-written description. These pairs are expensive to collect and uneven across languages, especially for less common ones like Ruby or Go compared to Python.

The problem: There’s lots of raw code on the internet, but not nearly enough trustworthy code–doc pairs. Without those pairs, models don’t learn the tight connection between what code does and how to explain it clearly—or how to go from an explanation back to accurate code.

Failed attempts: People tried to fix this by:

Using stronger external models to write synthetic docs (but then you depend on bigger 'teacher' models).
Doing supervised fine-tuning on those synthetic pairs (but the model just imitates fixed answers, without being judged on whether those answers actually help regenerate the code).
Generating pseudo-labels or rule-based pseudo-code (but quality varies, and rules can be brittle). These methods help, but they don’t close the loop: they don’t check whether the produced docs are truly good at guiding code generation.

The gap: Models needed a way to learn directly from raw code—without expensive labels—while being rewarded for producing documentation that really preserves the code’s meaning.

🍞 Hook: Think of teaching yourself to explain a magic trick so well that someone else can redo it exactly.

🥬 The Concept (Self-Supervised Learning): It’s learning from the data itself without human-provided labels. How it works:

Use what you already have (just code).
Make your own targets by transforming it (create docs, rebuild code).
Score yourself on how faithful the round trip is. Why it matters: It unlocks huge unlabeled datasets and reduces reliance on costly curated pairs. 🍞 Anchor: You have a pile of songs (code) but no descriptions. You write your own summaries (docs) and see if someone can recreate the song’s melody and structure from your summary.

Real stakes: In the real world, most repos don’t have great docs, and many teams work across multiple languages. Better self-learning means:

Faster onboarding for new developers.
Cleaner, more maintainable codebases.
Stronger support for less popular languages.
Reduced cost and dependence on big private datasets. All of this means fewer bugs, clearer communication, and quicker delivery of features that people use every day.

02Core Idea

The 'Aha!' in one sentence: If documentation is truly good, you can use it to rebuild the original code—so train the model to write such docs and such code by rewarding success on that round trip.

🍞 Hook: You know how translators sometimes translate a sentence into another language and then back again to check if the meaning stayed the same?

🥬 The Concept (Back-Translation): Change from one form (code) to another (docs) and then back (code again) to check consistency. How it works:

Start with real code.
Generate documentation for it.
Use that documentation to regenerate code.
Compare the rebuilt code to the original. Why it matters: Without this round trip, we don’t know if the documentation truly captures the code’s meaning. 🍞 Anchor: Translate 'It’s raining cats and dogs' to another language and back. If it returns to 'It’s pouring hard,' you kept the meaning.

Three analogies for the same idea:

Recipe loop: Dish -> Recipe -> Dish again. If the second dish tastes the same, the recipe was great.
Map loop: City -> Directions -> City tour. If you can navigate back to the same landmarks, the directions were accurate.
LEGO loop: Model -> Instruction booklet -> Model. If the rebuilt model matches, the booklet did its job.

Before vs After:

Before: Models learned from fixed pairs and were not directly rewarded for writing docs that enable faithful code reconstruction.
After: The model is rewarded when its docs let it successfully rebuild the code, and when its code matches the doc—both directions get better together.

🍞 Hook: Think of a puppy learning tricks with treats only when the trick is done right.

🥬 The Concept (Reinforcement Learning): The model tries, gets a reward based on how close the rebuilt code is to the original, and learns to do better next time. How it works:

Model generates docs, then regenerates code.
A similarity score between original and rebuilt code becomes the reward.
The model updates itself to get higher rewards next time. Why it matters: Without rewards, the model just imitates examples; with rewards, it practices and improves what really counts. 🍞 Anchor: If your basketball shot goes in (good result), you practice that motion more. If it misses, you adjust.

Why it works (intuition, not equations):

The only way to get a high round-trip score is to write docs that keep the true meaning and to write code that follows those docs exactly.
This creates a natural feedback loop: better docs make code reconstruction easier; successful reconstruction proves the docs captured the essentials.
Because we use unlabeled code, we get lots of practice data without needing human-written docs.

Building blocks:

A way to generate documentation from code.
A way to generate code from documentation.
A similarity score that checks whether the rebuilt code matches the original in meaning (not just exact text).
A reinforcement learning loop that nudges the model toward choices that increase this score.

🍞 Hook: Comparing two songs to see if they have the same melody, even if played by different instruments.

🥬 The Concept (Semantic Code Similarity): It scores how alike two programs are in what they do, even if they look different. How it works:

Analyze structure and data flow of both versions.
Match important operations and relationships.
Output a score from 0 to 1. Why it matters: Without a meaning-aware score, the model could cheat by matching surface details without truly preserving behavior. 🍞 Anchor: Two functions that both sort a list—one uses quicksort, the other mergesort—are semantically similar even if their code looks different.

03Methodology

At a high level: Input code → Stage 1 (generate documentation) → Filter/format check → Stage 2 (rebuild code from the doc) → Compare with original → Give rewards → Update the model.

Step-by-step, like a recipe:

Input: Take one unlabeled code snippet from a mixed-language code corpus (e.g., Python, Ruby, Go). Why this step exists: We want to learn directly from the abundant raw code available online. Example: A Ruby function that adds 100, multiplies by 2.5, takes a square root, and rounds to two decimals.

🍞 Hook: Like writing instructions for how to recreate a sandcastle you just built.

🥬 The Concept (Documentation Generation): Turn code into a clear, structured, and format-following description. How it works:

Read the code and infer what it does, its inputs and outputs.
Produce a doc that follows a language-specific template (e.g., tags, examples, function signature hints).
Include at least one example showing input and expected output. Why it matters: Without good docs, the next step (rebuilding code) will fail or be vague. 🍞 Anchor: From the Ruby function, generate a doc explaining each step and showing '>>> calculate_value(0) -> 15.81'.
Sampling strategy in Stage 1: For each code snippet, create multiple candidate docs (e.g., K=8). This explores different wording styles and levels of detail. Why this step exists: Natural language is diverse; multiple tries help find docs that best preserve meaning. Example: Eight slightly different Ruby doc candidates that all mention the add–multiply–sqrt–round steps in various phrasings.
Filtering and rewriting: Extract just the content between <doc> and </doc>, check it meets structure rules (has description, examples, and a function line), and trim extra trailing text. Why this step exists: The next stage needs clean, well-formed docs; messy docs make reconstruction unreliable. Example: If a candidate has the right parts but keeps rambling after the function line, keep only the valid part and mark it as having 'redundant content'.
Stage 2: For each valid doc, generate exactly one reconstructed code sample. Why this step exists: It forms a full round-trip for each doc; using one sample per doc balances compute and memory. Example: From the cleaned Ruby doc, output a Ruby function matching the described behavior.

🍞 Hook: Like checking whether two different recipes produce the same cake taste.

🥬 The Concept (Code Reconstruction): Build code directly from the documentation. How it works:

Read the doc’s description, examples, and function signature hints.
Implement code that satisfies the described behavior.
Follow the language conventions (imports, function defs, indentation). Why it matters: This tests if the doc was precise enough to guide accurate coding. 🍞 Anchor: Use '>>> calculate_value(0) -> 15.81' and the step-by-step description to write the Ruby function that matches.
Reward design: Compute a semantic code similarity score S(original, rebuilt). For Stage 2 (doc→code), reward = S. For Stage 1 (code→doc), reward = S times a 'format factor' (0 for invalid doc, 0.5 for doc with redundant content, 1 for perfectly formatted doc). Why this step exists: We need a single, meaningful number that tells us whether the round trip preserved behavior, and we also want to encourage tidy, usable docs. Example: If the rebuilt code behaves like the original, S might be 0.83. If the doc was clean, Stage 1 reward is 0.83; if it had extra trailing stuff, reward is 0.83 × 0.5.
Reinforcement learning updates: Store these trajectories and rewards, normalize them, and update the model to increase the chance of choices that led to higher rewards (an on-policy method like Reinforce++ is used). Why this step exists: Instead of memorizing fixed answers, the model keeps practicing what actually works—writing docs that enable faithful reconstruction and writing code that follows docs. Example: Over time, the model learns to consistently include key steps and useful examples in docs, and to produce code that matches those steps.

The secret sauce:

The round-trip reward ties both directions together—improving one helps the other.
The format check stabilizes training by filtering unhelpful docs early.
The semantic similarity score rewards true meaning preservation, not just surface matches.
Sampling multiple docs per code snippet boosts the chance of finding clear, reconstructable explanations.

🍞 Hook: Like comparing two songs by melody rather than by the exact notes on paper.

🥬 The Concept (Semantic Code Similarity): Judge if two programs perform the same task, even with different implementations. How it works:

Build graphs that represent program structure and data flow.
Compare important pieces and connections.
Output a score from 0 (unrelated) to 1 (near-identical meaning). Why it matters: Without it, the model could match variable names but still change behavior. 🍞 Anchor: Two sum functions—one loops, one uses a built-in—both get a high similarity score because they compute the same result.

04Experiments & Results

The test: Measure pass@1 on standard benchmarks that check if the model’s first code attempt solves each problem using unit tests. This matters because in real use, you usually want the model to get it right the first time.

The competition: BatCoder is built on top of Qwen2.5 Instruct models (3B and 7B). It’s compared to other strong open-source code models like CodeT5+, CodeLlama, StarCoder2, WizardCoder, Magicoder, and DeepSeek-Coder-Instruct, and contrasted with closed-source leaders (GPT-4o, O1 Mini) as upper bounds.

The scoreboard with context:

HumanEval (Python): BatCoder 7B scores 83.5% pass@1. Think of it as getting an A when many good students are getting A− or B+.
HumanEval+ (tougher, more tests): BatCoder 7B reaches 76.8%, showing the gains are not just on easy cases.
MBPP (Python): BatCoder 7B hits 81.0%, beating its base model by a healthy margin.
MBPP+ (tougher): 69.3%—again, improving over the base. Importantly, BatCoder 7B outperforms some larger open-source models (e.g., a 33B baseline) on several benchmarks, showing that better learning signals can beat brute force size.

Surprising findings:

Low-resource languages: On MultiPL-E Ruby, Qwen2.5-3B scores 0.0% pass@1 (can’t solve any). BatCoder-3B jumps to 10.6%. That’s like going from scoring no baskets to suddenly making consistent shots.
On Ruby with 7B, BatCoder goes from 3.1% to 13.0%, a big absolute gain.
On Go, where baselines are decent, BatCoder still adds several points (e.g., 33.8% to 37.7% at 3B, 34.4% to 39.0% at 7B).

Why these results make sense:

The model is trained to produce docs that truly capture behavior and code that truly follows docs. This two-way skill helps even when problems are phrased differently than training data.
In languages with fewer paired examples available, self-supervised round-trip learning exploits the abundant unlabeled code, creating strong practice without needing human labels.

Training dynamics:

As training steps increase, the average rewards for both directions steadily rise, and pass@1 on Ruby climbs too. The reward curves and accuracy curves move upward together, showing the reward signal is aligned with real-world performance.

Ablations (what if we remove parts?):

No Stage 1 updates (don’t optimize doc generation): tiny gains only (e.g., 0.0% to 1.9% on Ruby), far below full BatCoder. Conclusion: directly improving docs matters a lot.
Supervised fine-tuning on synthetic pairs: helps, but still trails BatCoder. Conclusion: being judged and rewarded on round-trip similarity beats just copying synthetic pairs.

05Discussion & Limitations

Limitations:

Rewards focus on code similarity and doc formatting. They don’t yet include execution tests, style checks, or security constraints, which could further improve real-world robustness.
One set of hyperparameters was reused for both 3B and 7B. More careful tuning per size might unlock even better results.
The approach depends on the quality of the semantic similarity metric. If the metric misses subtle behavior changes, it may give misleading rewards.
The doc filtering rules are regex-based and format-driven; unusual but valid styles might be unfairly penalized.

Required resources:

A base instruction-tuned code model (e.g., 3B–7B scale).
A large unlabeled code corpus.
GPU resources to run two-stage sampling and reinforcement learning (the paper used A100s; smaller setups may need to reduce K or batch sizes).

When not to use:

If exact execution correctness (passing exhaustive test suites) is mandatory and you can’t add execution-based rewards yet.
In environments with extremely tight compute where two-stage generation and RL are too costly.
When documentation must follow strict house styles beyond the current format rules.

Open questions:

How much do results improve by mixing in execution-based rewards (e.g., unit tests) with semantic similarity?
What’s the best balance between number of doc candidates (K) and training stability/cost?
Can we adapt the method for tasks like code completion, refactoring, or multi-file projects where context is large?
How robust is the method to noisy or obfuscated code in the training corpus?
Which alternative RL algorithms or curriculum schedules yield the smoothest and fastest gains?

06Conclusion & Future Work

Three-sentence summary: BatCoder learns from raw code by doing a round trip—code to documentation and back to code—and rewarding itself when the rebuilt code matches the original in meaning. This ties documentation quality and code generation together, letting the model improve both directions without needing expensive code–doc pairs. The result is strong performance on standard benchmarks and big gains in low-resource languages.

Main achievement: Turning round-trip semantic similarity into a powerful self-supervised reward that jointly upgrades documentation generation and code synthesis.

Future directions:

Add richer rewards (execution correctness, style, safety) and test alternative RL algorithms for stability.
Scale up data and model sizes, and explore architectures suited for long-context, multi-file projects.
Extend the loop to related tasks like code completion, code translation, and program repair.

Why remember this: BatCoder shows that you don’t need mountains of labeled pairs to align code and docs—you can bootstrap from raw code by closing the loop. That simple, elegant idea opens the door to stronger models in every programming language, not just the most popular ones.

Practical Applications

•Auto-generate accurate function docstrings that include examples developers can trust.
•Turn issue descriptions or specs into starter code that matches the intended behavior.
•Improve legacy code understanding by producing clear, reconstructable explanations.
•Boost low-resource language support (e.g., Ruby, Go) without needing massive labeled datasets.
•Pre-train internal code assistants using only your company’s unlabeled repositories.
•Create self-checking documentation workflows: if docs can’t rebuild the code, flag them for review.
•Assist code review by comparing reconstructed code from the PR description to the proposed changes.
•Enhance educational tools that teach programming by linking exercises (docs) to working solutions (code).
•Support refactoring guidance by generating intent-preserving docs before changing code.
•Seed test generation: use reconstructed behavior to suggest input-output cases for unit tests.

Version: 1