Scaling Laws for Code: Every Programming Language Matters

Jian Yang; Shawn Guo; Lin Jing; Wei Zhang; Aishan Liu; Chuan Hao; Zhoujun Li; Wayne Xin Zhao; Xianglong Liu; Weifeng Lv; Bryan Dai

Scaling Laws for Code: Every Programming Language Matters

Intermediate

Jian Yang, Shawn Guo, Lin Jing et al.12/15/2025

arXiv PDF

Key Summary

•Different programming languages scale differently when training code AI models, so treating them all the same wastes compute and lowers performance.
•Interpreted languages like Python gain more from bigger models and more data than compiled languages like Rust.
•Mixing languages during training often helps: pairs that look alike (such as Java and C# or JavaScript and TypeScript) give strong synergy boosts.
•A “parallel pairing” strategy—training on code plus its translation side-by-side—greatly improves translation skills and even helps zero-shot transfer to unseen language pairs.
•The paper measures a synergy gain matrix that shows which language pairs help each other most.
•A new proportion-dependent multilingual scaling law tells you how to split training tokens across languages for the best overall results.
•Optimized token allocation (more Python, balanced JavaScript–TypeScript, fewer fast-saturating languages like Rust) beats uniform allocation under the same budget.
•Over 1000 experiments (up to 1 trillion tokens and models up to 14B parameters) back up these findings.
•Guided allocation improves average multilingual code generation and translation without hurting any single language.
•This work offers a practical recipe for building stronger, more compute-efficient multilingual code LLMs.

Why This Research Matters

This work turns multilingual code model training from guesswork into a guided plan that saves compute and money. Teams can target the languages that benefit most from more data and pair those that lift each other, improving average performance under fixed budgets. Better zero-shot translation means tools can help developers move code between stacks faster, modernizing systems more easily. Language-aware training also supports fairer performance across the languages real teams use, instead of overfitting to just one. The approach is practical and measurable, ready to plug into real training pipelines. As models scale and costs rise, these insights help everyone do more with less.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how in a school orchestra, each instrument has its own sound? If you give every instrument the same sheet music and volume, the music won’t sound right. Violins, drums, and flutes need different parts to make the song beautiful.

🥬 What it is: Before this paper, many people trained code AIs by mixing programming languages as if they were all the same “instrument,” following general scaling rules (more model size + more data → better). How it works (before):

Collect lots of mixed-language code.
Scale up the model and data using general scaling laws.
Hope performance rises smoothly for every language. Why it matters: If languages aren’t actually alike, this one-size-fits-all plan can waste compute and leave performance on the table.

🍞 For example, giving Python and Rust the same training diet can be like giving a tuba and a piccolo the same notes—neither shines.

🍞 You know how some video games are “easy to pick up” while others take lots of practice? Programming languages are like that for AI models, too—some are predictable and strict; others are flexible and tricky.

🥬 What it is: The core problem is that code AIs (Code LLMs) learn differently from different programming languages (PLs). Some languages benefit more from extra model size and data; others reach their best sooner. How it works:

Each language has its own structure, typing rules, and style.
These properties change how quickly an AI can predict code correctly.
If we ignore these differences, we misjudge how much model or data each language needs. Why it matters: Without language-specific understanding, predictions about training cost and performance can be wrong, making large projects very expensive.

🍞 Think of Python (flexible) vs. Rust (strict). If you treat them the same, you’ll likely overfeed Rust and underfeed Python.

🍞 Imagine a sports practice where soccer, basketball, and swimming teams all do the same drills. Some athletes will improve, but many will train the wrong muscles.

🥬 What it is: Earlier attempts mostly used language-agnostic pre-training or uniform token mixing, assuming all languages help equally. How it works:

Mix code from many languages evenly.
Train one big model.
Evaluate average scores and move on. Why it matters: This hides which languages truly help which others, and it wastes compute on languages that already saturate early.

🍞 Result: A model might be okay at everything but not excellent where it counts, like failing a company’s main language even after huge spending.

🍞 You know how group projects go better when teammates share skills or can translate ideas quickly? The same is true for programming languages.

🥬 What it is: The gap this paper fills is a full, language-aware map of how code models scale—per language, across language pairs, and under different data organizations—plus a practical rule for how to split training tokens. How it works:

Measure scaling per language (how performance improves with more model/data).
Measure synergy between language pairs (who helps whom).
Test data organization strategies (random mix vs. parallel pairing of code + translation).
Use these to compute the best token split (the new multilingual scaling law). Why it matters: Now we can plan training like engineers, not guessers—spending compute where it pays off most.

🍞 Example: Allocate extra tokens to Python (high gain), balance JavaScript–TypeScript (high synergy), and reduce Rust (saturates fast). The model gets better on average without sacrificing any language.

🍞 Think of building a city: roads (data), buses (models), and fuel (compute) must be balanced. If you know which roads will carry the most traffic, you can plan smarter.

🥬 What it is: The stakes are real—training large code models is expensive and time-consuming. How it works:

Wrong assumptions → wrong budgets and weaker models.
Right assumptions → better results with the same money and time.
Multilingual settings match real engineering teams who use many PLs. Why it matters: Better planning saves money, speeds up research, and helps developers everywhere.

🍞 Example: A company can train one model that handles Python, Java, and TypeScript better than three separate models—if it splits data the smart way.

02Core Idea

🍞 You know how gardeners plant sun-loving flowers in bright spots and shade-loving ones under trees? If you put them all in the same place, many won’t thrive.

🥬 What it is: The key insight—every programming language scales differently, and the best multilingual model comes from measuring those differences, using language synergies, and then allocating training tokens proportionally. How it works:

Measure language-specific scaling (how each PL improves with more model/data).
Measure cross-language synergies (which pairs help each other).
Organize data cleverly (parallel pairing: code + its translation).
Use a new proportion-dependent scaling law to split tokens across languages. Why it matters: Without this, models are compute-wasteful; with it, they learn faster and perform better across languages.

🍞 Anchor: Give Python more data (it benefits a lot), pair Java with C# (they help each other), and don’t overfeed Rust (it saturates). The whole model’s multilingual score rises.

— Multiple analogies —

Cooking: Different ingredients need different cook times. You don’t boil lettuce for an hour. Language-aware token allocation is like timing each ingredient right.
Sports: A basketball team practices shooting; swimmers practice strokes. Mixed but targeted training beats making everyone run laps.
School: Pair a Spanish learner with a Portuguese speaker for vocabulary boosts—similar languages transfer skills better (like Java ↔ C#).

— Before vs. After — • Before: Uniform token mixing; assume all PLs scale similarly; limited view of cross-lingual transfer; weak translation on unseen pairs. • After: Language-specific scaling curves; a synergy gain matrix; parallel pairing for stronger alignment; a formula to allocate tokens optimally; better zero-shot translation.

— Why it works (intuition) — • Each PL has its own predictability and complexity. Strict typing and rigid syntax make code easier to predict; dynamic styles are harder and need more data. • Similar languages share patterns (like cousins), so learning one boosts the other. • Parallel pairing gives the model a clean “bridge” between languages, teaching it to line up ideas across syntax. • The new scaling law encodes all this: it says how much each language’s data counts, plus extra credit from synergy.

— Building blocks — • Language-specific scaling: Shows that interpreted languages (e.g., Python) keep improving with more data/model; some compiled languages (e.g., Rust) saturate earlier. • Synergy gain matrix: Measures who helps whom (e.g., Java + C# big boost; JS + TS big boost). • Parallel pairing: Concatenate a snippet and its translation; this sharpens cross-lingual alignment and helps zero-shot. • Proportion-dependent multilingual scaling law: A rule that uses all the above to calculate the best token split.

03Methodology

High-level recipe: Input (multilingual code + translations) → Measure per-language scaling and pairwise synergies → Choose data organization (parallel pairing vs. random mix) → Fit a multilingual scaling law that depends on language proportions → Allocate tokens optimally → Train and evaluate.

Step 1: Build the training and test data 🍞 Imagine sorting a giant library: you shelf books by language and also keep bilingual editions together.

🥬 What it is: A large multilingual code corpus plus a carefully curated translation test set. How it works:

Gather about 900B code tokens across Python, Java, JavaScript, TypeScript, C#, Go, Rust, plus 100B natural language tokens.
Build parallel code: Python ↔ each other language (but no direct JS ↔ TS, etc., in training).
Create a held-out test of 50 programs translated into all 7 PLs, covering 42 directions. Why it matters: Clean, parallel code lets us test alignment strategies; diverse monolingual code tests generalization.

🍞 Anchor: 1T total tokens (900B code + 100B text), with Python as a pivot for translations.

Step 2: Train many models to learn each language’s scaling curve 🍞 Think of testing how different seeds sprout by changing sunlight and water.

🥬 What it is: Train 420 models per-language to map how performance improves with model size and data size. How it works:

Fix the model architecture (LLaMA-style with modern components) for fairness.
Vary model sizes from ~0.1B to ~3.1B and token budgets from 2B to 64B.
Measure validation loss for each language and fit language-specific scaling trends. Why it matters: This shows which languages need more data vs. more parameters, and reveals irreducible difficulty.

🍞 Anchor: Results show an intrinsic complexity order: C# < Java ≈ Rust < Go < TypeScript < JavaScript < Python.

Step 3: Measure cross-language synergy with bilingual mixtures 🍞 You know how study buddies who share similar skills can boost each other’s grades?

🥬 What it is: Train on 128B tokens split either as (L_i + L_i) or (L_i + L_j), then evaluate only on L_i. How it works:

Keep the total budget fixed.
Swap the second half from same-language to a different language.
Compute synergy gain: how much better (or worse) L_i gets with L_j than with itself. Why it matters: It reveals productive pairs (e.g., Java + C#) and warns about harmful mixtures.

🍞 Anchor: Java + C# shows a remarkable ~20% improvement over Java training twice on itself; JS + TS also strong. Python helps others but often loses a bit when mixed as the target.

Step 4: Choose a data organization strategy for cross-lingual learning 🍞 Picture two ways to learn a new language: reading random sentences versus reading a bilingual book with aligned paragraphs.

🥬 What it is: Compare random shuffling vs. parallel pairing (concatenate code with its translation). How it works:

Random shuffling: mix all monolingual code—no explicit alignment.
Parallel pairing: glue matching snippets (source + translation) to give a clear alignment signal.
Train models of multiple sizes and compare translation loss (seen pairs) and zero-shot (unseen pairs). Why it matters: Parallel pairing teaches the model to “line up” ideas across languages, improving both seen and unseen directions.

🍞 Anchor: With parallel pairing, zero-shot (e.g., Java → Go) improves notably, as if the model composes via Python as a bridge.

Step 5: Fit a proportion-dependent multilingual scaling law 🍞 Think of a budgeting app that learns which expenses give you the best value and then suggests where to spend more or less.

🥬 What it is: A formula that uses (a) each language’s scaling behavior and (b) pairwise synergies to decide how to split training tokens. How it works:

Combine per-language “how fast do I improve with more data/model?” measurements.
Add a synergy term that boosts effective data when helpful pairs are present.
Solve for token proportions that maximize expected performance under a fixed budget. Why it matters: This turns guesswork into a principled plan for multilingual pre-training.

🍞 Anchor: The optimized plan gives more tokens to Python, balances JavaScript–TypeScript and Java–C#, and trims languages that saturate early (e.g., Rust), improving average performance.

Step 6: Evaluate on code generation and translation 🍞 Like testing a new study schedule with both quizzes (translation) and projects (generation).

🥬 What it is: Compare uniform vs. optimized token splits on standard benchmarks (e.g., MultiPL-E for Pass@1, BLEU for translation). How it works:

Train same-size models with the same total tokens but different language proportions.
Measure Pass@1 across languages and BLEU on all directions.
Check if any language is harmed and whether the average improves. Why it matters: Proves that smarter allocation boosts the whole system without sacrificing parts.

🍞 Anchor: The optimized split raises average Pass@1 and BLEU; no language suffers significant drops. Secret Sauce: combining language-specific scaling, synergy mapping, and parallel pairing to guide token allocation.

04Experiments & Results

🍞 Imagine running over 1000 lab trials to learn how each plant (language) grows with water (data) and pot size (model). Then you chart which plants help each other when grown side-by-side.

🥬 What it is: A large-scale experimental campaign measuring (1) per-language scaling, (2) bilingual synergy, (3) cross-lingual organization strategies, and (4) token allocation outcomes. How it works:

Per-language scaling: 420 runs across 7 PLs, varying model sizes and tokens, to fit language-specific curves and compare intrinsic difficulty.
Bilingual mixtures: fixed total tokens, compare (L_i + L_i) vs. (L_i + L_j), measure synergy gain.
Cross-lingual strategies: random shuffling vs. parallel pairing; test seen and unseen translation directions.
Allocation test: uniform vs. optimized token splits under the same budget. Why it matters: Each test answers a piece of the puzzle and, together, they form a practical training recipe.

🍞 Anchor: Think “league standings”—not just scores, but who beats whom and why.

Key findings with context: • Language-specific scaling

Interpreted languages (e.g., Python) benefit more from scaling, needing larger datasets to shine; compiled, strict languages (e.g., Rust) saturate earlier.
Intrinsic difficulty order (easier → harder): C# < Java ≈ Rust < Go < TypeScript < JavaScript < Python. This is like some school subjects being naturally trickier to predict.

• Bilingual synergies

Most languages gain from mixing; 6/7 see consistent positives.
Java + C#: a standout pair—about a 20% improvement compared to self-repetition. JavaScript + TypeScript also strong in both directions.
Python as an auxiliary often helps others, but when Python is the target, mixing can slightly hurt it (except modest gains with Java). This asymmetry is like a great tutor who helps many classmates but prefers solo study for their own test.

• Cross-lingual strategies

Parallel pairing beats random shuffling on seen and unseen translation tasks.
Zero-shot improvements (e.g., Java → Go) suggest models learn to compose translations using Python as a bridge (Java → Python → Go), like translating via a common language.
Scaling exponents for pairing are high, meaning bigger models can exploit the alignment signal extremely well—like giving them clearer maps.

• Optimized allocation vs. uniform

With the same total tokens and model size, optimized allocation improves average code generation (Pass@1) and translation (BLEU).
No language suffers major drops, showing that smart rebalancing can lift the average without pushing any one language down.

Scoreboard-style context:

Think of uniform mixing as a class average of B-. The optimized split is like nudging the class to an A- by giving extra practice to students who learn fastest from it (Python) and pairing look-alike learners (JS–TS), while not over-studying students who already top out early (Rust).

Surprises and notable patterns:

Asymmetric transfer: Python helps others more than others help Python.
Similar-language pairs shine (Java–C#, JS–TS), confirming that structural similarity aids transfer.
Even without explicit pairings, models show some zero-shot translation skill; with pairing, this jumps noticeably.

Bottom line: The data shows a clear win for language-aware planning plus parallel pairing and proportionate token allocation.

05Discussion & Limitations

🍞 Think of planning a road trip with a great map, but some roads aren’t drawn yet and weather could change. You still travel smarter, but you know the limits.

🥬 What it is: An honest look at what this approach can and can’t do, what it needs, and what’s still unknown. How it works:

Limitations

Coverage: Only seven PLs were studied; very low-resource or niche languages (e.g., SQL, assembly) might behave differently.
Scale: Largest models (~14B) and 1T tokens—findings should be validated at 100B+ parameter scales.
Benchmarks: Focused on translation and single-file generation; multi-file, refactoring, or program repair may reveal new dynamics.
Data-dependence: Synergy numbers come from this corpus; other corpora may need recalibration.
Fixed budgets: Dynamic curricula or adaptive sampling could push performance further.

Required resources

Large, clean multilingual code datasets; optional parallel code pairs.
Significant compute for running many controlled experiments (though the final recipe saves compute in production).

When not to use

If you only care about a single language, monolingual scaling may be sufficient.
If your main target is extremely niche and lacks related languages, synergy gains may be limited.

Open questions

How do findings extend to more languages, especially low-resource ones?
What is the best dynamic curriculum that changes proportions over time?
How does multi-file or repository-level context change the synergy map?
Can we automatically learn the synergy matrix on-the-fly during training? Why it matters: Knowing boundaries helps teams apply the method wisely and target future research where it counts most.

🍞 Anchor: If your product is heavy in TypeScript and JavaScript, this paper already gives you a strong plan. If you rely on COBOL or Verilog, you’ll likely need to measure new synergies first.

06Conclusion & Future Work

Three-sentence summary: This paper shows that every programming language scales differently and that multilingual code models work best when we measure those differences, organize data with parallel pairing, and then split tokens using a proportion-dependent scaling law. The approach raises average performance across languages, strengthens zero-shot translation, and avoids harming any single language—all under the same compute budget. In short, language-aware planning turns expensive guesswork into efficient engineering.

Main achievement: A practical, tested blueprint—language-specific scaling + synergy mapping + parallel pairing—combined into a proportion-dependent multilingual scaling law that tells you exactly how to allocate training tokens.

Future directions: Extend to more (and rarer) languages, push to larger model scales, add repository-level and multi-file tasks, and explore adaptive sampling that updates proportions during training. Also, refine automatic discovery of synergy structures and investigate other alignment signals beyond simple concatenation.

Why remember this: Because it replaces “mix everything and hope” with a measured, math-guided strategy that saves compute and lifts real-world multilingual coding performance—exactly what teams need to build the next generation of code AIs.

Practical Applications

•Design multilingual code corpora that prioritize high-gain languages (e.g., Python) and high-synergy pairs (e.g., Java–C#, JS–TS).
•Adopt parallel pairing (code + translation) to boost cross-language translation and zero-shot abilities.
•Use the synergy gain matrix to avoid harmful mixtures when your target language is sensitive (e.g., Python as target).
•Plan compute budgets with language-specific scaling in mind, not one-size-fits-all assumptions.
•Improve migration tools that translate legacy code (e.g., Java → C#) by training with aligned pairs.
•Build better cross-language code search by aligning semantically equivalent snippets across PLs.
•Tune token proportions for company-specific stacks (e.g., more TypeScript if that’s your frontend).
•Increase ROI of training runs by reducing tokens for fast-saturating languages like Rust.
•Enhance code assistants in multi-repo, multi-language environments by leveraging pivot-language alignment.
•Forecast performance at larger scales using fitted scaling curves to decide whether to add data or parameters.

Version: 1