Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Junxiao Liu; Zhijun Wang; Yixiao Li; Zhejian Lai; Liqian Huang; Xin Huang; Xue Han; Junlan Feng; Shujian Huang

Self-Improving Multilingual Long Reasoning via Translation-Reasoning Integrated Training

Intermediate

Junxiao Liu, Zhijun Wang, Yixiao Li et al.2/5/2026

arXiv PDF

Key Summary

•TRIT is a new training method that teaches AI to translate and think at the same time so it can solve hard problems in many languages without extra helper models.
•The method first checks which English questions the AI can answer in the target language, then uses only those reliable questions to train translation and target-language reasoning together.
•Translation gets a reward only if the translated question can be solved correctly in the target language, so reasoning accuracy becomes a smart proxy for translation quality.
•A special reward design also enforces clean formatting, correct language use, and no annoying repetition, which keeps answers readable.
•Across three backbone models, TRIT beats strong baselines on the MMATH benchmark by about 7 percentage points on average and reaches near-100% language consistency.
•Translation quality improves not only for math but also for general text, with up to +8.4 COMET on FLORES-200, showing strong out-of-domain generalization.
•TRIT raises cross-lingual question alignment (how similarly English and non-English versions are understood inside the model) by over 10 percentage points in later layers.
•Even when models are allowed to think in any language, TRIT still improves accuracy, proving it strengthens true question understanding, not just language control.
•An ablation study shows all three ingredients—cross-lingual reasoning, self-translation, and target-language reasoning—are necessary for the full gains.
•TRIT is self-improving, needs only English questions, and avoids external feedback models, making it practical and scalable to more languages.

Why This Research Matters

Many people learn and work in languages other than English, so AI must understand and reason well across languages to be truly helpful. TRIT improves both translation and reasoning together, making answers accurate, clear, and in the right language. This helps students study math in their native tongues, supports customer service in global markets, and aids public services where misunderstandings can be costly. Because TRIT needs only English questions and no external judge, it is cheaper and easier to scale to new languages. Its translation gains even carry over to general text, not just math, so it broadens real-world usefulness. Better cross-lingual alignment inside the model means the same problem is solved consistently no matter the language, improving trust and reliability.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a tricky math puzzle with a friend who speaks another language. If you both understand the puzzle in the same way, you can solve it together. But if one of you misunderstands a key word, you might head in the wrong direction.

🥬 The Concept (Natural Language Processing, or NLP): NLP is how computers read, write, and understand human language. How it works: 1) Turn words into numbers the computer can handle. 2) Learn patterns from huge text collections. 3) Generate or analyze text based on learned patterns. Why it matters: Without NLP, computers can’t really understand our questions or give helpful answers. 🍞 Anchor: When you ask a voice assistant for the weather, NLP lets it understand your words and reply clearly.

🍞 Hook: You know how teachers give gold stars when you get answers right? Rewards make you try harder in the right direction.

🥬 The Concept (Reinforcement Learning): RL teaches a model by rewarding good behavior. How it works: 1) Try something. 2) Get a score (reward). 3) Do more of what earned higher scores. Why it matters: Without rewards, the model doesn’t know which behaviors to prefer. 🍞 Anchor: A robot dog learns to sit because it gets treats when it sits correctly.

🍞 Hook: Think about reading the same puzzle in English and in Japanese. If you can see they mean the same thing, you’ll solve it just fine in either language.

🥬 The Concept (Multilingual Processing): This is the model’s ability to understand and produce many languages. How it works: 1) Share a common “idea space” across languages. 2) Map different words in different languages to the same ideas. 3) Generate answers in the requested language. Why it matters: Without it, the model might only do well in one language. 🍞 Anchor: A multilingual AI answering “What’s 7×8?” in French: “56,” and in Korean: “56,” same math, different words.

🍞 Hook: If a recipe says “bake” but you translate it as “fry,” your cake won’t bake itself!

🥬 The Concept (Translation Quality): It measures how faithfully meaning moves from one language to another. How it works: 1) Capture exact facts (numbers, units, conditions). 2) Keep important math symbols the same. 3) Choose natural words without changing meaning. Why it matters: If translation warps the problem, the solution can go wrong from the start. 🍞 Anchor: Translating “parallelogram” as “quadrilateral” changes the geometry and can cause a wrong answer.

🍞 Hook: A detective doesn’t just read clues—he reasons carefully over many steps to catch the culprit.

🥬 The Concept (Long Reasoning Models, LRMs): These are AI models trained to think through multi-step problems. How it works: 1) Break problems into steps (“think-then-answer”). 2) Check progress with verifiable rewards (correct final answers). 3) Learn to avoid distractions and stick to logic. Why it matters: Without long reasoning, the AI may jump to a guess and miss the solution. 🍞 Anchor: Solving a multi-step algebra puzzle by writing out each line before boxing the answer.

The world before: LRMs had become impressive at deep thinking—but mostly in English. When asked a question in another language, many models switched to thinking in English in the middle or lost accuracy if forced to think in the question’s language. That shows two gaps: understanding questions across languages and reasoning in those languages.

The problem: If models misunderstand the question in non-English languages, even perfect reasoning won’t help—because the reasoning starts from the wrong place. Past fixes mostly tried to align the thinking steps with English using external judges, but if you misread the question, aligning the steps won’t fix the root cause. Plus, external judges are expensive.

Failed attempts: 1) Supervised fine-tuning on translated chains-of-thought—helps a bit but doesn’t ensure the model truly understands the question meaning in each language. 2) Preference optimization or RL with external evaluators—costly and addresses reasoning style more than question understanding. 3) Two-stage pipelines that translate first and then reason in English—still doesn’t teach the model to reason natively in the target language.

The gap: We need a way to improve both multilingual question understanding and target-language reasoning together, without an expensive external teacher.

Real stakes: Better multilingual reasoning matters for students reading math in their native language, for customer support across countries, for health instructions, and for public services where mistakes cost time and trust.

02Core Idea

🍞 Hook: Imagine learning a new sport and a new language at the same time. If your coach explains each move in the new language and also checks that you can play correctly, you improve both the sport and the language together.

🥬 The Concept (Translation-Reasoning Integrated Training, TRIT): TRIT trains a model to translate questions and reason about them in the target language in one loop so each skill boosts the other. How it works: 1) First, make sure the model can answer English questions in the target language (cross-lingual reasoning). 2) Keep only the questions it solves reliably. 3) Then train the model to translate those English questions and solve the translated versions in the target language. 4) Reward translation only if the translated question can be solved correctly—so reasoning accuracy becomes the translation coach. Why it matters: Without linking translation to reasoning success, translations can look nice but still twist meaning. 🍞 Anchor: If the model translates “give away 7 apples” as “receive 7 apples,” it will fail the reasoning, so translation gets no reward.

Multiple analogies:

Coach-and-referee: Translation is the coach explaining the problem in the new language; reasoning is the player performing. The referee (answer correctness) decides if the coach explained well and the player performed right. Bad coaching gets bad scores.
Mirror-building: If English and Japanese versions are mirrors of each other, the reflection (representation) should match. TRIT polishes the mirrors by translating and checking if the reflection still solves the puzzle.
Quality filter + feedback loop: A coffee shop tests beans (filtering), then adjusts roasting based on taste tests (feedback) to improve both bean selection and roasting. TRIT filters solvable questions, then uses solution success to adjust translation and reasoning.

Before vs After:

Before: Models often reasoned in English behind the scenes; forcing them to stay in the question language hurt accuracy. External judges were used, which was costly and didn’t fix misunderstanding.
After: TRIT ties translation quality to reasoning success. The model learns to truly understand the question in each language and to reason natively, with almost perfect language consistency and higher accuracy.

Why it works (intuition):

Correctness is a strong, verifiable signal. If the final boxed answer is right and the reasoning is clean and in the right language, the model gets a solid reward. This avoids fuzzy style scores.
Deferred translation reward: If the translated question can be solved, the translation preserved meaning. If not, translation likely lost key info. This turns reasoning into a natural translation judge.
Alignment inside the model: Making the model produce translations itself forces English and non-English versions to map to similar internal representations. That cross-lingual alignment makes solving the same problem in any language feel the same to the model.

Building blocks (with sandwich mini-explanations):

🍞 Hook: If your robot only gets points when its final LEGO build matches the picture, it learns what steps really matter. 🥬 The Concept (Reinforcement Learning from Verifiable Rewards, RLVR): Train with rewards based on answers we can check automatically. How: 1) Generate an answer. 2) Check correctness + language + format + no repetition. 3) Reward only when all are good. Why it matters: Prevents models from getting high scores for messy or off-language reasoning. 🍞 Anchor: A math problem with a clear numeric answer lets the model be graded automatically.
🍞 Hook: In a class competition, you compare your solution with your table group to see who did better. 🥬 The Concept (Group Relative Policy Optimization, GRPO): The model generates a small group of answers and learns by comparing rewards inside the group. How: 1) Sample several solutions. 2) Score each. 3) Push the model toward higher-scoring ones relative to the group. Why it matters: It stabilizes training without needing a separate value network. 🍞 Anchor: Picking the best of five drafts and learning what made it better.
🍞 Hook: If you first prove you can ride a bike on a gentle path, your coach trusts your feedback when you try new routes. 🥬 The Concept (Cross-Lingual Reasoning): Answer English questions directly in the target language to prove you can handle that language. How: 1) Start with a small warm-up set. 2) Use rewards to ensure correct, consistent, non-repetitive answers. 3) Keep questions you can solve reliably. Why it matters: It prevents blaming translation when the real issue is that the model can’t solve the problem in the target language yet. 🍞 Anchor: If you can solve 2-digit multiplication in Spanish, we can trust later feedback about Spanish translations.
🍞 Hook: If a map is copied carefully, you can still find treasure using the copy. 🥬 The Concept (Self-Translation Training): The model translates the question itself, then tries to solve it. How: 1) Produce translation with strict tags and language checks. 2) Solve the translated question. 3) Reward translation only if solving succeeds. Why it matters: Forces translations to preserve exact meaning—numbers, conditions, terms. 🍞 Anchor: If a single word like “give away” is mistranslated, the answer turns wrong and translation gets no reward.

03Methodology

At a high level: Input (English question) → Phase 1: Cross-Lingual Reasoning and Filtering → Phase 2: Translation + Target-Language Reasoning with deferred rewards → Output (correct, language-consistent reasoning and answer).

Step-by-step (like a recipe):

Define a careful reward. What happens: Each response is checked for (a) answer correctness, (b) language consistency, (c) no degenerate repetition, and (d) proper think/answer format. Why it exists: Without these checks, models might be correct but unreadable, or correct but in the wrong language, or well-formed but wrong—none are acceptable. Example: A Japanese solution that repeats phrases endlessly fails the repetition check, so it won’t get the big reward even if the final number is right.
Phase 1 – Cross-Lingual Reasoning and Filtering. What happens: The model answers English questions directly in the target language. We compute an average reward per question by sampling several responses. Only questions with average reward above a threshold (like 1/3) move forward. Why it exists: This ensures later translation training won’t be penalized unfairly because the model cannot solve the problem in the target language yet. It reduces false negatives (good translations blamed for reasoning failures). Example: If the model can solve “20 apples, give away 7” in Japanese with correct steps and answer, that question is safe to use later.
Phase 2 – Translation + Target-Language Reasoning (deferred reward). 3a) Translation attempt. What happens: For each filtered English question, the model writes a translation inside <Translation> ... </Translation> tags. We first reject outputs that break format/language rules. Why it exists: Ensures clean data and prevents noisy training. Example: A translation into Portuguese that includes English words or breaks tags is invalid. 3b) Solve the translated question. What happens: For valid translations, the model solves the translated question in the target language, sampling several reasoning paths. If any path gets the right answer, we mark translation as good and give it reward; otherwise translation gets no reward. Why it exists: This cleverly uses reasoning success as a proxy to judge whether the translation preserved key meaning. Example: If “receive 7 apples” accidentally replaced “give away 7 apples,” solving fails and translation gets no reward.
Collect three kinds of training data. What happens: We gather (a) cross-lingual reasoning data from Phase 1, (b) translation pairs (English question, self-translation) from Phase 2, and (c) target-language reasoning data, but only from correctly translated questions. Why it exists: Each piece teaches a different part—(a) proves ability to reason in the target language, (b) improves faithful translation, (c) ensures the model can handle questions originally written in the target language. Example: The model learns both to translate “parallelogram” faithfully and to use that property when solving geometry.
Optimize with GRPO, group-wise. What happens: For each task, the model generates small groups of candidates (answers or translations), computes rewards, and adjusts itself to favor higher-scoring ones compared to the group average. Why it exists: Group-relative learning stabilizes training and removes the need for a separate value model, making RL simpler and more efficient. Example: Among five Japanese solutions, the clearest and correct one pulls the model’s policy in that direction.
Iterate for self-improvement. What happens: As the model improves, more questions pass the Phase 1 filter, providing richer training data; better translations lead to better target-language reasoning, which in turn gives clearer feedback to translation. Why it exists: Creates a positive loop where each skill lifts the other without extra external teachers. Example: After one iteration, the model can handle trickier algebra in Korean; after another, it handles multi-step geometry too.

Secret sauce:

Deferred translation reward from reasoning: Instead of guessing translation quality directly, we ask: did the translated version lead to a correct solution? This focuses the model on preserving key semantics (numbers, conditions, math terms).
Strict quality gates (language, format, no repetition): These prevent bad habits (like switching to English mid-thought or repeating lines) from sneaking into the model during RL.
Cross-lingual filtering: By only training translation on questions that are solvable in the target language, we avoid confusing signals that would punish good translations just because target-language reasoning was too weak yet.

Mini sandwich notes for two supporting concepts:

🍞 Hook: If you only reward the final tidy science report, students learn both good science and neat writing. 🥬 The Concept (Language Consistency Reward): A check that the reasoning stays in the requested language. How: Identify the language of the chain-of-thought. Why it matters: Mixed-language reasoning can be hard to read and train on. 🍞 Anchor: A French question expects French reasoning; otherwise no big reward.
🍞 Hook: Imagine a classmate copying the same sentence 30 times—that’s not real learning. 🥬 The Concept (Repetition Penalty): A check that punishes degenerate repetition. How: Detect repeated n-grams and repeated lines beyond thresholds. Why it matters: Prevents unreadable outputs and keeps training signals meaningful. 🍞 Anchor: “ですですです…” walls of text get penalized even if the final number is correct.

04Experiments & Results

The test: The authors evaluated multilingual mathematical reasoning on MMATH, which mixes problems from AIME24, AIME25, CNMO, and MATH500 across several languages (French, Portuguese, Japanese, Korean, Thai, plus English as out-of-domain). They measured: (1) Language Consistency (reasoning language matches the question), (2) Accuracy (correct final answer), and (3) LC&Acc (both at once)—their main score. They also measured translation quality on FLORES-200 with COMET to see if gains generalize beyond math.

The competition: TRIT was compared to Prompt Control (just tells the model which language to use), SFT (supervised fine-tuning on target-language Q&A), Naive RL (accuracy only), SLC-RL (adds a soft language reward), M-Thinker (uses an external model to align multilingual thinking to English), and External-Translation (uses a strong external translator but doesn’t train translation inside the model).

The scoreboard with context:

Across three backbones (a small DeepSeek 1.5B, Qwen3-1.7B, and Qwen3-4B), TRIT improves LC&Acc by about 7 percentage points over SLC-RL on average. Think of it as moving from a solid B to a clear A- when others hover around B- to B.
Language consistency is near-perfect under TRIT, meaning the model almost always reasons in the right language—like following the classroom language rule 99% of the time.
TRIT beats M-Thinker by around 5 points on Qwen3 models. Why surprising? M-Thinker uses an external evaluator but seems to hit a ceiling (reward saturation) when the base model is already good; TRIT keeps improving by focusing on question-level alignment through translation-and-reasoning.
External-Translation helps, but not as much as TRIT. Why? External translations don’t teach the model itself to align meaning across languages. TRIT’s self-translation forces internal alignment, which boosts reasoning too.

Out-of-domain English gains: Even in English-only testing, TRIT lifts accuracy (e.g., Qwen3-1.7B goes from 41.7% to 53.3%). That’s like practicing soccer drills in multiple languages and still getting better at the game itself. It shows improved core question understanding, not just language control.

Translation improves, and it travels:

In-domain (math): Human-judged or model-judged preferences pick TRIT’s translations more often than the base model’s, especially for smaller/weaker models (e.g., 3.3 wins for every 1 loss on Qwen3-1.7B).
Out-of-domain (general text): On FLORES-200, COMET jumps up to +8.4 for the smallest backbone and still rises for stronger models. That means the better translation skill is not stuck in math—it generalizes to regular text.

Alignment inside the model improves: Using MEXA-style analysis, TRIT increases the similarity between English and target-language question representations, especially in later layers (e.g., final-layer similarity from ~62.7% to ~78.6% on the 1.5B model). This suggests the model is truly learning that the same problem in English or Japanese is the same at its core.

Flexible reasoning setting: When models are allowed to think in any language but must answer in the target language, TRIT still improves performance over SLC-RL (52.1% vs 48.0%). This shows TRIT strengthens real understanding, not just strict language discipline.

Sensitivity and ablation (what mattered most):

Filtering threshold: A middle choice (around 1/3) best balances keeping enough data and avoiding noisy cases that cause false translation penalties.
Ablations: Removing cross-lingual reasoning or target-language reasoning hurts a lot; removing self-translation also hurts (less but still meaningful). Using English-only filtering instead of cross-lingual filtering reduces performance, mainly by increasing false negatives (good translations mistakenly punished).

Takeaway: TRIT’s gains are broad (multiple models, multiple languages), deep (consistency + accuracy), and durable (still helps when the rules are relaxed).

05Discussion & Limitations

Limitations:

Language coverage: The paper tests five target languages. While TRIT does not need extra multilingual labels, more languages—especially low-resource ones with very different scripts or morphology—need testing to confirm generality.
Model size: Experiments go up to 4B parameters. Larger models might behave differently (often better), but training costs and stability could shift. Still, TRIT’s logic is model-agnostic, so it should scale.
Reward dependence on verifiability: TRIT shines when answers are easily checked (like a boxed number). For tasks without clear, verifiable answers, reward design would need adaptation.
Deferred reward noise: If the model cannot yet reason in the target language, good translations may get unfairly punished. The Phase 1 filter reduces this, but it doesn’t remove all noise.

Required resources:

RL training with group sampling (multiple outputs per question) and long context windows (thousands of tokens) for chain-of-thought.
The ability to run language detection, format checks, and repetition detection efficiently during training.
A reasonable set of English questions (no multilingual labels required), and enough compute to perform iterative RL.

When not to use:

If you already have excellent, large-scale, human-validated parallel data and you only need translation (not reasoning), a pure NMT approach might be cheaper.
If your task lacks verifiable rewards (no clear correctness signal), you’ll need a different or weaker proxy, and TRIT may not deliver the same gains.
If you cannot afford RL sampling (multiple candidates per prompt), simpler SFT baselines might be preferable despite lower ceilings.

Open questions:

How far does TRIT scale in very low-resource, long-tail languages with limited pretraining coverage? Does the filter let the system bootstrap reliably?
Can we design better proxy rewards for tasks without numeric or easily-checked answers (e.g., creative writing, open-ended QA)?
What happens at much larger model sizes (e.g., 70B+)—do gains saturate or compound due to stronger base multilinguality?
Can TRIT be extended to multimodal settings (e.g., translating and reasoning about math diagrams) with similar deferred rewards?

06Conclusion & Future Work

Three-sentence summary: This paper presents TRIT, a self-improving training framework that integrates translation and reasoning so each skill teaches the other. By rewarding translation only when the translated question can be solved, TRIT improves multilingual question understanding, reasoning accuracy, and language consistency—all without external evaluators or extra multilingual data. Experiments show strong gains over multiple baselines, better cross-lingual alignment inside the model, and translation improvements that generalize beyond math.

Main achievement: Turning reasoning correctness into a dependable, self-contained judge of translation quality, which in turn strengthens multilingual reasoning.

Future directions: Scale to more languages (especially low-resource ones), larger models, and tasks without easy verifiable rewards by crafting new proxy signals; explore multimodal question translation and reasoning.

Why remember this: TRIT reframes multilingual reasoning as a closed learning loop—translate, reason, reward, repeat—so models learn to truly understand the same problem in any language and think natively in that language with clarity and accuracy.

Practical Applications

•Multilingual math tutoring systems that think and explain steps entirely in the student’s native language.
•Customer support bots that correctly understand and solve complex requests across multiple languages without switching to English internally.
•Educational content platforms that auto-translate and verify problem accuracy via downstream solution checks.
•Government and NGO information services that deliver precise, language-consistent instructions in low-resource languages.
•Cross-border enterprise tools that maintain consistent reasoning quality across regions (e.g., policy Q&A, compliance checks).
•Assessment tools for translation vendors that validate translation fidelity by solving tasks on the translated text.
•Multilingual STEM help desks that preserve mathematical terms and notation while explaining solutions clearly.
•Internal model training pipelines that scale to new languages without external evaluators, using verifiable rewards to control quality.
•Voice assistants that offer native-language reasoning for complex tasks (e.g., budgeting, planning) with consistent accuracy.

Version: 1