Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Yihong Liu; Raoyuan Zhao; Hinrich Schütze; Michael A. Hedderich

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners

Beginner

Yihong Liu, Raoyuan Zhao, Hinrich Schütze et al.1/6/2026

arXiv PDF

Key Summary

•Large reasoning models can often find the right math answer in their “head” before finishing their written steps, but this works best in languages with lots of training data like English and Chinese.
•The authors test this by cutting off (truncating) the model’s step-by-step reasoning at different points and checking how often the model still answers correctly.
•They design new measures (AUTC, AUGC, and LRS) to score how early and how independently from written steps the correct answer shows up.
•On the easier MGSM benchmark, latent reasoning is clear in high-resource languages, with solid accuracy even when only a small slice of the steps is shown.
•On the harder Multilingual AIME benchmark, this early, hidden correctness mostly disappears across all languages and model sizes.
•Inside the model, the path toward the correct answer looks very similar across languages and tends to align with English—an English-centered latent reasoning pathway.
•Bigger models (7B → 14B → 32B) help, but they don’t close the gap for low-resource languages.
•A separate test shows some memorization exists, but models still change answers when numbers are edited and stay correct with paraphrases—supporting genuine reasoning.
•The study shares code and a clear protocol to measure multilingual latent reasoning and separate it from plain memorization.
•Bottom line: today’s models can reason silently in many languages, but it’s uneven, fragile on hard problems, and leans heavily on English.

Why This Research Matters

If AI is better at silently solving problems in English than in your language, that’s a fairness issue for students, workers, and communities worldwide. This paper shows where and when the gap appears, and how it grows on harder problems, so builders can fix it. Early, hidden correctness can speed up help and reduce costs, but only if it’s reliable across languages. Knowing that large models follow an English-centered internal route helps researchers design training that balances strengths across languages. The memorization checks show the models are not just parroting; they can adapt, which is encouraging for building trustworthy multilingual tools. These insights guide data curation, evaluation, and training strategies that make AI support more equal for everyone.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can do math in your head before you explain it out loud? You might already know 27 + 35 = 62 before you say the steps. Computers that talk can do that too—sometimes.

🥬 Filling (The Actual Concept): What it is: Large Reasoning Models (LRMs) are AI systems trained to solve multi-step problems like math and logic and to explain their steps if asked. How it works: 1) They read the question, 2) think through many tiny, hidden calculations, 3) write out reasoning steps (if prompted), and 4) give a final answer. Why it matters: If we don’t understand whether the model solves problems by really thinking or just copying patterns, we can’t trust its answers across different languages and difficulties.

🍞 Bottom Bread (Anchor): When you ask the model a word problem like “Sam has 3 apples and buys 2 more, how many total?”, the model might already “know” it’s 5 before writing any steps.

🍞 Top Bread (Hook): Imagine a librarian who sorts books in lots of languages. If most books are in English, the librarian might get really good at English first and only later learn to sort books in other languages.

🥬 Filling (The Actual Concept): What it is: Natural Language Processing (NLP) is how computers read, write, and understand human language. How it works: 1) The model turns words into numbers, 2) uses patterns it learned from tons of text, 3) predicts the next words or the final answer. Why it matters: If the training text is mostly English, NLP systems may do best in English and struggle elsewhere.

🍞 Bottom Bread (Anchor): Ask the model the same math problem in English and in Swahili. If it was trained mostly on English, it usually performs better in English.

🍞 Top Bread (Hook): Think of math word problems as puzzles with words and numbers glued together.

🥬 Filling (The Actual Concept): What it is: Mathematical reasoning tasks are word problems that need careful, step-by-step thinking to reach a numeric answer. How it works: 1) Understand the story, 2) pick the right math, 3) calculate carefully, 4) present a single number. Why it matters: If a model misses a detail, the final number can be wrong even if it sounds confident.

🍞 Bottom Bread (Anchor): “A train leaves at 3:00 PM and arrives at 4:30 PM. How long is the trip?” needs reading (find times) and math (90 minutes).

🍞 Top Bread (Hook): Imagine asking the same puzzle in 11 different languages and seeing if the solver is equally good in all of them.

🥬 Filling (The Actual Concept): What it is: Multilingual models are AI systems trained to handle many languages. How it works: 1) They learn shared patterns across languages, 2) they try to align meanings even with different scripts and grammars, 3) they reuse knowledge across languages. Why it matters: If training is uneven, some languages become “A-students” and others struggle.

🍞 Bottom Bread (Anchor): The study tests English, French, German, Chinese, Japanese, Russian, Spanish, Swahili, Bengali, Telugu, and Thai on the same math problems.

🍞 Top Bread (Hook): You know how teachers ask you to “show your work”? That’s like writing out your thinking steps.

🥬 Filling (The Actual Concept): What it is: Chain-of-Thought (CoT) is the model writing step-by-step explanations before giving the answer. How it works: 1) Expand the plan in small sentences, 2) compute partial results, 3) combine them, 4) finish with the final number. Why it matters: Without CoT, the model might jump to answers and make mistakes or be untrustworthy.

🍞 Bottom Bread (Anchor): For 27 + 35, a CoT might say, “Add tens: 20 + 30 = 50. Add ones: 7 + 5 = 12. Total = 62.”

🍞 Top Bread (Hook): Picture solving a problem in your head but not saying it until later. You’re still thinking—just silently.

🥬 Filling (The Actual Concept): What it is: Latent reasoning is the model’s hidden, non-verbal thinking that happens inside before any words are written. How it works: 1) The model forms clues and partial answers in hidden states, 2) these get refined layer by layer, 3) the final answer may be ready early, even if the model keeps talking. Why it matters: If the model already “knows” the answer inside, the written steps might just be a performance, and the real power lies in the hidden process.

🍞 Bottom Bread (Anchor): The model might hit the correct number halfway through its explanation—or even before it writes any steps at all.

The world before this paper: People knew LRMs could show their steps, and some papers in English showed that models often reach the right answer internally before finishing CoT. But we didn’t know if this hidden thinking also worked well in other languages.

The problem: Are models equally good at this hidden, early answer formation across languages? Or is it uneven? And does task difficulty erase these hidden advantages?

Failed attempts: Most past work looked only at English, mixed languages in the reasoning steps by accident, or didn’t measure how early the correct answer forms.

The gap: We needed a clean, multilingual test that (1) keeps the reasoning language under control, (2) peeks at how answers appear when you only show part of the steps, and (3) compares the insides of the model across languages.

Real stakes: If your calculator-in-a-chatbot is weaker in your language, students, doctors, or citizens might get worse help. Understanding and fixing this matters for fairness, education, and access.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re watching a mystery show but only allowed to see the first 10%… then 20%… then 30% of each episode. If you can still guess the culprit early, you must be picking up hidden clues.

🥬 Filling (The Actual Concept): What it is: The key insight is to cut (truncate) the model’s written steps at different points and ask for the answer—then measure how often the model is already right. How it works: 1) Make the model write a full solution in the target language, 2) cut the reasoning after only a small slice (like the first 10%), 3) immediately force the model to give the number, 4) repeat for bigger slices, 5) track how fast correctness appears and whether the answer was already written in the visible part. Why it matters: If accuracy is high even when almost no steps are shown—and the correct number isn’t yet written—then the model is doing true latent reasoning.

🍞 Bottom Bread (Anchor): If English answers are often correct at 0% or 10% of the steps, that’s strong evidence the model already computed the answer inside.

🍞 Top Bread (Hook): Think of tasting soup every few minutes as it cooks. If it tastes delicious way before the recipe is done, the flavor formed early.

🥬 Filling (The Actual Concept): Truncation-based Strategy. What it is: A way to test early understanding by only letting the model see a prefix of its own reasoning. How it works: 1) Generate the full step-by-step thoughts, 2) keep only the first r% of steps, 3) stop thinking and ask for the final number, 4) score correctness across many r values. Why it matters: Without truncation, you can’t tell if the model needed all the steps or secretly had the answer long before.

🍞 Bottom Bread (Anchor): With r = 10% on a grade-school problem, English still gets many right, showing early internal solution formation.

🍞 Top Bread (Hook): When you guess on a quiz 10 times, your chance to get at least one right goes up.

🥬 Filling (The Actual Concept): Pass@k. What it is: A score that asks, “If the model tries k times, does it get at least one correct?” How it works: 1) Sample up to k answers, 2) check if any match the gold answer, 3) count across problems. Why it matters: It shows how reliable the model is, even with some randomness.

🍞 Bottom Bread (Anchor): Pass@10 on a truncated trace tells us if at least one of ten tries is correct with limited steps.

🍞 Top Bread (Hook): Suppose you’re reading your own notes. If the answer is literally written there, it’s easy to repeat it.

🥬 Filling (The Actual Concept): Gold-in-Trace Rate. What it is: The fraction of correct predictions where the correct number already appears in the visible steps. How it works: 1) Look at the truncated text, 2) check if the gold number is present, 3) compute the rate among correct cases. Why it matters: High early gold-in-trace means the model might be copying; low early gold-in-trace hints at true latent reasoning.

🍞 Bottom Bread (Anchor): If at 10% truncation the model’s correct answers rarely have the gold number written yet, it likely solved it silently.

🍞 Top Bread (Hook): Imagine plotting how quickly a runner pulls ahead in a race.

🥬 Filling (The Actual Concept): Area Under the Truncation Accuracy Curve (AUTC). What it is: A single score for how early accuracy rises as you reveal more steps. How it works: 1) Plot accuracy vs. truncation percent, 2) measure the area under the curve, 3) bigger area = earlier, stronger correctness. Why it matters: Without an early lead (big area), the model only gets good very late, meaning less latent reasoning.

🍞 Bottom Bread (Anchor): English on MGSM shows a high AUTC—accuracy is already strong with small prefixes.

🍞 Top Bread (Hook): If the answer shows up in the text early, that’s different from discovering it silently.

🥬 Filling (The Actual Concept): Area Under the Gold-in-Trace Curve (AUGC). What it is: A score for how early the answer is explicitly written in the steps. How it works: 1) Plot gold-in-trace vs. truncation percent, 2) compute the area, 3) higher area = the answer appears early in text. Why it matters: It helps separate “I wrote it early” from “I knew it early but didn’t write it yet.”

🍞 Bottom Bread (Anchor): Low early AUGC with high early accuracy signals strong latent reasoning.

🍞 Top Bread (Hook): Think of giving extra credit for solving without seeing the solution written down.

🥬 Filling (The Actual Concept): Latent Reasoning Score (LRS). What it is: A metric that boosts correctness that happens before the gold answer appears in text (and downweights after). How it works: 1) Weight accuracy at each truncation level by (1 − gold-in-trace rate), 2) add over all truncation levels, 3) higher LRS = stronger latent reasoning. Why it matters: It directly rewards hidden, not-just-copying skill.

🍞 Bottom Bread (Anchor): On MGSM, English and Chinese have high LRS; Swahili and Telugu are lower—showing a resource gap.

Multiple analogies for the core idea:

Puzzle peek: Show only the first few pieces of a jigsaw and see if the model already guesses the picture.
Soup taste-test: Sip early, mid, and late to learn when the flavor arrives.
Race curve: Track who leads early; if the model leads with tiny prefixes, it’s a fast starter.

Before vs After:

Before: We knew about English latent reasoning but not if it survived across languages or tough tasks.
After: Latent reasoning exists in many languages but is uneven (stronger in high-resource languages) and fades on hard problems. Inside, language paths align to an English-centered route.

Why it works (intuition): Truncation locks the model out of using later steps. If it still succeeds early—and the answer isn’t visible—then the hidden states must already contain the solution. Comparing curves and the new metrics makes that signal clear and comparable across languages.

Building blocks:

Careful language control so the model reasons in the same language as the prompt.
Truncation at many ratios to map how early correctness appears.
Metrics (AUTC, AUGC, LRS) to convert curves into easy-to-compare numbers.
Internal probes (logit lens) and hidden-state similarity to see if languages share the same inside track.

03Methodology

At a high level: Input (same math problem in 11 languages) → Make the model think in the target language → Generate a full step-by-step trace → Truncate to r% of steps → Immediately elicit the final number → Score with pass@k, gold-in-trace, AUTC, AUGC, LRS → Probe internals (logit lens, hidden-state similarity) → Check memorization vs reasoning (number edits and paraphrases) → Output findings.

Step 1: Keep the model thinking in one language

What happens: The prompt says “think in [target language]” and inserts a tiny language reminder right after the <think> tag so the reasoning stays in that language.
Why it exists: Without this, the model might switch languages mid-thought, which would break fair comparisons.
Example: Prompt says, “By request, I will begin to think in German,” and the model writes its steps in German.

🍞 Top Bread (Hook): Imagine a secret notebook where you do the work before sharing the answer. 🥬 Filling (The Actual Concept): Hidden States. What it is: The model’s internal memory of the current problem, updated after each token. How it works: 1) Words become vectors, 2) layers transform them, 3) the final vector guides the next output. Why it matters: This is where latent reasoning lives—inside, before words. 🍞 Bottom Bread (Anchor): The model may store “the answer is 62” internally before it writes any sentence saying 62.

🍞 Top Bread (Hook): Picture rows of gears shaping a thought step by step. 🥬 Filling (The Actual Concept): Neural Network Representation. What it is: The way the model encodes meaning as numbers across layers. How it works: 1) Each layer mixes and refines information, 2) deeper layers form more abstract ideas, 3) the unembedding maps vectors back to words/numbers. Why it matters: If these representations agree across languages, the model might share one internal pathway. 🍞 Bottom Bread (Anchor): “Carrying the 1” in addition might show up as a pattern inside certain layers—no matter the language.

Step 2: Generate a full chain-of-thought (CoT)

What happens: The model solves the math word problem, writing steps in the target language, then gives a final number.
Why it exists: We need a complete trace so we can later cut it at many points.
Example: For 27 + 35, it writes tens and ones steps in Chinese (if the prompt is Chinese).

Step 3: Truncate the reasoning to r% and force an answer

What happens: Keep only the first r% of sentences from the trace, then immediately ask for the final number (no more steps allowed).
Why it exists: It reveals whether the model can answer correctly early, without relying on later written steps.
Example: With r = 10% on an MGSM problem, English still gets a surprising number right.

🍞 Top Bread (Hook): Like peeking at only the first chapter of a mystery and guessing the ending. 🥬 Filling (The Actual Concept): Representational Analysis. What it is: Studying how the model’s inside changes as it thinks. How it works: 1) Collect hidden states across layers, 2) analyze how the answer gets more likely, 3) compare languages. Why it matters: It shows whether languages share the same inner route toward the right answer. 🍞 Bottom Bread (Anchor): If German and English hidden states evolve similarly layer by layer, they likely use a shared pathway.

🍞 Top Bread (Hook): Think of holding up a magnifying glass to each layer to see what answer it leans toward. 🥬 Filling (The Actual Concept): Logit Lens. What it is: A probe that projects hidden states into the model’s vocabulary space to see which tokens look likely at each layer. How it works: 1) Take a layer’s hidden state, 2) map it through the unembedding, 3) rank where the correct answer token stands. Why it matters: If the gold number ranks high early, the model’s internals already favor it. 🍞 Bottom Bread (Anchor): On MGSM, the gold number’s rank improves similarly across languages, suggesting shared dynamics.

Step 4: Score performance with multiple metrics

Pass@k: Does the model get at least one correct in k tries at each truncation level?
Gold-in-Trace rate: Among correct cases, is the answer already written in the visible steps?
AUTC: How early and strongly accuracy rises as r grows.
AUGC: How early the answer gets written down.
LRS: Accuracy that happens before the answer is explicitly written (a proxy for latent reasoning).

Concrete mini-example:

Problem (English): “A box has 4 red balls and 3 blue balls. If you add 2 red balls, how many balls now?” Answer: 9.
Full CoT might be 5 sentences. Try r = 20% (keep only 1 sentence): The model still outputs 9 often in English; the gold number 9 isn’t written in that one sentence, so early correctness counts toward LRS.

Step 5: Compare easy vs. hard datasets and language groups

Datasets: MGSM (grade-school, easier) vs. Multilingual AIME (competition-level, harder).
Languages: High-resource (e.g., English, Chinese), mid-resource (e.g., Bengali, Thai), low-resource (e.g., Swahili, Telugu).
Why it exists: To see if difficulty or training data availability changes latent reasoning.

Step 6: Check memorization vs real reasoning

NumEdit: Change just one number so the right answer must change. If the model still says the old answer, that hints memorization; if it adapts, that’s reasoning.
Paraphrase: Reword the question but keep its meaning. If the model stays correct, that’s robust reasoning, not just copying phrasing.
Finding: Some memorization exists, but models mostly adapt on NumEdit (lower match), and stay strong on Paraphrase (high accuracy), especially with a fresh trace.

The secret sauce:

Truncation converts a fuzzy idea (“hidden thinking”) into a measurable curve.
LRS directly rewards correctness that happens before the answer appears in text.
Logit lens and hidden-state similarity show that, inside, languages follow very similar paths, often aligned with English—revealing an English-centered latent route.

04Experiments & Results

The test: The authors measured how early correctness appears when only part of the reasoning is visible, across 11 languages and two datasets of different difficulty. They also looked inside the model to see how the correct answer becomes more likely across layers and how similar the hidden states are across languages.

The competition: Three distilled DeepSeek-R1 models based on Qwen2.5 (7B, 14B, 32B) solved MGSM (easier) and Multilingual AIME (harder), each in 11 languages spanning high-, mid-, and low-resource groups.

The scoreboard with context:

On MGSM (easier):
- In high-resource languages like English and Chinese, accuracy was already nontrivial even at 0% truncation (like getting a solid head start with no visible steps). Their AUTC and LRS were high, meaning early, hidden correctness was common and not just copying from written steps.
- Example numbers from the paper: For English with the 7B model, AUTC ≈ 0.52 and LRS ≈ 0.38—a strong early lead, much of it not explained by the answer being visible in text.
- Low-resource languages (e.g., Swahili, Telugu) showed lower AUTC and LRS, so less early success and more dependence on full, explicit CoT.
- Scaling up to 32B improved all languages, but the gap remained—resource-rich languages still led.
On Multilingual AIME (harder):
- Early latent correctness shrank dramatically across the board. English LRS dropped to about 0.03, like going from an A to barely passing on the early-guess test.
- This suggests hard problems force the model to actually lean on long, explicit reasoning steps rather than silently solving early.

Surprising findings:

Latent, inside-the-model dynamics were strikingly similar across languages at a given model size. The rank of the correct answer (via the logit lens) improved over layers in very similar ways for German, French, Chinese, etc.
Hidden-state similarity tied closely to English: high-resource languages aligned more with English’s internal pathway; mid- and low-resource languages aligned less. This pattern held even for different scripts.
Correctness didn’t fully explain this alignment. For high-resource languages, even incorrect cases stayed strongly aligned with English’s hidden trajectory—hinting the pathway itself is English-centered, not just a byproduct of getting the answer right.

Memorization vs reasoning:

NumEdit (change one number): Without a new trace, the model sometimes stuck to the old answer (~30% match), but allowing a new trace reduced that matching rate, especially in bigger models—showing it recomputes rather than just recalls.
Paraphrase (reword, same meaning): Accuracy stayed high and got even better with a fresh trace. High-resource languages with larger models neared perfection, supporting robust reasoning instead of phrase-matching.

Bottom line of results:

Latent reasoning exists in many languages, especially on easier tasks, but is much stronger in high-resource languages.
On hard tasks, early latent correctness mostly vanishes; explicit reasoning dominates.
Inside, language trajectories look alike and often converge toward an English-centered path.
Bigger models help but do not eliminate the multilingual gaps.

05Discussion & Limitations

Limitations:

Truncation was done at the sentence/step level, not token-by-token, so ultra-fine timing of when the answer forms remains unknown.
Models capped at 32B: helpful but smaller than frontier systems; larger models might show different dynamics.
The study observed an English-centered pathway but did not fully explain why. Training data mix, alignment methods, and scripts likely matter and deserve deeper, mechanistic study.

Required resources:

Access to multilingual prompts and the ability to control the model’s reasoning language.
Enough compute to generate full traces, sample multiple outputs, and run hidden-state probes across layers.
Multilingual datasets with shared underlying problems (like MGSM translations) to enable fair cross-language comparisons.

When NOT to use this approach:

If your task is extremely hard and you only care about final accuracy, truncation curves may not show much early signal, and probing internals may add overhead without boosting performance.
If you can’t reliably control the model’s reasoning language, cross-language comparisons might be misleading.
If your domain isn’t numeric or cannot be fairly matched across languages, the metrics here (especially gold-in-trace checks) may not apply.

Open questions:

Can targeted training (e.g., multilingual reasoning RL, curated datasets) flatten the resource gap in latent reasoning?
Do token-level truncation and finer probes reveal earlier, clearer latent signals?
What exact training ingredients cause the English-centered pathway—data quantity, quality, alignment choices, or script effects?
Can we design models that reason well in latent space without over-relying on English, and that stay strong on hard tasks?
How do these findings transfer beyond math to science, law, or everyday planning in many languages?

06Conclusion & Future Work

Three-sentence summary: The paper shows that large reasoning models often find the right answer in their hidden states before finishing their written steps, but this works best in high-resource languages and on easier problems. On tough problems, the early hidden advantage mostly disappears, and models depend more on long, explicit reasoning. Inside the networks, many languages follow similar paths that tend to align with an English-centered latent route.

Main achievement: A clean, multilingual protocol—language-controlled reasoning traces + truncation curves + new metrics (AUTC, AUGC, LRS) + internal probes (logit lens, similarity)—that quantifies and compares latent reasoning across 11 languages and two difficulty levels, while separating reasoning from memorization.

Future directions: Scale to larger models and richer languages, use token-level truncation, and run targeted training to strengthen latent reasoning in under-resourced languages. Explore mechanistic interpretability and data attribution to pinpoint the causes of the English-centered pathway. Extend beyond math to other domains that matter daily.

Why remember this: It reveals that today’s models do have a real—but uneven—silent thinking engine that favors English and easier tasks. Knowing where it shines and where it breaks is the first step toward building fairer, stronger, multilingual reasoners for everyone.

Practical Applications

•Diagnose multilingual gaps: Use truncation curves and LRS to find where a specific language needs more training data or alignment.
•Improve language control: Apply prompt-language reminders after <think> to keep reasoning in the user’s language.
•Benchmark model upgrades: Track AUTC/LRS before and after fine-tuning or RL to confirm real gains in early latent reasoning.
•Curriculum design: Start with MGSM-like tasks to build robust latent reasoning, then gradually introduce harder problems.
•Data curation: Add high-quality reasoning traces for under-resourced languages to reduce the English-centered bias.
•Safety checks: Use NumEdit and Paraphrase probes to detect memorization vs genuine reasoning during evaluation.
•Compute savings: If a language shows strong early correctness, use early stopping to save tokens at inference time.
•Crosslingual routing: For low-resource languages, consider translate-then-solve backbones if tests show higher LRS/AUTC in English.
•Model selection: Prefer larger models when early latent reasoning matters, acknowledging they won’t fully close resource gaps.
•Interpretability audits: Use logit lens and hidden-state similarity to ensure multilingual reasoning paths are stable and aligned with goals.

Version: 1