Language of Thought Shapes Output Diversity in Large Language Models

Shaoyang Xu; Wenxuan Zhang

Language of Thought Shapes Output Diversity in Large Language Models

Intermediate

Shaoyang Xu, Wenxuan Zhang1/16/2026

arXiv PDF

Key Summary

•The paper shows that changing the language a model 'thinks in' (its language of thought) can make its English answers more varied without making them much worse in quality.
•Different thinking languages push the model’s hidden thoughts into different regions of its 'thinking space,' and languages farther from English give bigger diversity gains.
•Sampling many times while thinking in one non-English language increases diversity compared to thinking in English.
•Mixing several thinking languages (one per sample) boosts diversity even more than any single language alone, thanks to complementary effects.
•As the number of samples grows, the mixed-languages approach keeps finding new ideas longer, raising the 'diversity ceiling.'
•These findings hold across multiple models, two open-ended benchmarks, and several diversity metrics.
•The approach also improves cultural pluralism on BLEND and WVS, covering more countries and value orientations than standard tricks like high temperature or 'please be diverse' prompts.
•Quality drops are small on average, and the method works well with normal decoding temperatures (and stacks nicely with higher temperatures if desired).
•This gives a practical, controllable knob—'thinking language'—to produce broader, fairer, and more creative outputs.
•The authors share code and evaluate across 15 languages, showing consistent, repeatable gains.

Why This Research Matters

This work gives us a simple, controllable way to make AI brainstorming more creative and less repetitive by changing the model’s hidden thinking language while keeping the final answer in English. It helps reduce cultural bias by surfacing perspectives connected to many languages, not just English-centered habits. Teams can cover more possibilities in product names, lesson plans, policy options, or story ideas without heavy retraining or complex ensembles. The method scales: mixing multiple thinking languages keeps uncovering new angles longer than English-only sampling. It also combines well with familiar knobs like temperature, letting practitioners trade off speed, variety, and quality. Overall, it moves AI from random variation to structured exploration, leading to fairer, richer, and more useful outputs in everyday applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

Imagine you’re in a classroom where every student writes stories. If they all use the same outline and brainstorm in the same way, their stories start to sound alike. That was often the world of Large Language Models (LLMs): they could write well, but their answers to open-ended questions sometimes clustered around the same ideas. People wanted more variety—more colors on the palette—so the results could better reflect different tastes, cultures, and creative paths.

Before this research, many teams tried to make LLMs more diverse. One common trick was to turn up the “temperature,” which makes the model more random when it picks words. Others invented fancy decoding methods, asked multiple models and combined their answers, or changed the prompts in many ways. Training-time methods tried to reward variety, too. These steps helped, but they mostly stayed within the same ‘mental groove’—often English—and often felt like shaking the same snow globe harder instead of looking at a new landscape.

The core problem was sneaky: even when you ask an LLM to vary its style, if it thinks through the problem in the same internal way each time—especially in the same language—it can still end up exploring a narrow set of ideas. That’s risky for two reasons. First, creativity: if we want AI to help with brainstorming, invention, and exploration, it needs to suggest truly different directions, not small rephrasings. Second, fairness: if most of the model’s thinking is steered by English patterns and dominant cultures, less-common viewpoints may get sidelined.

People tried several fixes. Higher temperature added randomness, but sometimes led to silly mistakes without opening genuinely new idea paths. Advanced decoders shuffled the output path, but still used the same underlying thinking route. Prompt mixing and model ensembles added variety, but at higher cost and still didn’t fully tap the model’s multilingual abilities. Training for diversity helped, but required data and compute, and can be tricky to balance with quality.

So what was missing? A way to structurally change the route the model takes while it reasons—before it writes the final answer. Cognitive science hints that language shapes how we organize thoughts (the Sapir–Whorf idea). Multilingual people often show stronger divergent thinking because different languages nudge the mind to group concepts and relationships in unique ways. Modern LLMs also can reason in many languages. Put those together, and you get a big question no one had really tested for diversity: What if we change the language the model uses for its internal chain-of-thought?

This paper fills that gap. The authors make the model do its intermediate “thinking” in different languages, then force the final answer to be in English. That way, they can fairly compare how varied the English outputs are, depending on the hidden thinking language used. They don’t just rely on the surface text—they also peek into the model’s hidden states and find that different thinking languages fall into different regions of the model’s “thinking space.” That’s like discovering there are many rooms in the model’s brain, and English is only one of them.

Why should anyone care? Think of real life: we want AI to suggest many different weekend plans, story ideas, product names, science project angles, or policy viewpoints, not just a few near-copies. We also want it to reflect many cultures fairly, not just the loudest one. By switching the language of thought, the model naturally walks through new neighborhoods of its idea city, uncovering paths English alone might miss. And importantly, it does this while still giving the final answer in English, so users don’t have to read other languages. The result is a simple, controllable knob—thinking language—that boosts diversity, scales with more samples, and plays nicely with existing methods like temperature, all with only small quality trade-offs.

02Core Idea

Aha! Moment in one sentence: If you change the language the model uses while it thinks, you steer it into different hidden idea neighborhoods, which makes its final English answers more diverse.

Three friendly analogies:

Colored lenses: Imagine trying on different colored glasses before drawing a picture. Each color makes you notice different things, so your final drawing (still in pencil) varies more.
Kitchens of cuisine: A chef planning a dish in Italian, Japanese, or Mexican cooking traditions will reach for different flavor ideas—even if the final menu is printed in English.
Map rooms: The model’s brain is a building with many rooms. English thinking uses one room; other languages use different rooms with different tools. Exploring more rooms leads to more varied creations.

Before vs. After:

Before: We mostly shook the same English-centered idea jar (temperature up, fancy decoding, more prompt rewrites), hoping new surprises would fall out.
After: We guide the model into different internal thinking rooms (languages), then write the answer in English. This reliably opens new idea paths and increases diversity—especially when we mix multiple thinking languages.

Why it works (intuition, not equations):

The model’s hidden states (its quiet thoughts) form a ‘thinking space.’ Different languages push those thoughts to different regions in that space.
The farther a language’s thinking region is from English, the more it uncovers different idea routes, so repeated samples from that region overlap less.
Mixing several languages covers multiple regions at once, so the combined set of answers is even broader than any single region alone.

Building blocks (each explained with the Sandwich pattern):

🍞 Hook: You know how switching from solving a math problem in your head to sketching a diagram can change how you think? Different tools shape your thoughts. 🥬 The Concept (Language of Thought): It’s the language the model uses for its hidden step-by-step thinking before giving an answer.

How it works:
1. Insert a tiny cue after a special <think> tag telling the model to think in a target language (like French or Hebrew).
2. Let the model do its internal reasoning in that language.
3. Then tell it to answer in English.
Why it matters: If all thinking happens in English, ideas can cluster; changing the thinking language shifts the thought paths, unlocking new ideas. 🍞 Anchor: The model plans a story in Spanish (thinking) but writes the final story summary in English. The English summaries vary more.

🍞 Hook: Imagine a giant playground where different areas inspire different games. 🥬 The Concept (Thinking Space): It’s the invisible landscape of the model’s hidden thoughts.

How it works:
1. Collect the model’s hidden states during thinking.
2. Summarize them to represent a ‘spot’ in space for each language.
3. Visualize and compare distances between languages’ spots.
Why it matters: If languages sit in different spots, sampling from each spot yields different idea styles. 🍞 Anchor: French-thinking sits near English; Tagalog-thinking is farther. The far one tends to give more distinct answers.

🍞 Hook: Using only one spice makes dishes taste similar. 🥬 The Concept (Single-Language Sampling): Take many samples while thinking in one chosen non-English language, then answer in English.

How it works:
1. Fix one thinking language (e.g., Hebrew).
2. Generate many English answers by sampling multiple times.
3. Measure how different these answers are from each other.
Why it matters: It shows the diversity you can gain from a single non-English ‘idea room.’ 🍞 Anchor: Brainstorming ten gift ideas while thinking in Danish yields more variety than doing all ten in English.

🍞 Hook: A smoothie with several fruits has a richer flavor than one fruit alone. 🥬 The Concept (Mixed-Language Sampling): Take one sample per different thinking language and collect them together (all final answers in English).

How it works:
1. Pick a set of languages.
2. For each sample, switch to a different thinking language.
3. Combine all English outputs into one bigger, more varied set.
Why it matters: Different languages contribute complementary ideas, creating broader coverage than any single language. 🍞 Anchor: One answer planned in German, one in Occitan, one in Norwegian… Together, they cover more angles for the same prompt.

🍞 Hook: A fruit salad beats a bowl of only apples when you want variety. 🥬 The Concept (Output Diversity): It’s how many meaningfully different answers the model can produce to the same open-ended question.

How it works:
1. Generate many answers.
2. Group near-duplicates; count how many distinct groups you get.
3. Also measure how similar the answers are in meaning.
Why it matters: Low diversity feels repetitive and can hide minority viewpoints. 🍞 Anchor: For “weekend plans,” high diversity gives hiking, museum, cooking class, volunteering, stargazing—rather than five versions of “watch a movie.”

🍞 Hook: Teamwork makes the dream work—different teammates bring different strengths. 🥬 The Concept (Compositional Effects): Mixing languages doesn’t just add ideas; they interact to unlock even more variety together.

How it works:
1. Sample across several languages.
2. Remove one language at a time—diversity barely drops.
3. Remove several—diversity drops a lot, showing they complement each other.
Why it matters: Diversity gains come from the combo, not a single ‘magic’ language. 🍞 Anchor: Removing one fruit from your smoothie is fine; remove half and the flavor flattens.

🍞 Hook: A big potluck beats a one-chef dinner when you want everyone’s culture represented. 🥬 The Concept (Cultural Pluralism): Ensuring many cultures and value orientations show up in the model’s answers.

How it works:
1. Ask cultural questions.
2. Sample many times and see how spread-out the answers are across countries/values.
3. Higher spread (entropy) = better pluralism.
Why it matters: Broader cultural coverage is fairer and more useful. 🍞 Anchor: On world-views questions, mixed-language thinking covers more countries and value options than English-only thinking.

03Methodology

High level recipe: English question → control the model’s thinking language → force final answer to English → repeat sampling (Single-Language or Mixed-Language) → evaluate diversity and quality.

Step-by-step, like a friendly lab guide:

Input preparation (What): Start from an English prompt, like “Suggest five science fair topics.” (Why): Keeps the user-facing part consistent so we can fairly compare diversity across conditions. (Example): The prompt is always in English, regardless of hidden thinking language.
Thinking language control (What): Right after a special <think> tag, insert a tiny phrase in the target language, such as “D’accord, l’utilisateur demande” for French. This nudges the model to carry out its hidden reasoning in that language. (Why): Without a clear cue, the model may default to English thinking, mixing conditions and muddying results. (Example): For Tagalog, insert “Sige, nagtatanong ang gumagamit.”
Output language control (What): After </think>, insert an English cue like “Let me provide my answer in English only:” to make sure the final text is English. (Why): We want to compare apples-to-apples—English answers generated after different thinking languages. (Example): Even if the model thought in Hebrew, its final visible answer is English.
Two sampling strategies (What):

Single-Language Sampling (SLS): Choose one thinking language and draw M samples (e.g., 15) from that same language’s thinking region.
Mixed-Language Sampling (MLS): For each of the M samples, switch to a different thinking language (one sample per language), but always answer in English. (Why): SLS shows the gain from one non-English room. MLS shows the extra gain from combining rooms. (Example): With 15 samples, SLS might use only Norwegian-thinking, while MLS uses 15 different languages.

Thinking-space geometry study (What): To understand why languages differ, the authors inspect hidden states during thinking and average them into one vector per language per layer. They compare each language to English using cosine distance, then visualize with PCA. (Why): If languages truly land in different regions, that explains why they can yield different idea paths (and diversity) downstream. (Example): Occitan-thinking might be geometrically farther from English than French-thinking.
Diversity metrics (What):

Distinct Score: Group functionally equivalent outputs; more groups per M samples means higher diversity.
Similarity Score: Use embeddings to compute average pairwise similarity; lower means more spread-out ideas. (Why): Two complementary angles—one counts truly distinct clusters; the other measures average closeness in meaning. (Example): If several answers reduce to “watch movies,” they merge into one group, lowering Distinct Score.

Output quality check (What): Use a consistent grader (gpt-4o-mini) to score instruction adherence and overall quality from 0–100. (Why): Ensures we aren’t trading too much quality for diversity. (Example): A diverse set that’s unreadable isn’t helpful; we want both.
Parameters to probe (What): Number of samples M and temperature. (Why): Real users can’t sample forever, and temperature is a common knob. Understanding how these interact with thinking language helps pick good settings. (Example): MLS at temperature 1.0 often matches or beats English-only at 2.0.

What breaks without each step:

Skip thinking-language control: You can’t test the main idea; thinking drifts back to English.
Skip output-language control: You can’t fairly compare because user-visible languages differ.
Skip multiple samples: You can’t measure diversity; one sample hides variation.
Skip metrics: You can’t quantify improvements or compare methods.
Skip geometry analysis: You miss the intuition for why languages differ.

Concrete walkthrough with toy data:

Prompt: “List creative, low-cost weekend activities.”
SLS (Norwegian-thinking): Sample 15 answers in English. You might see hiking, library scavenger hunts, free outdoor concerts, DIY astronomy, neighborhood photo bingo, etc.
SLS (English-thinking): Sample 15 answers in English. You might get several overlaps (movie night variants) and fewer truly different themes.
MLS (15 languages): Sample once per language (Hebrew, Occitan, Tagalog, etc.). The combined 15 English answers cover even more ground—cultural festivals, traditional crafts, local history walks, sunrise yoga, language-exchange meetups—expanding the set beyond any single language’s samples.

The secret sauce:

Controlling the language of thought realigns the model’s hidden route through ideas.
Languages that are geometrically farther from English unlock more distinct routes, so repeats are less likely.
Mixing languages stacks complementary routes, so the union set keeps growing longer before it hits a saturation ceiling.

Putting it all together: The method is simple to implement (insert tiny language cues), model-agnostic (works across several LLMs), compositional (mix-and-match languages), and synergistic with temperature (you can still turn that knob). The final answers stay in English for easy use, while the hidden thinking becomes your new diversity dial.

04Experiments & Results

The tests: The authors used two open-ended diversity benchmarks—NOVELTYBENCH and INFINITY-CHAT—each with 100 questions that don’t have one right answer. This lets them judge how many truly different ideas a model can produce. They tried four LLMs (Qwen3-8B, 14B, 32B; DeepSeek-14B) and 15 thinking languages, always outputting in English. They measured three things: Distinct Score (higher is better diversity), Similarity Score (lower is better diversity), and Output Quality (0–100).

The competition (what they compare against):

English-only thinking (the usual default).
Non-English Single-Language Sampling (one non-English thinking language, many samples).
Mixed-Language Sampling (one sample per different thinking language, aggregate).
Plus practical baselines for cultural pluralism: High Temperature (hotter decoding), Request Diversity (explicitly ask for novelty), and Multilingual Prompting (translate inputs into many languages instead of switching thinking language).

Scoreboard with context:

Single-Language Sampling (non-English) vs English-only: • On NOVELTYBENCH, switching from English-thinking to non-English-thinking improved Distinct Score by about 5.3 to 7.7 points on average, and reduced Similarity Score by about 1.0 to 2.6 points. That’s like going from a B to a solid A on variety. • Some languages (e.g., Hebrew, Norwegian, Occitan) often gave the biggest boosts; others closer to English (e.g., Chinese, Malay, French) gave smaller ones. • Importantly, the farther a language’s thinking representation was from English in the model’s hidden space, the higher the diversity: strong positive correlations (Pearson r ≈ 0.72–0.88 across models on NOVELTYBENCH; similarly strong on INFINITY-CHAT). • Output Quality dipped only a little on average (≈1–2 points), showing we didn’t have to throw quality out the window to gain variety.
Mixed-Language Sampling (MLS) vs Single-Language Sampling (SLS): • MLS consistently beat English-only and the average non-English SLS across models and datasets. • Often, MLS matched or even exceeded the best single language, without needing to know ahead of time which that best language was. That’s a big practical plus. • Removing one language from the MLS set barely changed diversity; removing several languages caused a much larger drop. This shows complementary, not redundant, contributions—like a team where losing one player is fine, but losing half the team really hurts.
Scaling effects (number of samples M and temperature): • As M increases, any method eventually slows down in finding new distinct answers (saturation), but MLS saturates later, effectively raising the ‘diversity ceiling.’ • Temperature and thinking language stack: MLS at temperature 1.0 can rival or beat English-only at 2.0, suggesting smarter coverage beats just extra randomness.
Pluralistic alignment (cultural knowledge and values): • On BLEND (countries) and WVS (values), MLS achieved the highest pluralism (normalized entropy) across models, outperforming English-only, High Temperature, Request Diversity, and Multilingual Prompting. • In plain terms: thinking in multiple languages helped the model’s English answers cover more countries and value orientations fairly and broadly.

Surprising findings:

The hidden-space distances weren’t just pretty pictures; they predicted diversity outcomes well. Languages ‘farther’ from English thinking tended to give measurably more diverse answers.
Quality held up better than one might fear. Some non-English thinking languages even managed good diversity with strong quality at the same time.
Mixing languages delivered robust gains without any single ‘magic language’—the power came from the combination.

05Discussion & Limitations

Limitations (honest and specific):

Correlation ≠ causation: We see strong links between hidden-space distance and diversity, but we didn’t train or alter models to prove mechanisms. Future controlled interventions (e.g., pushing languages closer/farther) could verify causality.
Cross-lingual alignment side effects: Some training techniques align non-English representations to English. That might shrink distances and reduce the diversity benefits we see here. We don’t yet know how to preserve diversity while aligning.
Proxy measures: We used entropy and Distinct/Similarity Scores as stand-ins for diversity/pluralism. Real deployments may need richer, context-aware measures (e.g., sensitivity to cultural nuance and correctness constraints).
Language control isn’t perfect: Automatic language ID and simple prefixes mostly work but are not flawless across all models/topics. Leaks or mixed-language thoughts can blur conditions.
Compute/latency trade-offs: Diversity comes from repeated sampling. If you need one fast answer, you may not realize the gains unless you sample a few times.

Required resources:

An LLM that supports explicit thinking traces (<think> … </think>) and follows short multilingual cues.
Access to several target languages (prefix snippets) and optionally translation tools to craft cues.
Embedding models and/or judge models to compute diversity and quality metrics if you want to evaluate rigorously.
Enough inference budget to draw multiple samples (M ≈ 5–30 is a practical range; bigger M for maximal diversity).

When NOT to use this approach:

Single-truth, high-stakes tasks (e.g., precise calculations, medical dosages, legal citations) where diversity is harmful and accuracy/safety dominate.
Extremely low-latency scenarios where multiple samples aren’t feasible.
Cases where the model’s multilingual control is weak (it can’t reliably think in the requested language), leading to noisy gains.
Strict compliance contexts where exploring culturally varied viewpoints could violate instructions or policies.

Open questions:

Smart routing: Can we pick the next thinking language adaptively (based on what we’ve already sampled) to cover gaps faster?
Beyond languages: Do dialects, scripts, or code (e.g., Python-thinking) create new useful ‘rooms’ in thinking space?
Fairness balancing: How do we ensure broader cultural coverage without amplifying stereotypes or harming accuracy?
Training-time synergy: Can fine-tuning encourage strong multilingual ‘room separation’ to boost inference-time diversity further?
Metrics: Can we design better, task-aware diversity/pluralism metrics that capture human-meaningful variety while respecting constraints?

06Conclusion & Future Work

Three-sentence summary: This paper discovers that the language a model uses for its hidden reasoning acts like a steering wheel for idea exploration: switching from English to other languages moves thinking into different hidden regions and increases the diversity of final English answers. Sampling across multiple thinking languages compounds this effect, raising the diversity ceiling and outperforming common tricks like higher temperature or ‘please be diverse’ prompts. The gains generalize across models and tasks, improve cultural pluralism, and come with only small average quality trade-offs.

Main achievement: Establishing “language of thought” as a simple, controllable, and structurally grounded knob for output diversity—validated by clear geometry in hidden states and strong empirical results, including compositional benefits when mixing languages.

Future directions: Develop adaptive language-of-thought routing that targets uncovered idea regions; explore training-time methods to preserve or enhance beneficial cross-language separation; design richer diversity/pluralism metrics for real-world constraints; test more languages, dialects, and even non-natural ‘thinking languages’ like code or logic forms.

Why remember this: It transforms diversity generation from mostly ‘shake it harder’ randomness to ‘walk a new path’ structure. By shifting the hidden language of thought, we can systematically and predictably broaden the space of ideas an LLM explores—making AI brainstorming more creative, more culturally inclusive, and more useful in everyday life.

Practical Applications

•Brainstorming sessions: Mix thinking languages to generate broader lists of product names, slogans, or campaign ideas in English.
•Curriculum design: Produce more varied lesson plans or classroom activities while keeping final text in English for teachers and students.
•Creative writing: Draft plot twists, character backstories, or settings with mixed-language thinking, then edit the best English outputs.
•UX copy and microcopy: Explore multiple tones and angles (formal, playful, culturally nuanced) for on-screen text without manual prompt engineering.
•Policy option generation: Surface more diverse trade-offs and stakeholder views by sampling across multiple thinking languages.
•Market research: Collect a wider range of hypotheses about user needs or cultural preferences from the same English brief.
•Ideation for science fairs or hackathons: Generate varied project ideas and experiment angles with minimal setup.
•Content calendars: Create more diverse social-media post themes and hooks that avoid repetition across weeks.
•Design alternatives: For product features or UI layouts, gather multiple concept directions explained in English but sourced from different thinking languages.
•Cultural coverage checks: Use MLS to test if outputs represent multiple countries/values before publishing global content.

Version: 1