The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving

Max Ruiz Luyten; Mihaela van der Schaar

The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving

Intermediate

Max Ruiz Luyten, Mihaela van der Schaar1/2/2026

arXiv PDF

Key Summary

•Modern AI models can get very good at being correct, but in the process they often lose their ability to think in many different ways.
•This paper explains why that creativity loss happens and shows how to fix it without giving up accuracy.
•The key idea, called Distributional Creative Reasoning (DCR), trains the whole distribution of reasoning paths—not just the single best answer.
•DCR adds a 'diversity energy' that rewards both broad exploration (entropy) and truly different ideas (a creativity kernel).
•The Diversity Decay Theorem predicts three collapse styles: winner-takes-all (STaR), neutral drift (GRPO), and homogenization (DPO).
•Simply adding randomness or top-k sampling isn’t enough; once a strategy’s probability hits near-zero, these tricks can’t bring it back.
•With the right diversity design, DCR provably converges to a stable policy that is both correct and creatively diverse.
•A practical recipe is given: pick a semantic similarity kernel, gate it to only reward diversity among correct solutions, and tune two knobs (α and β).
•Batch noise doesn’t fix collapse and can actually speed it up; structured diversity pressure is necessary.
•This work turns diversity from a guess-and-check heuristic into a principled part of training.

Why This Research Matters

Real problems change their rules often, and models that think in only one way break easily when that happens. By keeping a portfolio of correct, distinct strategies, DCR makes AI more robust to surprises in schoolwork, coding, science, and day-to-day planning. This reduces brittle failures, improves creative brainstorming, and supports better problem-solving in unfamiliar territory. It also offers a principled way to balance accuracy with exploration, instead of relying on trial-and-error heuristics. In safety-critical settings, diversity acts like a backup plan when the main strategy fails. Ultimately, this work helps AI think more like a resourceful teammate than a single-track machine.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class always solves math problems the exact same way because it’s the fastest way to get full marks. It works—until a tricky new question shows up that needs a fresh idea. Suddenly, the class gets stuck.

🥬 The Concept (Creativity Collapse): Creativity collapse is when a model’s answers all start to look the same because training only rewards what’s already working.

What it is: The model’s spread of different reasoning paths shrinks, and a few “templates” dominate.
How it works: (1) Start with a model that knows many ways to think, (2) Train it to pick the highest-scoring reasoning path, (3) Repeatedly reward those same paths, (4) Other good paths fade away.
Why it matters: Without variety, the model fails on new or weird problems that need different thinking.

🍞 Anchor: It’s like a music playlist that only repeats one hit song. Great at first, but terrible when you want music for a different mood.

The World Before: Large language models (LLMs) were trained in two broad stages. First, supervised fine-tuning (SFT) taught them good manners and known solutions. Then, reinforcement learning (like RLHF, GRPO, STaR, DPO) pushed them to be more correct by rewarding high-scoring answers. This raised test scores but often squeezed out variety. People noticed the model’s “semantic entropy” (how surprising and varied its ideas are) dropped a lot—stories sounded samey, math steps repeated a few patterns, and brainstorming lost sparkle.

🥪 New Concept: KL Penalty 🍞 Hook: You know how your parents say “Stay close to home” when you go out? That keeps you safe but limits exploring new places. 🥬 The Concept: KL-divergence penalties keep the new model close to the base model.

What it is: A rule that says “Don’t drift too far from where you started.”
How it works: (1) Compare the current model to the base, (2) Measure how different they are, (3) Add a penalty if the difference grows, (4) Training prefers staying near home.
Why it matters: It preserves some variety but can also block useful, far-away creative strategies. 🍞 Anchor: It’s like a leash: you won’t get lost, but you also can’t reach the cool treehouse down the street.

The Problem: When training only chases a single reward like correctness, the model’s reasoning becomes a monoculture. Prior work found:

STaR tends to reward the single best-looking chain of thought.
GRPO can float along with no push to diversify, so randomness nudges it toward fewer strategies over time.
DPO tends to make preferred solutions look more alike, evening out probabilities without guarding for truly different ideas.

Failed Attempts:

Entropy bonuses: These add general randomness, but they don’t specifically encourage different ideas; they often just keep noise.
Sampling tricks (top-k, temperature): They can spread answers at inference, but if training already pushed some strategies near zero, sampling won’t resurrect them.
External quality-diversity lists: They can generate diverse outputs, but when you distill back into the model, the variety often shrinks again.

The Gap: We needed a single, principled framework that (1) explains why collapse happens, (2) predicts how it looks under different algorithms, and (3) gives a solid recipe to keep both correctness and creativity.

Real Stakes:

Schoolwork: A model that only knows one way to solve fractions will freeze on a twisty word problem.
Coding: If it memorizes one design pattern, it struggles with unusual bugs.
Science & design: Discovery demands trying several good angles, not just the usual path.
Safety & robustness: Diverse reasoning is a hedge—if one method fails, others can still succeed.

🥪 New Concept: Portfolio of Strategies 🍞 Hook: Think of a soccer team. You don’t want 11 goalkeepers—you want strikers, defenders, and midfielders for different plays. 🥬 The Concept: A portfolio of strategies is a healthy mix of ways to reason.

What it is: The model keeps multiple high-quality reasoning styles ready.
How it works: (1) Learn many solutions that work, (2) Keep them alive during training, (3) Use the best fit for each new problem, (4) Keep exploring new angles.
Why it matters: One trick pony = stuck on new tasks; many plays = resilience and generalization. 🍞 Anchor: When a puzzle changes rules mid-game, the team with more plays still scores.

This paper’s answer is Distributional Creative Reasoning (DCR): instead of training the one “best” trace, DCR trains the whole distribution over reasoning traces and adds a special diversity energy that rewards both breadth and meaningful differences. It unifies popular methods (like STaR, GRPO, DPO) and shows when and why they collapse—and how to stop it.

02Core Idea

🍞 Hook: You know how a garden stays healthy when it has many kinds of plants, not just one? If you only grow one plant, a single pest can ruin everything.

🥬 The Concept (DCR – Distributional Creative Reasoning): DCR is a way to train AI so it keeps a healthy garden of reasoning paths—many correct and different ways to solve a problem.

What it is: A single training objective that balances utility (correctness) with structured diversity across the whole distribution of solution traces.
How it works: (1) Look at the full spread of possible solutions, (2) Score them for usefulness (utility), (3) Add diversity energy that rewards breadth (entropy) and penalizes piling onto similar ideas (kernel coverage), (4) Keep the model from drifting too far from its base using a KL term, (5) Follow the gradient flow so the whole distribution moves toward a stable, diverse sweet spot.
Why it matters: Without this, models collapse into a few templates; with DCR, they remain both right and resourcefully varied.

🍞 Anchor: Instead of training one champion runner, DCR trains a relay team—sprinter, long-distance runner, hurdler—so the team can win any race style.

The “Aha!” Moment in one sentence: Train the distribution of reasoning paths with a strictly concave diversity energy so the model naturally converges to a correct and creatively diverse policy, preventing collapse.

Three Analogies:

Orchestra: Utility is playing the right notes; entropy ensures many instruments are active; the kernel discourages too many violins playing the same melody, nudging in woodwinds and brass for richer harmony; KL keeps the style close to the original score.
City Traffic: Utility is getting to destinations fast; entropy opens more routes; the kernel prevents everyone from crowding one highway; KL keeps maps similar to the reliable base map.
Sports Team: Utility picks players who score; entropy gets more players onto the field; the kernel prevents a team of only strikers; KL keeps the playbook grounded.

Before vs After:

Before: Objectives centered on a single scalar reward (correctness) made models narrow and brittle. Tricks like entropy bonuses or sampling couldn’t reliably revive lost strategies.
After: A principled objective (utility + diversity energy + KL) shapes learning so correct and distinct strategies co-exist, with a proof that training converges to a unique, diverse equilibrium.

Why It Works (intuition, no equations):

When you only reward the top answers, probabilities flow to a few patterns; others starve.
Adding raw entropy keeps some spread but doesn’t ensure those extra ideas are truly different.
The creativity kernel measures similarity between solutions and gently pushes probability away from clusters of near-duplicates, making room for qualitatively new strategies.
A small KL term and an entropy barrier steady the system so it doesn’t fall off the edge.
Because the diversity energy is concave, the optimization landscape has one clear hilltop inside the simplex: the training flow climbs there and stays.

Building Blocks (Sandwich-explained):

🍞 Hook: Picking a snack from a vending machine—you like variety but still want healthy options. 🥬 Shannon Entropy:
- What it is: A measure that rewards spreading probability across many options.
- How it works: (1) Notice when the model is putting all its weight on a few answers, (2) Reward it for distributing weight more evenly, (3) Keep exploration alive, (4) Avoid total randomness by balancing with utility.
- Why it matters: Without it, the model gets too certain too fast and stops exploring better options. 🍞 Anchor: It’s like making sure you don’t always pick the same snack—you’ll try apples sometimes, not just chips.
🍞 Hook: Two book reports that use different words but the same ideas aren’t really different. 🥬 Kernel Coverage:
- What it is: A similarity-aware penalty that discourages piling probability on very similar solutions.
- How it works: (1) Measure how close two solutions are in meaning, (2) If a bunch cluster together, add a penalty, (3) Encourage shifting weight toward other, distinct approaches, (4) Focus this only among correct solutions using a verifier.
- Why it matters: Entropy alone can protect nonsense or near-duplicates; kernel coverage carves out truly new thinking. 🍞 Anchor: If five kids say the same answer in different words, the teacher still wants a fresh idea from someone else.
🍞 Hook: Training wheels stop you from wobbling too far while you learn. 🥬 KL-Divergence:
- What it is: A gentle tether to the base model.
- How it works: (1) Compare current and base distributions, (2) Penalize big drifts, (3) Keep updates stable, (4) Reduce wild swings.
- Why it matters: Prevents the model from forgetting useful base skills while exploring diversity. 🍞 Anchor: You can ride farther once you’re steady; until then, the tether keeps you safe.
🍞 Hook: Sliding downhill to the valley’s lowest point is the fastest way to settle. 🥬 Gradient Flow:
- What it is: A smooth way to update the whole distribution toward better utility and diversity.
- How it works: (1) Compute which directions increase the objective most, (2) Move a bit that way, (3) Repeat steadily, (4) Converge to the unique interior best mix.
- Why it matters: Guarantees the training won’t bounce forever or collapse to a corner if we include the right diversity energy. 🍞 Anchor: It’s like water flowing to a calm lake at the bottom—once there, it stays.
🍞 Hook: If everyone copies the same kid’s homework, the class stops learning. 🥬 Diversity Decay Theorem:
- What it is: A diagnosis of how popular training recipes collapse diversity.
- How it works: (1) STaR: the earliest winner grabs more and more weight, (2) GRPO: balance is fragile; random bumps push to a corner, (3) DPO: probabilities get equalized among preferred answers but not made meaningfully different.
- Why it matters: Knowing the failure mode tells us which tool to add (the diversity energy) to fix it. 🍞 Anchor: It’s a doctor’s note: here’s what’s sick (collapse), here’s how it shows up (three modes), here’s the medicine (DCR).

03Methodology

At a high level: Prompt → Sample many reasoning paths → Score correctness (utility) → Add diversity energy (entropy + kernel coverage) → Add KL-to-base stability → Combine into one objective → Update the whole distribution (gradient flow/SGD) → Repeat.

Step-by-step (with Sandwich explanations for the key pieces):

Input and Sampling

What happens: For a prompt (like a math problem), the model samples multiple solution traces (chains of thought, code, or action steps). Think of these as many little journeys to the same destination.
Why this step exists: If you only look at one path, you can’t know whether there were other good routes.
Example: Solve 23×17. One trace multiplies directly; another breaks 17 into 10+7; a third uses (20+3)×17.

Utility Scoring (Correctness)

What happens: Each trace gets a utility score, often 1 if correct and 0 if wrong (or a shaped score if partially right).
Why it exists: Without caring about correctness, the model could become diverse but useless.
Example: The trace that produces 391 is correct; others that reach a wrong number get 0.

Diversity Energy = Entropy + Kernel Coverage

🍞 Hook: Picture a buffet—you want lots of dishes (entropy) and you don’t want them all to be pasta (kernel coverage). 🥬 Entropy (breadth):
- What happens: Encourage spreading probability across multiple options so exploration stays alive.
- Why it exists: Prevents early overconfidence in one path.
- Example: If the model keeps picking only the (20+3)×17 path, entropy pushes it to try the (10+7) path too. 🍞 Anchor: It’s like a teacher saying, “Try at least two methods.”
🥬 Kernel Coverage (distinctiveness):
- What happens: Compute a similarity between pairs of traces (using embeddings, structure, or steps). If too many similar traces hog the probability, add a penalty.
- Why it exists: Stops cosmetic rewordings from counting as diversity; pushes the model to find truly different reasoning styles.
- Example: If five traces use the exact same algebra steps with slightly different words, kernel coverage says, “That’s one idea, not five—now try a factoring approach.” 🍞 Anchor: It’s like grading projects—you can’t all submit the same poster with different colors.
Gating to correctness: To avoid rewarding “different ways to be wrong,” the kernel can be applied only among correct traces (using a verifier that flags which outputs are correct). This keeps pressure focused on finding multiple right ways.

KL-to-Base Regularization

What happens: Compare current and base distributions; penalize big drifts.
Why it exists: Stabilizes learning so it doesn’t forget useful base knowledge.
Example: If the base model is decent at mental math routines, KL discourages throwing them all away at once.

Combine into One Objective J(p)

What happens: Add utility + λ·(entropy − β·kernel_coverage) − β_KL·KL.
Why it exists: One scoreboard makes trade-offs explicit and tunable via simple knobs (α for entropy weight inside diversity, β for kernel strength, and β_KL for stability).
Example: Increase α to widen exploration; increase β to insist on distinct strategies among correct solutions; increase β_KL if updates get too wild.

Gradient Flow / SGD Update

🍞 Hook: Imagine nudging marbles on a rubber sheet so they roll toward the highest point. 🥬 Gradient Flow over the Distribution:
- What happens: Compute how to nudge the entire probability distribution so J(p) increases the most, then step a little.
- Why it exists: Ensures steady, on-simplex updates and convergence to a unique interior solution when diversity energy is properly set.
- Example: If similar correct traces hog mass, the kernel term pushes some mass toward alternative correct traces; if wrong traces keep weight, utility and entropy balance nudge them down. 🍞 Anchor: Like steering a canoe: small, steady paddles in the right direction keep you headed to shore.
Practical tricks:
- Mini-batches: Estimate diversity gradients using pairwise similarities among sampled traces (cost ~B^2 per batch, standard in contrastive learning).
- Entropy barrier: Keep a tiny entropy term so no path probability hits zero (helps stability and uniqueness).
- Clip-and-renormalize: Prevent tiny probabilities from vanishing and avoid numerical blow-ups.

The Secret Sauce: The Creativity Kernel

What makes it clever: It’s not just “be random.” It’s “be meaningfully different.” By learning a semantic kernel (e.g., from embeddings of full chains of thought or math-proof graphs), the model can measure idea-level similarity and push toward genuinely new, correct strategies.
Why this matters: Entropy alone can protect fluff; the kernel focuses on diversity that actually expands the toolbox.

Concrete in-action example (word problem):

Prompt: “A farmer has 36 apples and wants to make bags of equal size with none left over. What bag sizes are possible?”
Traces: A) Factor 36 into primes, then list divisors. B) Enumerate 1..36 and test divisibility. C) Use pairs (1,36), (2,18), (3,12), (4,9), (6,6), reflect symmetry.
Utility: All three are correct when they list the same divisors.
Entropy: Keeps A, B, C from collapsing into just A.
Kernel coverage: Notices A and C are related but not identical; if A dominates, it boosts B or C to maintain distinct approaches.
KL: Keeps overall style similar to the base model’s math persona.
Update: Increase probability for A, B, C together but not as clones; some mass goes from near-duplicates back to underused, valid styles.

Recipe summary:

Choose α (breadth) and β (distinctiveness) so that incorrect traces remain suppressed (tune using the paper’s inequality: kernel penalty among correct traces should not overpower the 1-point utility gain).
Use a correctness gate on the kernel.
Use small KL (and tiny entropy barrier) for stability.
Train with batched traces, compute pairwise kernel similarities, and step via SGD.
Monitor both accuracy and semantic diversity among correct solutions.

04Experiments & Results

The Test (what they measured and why):

Because the paper is theoretical, the primary “tests” are mathematical: analyze how different training objectives change the distribution over reasoning paths. The goal is to understand and predict diversity dynamics—will the model keep many good ideas or collapse to a few?

The Competition (what was compared):

STaR: Self-Taught Reasoner
GRPO: A reinforcement approach widely used for math reasoning
DPO: Direct Preference Optimization
DCR: The proposed framework with diversity energy (entropy + kernel coverage) and optional KL

The Scoreboard (with context):

Under scalar-only objectives (little-to-no diversity energy):
- STaR: Winner-takes-all. If one correct trace gets ahead, it keeps snowballing. That’s like running an election where early votes decide everything.
- GRPO: Neutral drift. Initially balanced, but randomness slowly tilts the system until one or a few strategies dominate—like a fair coin that, over many flips with slight nudges, ends up always choosing heads.
- DPO: Homogenization. It evens out probabilities among preferred solutions but doesn’t push for truly different ideas—like handing everyone the same uniform but calling it “variety.”
With DCR (diversity energy on):
- Guaranteed convergence to a unique, stable, and diverse interior solution. This is like getting an A+ on both correctness and creativity where others got stuck balancing B- correctness with D-level creativity.

Surprising Findings:

Noise is not your friend: Mini-batch randomness doesn’t save diversity; it often speeds up collapse in GRPO and can break symmetries in DPO.
Entropy-only is too blunt: It can keep some spread but doesn’t ensure structural novelty; you need the kernel to prefer different strategies, not just different wordings.
DPO’s equalization seems helpful, but it tends to push probabilities toward similar, often longer traces without ensuring conceptual distinctness.

Interpretation of Theoretical Guarantees:

Diversity Decay Theorem: Formally proves the collapse modes for STaR (deterministic fixation), GRPO (stochastic drift to corners), and DPO (equalization without semantic novelty) when diversity energy is weak or absent.
Convergence Theorem for DCR: If you include the diversity energy with the right conditions (concavity via entropy barrier, PSD kernel), gradient flow climbs to a single best distribution that’s both high-utility and diverse.

What to look for in practice (testable predictions):

If you run RLHF-like pipelines that optimize a single reward, you should see entropy and embedding-spread drop after RL—less creative variety than after SFT.
If you add kernel-driven diversity focused on correct traces, you should measure increased semantic variety among correct solutions without losing accuracy.

No fabricated numbers: The paper centers on theory and provides a blueprint; where they reference experiments, it’s to validate the predicted modes of collapse qualitatively (e.g., entropy drops, solution clustering).

05Discussion & Limitations

Limitations:

Kernel quality matters: If your semantic kernel can’t tell truly different ideas apart, the model might still protect near-duplicates.
Verifier dependence: Gating diversity to correct traces needs a correctness signal; weak or noisy verifiers can misguide diversity pressure.
Compute overhead: Kernel coverage uses pairwise similarities in a batch (O(B^2)); this is standard for contrastive learning but costs more than simple rewards.
Tuning trade-offs: Set α (breadth) too high and you may protect wrong ideas; set β (distinctiveness) too high and you may over-push away from efficient shared patterns.
Domain specificity: What counts as “semantically similar” differs across math, code, or writing; kernels may need domain-tailored features.

Required Resources:

A base model and training loop that can sample multiple traces per prompt.
A way to score correctness (utility) and, ideally, a verifier to gate the kernel to correct traces.
An embedding model or structural features to build a PSD semantic kernel.
Enough batch size and compute to handle O(B^2) kernel gradients (like many contrastive setups).

When NOT to Use:

Tasks with a single canonical solution and no benefit from multiple strategies (e.g., fixed-format parsing) may not need kernel-driven diversity.
Extremely low-resource settings where the O(B^2) cost is prohibitive.
Situations where the base model is already dangerously off-target; KL stabilizers may need to be stronger first.

Open Questions:

How to learn the best kernel automatically per domain? Can we co-train the kernel with the model?
How to design verifiers that are accurate and cheap across domains (math, code, reasoning, creative writing)?
What are the best metrics for semantic diversity among correct solutions beyond entropy and embedding spread?
Can we combine DCR with safety alignment objectives without hurting either creativity or safety?
How does DCR interact with curriculum learning (easier-to-harder tasks) and tool use (like calling calculators or theorem provers)?

06Conclusion & Future Work

3-Sentence Summary: This paper shows that popular ways of training models for correctness often shrink their creativity, making them brittle when problems change. It introduces Distributional Creative Reasoning (DCR), which adds a strictly concave diversity energy—entropy for breadth and a creativity kernel for distinctiveness—to train the whole distribution of reasoning paths. With DCR, training provably converges to a single, stable policy that is both correct and creatively diverse.

Main Achievement: A unified, principled framework (and proofs) that explains creativity collapse (the Diversity Decay Theorem), predicts how different algorithms fail, and gives a practical, tunable recipe to prevent collapse while keeping or improving accuracy.

Future Directions:

Learn better domain-specific semantic kernels and correctness gates automatically.
Develop lightweight verifiers and efficient approximations to reduce O(B^2) costs.
Extend DCR to multi-step tool use, program synthesis, and collaborative agents.
Design robust metrics for semantic diversity among correct solutions across domains.

Why Remember This: It reframes training from chasing a single best answer to sculpting a healthy ecosystem of good answers—like teaching not just the class valedictorian, but the whole classroom to think in strong, different ways. That shift makes models more robust, inventive, and ready for the real world’s surprises.

Practical Applications

•Train reasoning models with DCR to sustain multiple valid solution styles for math, code, and planning.
•Use a correctness-gated kernel so diversity rewards focus on correct solutions, not on creative mistakes.
•Tune α (entropy) for breadth and β (kernel) for distinctiveness; ensure kernel pressure among correct traces doesn’t overpower correctness.
•Design or learn a semantic kernel with embeddings of full chains of thought, or with structure-aware features (e.g., proof graphs).
•Add a small entropy barrier and modest KL-to-base to keep training stable and avoid probability mass collapsing to zero.
•Monitor semantic diversity among correct traces with embedding spread or clustering, not just entropy.
•Prefer DCR-style training over inference-only tricks (e.g., top-k) so lost strategies can be recovered during learning.
•Batch your training to compute pairwise kernel similarities (O(B^2)) and consider memory-saving approximations if needed.
•Gate exploration over time: start with higher α, then gradually increase β to solidify distinct correct strategies.
•Use DCR in curriculum learning to preserve new strategies learned at easier stages when moving to harder tasks.

Version: 1