Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu; Yucheng Wang; Yufei He; Jiaying Wu; Yilun Zhao; See-Kiong Ng; Cynthia Breazeal; Anh Tuan Luu; Hae Won Park; Bryan Hooi

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Intermediate

Zhiyuan Hu, Yucheng Wang, Yufei He et al.1/13/2026

arXiv PDF

Key Summary

•The paper fixes a common problem in training AI reasoners: models get stuck using the same favorite solution style and stop exploring new ways to solve problems.
•Instead of rewarding every correct answer equally, the method gives extra credit to correct answers that use rare, genuinely different strategies.
•To tell which solutions are truly different, a separate judge model groups multiple attempts for the same problem by high-level strategy (not wording).
•During learning, the model’s updates are reweighted so rare-but-correct strategies count more, while common-but-correct strategies count less.
•This keeps exploration alive and improves pass@k, which measures how often at least one out of k tries is correct.
•Across math, physics, and medical reasoning tests, the method raises the overall area under the pass@k curve (AUC@K) without hurting pass@1.
•It also keeps token-level entropy healthier (less collapse) and covers more human-style solution ideas (higher cover@n).
•The approach is simple to add on top of standard GRPO-style reinforcement learning: only the advantage term is changed.
•Downsides include extra compute for the judge and possible misgrouping of strategies on ambiguous cases.
•Overall, this uniqueness-aware RL makes AI problem solvers more creative and reliable when you can look at multiple answers.

Why This Research Matters

In real life, we often want more than one idea—think second opinions from doctors or alternative solution paths in math. This method trains AI to keep multiple correct strategies alive, so when you ask for several answers, they aren’t just rephrases of the same plan. That makes the AI more dependable on hard problems where one approach may fail. It also reduces the risk of “getting stuck” in one way of thinking as models keep learning. By directly rewarding rare, correct strategies, the AI becomes more creative without sacrificing accuracy. This leads to better tools for education, science, engineering, and healthcare where diverse reasoning is essential.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a classroom where every student copies the same solution from the board. They get today’s homework right, but when the test changes, many of them struggle because they never learned other ways.

🥬 The Concept: Reinforcement Learning (RL)

What it is: RL is a way to teach AI by giving rewards for good behavior and penalties for bad behavior.
How it works:
1. The AI tries an action (like writing a solution).
2. A reward system checks if it’s good (e.g., correct answer = 1; wrong = 0).
3. The AI updates itself to make good actions more likely next time.
Why it matters: Without RL, models may not steadily improve at tricky, multi-step reasoning where feedback guides better habits. 🍞 Anchor: Like training a dog: sit = treat, don’t sit = no treat. Over time, the dog sits faster.

🍞 Hook: You know how you don’t just say words—you tell a story with steps? The same is true when an AI solves a problem.

🥬 The Concept: Rollout

What it is: A rollout is one full try at solving a problem, from first thought to final answer.
How it works:
1. Start from the question.
2. Think step by step (the reasoning trace).
3. Produce a final answer.
Why it matters: If we only look at single words (tokens) and not full tries, we miss whether two attempts used the same plan or a different plan. 🍞 Anchor: Two essays might use different sentences but still follow the same outline—that’s one plan, not two.

🍞 Hook: Having a bigger crayon box doesn’t mean you’re drawing new pictures—it might just be brighter colors of the same drawing.

🥬 The Concept: Token Diversity

What it is: Token diversity means using different words or wording in the text.
How it works:
1. The model picks words with some randomness.
2. Higher entropy means a wider variety of word choices.
3. Outputs can look different on the surface.
Why it matters: If we only chase token variety, we may get different-looking answers that follow the same old plan, so we don’t truly explore new solution ideas. 🍞 Anchor: Saying “compute” vs “calculate” changes the word, not the strategy.

🍞 Hook: Imagine playing a new game but only pressing the one button that worked once—you’ll never learn the other moves.

🥬 The Concept: Exploration–Exploitation Trade-off

What it is: It’s the balance between trying new ideas (exploration) and repeating what already works (exploitation).
How it works:
1. Early on, you try many ideas.
2. As you find winners, you reuse them more.
3. If you overdo reuse, you stop learning better or different ways.
Why it matters: Without healthy exploration, the AI improves a single favorite style but fails to grow a toolkit of strategies. 🍞 Anchor: If your favorite chess opening is all you ever use, you’ll be stuck when your opponent counters it.

🍞 Hook: Think of a bakery that only bakes chocolate cookies because they sell well, and then customers who want variety leave.

🥬 The Concept: Exploration Collapse

What it is: When training keeps shrinking variety, the model gets stuck in a few dominant solution patterns.
How it works:
1. Rewards boost the current best pattern.
2. Competing patterns get less practice and fade.
3. Over time, almost all attempts look the same.
Why it matters: Performance on one try (pass@1) may go up, but the chance that at least one of many tries is right (pass@k) stops improving. 🍞 Anchor: If every brainstorm gives the same idea, you’re not really brainstorming.

🍞 Hook: Imagine a team where a coach only praises the most common play and ignores clever new ones.

🥬 The Concept: pass@k

What it is: pass@k measures how often at least one out of k tries is correct.
How it works:
1. For each problem, sample k solutions.
2. If any is correct, it’s a pass.
3. Average over many problems.
Why it matters: Real users often look at more than one answer. Keeping multiple distinct strategies boosts the odds that one works. 🍞 Anchor: If you can submit 10 guesses, having truly different guesses beats 10 copies of the same guess.

🍞 Hook: You can say the same thing with different words, or you can use a totally different plan. Only one of these is real creativity.

🥬 The Concept: Strategy-Level Diversity

What it is: It’s variety in high-level solution plans, not just in wording or small steps.
How it works:
1. Identify the big idea behind a solution (e.g., factorization vs quadratic formula).
2. Group solutions by their big ideas.
3. Prefer sets that cover multiple big ideas.
Why it matters: pass@k improves when the model carries several distinct ways to solve the same problem. 🍞 Anchor: Solving x^2−5x+6=0 by factoring vs using the quadratic formula: two real strategies, not just two phrasings.

The world before this paper: RL for large language models (LLMs) often rewarded correctness and sometimes token-level variety (like entropy bonuses). That made outputs look different, but the big ideas stayed the same. As training continued, models would lock onto a small set of “safe” strategies—great for one-shot accuracy but not for creative problem solving across many tries.

The problem: We were regularizing the wrong thing. We nudged tokens (local choices) rather than sets of rollouts (full-solution attempts). The result: exploration collapse. We needed a way to measure and reward true strategy-level variety.

Failed attempts: Techniques like entropy bonuses, low-probability token protection, and pass@k-based tweaks helped a bit, but they mostly operated on tokens or proxy signals (like embedding distances). They could make text look diverse without guaranteeing diverse solution strategies.

The gap: A direct, per-problem, rollout-set view that says, “If two correct answers follow the same plan, they shouldn’t both get full credit. Reward correct answers that follow rare, genuinely different plans more.”

Real stakes: In math solvers, science assistants, and diagnostic support, people often check multiple samples. Keeping multiple valid strategies boosts the chance that at least one path works, especially on hard problems. Creative breadth is not just nice-to-have—it’s what makes an assistant reliable under pressure.

02Core Idea

🍞 Hook: Picture a talent show where judges give extra points to acts no one else tried—that pushes contestants to be original, not just polished.

🥬 The Concept: Uniqueness-Aware RL (UARL)

What it is: A way to train AI that gives more reward to correct solutions using rare, high-level strategies and less to repeated, common ones.
How it works:
1. For each problem, generate several rollouts (tries).
2. A judge groups rollouts by big strategy, ignoring wording.
3. Compute how rare each strategy is (small group = rare).
4. Reweight learning so rare-and-correct rollouts count more.
Why it matters: It keeps multiple solution ideas alive, boosting pass@k without hurting pass@1. 🍞 Anchor: If two kids both solve a puzzle but one uses a fresh approach, the teacher gives a bonus for creativity.

Multiple analogies:

Orchestra analogy: You know how a conductor doesn’t want only violins? UARL is like making sure brass, woodwinds, and percussion all get featured, so the performance is richer.
Sports analogy: A coach rewards not only scoring but also trying new plays that keep defenses guessing. The team keeps a deeper playbook.
Cooking analogy: A chef encourages different techniques (grilling, steaming, baking) instead of ten versions of the same sauté—so the menu covers more tastes.

🍞 Hook: Before, models were trained like spelling bees—correct letters mattered most. After, it’s like a science fair—original methods matter too.

🥬 The Concept: Before vs After

What it is: A comparison of training behavior.
How it works:
1. Before: RL upgrades the most common winning strategy, shrinking diversity over time.
2. After: UARL spreads credit across different winning strategies, preserving diversity.
3. Result: Higher pass@k, especially when you can sample many tries.
Why it matters: Users benefit when the model keeps several strong, distinct ways to solve a problem. 🍞 Anchor: With many lottery tickets, it’s better if they’re from different number patterns, not clones.

🍞 Hook: Ever sort your LEGO by the kind of builds (cars vs castles), not by brick color? That’s the key move.

🥬 The Concept: Strategy Clustering (with a Judge)

What it is: Grouping solutions by their big idea rather than their words.
How it works:
1. Show the judge multiple rollouts for the same problem.
2. The judge groups them by plan (e.g., factoring vs quadratic formula).
3. We store the group sizes as a measure of commonness vs rarity.
Why it matters: It turns fuzzy “creativity” into something measurable we can reward. 🍞 Anchor: Two stories can have different sentences but the same plot; clustering finds the plots.

🍞 Hook: When you split a pie, you can give bigger slices to the pieces you have less of—that’s fair and keeps variety on the table.

🥬 The Concept: Advantage Reweighting

What it is: Scaling the learning signal of each rollout by how rare its strategy is (and whether it’s correct).
How it works:
1. Compute the usual per-problem advantage (quality signal) from rewards.
2. Divide by the cluster size to boost rare strategies.
3. Multiply to get the final, uniqueness-aware advantage.
Why it matters: The model learns to allocate probability to multiple good strategies, not just one. 🍞 Anchor: If three kids give the same correct answer and one gives a correct but different one, the different one gets extra points.

🍞 Hook: Big words ahead, but think of them as roles in a play.

🥬 The Concept: GRPO (Group Relative Policy Optimization)

What it is: A common RL method where several rollouts of the same problem form a group, and their rewards are normalized to compute advantages.
How it works:
1. Sample K rollouts for one problem.
2. Compute mean and spread of rewards in that group.
3. Calculate advantage of each rollout relative to the group.
Why it matters: It stabilizes learning and makes comparisons fair within the same problem. 🍞 Anchor: Grading on a curve within one class, not across different schools.

Why it works (intuition):

If the goal is better pass@k, then we must keep several distinct, correct strategies around. Rewarding rare strategies directly lines up with that goal. By clustering at the strategy level, we stop confusing surface-level word changes with real novelty. And by blending uniqueness with correctness in the advantage, we avoid rewarding wrong-but-weird attempts.

Building blocks:

Rollouts per problem (several tries).
A correctness verifier (math checker, numeric tolerance, or LLM judge).
A strategy judge that clusters the full traces.
A uniqueness weight that downweights large clusters and upweights small ones.
A GRPO-style update that multiplies uniqueness with quality. Together, these blocks turn “creative and correct” into the winning recipe.

03Methodology

At a high level: Problem → Generate K rollouts → Verify correctness → Cluster by strategy → Compute uniqueness weights → Reweight advantages → Update policy.

🍞 Hook: Think of a science fair: many projects for one challenge, a judge groups similar projects, rare-and-excellent projects get bigger ribbons, and the school learns what to cultivate next.

🥬 The Concept: Sampling Multiple Rollouts

What it is: Making several complete solution attempts per problem.
How it works:
1. For each training problem, the model writes K full solutions.
2. Each solution includes its steps and final answer.
3. These K attempts form a group.
Why it matters: You can measure diversity only if you have multiple tries to compare. 🍞 Anchor: You can’t know if a class is creative from one essay; you need a stack.

Step A: Rewarding Quality with a Verifier

What happens: We check if each rollout is correct using a task-appropriate verifier: exact math equality for boxed answers, numeric tolerance for physics, or an LLM judge for diagnosis equivalence in medicine.
Why this step exists: Without correctness, we’d accidentally boost wrong-but-rare ideas.
Example: If the correct boxed answer is \boxed{588} and the rollout matches, reward=1; else 0.

🍞 Hook: Sorting by plot, not by pretty sentences.

🥬 The Concept: LLM Judge for Strategy Clustering

What it is: A larger or specialized model that groups rollouts by high-level plan.
How it works:
1. Feed the problem and all K rollouts to the judge at once.
2. The judge returns clusters like: {Group 1 (Factoring): Solutions 1,5; Group 2 (Quadratic Formula): Solutions 2,3; Group 3 (Graphing): Solution 4}.
3. We ignore stylistic differences (variable names, order of simplifications).
Why it matters: It finds real strategic differences that affect pass@k. 🍞 Anchor: Two students both use the quadratic formula, even if one shows more steps—it’s the same idea.

Step B: Measuring Uniqueness from Cluster Size

What happens: For each rollout, we get the size of its cluster. Small cluster = rare strategy; big cluster = common strategy.
Why this step exists: We need a simple knob to prefer rare-but-correct strategies without blowing up updates.
Example: With K=8 rollouts, if a rollout belongs to a cluster of size 2, it’s rarer than one in a cluster of size 5.

🍞 Hook: When candy is scarce, each piece is precious; when there’s a pile, each piece counts less.

🥬 The Concept: Uniqueness Weight (with strength alpha)

What it is: A number that shrinks as the cluster gets bigger; controlled by alpha (0 to 1).
How it works:
1. Compute weight = 1 / (cluster_size^alpha).
2. Alpha=0 → no uniqueness effect (plain GRPO).
3. Larger alpha → stronger boost for rarities.
Why it matters: It softly shifts learning toward rare strategies without ignoring common good ones. 🍞 Anchor: If three kids give the same correct idea and one gives a different correct idea, choose alpha to decide how much extra the different idea gets.

Step C: Group-Normalized Advantage (Quality Signal)

What happens: Within each problem’s group, compute an advantage that centers and scales rewards by the group’s average and spread.
Why this step exists: It stabilizes updates and adapts to problem difficulty.
Example: If most rollouts for a hard problem are wrong, a single correct one gets a large positive advantage relative to its group.

🍞 Hook: Mix creativity with quality so you don’t reward wild guesses.

🥬 The Concept: Uniqueness-Aware Advantage

What it is: The final learning signal = (uniqueness weight) × (group-normalized advantage).
How it works:
1. Compute advantage_z from rewards within the group.
2. Multiply by weight (from cluster size and alpha).
3. Use this product in the policy update.
Why it matters: Correct-but-rare gets amplified; correct-but-common stays helpful but smaller; wrong stays penalized. 🍞 Anchor: A gold star gets bigger if it’s for a fresh, correct method.

Step D: Policy Update with GRPO Framework

What happens: Use the uniqueness-aware advantage in a standard GRPO objective (plus regularizers like KL) and update the model.
Why this step exists: It’s a drop-in change—only the advantage is altered—so we keep training stable and scalable.
Example with data: For a math problem sampled K=8 times—three rollouts use quadratic formula (2 correct), three use factoring (1 correct), two use graphing (1 correct). Correct rollouts in smaller clusters (graphing, size 2) get larger updates than those in bigger clusters (quadratic formula, size 3).

🍞 Hook: Keep track of your budget: more tries give more chances to show variety.

🥬 The Concept: Sampling Budget K

What it is: The number of rollouts per problem at training (and a related number at test time for pass@k).
How it works:
1. Train with multiple samples per problem (e.g., K=8).
2. Test with k samples (often larger than K), compute pass@k and AUC@K.
3. Bigger k makes diversity more valuable.
Why it matters: The whole point is to improve sets of tries, not single answers. 🍞 Anchor: If you can check 64 attempts, having different strategies matters a lot.

The secret sauce:

Direct set-level thinking: We don’t guess diversity from token entropy; we measure it at the strategy level per problem.
Gentle, bounded weighting: Cluster-size weights are bounded, so rare strategies help more without exploding updates.
Correctness first: Uniqueness is only a booster when quality is present, preventing reward to wrong-but-odd outputs.
Plug-and-play: Works as a replacement for the advantage term in common GRPO pipelines.

Implementation notes (kept simple and robust):

Use a larger LLM from the same family as a judge (inference only) to cluster strategies.
Use task-specific verifiers for correctness: exact math checking, numeric tolerance for physics, and LLM-based equivalence for medical diagnoses.
Regularize with standard KL and train with AdamW; sample with temperature around 1.0; limit generation length per model.
Tune alpha between 0 and 1 to control how much to favor rare strategies.

04Experiments & Results

🍞 Hook: If you let a team submit several plays per turn, you want them to try different plays, not the same one over and over.

🥬 The Concept: The Test (What was measured and why)

What it is: We measured pass@k and AUC@K to see if keeping multiple strategies improves how often at least one try succeeds, and we tracked entropy and human-strategy coverage to ensure real diversity.
How it works:
1. Pass@k: Check if any of k generations solves the problem.
2. AUC@K: The area under the pass@k curve across k=1..K, summarizing overall performance.
3. Entropy dynamics: Whether training keeps or collapses token-level exploration.
4. cover@n: How many distinct, canonical human solution ideas are recovered among n tries.
Why it matters: These together show not just accuracy but breadth of thinking and sustained exploration. 🍞 Anchor: It’s like grading both your test score and how many different methods you actually learned.

🍞 Hook: You don’t just race yourself—you race other teams.

🥬 The Concept: The Competition (Baselines)

What it is: We compared to standard instruction models and RL variants that address exploration differently.
How it works:
1. Instruct backbones (no RL).
2. SimpleRL (GRPO only).
3. DAPO (diversity-aware RL recipe).
4. Forking Token (protects rare, high-entropy tokens).
Why it matters: If UARL wins against strong diversity baselines, it’s adding something new. 🍞 Anchor: Beating a team known for creative plays shows your strategy really helps.

Datasets and models:

Math: AIME 2024/2025, HLE-Math; Physics: OlympiadBench (text-only, competition); Medicine: MedCaseReasoning.
Backbones: Qwen2.5-7B, OLMo-3-7B, Qwen-3-8B.
Judge models: Larger variants from same families (e.g., Qwen2.5-72B for Qwen2.5-7B).

The scoreboard (with context):

Pass@k curves: Across math, physics, and medicine, UARL matches or beats baselines at most budgets, with the advantage growing as k increases (think: going from a B to an A- when you get more guesses).
AUC@K (Qwen2.5-7B): UARL leads at K=64/128/256 on all domains. Example: On AIME, +0.044 AUC@64 over SimpleRL and +0.058 AUC@128—like moving from an 84 to an 89 on a curve where others stay flat.
Additional families (OLMo-3-7B, Qwen-3-8B): On HLE and Physics where the metric is most discriminative, UARL tops Instruct, SimpleRL, DAPO, and Forking Token. Example (Qwen-3-8B @ K=64): UARL improves over DAPO on HLE (0.201→0.217) and Physics (0.361→0.365), showing complementary and stronger gains in strategy coverage.

🍞 Hook: If everyone starts sounding the same during training, creativity is in trouble.

🥬 The Concept: Entropy Dynamics

What it is: Tracking how much variety in token choices the model keeps during training.
How it works:
1. Measure average token-level entropy over steps.
2. Compare SimpleRL vs UARL.
3. Look for collapse (downward drift) vs stability.
Why it matters: While we aim at strategy diversity, healthy entropy shows the model hasn’t become too deterministic. 🍞 Anchor: UARL keeps the mixing bowl from drying out—still stirrable, still exploratory.

Human strategy coverage (cover@n):

Setup: For 20 tough AIME problems, we assembled 3–5 canonical human solution ideas per problem and checked how many the model recovered among 32 correct rollouts.
Result: On 4 of the most complex problems, UARL improved coverage where Instruct lagged. Example: On a geometry task, Instruct covered 2/5 ideas (40%), while UARL covered all 5 (100%), recovering rare insights like Symmedian Similarity and Pure Trigonometry.
Meaning: Gains reflect real strategy exploration, not just word shuffle.

Surprising findings:

No pass@1 trade-off: Despite favoring rarities, UARL didn’t hurt one-shot accuracy; it often matched or slightly improved it.
Scaling with k: Benefits became most visible at medium-to-large budgets (k ≳ 32), exactly where strategy diversity matters most to users.
Robust across domains: Even in medicine where accuracy plateaus quickly, UARL kept small but steady gains without regressions.

Takeaway: By directly rewarding correct-but-rare strategies, UARL sustains exploration, widens the solution portfolio, and translates that breadth into better pass@k and AUC@K—much like a team that keeps multiple winning plays ready instead of overusing one.

05Discussion & Limitations

🍞 Hook: Even smart plans have trade-offs—bringing a map helps, but you still need to trust the compass.

Limitations:

Judge dependence: The LLM judge that clusters strategies adds compute and can misclassify when strategies overlap or are ambiguous.
Local rarity only: Rarity is measured within each problem’s K rollouts, not across the entire training history—so it doesn’t reward global novelty over time.
Task-specific definitions: What counts as a “high-level strategy” depends on the domain; prompts and instructions must be tuned.
Overemphasis if mis-set alpha: Too-strong uniqueness weighting could over-reward rare but marginal strategies; tuning alpha matters.

Required resources:

A capable judge model (often a larger variant) run in inference-only mode.
Verifiers (math equality, numeric tolerance, medical equivalence rubric) and infrastructure for multi-sample training (K rollouts per problem).
Standard RL compute for GRPO with added judging and grouping overhead.

When not to use:

Single-shot settings where k=1 is the whole story and diversity offers little benefit.
Tasks where correctness is hard to verify and the judge can’t reliably cluster strategies (risking noisy rewards).
Ultra-limited compute scenarios where the extra passes through a judge are too costly.

Open questions:

Judge-free clustering: Can we learn strategy embeddings or use lightweight classifiers to cut compute and reduce misgrouping?
Global novelty: How do we track and reward long-term, cross-problem novelty without gaming the signal?
Adaptive alpha: Can the system learn how strongly to favor rare strategies based on problem uncertainty (e.g., semantic entropy) or pass@k gaps?
Multi-objective tuning: How best to balance correctness, rarity, and efficiency when budgets, domains, and user needs differ?

🍞 Anchor: Think of UARL as a strong foundation—already useful—but with room to add floors like smarter judges and global novelty meters.

06Conclusion & Future Work

Three-sentence summary: This paper introduces Uniqueness-Aware RL, which rewards correct solutions that use rare, high-level strategies, not just correct solutions in general. By clustering multiple rollouts per problem into strategy groups and reweighting advantages inversely to cluster size, the method keeps diverse, correct approaches alive. As a result, pass@k and AUC@K improve across math, physics, and medicine without hurting pass@1, and exploration remains healthy.

Main achievement: Turning creativity into a first-class training signal—simple, bounded, and pluggable—so RL for LLMs optimizes sets of solutions (strategy coverage) rather than just single-token behavior.

Future directions:

Build lighter, judge-free clustering or learned strategy embeddings.
Add global novelty accounting across problems and time.
Learn to adapt the uniqueness strength (alpha) based on uncertainty or pass@k gaps.
Extend to coding, theorem proving, and multi-agent collaboration where diverse plans matter.

Why remember this: When users can look at several answers, variety of correct strategies is power. This work shows a practical way to reward that variety directly—keeping AI problem solvers creative, robust, and more likely to nail tough questions when given a handful of tries.

Practical Applications

•Math tutoring that shows different correct methods (factoring, graphing, completing the square) so students learn multiple approaches.
•Scientific assistants that explore distinct modeling strategies (energy methods vs force balance) for physics problems.
•Medical decision support that presents several evidence-aligned differential diagnoses instead of repeating the same one.
•Coding helpers that propose multiple algorithmic patterns (DP, greedy, divide-and-conquer) for the same task.
•Theorem-proving or proof assistants that try alternative proof ideas (induction, contradiction, invariants).
•Data analysis tools that offer different statistical models (GLMs, tree-based, Bayesian) to cross-check conclusions.
•Design brainstorming agents that suggest diverse concepts rather than slight rewordings of a single idea.
•Interview preparation bots that generate multiple distinct solution paths to common algorithm questions.
•Education platforms that assess student understanding by comparing strategy types, not just final answers.
•Research copilots that maintain varied hypothesis sets to avoid premature convergence during literature exploration.

Version: 1