Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai; Yuxiang Ji; Xiao Zhang; Yong Wang; Xiangxiang Chu; Zhiwu Lu

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Intermediate

Yanqi Dai, Yuxiang Ji, Xiao Zhang et al.1/28/2026

arXiv PDF

Key Summary

•This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.
•It fixes a hidden problem in a popular training method (GRPO) that accidentally gives medium-difficulty questions the biggest influence.
•Their new algorithm, DGPO, balances updates across questions and then gently upweights the harder ones.
•They also make existing questions tougher without changing the correct answer using a strategy called MQR.
•MQR adds story noise, invents abstract terms, or turns a given number into a small sub-problem to raise difficulty safely.
•Together, the algorithm (DGPO) and the data strategy (MQR) form MathForge, a loop where harder data and better learning help each other.
•On several math benchmarks, MathForge beats strong baselines by a clear margin, including a +4.56% average improvement over GRPO.
•DGPO also works with other training methods and even helps in multimodal math problems.
•The key insight is that ‘harder is better’ when difficulty is measured and used carefully.
•All code and augmented data are released, aiming for reproducibility and broader use.

Why This Research Matters

MathForge shows that smarter training isn’t just about more data—it’s about the right difficulty and fair learning rules. By reliably making problems harder while keeping answers the same, we can safely stretch a model’s abilities. Fixing the hidden imbalance in GRPO ensures that tough questions aren’t sidelined by accident. This approach helps models become better problem solvers, not just better memorizers. Such improvements can translate into stronger tutoring tools, better planning systems, and more reliable scientific assistants. Because DGPO is compatible with other methods and even works with images, it can boost many future AI systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine practicing basketball. If you only shoot easy layups, you’ll get good at layups—but not at three-pointers. To win real games, you must practice what’s hard, on purpose.

🥬 The Concept: Before this work, large language models (LLMs) were getting better at solving math by learning from their own tries and receiving simple, checkable rewards (right or wrong). This is called Reinforcement Learning with Verifiable Rewards (RLVR). It’s efficient because the reward is clear and trustworthy: did the answer match the gold solution or not?

Why it matters: Even with RLVR, many systems didn’t pay special attention to hard questions. That meant models often improved on what they already found comfortable and didn’t push hard enough on their weak spots.

🍞 Anchor: Think of a quiz app that marks answers correct or incorrect. If it keeps serving you medium questions where you score okay, you’ll stay okay. To grow, the app should notice what’s hardest for you and serve more of that.

— New Concept #1 — RLVR (Reinforcement Learning with Verifiable Rewards) 🍞 Hook: You know how teachers grade math answers with a simple checkmark if they’re right? That’s a clear, fair score. 🥬 The Concept: RLVR is training where an AI gets a reward it can fully trust—like 1 for a correct final answer and 0 for a wrong one. How it works (recipe):

Ask the model to solve a problem. 2) Use a rule-based checker to see if the final answer matches the gold one. 3) Give 1 for correct, 0 for wrong. 4) Nudge the model to make correct-like answers more likely next time. Why it matters: Without trustworthy rewards, the model might chase tricks that don’t mean real understanding. 🍞 Anchor: When the AI says the area is 36 and the gold answer is 36, it gets a 1; otherwise, 0.

— New Concept #2 — GRPO (Group Relative Policy Optimization) 🍞 Hook: Imagine a class solves the same problem, and the teacher compares students within that one problem to see who did better. 🥬 The Concept: GRPO improves a model by generating several answers to the same question and comparing them to each other. How it works:

For one question, the model writes multiple answers. 2) Each answer gets a reward (correct/incorrect). 3) The model boosts answers that did better than the group average and reduces those that did worse. 4) Repeat for many questions. Why it matters: Without comparing within the same question, the model needs a separate critic, which is slower and riskier to train. 🍞 Anchor: If three answers are wrong and one is right, GRPO learns to make the right-style answer more common.

The Problem: The authors discovered a hidden imbalance in GRPO: mathematically, the size of the learning push is biggest for medium-difficulty questions (about half the answers right), and smaller for both really easy and really hard ones. That means super valuable hard-but-solvable questions don’t get enough attention.

Past Attempts: Some tried reweighting by difficulty, or just rephrasing questions to increase variety. But these either didn’t fix the root math imbalance in GRPO’s update size, or they made data different without reliably making it harder.

The Gap: We needed a way to (1) fix the imbalance so all questions have a fair baseline push, and (2) then deliberately give extra weight to the currently hard-but-solvable ones. On the data side, we needed a way to make questions truly harder in systematic, safe ways while keeping the same correct answer.

Real Stakes: Better math reasoning shows up in everyday tech—like planning routes, budgeting, interpreting charts, or tutoring. If models learn mostly from medium stuff, they won’t be ready for real-world puzzles, which are often messy, multi-step, and abstract.

02Core Idea

— Aha! Moment — If we train models to practice the hardest questions they can almost solve—and fix the math so hard questions count fairly—then the model’s reasoning improves faster and farther.

Explain it three ways:

Sports analogy: Instead of only scrimmaging against average teams (medium questions), schedule more games against strong teams you can still score on (hard-but-solvable). Also make sure the scoreboard counts every game fairly.
Music analogy: Don’t just play medium-speed songs. Balance the metronome so every practice counts the same, then spend more time on fast, tricky passages to grow your skill.
Video game analogy: Don’t farm mid-level areas forever. Make experience points fair for all levels, then farm tougher dungeons you can almost beat to level up quickly.

Before vs After:

Before: GRPO gave the biggest update to medium-difficulty questions. Data augmentation often rephrased problems without really making them more challenging.
After: DGPO first balances the update size across questions and then upweights the hard ones. MQR systematically reformulates questions to raise difficulty without changing the correct answer. Together, they form MathForge, a loop where harder data and better learning reinforce each other.

Why it works (intuition):

Fixing the imbalance removes a hidden preference for medium difficulty, so all questions start on equal footing.
Then, focusing a bit more on hard-but-solvable questions targets the model’s weakest skills—places where it’s close to right—so learning is efficient and impactful.
Making data harder in meaningful ways (noise tolerance, abstraction, multi-step logic) teaches general skills the model can reuse elsewhere, boosting test performance.

Building Blocks (with Sandwich explanations):

— New Concept #3 — Update Magnitude Imbalance (in GRPO) 🍞 Hook: You know how some games give more points for tasks in the middle of the map, even if the edges are tougher or easier? That changes where you play. 🥬 The Concept: GRPO’s math makes medium-difficulty questions push learning the most, and hard/easy ones push less. How it works: Because GRPO normalizes by standard deviation of rewards, the summed ‘push’ is largest around 50% accuracy and shrinks near 0% or 100%. Why it matters: Hard-but-solvable questions (gold for learning) get under-emphasized. 🍞 Anchor: If the model gets 1 out of 2 right (50%), GRPO pushes strongly; if 1 out of 10 or 9 out of 10, it pushes weakly—even if we really need to learn from that case.

— New Concept #4 — MathForge 🍞 Hook: Imagine a smart coach who both designs tougher drills and also scores every drill fairly, then schedules extra reps for what you struggle with. 🥬 The Concept: MathForge = DGPO (the fair, difficulty-aware trainer) + MQR (the safe, difficulty-raising question maker). How it works: MQR expands the training set with harder versions of the same problems; DGPO learns from them by first balancing update sizes and then upweighting hard questions. Why it matters: Without both parts, you either don’t have enough hard data or you don’t learn from it efficiently. 🍞 Anchor: Harder practice sheets (MQR) + smarter grading and focus (DGPO) → better test scores.

03Methodology

High-level pipeline: Questions → MQR (make harder, same answer) → DGPO training (balance then upweight) → Stronger math reasoner.

Step-by-step, like a recipe:

— New Concept #5 — MQR (Multi-Aspect Question Reformulation) 🍞 Hook: You know how a riddle gets trickier if it adds a long story, uses fancy words, or makes you solve a small puzzle first? 🥬 The Concept: MQR rewrites a question in different ways to make it harder but keeps the final answer exactly the same. How it works:

Background: Add a story that sounds related but isn’t needed for the math (noise). 2) Term: Invent an abstract term and restate the problem using it (abstraction). 3) Sub-Problem: Replace a given number with a small sub-problem whose solution supplies that number (multi-step reasoning). 4) Always check the gold answer stays unchanged. Why it matters: Without preserving the answer, you could teach the model the wrong target or need new solutions each time. With preserved answers, training stays simple and safe. 🍞 Anchor: Original: “Cake costs 6 euros.” Background version: long Paris story—same math. Term version: define ‘euro-gap’—same math. Sub-problem version: solve a mini puzzle to get the exchange rate—same final euros to add.

— New Concept #6 — DGPO (Difficulty-Aware Group Policy Optimization) 🍞 Hook: Picture a fair weightlifting competition: first weigh every lift on the same scale, then give a little extra spotlight to the toughest lifts the athlete can complete. 🥬 The Concept: DGPO is a training rule that first balances how much each question can push the model, then gives more weight to the hard-but-solvable ones. How it works (two key parts):

DGAE: Normalize advantages by mean absolute deviation so each question has a constant total ‘push’. 2) DQW: Compute a difficulty score from accuracy (lower accuracy = harder) and softly upweight harder questions in the batch. Why it matters: Without balancing, medium questions dominate. Without upweighting, the model doesn’t focus where learning is most needed. 🍞 Anchor: If a batch has easy, medium, and hard questions, DGPO makes them all count fairly, then gently tilts attention toward the hard ones.

— New Concept #7 — DGAE (Difficulty-Balanced Group Advantage Estimation) 🍞 Hook: You know how a fair scale compares items the same way every time? If the scale is biased, some items always look heavier. 🥬 The Concept: DGAE replaces GRPO’s standard-deviation scaling with mean absolute deviation (MAD) so every question contributes a constant total update size. How it works:

For one question, compare each answer’s reward to the group average. 2) Divide by the group’s MAD (not standard deviation). 3) The sum of the absolute advantages across answers becomes a constant, independent of how hard the question is. Why it matters: Without this, medium questions get outsized impact; with it, all questions start equally loud. 🍞 Anchor: Whether the model gets 1/10 or 5/10 right, the total ‘push’ from that question is the same after DGAE.

— New Concept #8 — MAD (Mean Absolute Deviation) 🍞 Hook: Imagine measuring how far each kid’s height is from the class average, ignoring plus or minus, then averaging those distances. 🥬 The Concept: MAD is the average of absolute differences from the mean. How it works: For rewards in a group, compute each reward minus the average reward, take absolute values, then average. Why it matters: MAD gives a stable scale that treats groups fairly, avoiding the medium-only boost from standard deviation in this setting. 🍞 Anchor: Rewards [1,0,0,1] vs [1,1,0,0] get comparable normalization strength via MAD.

— New Concept #9 — DQW (Difficulty-Aware Question-Level Weighting) 🍞 Hook: When studying, you spend a bit more time on topics you often miss. 🥬 The Concept: DQW gives each question a weight based on how hard it currently is (measured by low accuracy across its answers). How it works:

Compute difficulty D = negative mean accuracy for that question. 2) Use a softmax with temperature T to convert D into weights. 3) Multiply each question’s loss by its weight. Why it matters: Without DQW, you don’t focus learning on where the model needs it most. 🍞 Anchor: A question that only 1/8 answers got right gets a higher weight than one with 6/8 right.

— New Concept #10 — Valid Token-Level Loss Averaging 🍞 Hook: If your class grade counted missing assignments unevenly, your average would jump around a lot. 🥬 The Concept: Only average losses over ‘valid’ questions (not all-correct or all-wrong in the group) to keep gradients stable. How it works:

Mark a question valid if its group responses aren’t uniformly correct or uniformly wrong. 2) Average token-level losses over valid queries. 3) This stabilizes training and pairs neatly with DQW. Why it matters: Without it, gradients can swing wildly from batch to batch, slowing or destabilizing training. 🍞 Anchor: If a question yields all zeros (no signal), skip it for averaging so it doesn’t distort learning.

Putting it all together (with a small example):

Suppose a batch has 3 questions, each with 4 generated answers. For each question: compute rewards (1/0), compute DGAE advantages using MAD, check validity, and compute a difficulty score from accuracy. Then, scale the whole question’s contribution by DQW. Across the batch, the model learns more from the harder questions, but every question speaks clearly thanks to DGAE.

Secret sauce:

The balance-then-reweight design fixes the math first (fair baseline), then applies a simple, controllable focus on hard cases (DQW). Combined with MQR’s high-quality harder data, the model practices exactly the skills it lacks, safely and efficiently.

04Experiments & Results

The Test: The authors trained several math-focused LLMs on the MATH dataset and tested on multiple benchmarks (AIME24, AIME25, AMC23, MATH500, Minerva, Olympiad). They measured average accuracy—like a report card across many subjects. They also tried a multimodal case (GeoQA) to check generality.

The Competition: Strong baselines included GRPO, Dr.GRPO, GPG, DAPO, GSPO, and a difficulty-aware variant (GRPO-AD). They also isolated the contributions of each new piece: DGPO alone, MQR alone, and the full MathForge combo.

Scoreboard (with context):

Baseline GRPO average: 37.61%. Think of this as a solid B.
DGPO: 39.79% (+2.18). That’s moving from a B to a strong B+ by fixing imbalance and upweighting hard questions.
MQR: 41.04% (+3.43). This is like taking tougher practice sheets and seeing your test grade rise to an A-.
MathForge (DGPO + MQR): 42.17% (+4.56). That’s an A-, clearly above the rest, showing the synergy of harder data plus smarter learning.

More models, same story:

On smaller and different models (Qwen2.5-Math-1.5B, Qwen2.5-3B, DeepSeek-Math-7B), MathForge again wins. Gains up to +4.45 percentage points show it’s not just a one-model trick.

Compatibility bonus:

Adding DGPO to other methods (GPG, DAPO, GSPO) improved them, too. For example, DAPO+DGPO outperformed DAPO alone, meaning DGPO is a good ‘plug-in’ that helps many systems.

Multimodal case:

On GeoQA with images, DGPO scored 59.95% vs GRPO’s 57.43% (+2.52). The core idea—emphasize hard-but-solvable—still helps when pictures are involved.

Surprising findings:

DGPO’s answers got more concise over training. That suggests the model learned to take cleaner, shorter reasoning paths—not just to be correct, but to be efficient.
Training on MQR (harder) data made training accuracy lower but test accuracy higher: ‘train harder, test better.’ This is the good kind of challenge that builds durable skills.

Difficulty matters (checked directly):

When they measured how often a base model got the reformulated questions right, Background, Term, and Sub-Problem versions were indeed harder than Original, especially Sub-Problem. This confirms MQR truly raises difficulty, not just changes wording.

Quality control:

They sampled many reformulations and used a strong model to check that the final answer stayed the same. Equivalence rates were high (≈97–99%), reducing the risk of teaching the wrong target.

05Discussion & Limitations

Limitations:

Hard focus, gentle guardrails: While DGPO upweights hard problems, it still needs to avoid neglecting easy ones (to prevent forgetting). The temperature T and valid-query logic help, but poor settings could over- or under-focus.
Data quality matters: MQR must keep the same final answer. Although equivalence checks are strong, a small fraction can slip through and become invalid (usually yielding no harmful gradient, but still consuming compute).
Binary rewards are simple but blunt: In many setups, correctness is all-or-nothing. Partial-credit rewards could teach more nuance but require careful design.
Batch dependence: DQW computes difficulty within each batch, so batches should be well-shuffled and representative.

Required resources:

Compute for RL training (they used 8 NVIDIA H20s) and a reformulator model for MQR (OpenAI o3 or capable open-source alternatives). Storage for augmented datasets.

When not to use:

If you can’t verify rewards reliably (no clear checker), RLVR and DGPO lose their main advantage.
If your domain lacks a safe way to ‘make it harder but keep the same answer’, MQR may be risky.
If you only need short, factual recall (not reasoning), the extra complexity might not pay off.

Open questions:

Best difficulty signals: Besides accuracy, could calibrated confidence or step-level signals help?
Adaptive curricula: Can we auto-tune temperature T or weight schedules over time?
Beyond math: How to define ‘harder but same answer’ in other fields (logic puzzles, code synthesis), and validate safely?
Partial-credit rewards: Can fine-grained verifiable rewards accelerate learning even more?
Human-AI co-creation: Could teachers guide MQR styles to target specific classroom skills?

06Conclusion & Future Work

3-sentence summary: This paper introduces MathForge, a system that improves math reasoning by both making training data harder in safe ways (MQR) and training the model to learn more from those hard-but-solvable questions (DGPO). DGPO first balances update sizes across questions and then upweights hard ones, while MQR raises difficulty via Background, Term, and Sub-Problem rewrites without changing the correct answer. Together, they produce consistent gains over strong baselines across models and even in multimodal settings.

Main achievement: Showing that ‘harder is better’—when difficulty is measured, controlled, and paired with a balanced training rule—and delivering a practical, general method (MathForge) that reliably lifts benchmark performance.

Future directions: Explore new difficulty signals, partial-credit verifiable rewards, adaptive weighting schedules, and broader domains (logical reasoning, programming). Improve automatic checks that confirm answer preservation in MQR. Study how to combine DGPO with curricula and self-play.

Why remember this: It reframes how we train reasoning models—don’t just add more data; make it meaningfully harder and learn from it fairly. That simple shift, backed by a clean mathematical fix and careful data design, leads to stronger, more general problem-solving.

Practical Applications

•Build stronger AI math tutors that challenge students with carefully harder versions of the same problems.
•Improve automated homework checkers by focusing training on the toughest, near-miss cases students struggle with.
•Enhance financial planning bots to handle multi-step, abstract calculations with fewer errors.
•Train engineering assistants to navigate noisy, story-like specs and still extract the key numbers accurately.
•Upgrade scientific calculators-in-text that must parse dense narratives and compute precise results.
•Boost multimodal reasoning (e.g., geometry with diagrams) by upweighting hard visual-text questions.
•Create fairer RL pipelines in other domains by adopting DGAE-style normalization and difficulty-aware weighting.
•Design study apps that adaptively reformulate practice questions to build robust reasoning skills.
•Strengthen code-generation or logic-puzzle solvers by crafting harder-but-equivalent prompts.
•Develop teacher tools that generate rigorous practice sheets with guaranteed same-answer variants.

Version: 1