Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation
Key Summary
- ā¢This paper says that to make math-solving AIs smarter, we should train them more on the hardest questions they can almost solve.
- ā¢It fixes a hidden problem in a popular training method (GRPO) that accidentally gives medium-difficulty questions the biggest influence.
- ā¢Their new algorithm, DGPO, balances updates across questions and then gently upweights the harder ones.
- ā¢They also make existing questions tougher without changing the correct answer using a strategy called MQR.
- ā¢MQR adds story noise, invents abstract terms, or turns a given number into a small sub-problem to raise difficulty safely.
- ā¢Together, the algorithm (DGPO) and the data strategy (MQR) form MathForge, a loop where harder data and better learning help each other.
- ā¢On several math benchmarks, MathForge beats strong baselines by a clear margin, including a +4.56% average improvement over GRPO.
- ā¢DGPO also works with other training methods and even helps in multimodal math problems.
- ā¢The key insight is that āharder is betterā when difficulty is measured and used carefully.
- ā¢All code and augmented data are released, aiming for reproducibility and broader use.
Why This Research Matters
MathForge shows that smarter training isnāt just about more dataāitās about the right difficulty and fair learning rules. By reliably making problems harder while keeping answers the same, we can safely stretch a modelās abilities. Fixing the hidden imbalance in GRPO ensures that tough questions arenāt sidelined by accident. This approach helps models become better problem solvers, not just better memorizers. Such improvements can translate into stronger tutoring tools, better planning systems, and more reliable scientific assistants. Because DGPO is compatible with other methods and even works with images, it can boost many future AI systems.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine practicing basketball. If you only shoot easy layups, youāll get good at layupsābut not at three-pointers. To win real games, you must practice whatās hard, on purpose.
š„¬ The Concept: Before this work, large language models (LLMs) were getting better at solving math by learning from their own tries and receiving simple, checkable rewards (right or wrong). This is called Reinforcement Learning with Verifiable Rewards (RLVR). Itās efficient because the reward is clear and trustworthy: did the answer match the gold solution or not?
Why it matters: Even with RLVR, many systems didnāt pay special attention to hard questions. That meant models often improved on what they already found comfortable and didnāt push hard enough on their weak spots.
š Anchor: Think of a quiz app that marks answers correct or incorrect. If it keeps serving you medium questions where you score okay, youāll stay okay. To grow, the app should notice whatās hardest for you and serve more of that.
ā New Concept #1 ā RLVR (Reinforcement Learning with Verifiable Rewards) š Hook: You know how teachers grade math answers with a simple checkmark if theyāre right? Thatās a clear, fair score. š„¬ The Concept: RLVR is training where an AI gets a reward it can fully trustālike 1 for a correct final answer and 0 for a wrong one. How it works (recipe):
- Ask the model to solve a problem. 2) Use a rule-based checker to see if the final answer matches the gold one. 3) Give 1 for correct, 0 for wrong. 4) Nudge the model to make correct-like answers more likely next time. Why it matters: Without trustworthy rewards, the model might chase tricks that donāt mean real understanding. š Anchor: When the AI says the area is 36 and the gold answer is 36, it gets a 1; otherwise, 0.
ā New Concept #2 ā GRPO (Group Relative Policy Optimization) š Hook: Imagine a class solves the same problem, and the teacher compares students within that one problem to see who did better. š„¬ The Concept: GRPO improves a model by generating several answers to the same question and comparing them to each other. How it works:
- For one question, the model writes multiple answers. 2) Each answer gets a reward (correct/incorrect). 3) The model boosts answers that did better than the group average and reduces those that did worse. 4) Repeat for many questions. Why it matters: Without comparing within the same question, the model needs a separate critic, which is slower and riskier to train. š Anchor: If three answers are wrong and one is right, GRPO learns to make the right-style answer more common.
The Problem: The authors discovered a hidden imbalance in GRPO: mathematically, the size of the learning push is biggest for medium-difficulty questions (about half the answers right), and smaller for both really easy and really hard ones. That means super valuable hard-but-solvable questions donāt get enough attention.
Past Attempts: Some tried reweighting by difficulty, or just rephrasing questions to increase variety. But these either didnāt fix the root math imbalance in GRPOās update size, or they made data different without reliably making it harder.
The Gap: We needed a way to (1) fix the imbalance so all questions have a fair baseline push, and (2) then deliberately give extra weight to the currently hard-but-solvable ones. On the data side, we needed a way to make questions truly harder in systematic, safe ways while keeping the same correct answer.
Real Stakes: Better math reasoning shows up in everyday techālike planning routes, budgeting, interpreting charts, or tutoring. If models learn mostly from medium stuff, they wonāt be ready for real-world puzzles, which are often messy, multi-step, and abstract.
02Core Idea
ā Aha! Moment ā If we train models to practice the hardest questions they can almost solveāand fix the math so hard questions count fairlyāthen the modelās reasoning improves faster and farther.
Explain it three ways:
- Sports analogy: Instead of only scrimmaging against average teams (medium questions), schedule more games against strong teams you can still score on (hard-but-solvable). Also make sure the scoreboard counts every game fairly.
- Music analogy: Donāt just play medium-speed songs. Balance the metronome so every practice counts the same, then spend more time on fast, tricky passages to grow your skill.
- Video game analogy: Donāt farm mid-level areas forever. Make experience points fair for all levels, then farm tougher dungeons you can almost beat to level up quickly.
Before vs After:
- Before: GRPO gave the biggest update to medium-difficulty questions. Data augmentation often rephrased problems without really making them more challenging.
- After: DGPO first balances the update size across questions and then upweights the hard ones. MQR systematically reformulates questions to raise difficulty without changing the correct answer. Together, they form MathForge, a loop where harder data and better learning reinforce each other.
Why it works (intuition):
- Fixing the imbalance removes a hidden preference for medium difficulty, so all questions start on equal footing.
- Then, focusing a bit more on hard-but-solvable questions targets the modelās weakest skillsāplaces where itās close to rightāso learning is efficient and impactful.
- Making data harder in meaningful ways (noise tolerance, abstraction, multi-step logic) teaches general skills the model can reuse elsewhere, boosting test performance.
Building Blocks (with Sandwich explanations):
ā New Concept #3 ā Update Magnitude Imbalance (in GRPO) š Hook: You know how some games give more points for tasks in the middle of the map, even if the edges are tougher or easier? That changes where you play. š„¬ The Concept: GRPOās math makes medium-difficulty questions push learning the most, and hard/easy ones push less. How it works: Because GRPO normalizes by standard deviation of rewards, the summed āpushā is largest around 50% accuracy and shrinks near 0% or 100%. Why it matters: Hard-but-solvable questions (gold for learning) get under-emphasized. š Anchor: If the model gets 1 out of 2 right (50%), GRPO pushes strongly; if 1 out of 10 or 9 out of 10, it pushes weaklyāeven if we really need to learn from that case.
ā New Concept #4 ā MathForge š Hook: Imagine a smart coach who both designs tougher drills and also scores every drill fairly, then schedules extra reps for what you struggle with. š„¬ The Concept: MathForge = DGPO (the fair, difficulty-aware trainer) + MQR (the safe, difficulty-raising question maker). How it works: MQR expands the training set with harder versions of the same problems; DGPO learns from them by first balancing update sizes and then upweighting hard questions. Why it matters: Without both parts, you either donāt have enough hard data or you donāt learn from it efficiently. š Anchor: Harder practice sheets (MQR) + smarter grading and focus (DGPO) ā better test scores.
03Methodology
High-level pipeline: Questions ā MQR (make harder, same answer) ā DGPO training (balance then upweight) ā Stronger math reasoner.
Step-by-step, like a recipe:
ā New Concept #5 ā MQR (Multi-Aspect Question Reformulation) š Hook: You know how a riddle gets trickier if it adds a long story, uses fancy words, or makes you solve a small puzzle first? š„¬ The Concept: MQR rewrites a question in different ways to make it harder but keeps the final answer exactly the same. How it works:
- Background: Add a story that sounds related but isnāt needed for the math (noise). 2) Term: Invent an abstract term and restate the problem using it (abstraction). 3) Sub-Problem: Replace a given number with a small sub-problem whose solution supplies that number (multi-step reasoning). 4) Always check the gold answer stays unchanged. Why it matters: Without preserving the answer, you could teach the model the wrong target or need new solutions each time. With preserved answers, training stays simple and safe. š Anchor: Original: āCake costs 6 euros.ā Background version: long Paris storyāsame math. Term version: define āeuro-gapāāsame math. Sub-problem version: solve a mini puzzle to get the exchange rateāsame final euros to add.
ā New Concept #6 ā DGPO (Difficulty-Aware Group Policy Optimization) š Hook: Picture a fair weightlifting competition: first weigh every lift on the same scale, then give a little extra spotlight to the toughest lifts the athlete can complete. š„¬ The Concept: DGPO is a training rule that first balances how much each question can push the model, then gives more weight to the hard-but-solvable ones. How it works (two key parts):
- DGAE: Normalize advantages by mean absolute deviation so each question has a constant total āpushā. 2) DQW: Compute a difficulty score from accuracy (lower accuracy = harder) and softly upweight harder questions in the batch. Why it matters: Without balancing, medium questions dominate. Without upweighting, the model doesnāt focus where learning is most needed. š Anchor: If a batch has easy, medium, and hard questions, DGPO makes them all count fairly, then gently tilts attention toward the hard ones.
ā New Concept #7 ā DGAE (Difficulty-Balanced Group Advantage Estimation) š Hook: You know how a fair scale compares items the same way every time? If the scale is biased, some items always look heavier. š„¬ The Concept: DGAE replaces GRPOās standard-deviation scaling with mean absolute deviation (MAD) so every question contributes a constant total update size. How it works:
- For one question, compare each answerās reward to the group average. 2) Divide by the groupās MAD (not standard deviation). 3) The sum of the absolute advantages across answers becomes a constant, independent of how hard the question is. Why it matters: Without this, medium questions get outsized impact; with it, all questions start equally loud. š Anchor: Whether the model gets 1/10 or 5/10 right, the total āpushā from that question is the same after DGAE.
ā New Concept #8 ā MAD (Mean Absolute Deviation) š Hook: Imagine measuring how far each kidās height is from the class average, ignoring plus or minus, then averaging those distances. š„¬ The Concept: MAD is the average of absolute differences from the mean. How it works: For rewards in a group, compute each reward minus the average reward, take absolute values, then average. Why it matters: MAD gives a stable scale that treats groups fairly, avoiding the medium-only boost from standard deviation in this setting. š Anchor: Rewards [1,0,0,1] vs [1,1,0,0] get comparable normalization strength via MAD.
ā New Concept #9 ā DQW (Difficulty-Aware Question-Level Weighting) š Hook: When studying, you spend a bit more time on topics you often miss. š„¬ The Concept: DQW gives each question a weight based on how hard it currently is (measured by low accuracy across its answers). How it works:
- Compute difficulty D = negative mean accuracy for that question. 2) Use a softmax with temperature T to convert D into weights. 3) Multiply each questionās loss by its weight. Why it matters: Without DQW, you donāt focus learning on where the model needs it most. š Anchor: A question that only 1/8 answers got right gets a higher weight than one with 6/8 right.
ā New Concept #10 ā Valid Token-Level Loss Averaging š Hook: If your class grade counted missing assignments unevenly, your average would jump around a lot. š„¬ The Concept: Only average losses over āvalidā questions (not all-correct or all-wrong in the group) to keep gradients stable. How it works:
- Mark a question valid if its group responses arenāt uniformly correct or uniformly wrong. 2) Average token-level losses over valid queries. 3) This stabilizes training and pairs neatly with DQW. Why it matters: Without it, gradients can swing wildly from batch to batch, slowing or destabilizing training. š Anchor: If a question yields all zeros (no signal), skip it for averaging so it doesnāt distort learning.
Putting it all together (with a small example):
- Suppose a batch has 3 questions, each with 4 generated answers. For each question: compute rewards (1/0), compute DGAE advantages using MAD, check validity, and compute a difficulty score from accuracy. Then, scale the whole questionās contribution by DQW. Across the batch, the model learns more from the harder questions, but every question speaks clearly thanks to DGAE.
Secret sauce:
- The balance-then-reweight design fixes the math first (fair baseline), then applies a simple, controllable focus on hard cases (DQW). Combined with MQRās high-quality harder data, the model practices exactly the skills it lacks, safely and efficiently.
04Experiments & Results
The Test: The authors trained several math-focused LLMs on the MATH dataset and tested on multiple benchmarks (AIME24, AIME25, AMC23, MATH500, Minerva, Olympiad). They measured average accuracyālike a report card across many subjects. They also tried a multimodal case (GeoQA) to check generality.
The Competition: Strong baselines included GRPO, Dr.GRPO, GPG, DAPO, GSPO, and a difficulty-aware variant (GRPO-AD). They also isolated the contributions of each new piece: DGPO alone, MQR alone, and the full MathForge combo.
Scoreboard (with context):
- Baseline GRPO average: 37.61%. Think of this as a solid B.
- DGPO: 39.79% (+2.18). Thatās moving from a B to a strong B+ by fixing imbalance and upweighting hard questions.
- MQR: 41.04% (+3.43). This is like taking tougher practice sheets and seeing your test grade rise to an A-.
- MathForge (DGPO + MQR): 42.17% (+4.56). Thatās an A-, clearly above the rest, showing the synergy of harder data plus smarter learning.
More models, same story:
- On smaller and different models (Qwen2.5-Math-1.5B, Qwen2.5-3B, DeepSeek-Math-7B), MathForge again wins. Gains up to +4.45 percentage points show itās not just a one-model trick.
Compatibility bonus:
- Adding DGPO to other methods (GPG, DAPO, GSPO) improved them, too. For example, DAPO+DGPO outperformed DAPO alone, meaning DGPO is a good āplug-inā that helps many systems.
Multimodal case:
- On GeoQA with images, DGPO scored 59.95% vs GRPOās 57.43% (+2.52). The core ideaāemphasize hard-but-solvableāstill helps when pictures are involved.
Surprising findings:
- DGPOās answers got more concise over training. That suggests the model learned to take cleaner, shorter reasoning pathsānot just to be correct, but to be efficient.
- Training on MQR (harder) data made training accuracy lower but test accuracy higher: ātrain harder, test better.ā This is the good kind of challenge that builds durable skills.
Difficulty matters (checked directly):
- When they measured how often a base model got the reformulated questions right, Background, Term, and Sub-Problem versions were indeed harder than Original, especially Sub-Problem. This confirms MQR truly raises difficulty, not just changes wording.
Quality control:
- They sampled many reformulations and used a strong model to check that the final answer stayed the same. Equivalence rates were high (ā97ā99%), reducing the risk of teaching the wrong target.
05Discussion & Limitations
Limitations:
- Hard focus, gentle guardrails: While DGPO upweights hard problems, it still needs to avoid neglecting easy ones (to prevent forgetting). The temperature T and valid-query logic help, but poor settings could over- or under-focus.
- Data quality matters: MQR must keep the same final answer. Although equivalence checks are strong, a small fraction can slip through and become invalid (usually yielding no harmful gradient, but still consuming compute).
- Binary rewards are simple but blunt: In many setups, correctness is all-or-nothing. Partial-credit rewards could teach more nuance but require careful design.
- Batch dependence: DQW computes difficulty within each batch, so batches should be well-shuffled and representative.
Required resources:
- Compute for RL training (they used 8 NVIDIA H20s) and a reformulator model for MQR (OpenAI o3 or capable open-source alternatives). Storage for augmented datasets.
When not to use:
- If you canāt verify rewards reliably (no clear checker), RLVR and DGPO lose their main advantage.
- If your domain lacks a safe way to āmake it harder but keep the same answerā, MQR may be risky.
- If you only need short, factual recall (not reasoning), the extra complexity might not pay off.
Open questions:
- Best difficulty signals: Besides accuracy, could calibrated confidence or step-level signals help?
- Adaptive curricula: Can we auto-tune temperature T or weight schedules over time?
- Beyond math: How to define āharder but same answerā in other fields (logic puzzles, code synthesis), and validate safely?
- Partial-credit rewards: Can fine-grained verifiable rewards accelerate learning even more?
- Human-AI co-creation: Could teachers guide MQR styles to target specific classroom skills?
06Conclusion & Future Work
3-sentence summary: This paper introduces MathForge, a system that improves math reasoning by both making training data harder in safe ways (MQR) and training the model to learn more from those hard-but-solvable questions (DGPO). DGPO first balances update sizes across questions and then upweights hard ones, while MQR raises difficulty via Background, Term, and Sub-Problem rewrites without changing the correct answer. Together, they produce consistent gains over strong baselines across models and even in multimodal settings.
Main achievement: Showing that āharder is betterāāwhen difficulty is measured, controlled, and paired with a balanced training ruleāand delivering a practical, general method (MathForge) that reliably lifts benchmark performance.
Future directions: Explore new difficulty signals, partial-credit verifiable rewards, adaptive weighting schedules, and broader domains (logical reasoning, programming). Improve automatic checks that confirm answer preservation in MQR. Study how to combine DGPO with curricula and self-play.
Why remember this: It reframes how we train reasoning modelsādonāt just add more data; make it meaningfully harder and learn from it fairly. That simple shift, backed by a clean mathematical fix and careful data design, leads to stronger, more general problem-solving.
Practical Applications
- ā¢Build stronger AI math tutors that challenge students with carefully harder versions of the same problems.
- ā¢Improve automated homework checkers by focusing training on the toughest, near-miss cases students struggle with.
- ā¢Enhance financial planning bots to handle multi-step, abstract calculations with fewer errors.
- ā¢Train engineering assistants to navigate noisy, story-like specs and still extract the key numbers accurately.
- ā¢Upgrade scientific calculators-in-text that must parse dense narratives and compute precise results.
- ā¢Boost multimodal reasoning (e.g., geometry with diagrams) by upweighting hard visual-text questions.
- ā¢Create fairer RL pipelines in other domains by adopting DGAE-style normalization and difficulty-aware weighting.
- ā¢Design study apps that adaptively reformulate practice questions to build robust reasoning skills.
- ā¢Strengthen code-generation or logic-puzzle solvers by crafting harder-but-equivalent prompts.
- ā¢Develop teacher tools that generate rigorous practice sheets with guaranteed same-answer variants.