Exploring Reasoning Reward Model for Agents

Kaixuan Fan; Kaituo Feng; Manyuan Zhang; Tianshuo Peng; Zhixun Li; Yilei Jiang; Shuang Chen; Peng Pei; Xunliang Cai; Xiangyu Yue

Exploring Reasoning Reward Model for Agents

Intermediate

Kaixuan Fan, Kaituo Feng, Manyuan Zhang et al.1/29/2026

arXiv PDF

Key Summary

•The paper teaches AI agents better by grading not just their final answers, but also how they think and use tools along the way.
•They build a special helper called Agent-RRM that produces three things: a reasoning trace, a short critique, and a final quality score.
•These rich signals are plugged into agents in three ways: Reagent-C (use the critique to fix answers at inference time), Reagent-R (add the score as extra reward during training), and Reagent-U (combine both).
•On tough benchmarks like GAIA and WebWalkerQA, the unified method Reagent-U performs best, reaching 43.7% and 46.2% accuracy respectively.
•Text-only critiques already help without retraining, showing that clear feedback can fix many small reasoning or tool-use mistakes.
•Adding model-based rewards reduces the problem of “sparse rewards,” guiding agents even when the final answer is wrong but the reasoning is partly right.
•Balancing the weight of the reward model matters: a moderate mix works best, while overweighting it hurts final-task success.
•The team releases four datasets and open-source code to help others train and test reasoning-aware agents.
•This approach makes agents more reliable on long, multi-step tasks involving search, coding, and multimodal inputs.
•The work focuses on 8B models; scaling up and testing in messier, real-world settings are important next steps.

Why This Research Matters

Many real-life tasks require several correct steps in a row—finding sources, opening pages, checking details, computing results, and combining text with images or audio. If an AI only learns from final outcomes, it can’t tell which parts of its process to fix, so it repeats the same mistakes. This paper shows how to give rich, structured feedback that explains thinking, pinpoints actionable fixes, and still provides a clean score for training. The result is an agent that plans better, verifies sources more often, and self-corrects small errors before they become big ones. This approach can make research assistants, coding copilots, and data analysts more reliable. In the long run, it builds trust: we don’t just get answers—we get agents that learn good habits.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how teachers sometimes mark your homework with notes like “Great idea here!” or “Show your steps,” instead of just writing a big A or F? Those little notes help you learn way faster than a single final score.

🥬 The Concept (Agentic Reinforcement Learning – the world before): In AI, Agentic Reinforcement Learning is like teaching a robot to do multi-step tasks (search the web, read pages, write code) by letting it try things and then rewarding good results.

How it worked:
1. The agent tried a long sequence of steps (a trajectory).
2. At the end, it either got a reward for the right final answer or nothing if it was wrong.
3. Over time, it learned which actions led to success.
Why it mattered: Agents could learn to use tools and plan over many steps.
What broke: For long tasks, a simple “correct/incorrect” at the end is too vague. An agent that almost got it right but messed up one last click got the same zero as an agent that did everything wrong.

🍞 Anchor: Imagine baking a cake for the first time: if your parent only says “Wrong” at the end without telling you the batter was too runny or the oven was too hot, you won’t know what to fix next time.

🍞 Hook: Imagine a soccer coach who only looks at the final score and ignores how well the team passed, defended, or kept formation. The team would never know which parts to improve.

🥬 The Concept (The problem): Most agent training gives sparse, outcome-only rewards.

How it plays out:
1. Long tasks have many decisions; a single end score hides which decisions were good.
2. Agents can’t tell if their plan was solid but their last step failed, or if the plan was bad from the start.
3. Training stagnates or learns the wrong habits.
Why it matters: Without guidance on intermediate thinking, agents struggle at web search, coding, and multimodal tasks that need several correct steps in a row.

🍞 Anchor: It’s like practicing piano for a big concert but only hearing “pass/fail” after the show. You’d never learn which parts of the song to rehearse.

🍞 Hook: Think about getting gold stars not just for finishing a puzzle, but also for using smart strategies like starting with edge pieces or sorting by color. That helps you improve faster.

🥬 The Concept (Failed attempts): People tried to add more feedback using step-level rewards or pairwise preferences.

How they worked:
1. Step-level rewards label each action, but are costly to annotate and easy to “game.”
2. Pairwise preference models say which of two trajectories is better but may miss fine-grained differences and don’t tell you how to fix errors.
Why it matters: These methods either cost too much, bias learning, or don’t give actionable guidance.

🍞 Anchor: It’s like a judge who can say “This try is better than that one,” but can’t explain why or how to improve the worse try.

🍞 Hook: Imagine a coach who watches your whole play, writes clear notes on what went well, points out mistakes, and then gives a fair score. That’s the kind of feedback that speeds up learning.

🥬 The Concept (The gap): Agents needed multi-granular feedback that explains the reasoning, highlights fixable flaws, and still gives a clear overall score.

How it should work:
1. Explain the thinking process (what worked, what didn’t).
2. Give short, actionable critique (“You should have opened this link,” “Don’t trust snippets”).
3. Provide a numeric score to guide training.
Why it matters: This blends human-friendly advice with machine-friendly rewards.

🍞 Anchor: It’s like getting a teacher’s margin notes plus a rubric score—words help you improve today, scores help you track progress over time.

🍞 Hook: Think of a long treasure hunt with maps, codes, and tools. If you only learn whether you found the treasure at the end, you’ll keep repeating the same detours. If a guide tells you which clue you misread and which tool to try next, you’ll get better quickly.

🥬 The Concept (Real stakes): Better feedback means more reliable agents for everyday multi-step tasks—researching health info, summarizing documents, debugging code, checking tables in images, or transcribing audio before analysis.

How it works out in life:
1. Students can ask agents to find and verify sources.
2. Journalists can trace claims to original pages.
3. Analysts can mix code, web, and files in one reasoning loop.
Why it matters: When agents understand not just answers but also good process, they make fewer dangerous mistakes and are easier to trust.

🍞 Anchor: A homework helper that explains which step you skipped and how to fix it tomorrow is more helpful than one that just blurts out an answer.

02Core Idea

🍞 Hook: You know how a helpful tutor doesn’t just say “right” or “wrong,” but also shows you their scratch work and writes a short note like “Double-check your last step”?

🥬 The Concept (Aha! moment): The key idea is to teach agents with structured reasoning feedback: generate a reasoning trace, a focused critique, and a final score for each attempt, then use those signals to train better agents.

How it works:
1. Build a reward model (Agent-RRM) that explains its judgment out loud (a think trace), gives actionable critique, and ends with a numeric score.
2. Plug those signals into agents in three ways: use the critique at inference (Reagent-C), use the score during RL (Reagent-R), or combine both (Reagent-U).
3. Evaluate across many tasks to confirm gains.
Why it matters: Without these multi-part signals, agents can’t tell which parts of their thinking to fix.

🍞 Anchor: It’s like grading math with “show your work,” a sticky note of what to fix, and a final grade—together, they make you stronger next time.

🍞 Hook: Imagine three kinds of advice while practicing basketball: a video replay (reasoning trace), a coach’s quick tip (“Plant your feet before shooting”), and a scoreboard. All three help in different ways.

🥬 The Concept (Agent-RRM): Agent-RRM is a special judge that reads an agent’s whole trajectory and returns:

What it is: A reasoning-aware reward model that outputs a think trace, a concise critique, and a scalar score.
How it works:
1. Pretrain (SFT) to learn the output format and good judging habits.
2. Fine-tune with RL to calibrate scores and keep reasoning consistent.
3. Use it on any agent attempt to produce rich feedback without needing the ground-truth answer.
Why it matters: It gives both human-readable advice and machine-usable rewards, improving transparency and learning.

🍞 Anchor: Like a science-fair judge who explains their reasoning, gives you a tip sheet, and a final rating—all without knowing the “one right way,” just evaluating your process.

🍞 Hook: Picture Lego instructions (step-by-step trace), a sticky note saying “Use the blue 2x2 brick here!” (critique), and a star rating (score). Using all three beats guessing.

🥬 The Concept (Structured feedback components):

What it is: Three parts—think trace, critique, and score.
How it works:
1. Think trace: describes strengths/weaknesses in logic and tool calls.
2. Critique: short, focused, and actionable guidance to fix mistakes.
3. Score: a 0–1 number summarizing quality.
Why it matters: The trace builds insight, the critique drives immediate fixes, and the score guides training.

🍞 Anchor: When searching “Who first directed this award?”, the critique might say “Open the winners list page you found,” which often flips a near-miss into a correct answer.

🍞 Hook: Imagine three training modes: reading a tip and trying again right now, earning points based on how well you played, or doing both at once.

🥬 The Concept (Reagent family):

What it is: Three agent variants that use Agent-RRM differently.
How it works:
1. Reagent-C (text-augmented refinement): keep the model frozen and simply feed the critique back in; the agent tries a second pass with that advice.
2. Reagent-R (reward-augmented guidance): during RL, add the Agent-RRM score to the usual rule-based reward, so even partial-good reasoning gets credit.
3. Reagent-U (unified integration): combine both—optimize for good initial answers and good critique-guided refinements inside one loop.
Why it matters: This tests where language feedback, numeric feedback, or both together help most.

🍞 Anchor: In tests, Reagent-U consistently wins—like using both a map (critique) and a compass (score) instead of only one—reaching 43.7% on GAIA and 46.2% on WebWalkerQA.

🍞 Hook: If you’ve ever fixed a mistake because a friend said “Check step 3,” you know why this works.

🥬 The Concept (Why it works):

What it is: Intuition behind the math.
How it works:
1. Long tasks need many correct micro-decisions; a final-only score hides those signals.
2. A continuous score rewards partially-correct reasoning, keeping learning on track.
3. Short critiques directly point to fixable mistakes (e.g., “Don’t trust snippets; open the source”).
4. Combining both trains agents to think better the first time and to self-correct when needed.
Why it matters: It reduces wasted trials and locks in robust habits for multi-tool, multi-turn work.

🍞 Anchor: On a web task, adding the critique “Open the page and verify names” turns a near-miss into a hit. Over time, the agent learns to verify by itself, even without a critique.

03Methodology

🍞 Hook: Imagine a school project with three helpers: a coach who gives tips, a scoreboard for points, and a practice routine that tries several versions and keeps the best ideas.

🥬 The Concept (High-level recipe): At a high level: Input question → Agent produces attempts → Agent-RRM judges each with think/critique/score → The agent uses either the critique now (Reagent-C), the score in training (Reagent-R), or both (Reagent-U) → Output a better answer.

Why it matters: This loop teaches both good first tries and smart self-correction.

🍞 Anchor: Like drafting an essay, getting margin notes plus a grade, and then writing a stronger final draft.

🍞 Hook: Picture a set of tools on a desk: a web browser, a search bar, a calculator (Python), a file opener, a picture describer, and an audio transcriber.

🥬 The Concept (Agent’s tool belt): The agent can use Search, Web Browse, Python code, File reader, Image-to-text, and Audio-to-text.

How it works:
1. Search finds candidate pages; Browse opens and summarizes content.
2. Python runs calculations for math or data.
3. File reader opens documents; Image-to-text reads charts/images; Audio-to-text makes transcripts.
Why it matters: Real tasks often mix reading, computing, and media understanding.

🍞 Anchor: If asked “Which director first won this award?”, the agent may Search, then Browse the winners page to confirm the earliest name.

🍞 Hook: Think of making a giant practice set that covers math, web research, images, and audio, so the agent learns many patterns.

🥬 The Concept (Datasets): The team built four datasets:

What they are: Reagent-RL-709K (for RL), Reagent-SFT-55.6K (correct full trajectories for supervised warm-up), and two for the reward model (RRM-SFT-28K, RRM-RL-90K).
How it works:
1. Clean, filter, and deduplicate questions.
2. Generate trajectories with multiple models to capture many error styles.
3. Label with structured think/critique/score to teach Agent-RRM how to judge.
Why it matters: Good training data teaches robust reasoning and robust judging.

🍞 Anchor: Like practicing many puzzles of different kinds—arithmetic, reading, pictures—so the coach learns to give the right tip for each.

🍞 Hook: Imagine a fair contest where several answers compete, each gets feedback, and the agent shifts its habits toward the higher-rated ones.

🥬 The Concept (GRPO – the training engine):

What it is: A group-based RL method that samples multiple outputs, scores them, and nudges the model toward better ones while staying close to a safe reference.
How it works:
1. For a question, sample several candidate answers.
2. Assign rewards (rule-based correctness, plus Agent-RRM’s score when used).
3. Compare within the group so partial credit stands out.
4. Keep the policy from drifting too far, too fast.
Why it matters: It stabilizes learning and makes good attempts pull the model in the right direction.

🍞 Anchor: Like a bake-off where the tastier cakes get copied more next round, but the recipe doesn’t change too wildly at once.

🍞 Hook: Three ways to use the judge—read the tip and try again now, add the score to training, or both.

🥬 The Concept (Reagent-C – critique at inference):

What it is: A training-free, plug-in mode: the model is frozen; you just feed in the judge’s short critique and ask for a refined second answer.
How it works:
1. Generate a first try.
2. Ask Agent-RRM for a critique that names concrete mistakes.
3. Produce a second try that follows the advice.
Why it matters: Many failures are small missteps (e.g., didn’t open the key page); a precise tip often flips to correct.

🍞 Anchor: The critique “Open the winners list and verify the earliest director” often turns a guessed answer into the right one.

🥬 The Concept (Reagent-R – add the score to training):

What it is: Use Agent-RRM’s number as an extra reward during RL to reduce sparsity.
How it works:
1. Generate multiple tries.
2. Reward = rule-based correctness + λ × judge score.
3. Train the agent so even good partial reasoning gets credit.
Why it matters: The agent learns which intermediate behaviors are promising, not just which final answers were correct.

🍞 Anchor: Like earning points for good passes and not only for goals—you improve the whole play, not just the finish.

🥬 The Concept (Reagent-U – unify critique and score):

What it is: One loop that optimizes initial answers and critique-guided refinements together.
How it works:
1. Make an initial attempt and a critique-guided second attempt.
2. Pool them and score all tries (rule + judge).
3. Train toward the better ones across both stages.
Why it matters: The agent internalizes both good first thinking and good self-correction; at test time, it performs well without needing critiques.

🍞 Anchor: Reagent-U scores best across many tasks (43.7% GAIA text, 46.2% WebWalkerQA), like practicing both your first serve and your second serve until both are solid.

🍞 Hook: Too much of a good thing can be bad—like adding too much salt to soup.

🥬 The Concept (Balancing λ – the mixing knob):

What it is: λ controls how much the judge’s score affects the reward.
How it works:
1. Small λ: too little reasoning feedback; training stays sparse.
2. Medium λ (about 0.2–0.4): best results—balanced learning.
3. Big λ (0.5): over-focus on intermediate steps can hurt final answers.
Why it matters: You want enough reasoning guidance without overshadowing the true goal.

🍞 Anchor: Like balancing practice time between dribbling and shooting—the sweet spot gets you more wins.

04Experiments & Results

🍞 Hook: Think of a school decathlon: math, reading, research, and problem-solving. A good student must do well across the board, not just in one event.

🥬 The Concept (The test): The team measured how well agents completed multi-step tasks across 12 benchmarks, spanning math (AIME24/25, MATH500, GSM8K), knowledge reasoning (HotpotQA, 2Wiki, Bamboogle, MuSiQue), and general agent/search tasks (GAIA, WebWalkerQA, Humanity’s Last Exam, xbench).

How it worked:
1. Agents used tools like web search, browsing, coding, and media understanding.
2. A strong judge model scored correctness.
3. They compared Reagent variants and many baselines.
Why it matters: If the method helps across varied tasks, it’s building real, general skills.

🍞 Anchor: It’s like testing not just spelling but essays, science projects, and oral reports too.

🍞 Hook: Imagine racing against top clubs and community teams, small and big.

🥬 The Concept (The competition): They compared against proprietary systems and open-source baselines (7B–32B), including process-reward methods.

How it worked:
1. Same evaluation rules: pass@1 under standard decoding.
2. Use the same judge to be fair.
3. Track gains over solid recent methods.
Why it matters: Fair comparisons show if new ideas truly help.

🍞 Anchor: Like running on the same track with the same stopwatch for everyone.

🍞 Hook: If a class average is B− and you score an A, that’s a big deal.

🥬 The Concept (The scoreboard): Reagent-U leads broadly.

General agent/search:
- GAIA (text subset): 43.7% for Reagent-U, beating strong open baselines.
- WebWalkerQA: 46.2% for Reagent-U; strong gains over similar-size models.
- xbench: 43.0% for Reagent-U (vs. 41.0% for Reagent-R and 32.0% for rule-only Reagent).
Knowledge reasoning: Bamboogle 76.8% with Reagent-U—like moving from a B to a strong A.
Math: AIME24 60.0% with Reagent-U, and strong results on MATH500 and GSM8K.
Cross-modal (full GAIA): Reagent-U stays competitive on text and improves on the full set, showing breadth.
Context: Gains of 10–15 points in some settings are like jumping a whole letter grade versus peers.
Why it matters: The method helps not just on one niche task, but across varied and long-horizon challenges.

🍞 Anchor: Reagent-U is that student who aces the final, the lab, and the group project—not just quizzes.

🍞 Hook: What surprised the team? Sometimes a tiny tip flips a miss into a hit.

🥬 The Concept (Surprising findings):

Reagent-C (critique at inference, no training) already boosts performance. This shows many failures are fixable with pointed advice like “Open the source page” or “Print your Python result.”
Reagent-R (score-augmented RL) consistently beats rule-only RL: partial credit for good reasoning helps the agent learn even when the final answer is wrong.
Reagent-U (unified) wins overall: advice + points beat either alone.
λ matters: Too high a weight on the model score over-emphasizes intermediate steps; moderate values (0.2–0.4) work best.
Cross-modal: Gains hold when mixing search, coding, images, audio, and files—showing generalization.

🍞 Anchor: A math example: the critique “Don’t divide by the number of painters; the question asks hours each painter worked” turned a wrong 47.25 into the correct 189.

05Discussion & Limitations

🍞 Hook: Even great bikes have training limits—you still need a safe road, time to practice, and a good fit.

🥬 The Concept (Limitations):

What it is: Where the method falls short.
How it shows up:
1. Scale: Most results use 8B models; behavior at 14B, 70B, or larger is promising but untested here.
2. Benchmarks vs. real world: Web APIs, site structures, and media can be messier; more open-ended testing is needed.
3. Reward trade-offs: Overweighting model-based scores can reduce final-task focus; careful tuning is required.
4. Tool dependence: If search APIs or summarizers are noisy, agents can inherit those errors.
Why it matters: Knowing limits helps you deploy wisely and plan next steps.

🍞 Anchor: Like knowing your calculator sometimes rounds weirdly—you double-check results when it counts.

🍞 Hook: Before you cook a feast, make sure you have a kitchen, ingredients, and time.

🥬 The Concept (Required resources):

What it is: What you need to run or train this.
How it shows up:
1. Hardware: About 8× A800-80G GPUs for training runs reported.
2. Data: Hundreds of thousands of examples across RL/SFT for both agent and reward model.
3. Tools: Access to search/browse, image and audio models, code execution, and a reliable judging model for evaluation.
Why it matters: Reproducing results needs compute, data, and tool access.

🍞 Anchor: It’s like needing an oven, ingredients, recipes, and a tasting panel to bake at competition level.

🍞 Hook: Hammers aren’t for everything; sometimes you need a screwdriver.

🥬 The Concept (When not to use):

What it is: Situations where simpler methods may win.
How it shows up:
1. Very short tasks with perfect verifiers—outcome-only reward may be enough.
2. Settings without tool access—if you can’t search or browse, gains may shrink.
3. Ultra-low compute budgets—training the judge and agent together may be too heavy.
Why it matters: Pick the right tool for the right job.

🍞 Anchor: For a one-step arithmetic drill, you don’t need a full debate and a coach—just check the answer.

🍞 Hook: The best questions spark the next discoveries.

🥬 The Concept (Open questions):

What it is: What we still don’t know.
How it shows up:
1. Scaling: How do signals interact at 70B+?
2. Safety: How to ensure critiques never leak answers or harmful steps?
3. Robustness: How to resist reward hacking and style overfitting in critiques?
4. Domain breadth: How to extend to scientific tools, databases, and more complex multi-modal labs?
Why it matters: Solving these unlocks more trustworthy, general-purpose agents.

🍞 Anchor: Like moving from a great school team to the world championship—you’ll meet tougher fields and must refine your playbook.

06Conclusion & Future Work

🍞 Hook: Imagine a tutor who explains their thinking, gives you a short tip sheet, and a clear score—now imagine your study app learning from that every day.

🥬 The Concept (3-sentence summary): This paper builds Agent-RRM, a reasoning reward model that outputs a think trace, a concise critique, and a scalar score for agent trajectories. It integrates these signals into agents in three ways (Reagent-C/R/U), showing that language critiques and numeric rewards each help—and together help most. Across 12 benchmarks, the unified approach reaches top scores (e.g., 43.7% GAIA text, 46.2% WebWalkerQA), improving long-horizon, multi-tool reasoning.

🍞 Anchor: It’s like combining teacher notes with grades to train a stronger student who both plans well and self-corrects.

Main achievement: Turning agent training from “final-answer-only” into “process-aware learning” using structured, multi-part feedback that is both human-readable and machine-optimizable.

Future directions: Scale to larger models and broader tools, reinforce safety and anti-hacking measures, and test on open-ended real-world workflows (e.g., scientific discovery, enterprise data analysis).

Why remember this: When tasks are long and tricky, feedback that explains thinking, pinpoints fixes, and still gives a clean score can transform near-misses into reliable wins—and make AI agents far more trustworthy.

Practical Applications

•Research assistant that opens sources, verifies facts on the actual page, and flags unverified claims.
•Coding helper that runs small Python checks, prints outputs, and uses critiques to avoid silent errors.
•Data analyst agent that reads files, cross-checks numbers, and explains which cells or steps might be wrong.
•Open-domain QA agent that searches, browses, and confirms details rather than trusting snippets.
•Educational tutor that points out exactly which math step was mistaken and suggests a targeted fix.
•Customer support bot that follows documented procedures and uses critiques to avoid skipping key verification steps.
•News summarizer that links back to original sources and notes confidence or missing confirmations.
•Compliance checker that inspects documents against rules, with critiques explaining missing evidence or improper tool use.
•Multimodal assistant that reads charts/images/audio transcripts and explains how each input supports its answer.
•Enterprise agent that chains tools (search, code, files) with process-aware feedback to reduce brittle failures.

Version: 1