Training Data Efficiency in Multimodal Process Reward Models

Jinyuan Li; Chengsong Huang; Langlin Huang; Shaoyang Xu; Haolin Liu; Wenxuan Zhang; Jiaxin Huang

Training Data Efficiency in Multimodal Process Reward Models

Intermediate

Jinyuan Li, Chengsong Huang, Langlin Huang et al.2/4/2026

arXiv PDF

Key Summary

•Multimodal Process Reward Models (MPRMs) teach AI to judge each step of a picture-and-text reasoning process, not just the final answer.
•Creating step-by-step training labels with Monte Carlo (many tries) is expensive and often redundant, so using all data wastes time and compute.
•The paper discovers that what really helps learning are rollouts that mix both right and wrong steps while keeping the 'right' labels trustworthy.
•They propose the Balanced-Information Score (BIS) to pick the best rollouts using only existing Monte Carlo scores—no extra model calls.
•BIS combines two ideas: label mixture (a healthy mix of positives and negatives) and label reliability (how trustworthy the positive labels are).
•Across two strong backbones (InternVL2.5-8B and Qwen2.5-VL-7B), BIS subsets often match or beat training on the full dataset using just 10–25% of the data.
•With InternVL2.5-8B, BIS hit full-data performance using only 10% of the training rollouts, saving about 95% compute.
•Randomly dropping data plateaus quickly, proving there’s a lot of redundancy; BIS keeps the most informative rollouts so training stays strong.
•Soft targets from MC scores underperform simple binary labels, and making the binary threshold stricter also hurts, so BIS steers clear of low-confidence 'pseudo-positives.'
•BIS-trained MPRMs also improve inference-time reranking (best-of-N), giving better final answers across multiple benchmarks.

Why This Research Matters

This work shows we can train models to reason better with far less data and compute by carefully choosing which examples to keep. That means faster research cycles, lower costs, and a smaller environmental footprint. Because BIS uses only existing MC signals, anyone with an MC-annotated dataset can apply it immediately without extra model calls. The approach stabilizes learning by avoiding noisy pseudo-positives while keeping enough mistakes to teach the model where boundaries lie. Better step-level supervision also improves final answers via reranking, making AI more dependable on real tasks. Overall, BIS helps turn massive, messy datasets into smaller, smarter ones that teach models more effectively.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to solve math puzzles that include pictures and words. A helpful teacher doesn’t just say if your final answer is right or wrong; they point out which steps in your reasoning were good and which went off track. That makes you learn much faster.

🥬 The Concept: Multimodal Process Reward Models (MPRMs) do that teacher job for AI. They score each intermediate step in a visual-and-text reasoning chain so the AI can learn where it’s thinking clearly and where it’s slipping. How it works (before this paper):

Collect lots of reasoning rollouts (step-by-step explanations) from models on visual problems.
For each step, run Monte Carlo (MC) sampling—generate many possible continuations—and see how often they eventually lead to a correct final answer.
Turn those MC scores into labels for training an MPRM (usually binary: positive if any continuation succeeds; otherwise negative). Why it matters: Without step-level guidance, models only learn from final answers, missing clues on how to fix their thinking.

🍞 Anchor: Think of a coach who gives you feedback after each move in chess, not just at checkmate. You improve way faster.

🍞 Hook: You know how taking 100 photos of the same scene doesn’t make the picture much better after a point? More isn’t always more.

🥬 The Concept: The problem here is training data efficiency. MPRMs are usually trained on huge MC-annotated datasets (millions of steps), which is costly and slow. But the authors found that randomly using only a small slice of the data gives almost the same performance. How it works:

They tried random subsampling: keep only a fraction of the training rollouts.
Performance climbed as more data was kept but quickly flattened, showing lots of redundancy.
Even training the small subset for longer didn’t fully bridge the gap, hinting the issue isn’t just quantity. Why it matters: If we can intelligently pick which rollouts to keep, we can train much faster and cheaper without losing accuracy.

🍞 Anchor: It’s like studying a smaller but smarter set of practice problems that still teaches you everything you need.

🍞 Hook: Imagine some stickers in your homework book are mislabeled—some “Great job!” stickers are on mistakes because the grader glanced too quickly. If you copy from those, you’ll learn the wrong lesson.

🥬 The Concept: Label noise is when the labels (like “this step is good”) are unreliable, often because the MC process is based on a limited number of samples. Particularly dangerous are low-MC positive steps—steps marked positive even though only 1 or 2 out of 16 tries eventually succeeded. How it works:

MC scores come from N sampled continuations (N=16 here).
If just one continuation randomly lucks into a correct final answer, the step gets a positive label.
These “pseudo-positives” act like flipped labels, making gradients noisy and training harder. Why it matters: Too many noisy positives drown out the real learning signal.

🍞 Anchor: It’s like trusting a shaky rumor; it can send your studying in the wrong direction.

🍞 Hook: Picture a classroom where the teacher knows the perfect answer key and the student is trying to learn from imperfect worksheets.

🥬 The Concept: A teacher–student framework explains what makes training data informative. The “teacher” represents the ideal, noiseless model. The “student” is your real model learning from noisy MC-labeled data. How it works:

The teacher knows the true chance that a step is correct.
The student updates its parameters using gradients from the (noisy) data.
Strong learning happens when the teacher is uncertain (around 50/50), because that’s where feedback teaches the most. Why it matters: This shows we want steps where there’s meaningful uncertainty, but with reliable labels—otherwise updates get noisy.

🍞 Anchor: When you’re unsure between two chess moves and your coach clearly explains which is better and why, you learn a lot.

🍞 Hook: You know how mixing easy wins with clear mistakes in your practice set helps you spot patterns? The contrast sharpens your judgment.

🥬 The Concept: The authors discovered two rollout qualities that make training efficient: mixture and reliability. How it works:

Mixture: rollouts that include both positive and negative steps (not all one way) provide contrast, which strengthens learning.
Reliability: among positive steps, higher average MC scores mean those positives are trustworthy (fewer pseudo-positives). Why it matters: Mixture without reliability invites noise; reliability without mixture lacks contrast. You need both.

🍞 Anchor: It’s like a practice sheet that shows both great examples and clear mistakes, but where the “great” labels are actually correct.

🍞 Hook: Imagine packing a lunchbox: you want a good balance (protein and veggies) and fresh ingredients. A lunchbox with balance but spoiled food is bad; all fresh candy but no balance is also not great.

🥬 The Concept: The Balanced-Information Score (BIS) is a simple rollout-level score that picks data with both good mixture and good reliability. How it works:

For a rollout, compute mixture as p_pos(1 − p_pos), where p_pos is the fraction of positive steps.
Compute reliability R as the average MC score over positive steps (or 1 if there are none).
BIS = [mixture + α] × R, with a small α to not ignore low-mixture-but-very-reliable rollouts. Why it matters: BIS uses only the MC signals already in the dataset—no extra model calls—so it’s cheap and effective.

🍞 Anchor: Using BIS is like choosing the best lunchboxes that are both well-balanced and fresh, so students get energy and health without extra shopping trips.

02Core Idea

🍞 Hook: You know how a good study guide mixes tricky questions with clear answer keys, so you learn fast without wasting time?

🥬 The Concept: The “aha!” is that the most helpful training rollouts are both mixed (contain right and wrong steps) and reliable (their positive steps have trustworthy MC scores). The paper turns this into a single score, BIS, to pick the best rollouts and skip the rest—cutting data and compute while keeping (or improving) performance. How it works:

Notice training saturates quickly when you randomly drop data; so lots of data is redundant.
Theory says learning is strongest where the true model is uncertain (around 50/50), but only if labels are reliable.
Create BIS that multiplies mixture and reliability, then keep the top rollouts per source. Why it matters: Without BIS, training wastes time on flatly easy or misleading steps; with BIS, you spend updates where they teach the most.

🍞 Anchor: It’s like studying 10 carefully chosen pages instead of 100 random ones and still acing the test.

Multiple analogies:

Detective analogy: Mixing suspects (right/wrong steps) with trustworthy clues (reliable positives) cracks the case; unreliable clues mislead you.
Workout analogy: Best gains come from sets that are challenging (uncertain) but with good form (reliable labels); sloppy reps (noisy positives) risk injury (bad gradients).
Cooking analogy: Balance flavors (mixture) and freshness (reliability); too much of one or spoiled ingredients ruin the dish.

Before vs After:

Before: People believed more data would steadily help; they used giant MC-labeled sets and sometimes random subsets.
After: We now know much of that data is redundant; what you keep matters more than how much you keep. BIS-selected 10–25% can match or beat full-data training.

Why it works (intuition):

Learning signal is strongest near uncertainty (around 0.5), where mistakes and successes both show up. That’s what mixture measures.
But uncertainty helps only if labels are correct enough; otherwise gradients get noisy. That’s what reliability tracks.
These two effects multiply: high mixture × high reliability = big, clean learning signals.

Building blocks:

Rollouts: step-by-step reasoning traces over text+images.
Mixture: percent of positive steps balanced with negatives, measured by p_pos(1 − p_pos).
Reliability: average MC score of positive steps—higher means fewer pseudo-positives.
BIS: [p_pos(1 − p_pos) + α] × R; rank rollouts by BIS within each source and keep the top fraction.
Train the MPRM on the kept rollouts with the same simple binary loss as usual.
Evaluate on a human-annotated benchmark to verify gains hold up beyond the training pool.

03Methodology

At a high level: MC-labeled rollouts → compute BIS per rollout → rank within each source and keep top ρ% → train MPRM exactly as usual → evaluate on VisualProcessBench and best-of-N reranking.

Step A: Inputs and labels

What happens: Each training example is a visual question, images, and a step-by-step solution (a rollout). Every step has an MC score from N=16 continuations (how often later leads to a correct final answer). The standard rule makes a binary label: positive if MC score > 0, else negative.
Why it exists: Binary labels are robust here because raw MC scores are noisy; soft targets encouraged overfitting to sampling noise.
Example: If at step 3, 2 of 16 continuations eventually lead to a correct solution, its MC score is 2/16; it becomes a positive label under the default binarization.

Step B: Compute mixture p_pos(1 − p_pos)

What happens: For a rollout with n steps, compute p_pos = (# positive steps)/n, then mixture = p_pos(1 − p_pos). This is largest near 0.5 and zero when all steps are positive or all negative.
Why it exists: Mixture approximates teacher uncertainty—the sweet spot where gradients teach the most. All-positive or all-negative rollouts offer little contrast, so less signal.
Example: In a 10-step rollout with 6 positives and 4 negatives, p_pos = 0.6, mixture = 0.6 × 0.4 = 0.24 (a healthy mix).

Step C: Compute reliability R

What happens: Average the MC scores of the positive steps in the rollout. If there are zero positives, set R = 1 by convention (so purely negative rollouts aren’t unfairly punished).
Why it exists: Among positives, low MC scores often mean pseudo-positives (lucky one-offs), which behave like label flips and add noise. High R means trustworthy positive anchors.
Example: If the positive steps have MC scores [0.25, 0.5, 0.75], R = (0.25 + 0.5 + 0.75)/3 = 0.5.

Step D: Balanced-Information Score (BIS)

What happens: BIS = [mixture + α] × R, with a small α (e.g., 0.05) to avoid ignoring low-mixture-but-very-reliable rollouts. This couples contrast with trustworthiness.
Why it exists: Theory shows gradient signal scales with uncertainty times label correctness; BIS mirrors that multiplicative structure.
Example: Two rollouts: • Rollout A: mixture=0.24, R=0.6 → BIS=(0.24+0.05)*0.6=0.174. • Rollout B: mixture=0.10, R=0.9 → BIS=(0.10+0.05)*0.9=0.135. Result: A beats B because its better contrast outweighs B’s higher reliability.

Step E: Per-source ranking and selection with keep ratio ρ

What happens: Within each of the 38 sources in the training set, rank rollouts by BIS and keep the top ρ fraction (e.g., 10%, 25%). Concatenate all kept rollouts.
Why it exists: Equalizes coverage across sources and avoids collapsing onto a few sources with unusual scoring.
Example: If a source has 10,000 rollouts and ρ=10%, keep the 1,000 highest-BIS rollouts from that source.

Step F: Training the MPRM (unchanged recipe)

What happens: Train the same way as standard MPRMs—freeze the vision encoder, fine-tune the language + fusion parts, predict a Yes/No at each <prm> token with cross-entropy.
Why it exists: Keeps the comparison fair; the only change is which rollouts we include.
Example: The model outputs a probability of “Yes” per step; thresholds are tuned on a held-out split to maximize micro-F1.

Step G: Evaluation

What happens: Evaluate on VisualProcessBench (human-labeled step correctness) reporting micro-F1 overall and macro-F1 per source; also test best-of-N reranking where the MPRM scores candidate solutions.
Why it exists: Ensures step-level gains transfer to real tasks and final answers.
Example: In best-of-16, average the step scores for each candidate rollout and pick the highest-scoring one.

What breaks without each step:

Skip mixture: You might pick easy all-positive rollouts with little contrast; the model learns slowly and overfits.
Skip reliability: You might pick low-MC positives (pseudo-positives), amplifying gradient noise and hurting learning.
Skip per-source selection: Selection may skew toward a few sources, reducing diversity and generalization.
Use soft labels: The model chases MC sampling noise; results degrade.

The secret sauce:

Multiplicative coupling: BIS encodes that informative steps are both uncertain and correctly labeled.
No extra cost: Uses only existing MC scores, so selection is free compared to re-annotating or running new models.
Robustness: BIS avoids the ambiguous low-MC regime while keeping enough negatives to learn clear boundaries.
Speed: With a smaller, smarter subset, training reaches high accuracy in fewer steps and with far less compute.

04Experiments & Results

The test: The authors trained MPRMs on subsets chosen either randomly or by BIS and measured step-level accuracy on VisualProcessBench (human-annotated), reporting overall micro-F1 and per-source macro-F1. They also tested best-of-16 reranking on multiple benchmarks to see if step scoring improves final answers.

The competition: BIS-selected subsets vs random subsets of the same size (5–50%), and against full-data training. Additional ablations compare BIS to simple heuristics: Mixed-only (favoring rollouts with both positives and negatives), Reliability-only (favoring high average MC positives), and Low-MC (favoring low average MC rollouts).

The scoreboard with context:

InternVL2.5-8B backbone: • BIS-10% reached 65.46% micro-F1, matching or slightly beating full-data (about 65.12%) while using just one tenth of the rollouts and updates. That’s like getting an A+ with only 10% of the homework. • BIS consistently outperformed random subsets at the same budget; at 10%, BIS beat Random-10% by +2.6 points. • At 25%, BIS again outperformed the random subset across sources.
Qwen2.5-VL-7B backbone: • BIS delivered especially large gains in low budgets: +10.9 points over Random-5% and +5.5 points over Random-15%. • BIS at 25% already reached the full-data reference.
Labeling strategy: • Soft (using raw MC scores as targets) underperformed hard labels. • Tightening the hard-label threshold above 0 hurt performance, likely mislabeling hard-but-correct steps as negative.
Best-of-N reranking: • BIS-trained MPRMs improved final-answer accuracy over both random subsets and even some full-data runs on MM-K12, OlympiadBench, MathVerse, and MathVista.

Surprising findings:

Random subsampling plateaus quickly—there’s a lot of redundancy. Even keeping 25% and training longer didn’t fully close the small remaining gap to full data.
Low-MC positives are risky: they introduce pseudo-positives and degrade learning; BIS explicitly avoids over-favoring them.
Performance scales nonlinearly with subset size: rapid gains at small budgets, peaking around a moderate keep ratio (often ~25%), with slight dips beyond that as less-informative rollouts get included.
BIS is robust to the smoothing constant α (e.g., 0.05 worked best), confirming the method does not rely on delicate tuning.

Takeaway: Smaller but smarter beats bigger and random. BIS systematically finds the high-value rollouts so the model learns faster and better with far less data and compute.

05Discussion & Limitations

Limitations:

Assumptions in the theory (like simplified label-noise models and linearized views of learning) are approximations; real MPRMs are more complex.
BIS depends on having MC scores; if a dataset lacks per-step MC signals, BIS can’t be computed without extra annotation effort.
Some domains may have fewer naturally mixed rollouts (many all-positive solutions), reducing the advantage of mixture-based selection.
If the MC annotator itself is heavily biased or extremely noisy, even “reliable” positives may still be misleading.

Required resources:

A pool of MC-annotated rollouts with step-level MC scores (like VisualPRM400K-v1.1).
A capable MLLM backbone (e.g., InternVL2.5-8B or Qwen2.5-VL-7B), plus standard fine-tuning resources (multi-GPU recommended).
A benchmark with human-annotated step labels (e.g., VisualProcessBench) for honest evaluation.

When not to use:

When you have no step-level MC scores; BIS offers no advantage without that signal.
When tasks are outcome-only with no meaningful step structure; process reward modeling may not apply.
When your goal is to explore rare edge cases irrespective of label reliability; BIS may under-select very rare but crucial patterns if their positives are too low-MC.

Open questions:

Can we learn BIS-like selection without explicit MC scores (e.g., via proxy uncertainty or self-consistency signals)?
How does BIS interact with reinforcement learning fine-tuning and more complex training schemes?
Can we refine reliability beyond averaging positive MC scores (e.g., calibrating per-source or per-problem difficulty)?
What’s the best way to mix BIS with active learning—select, then query fewer but smarter MC samples to refine labels?
How broadly does BIS generalize beyond visual math to other multimodal or purely textual reasoning domains?

06Conclusion & Future Work

Three-sentence summary: This paper shows that MPRM training is limited more by noisy gradients than by data scarcity, so random subsampling quickly hits a plateau. The key is to prioritize rollouts that are both mixed (positive and negative steps) and reliable (trustworthy positive labels), which the Balanced-Information Score (BIS) captures using only existing MC signals. With BIS, small subsets (often 10–25%) can match or even surpass full-data performance while saving most of the compute.

Main achievement: A simple, theory-guided, zero-extra-cost selection rule—BIS—that consistently outperforms random selection and often reaches full-data results using a small fraction of rollouts.

Future directions: Combine BIS with active learning and better uncertainty estimation; extend BIS-like selection to domains without MC scores; and integrate selection more tightly with reinforcement learning pipelines. Further study could refine the reliability term and adapt BIS per source or per task type.

Why remember this: It reframes data efficiency for process supervision—what you keep matters more than how much you keep. BIS offers a practical recipe, grounded in simple theory and validated across models and benchmarks, to cut training cost dramatically without sacrificing accuracy.

Practical Applications

•Curate smaller, high-impact training sets for multimodal reasoning models to reduce compute costs.
•Speed up experiment cycles by training on BIS-selected subsets that converge faster to strong performance.
•Improve inference-time reranking (best-of-N) by training stronger MPRMs with less data.
•Filter noisy MC-annotated corpora to avoid pseudo-positive steps and stabilize training.
•Balance dataset composition across sources by ranking and selecting top rollouts per source.
•Deploy data-efficient training in edge or resource-constrained settings where full datasets are impractical.
•Combine BIS with active learning to first select, then refine a small set of ambiguous rollouts.
•Retrofit existing datasets (with MC scores) to increase quality without recollecting or relabeling data.
•Benchmark data-selection strategies by comparing BIS against random or heuristic filters under fixed budgets.

Version: 1