Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Yushi Hu; Reyhane Askari-Hemmat; Melissa Hall; Emily Dinan; Luke Zettlemoyer; Marjan Ghazvininejad

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image

Intermediate

Yushi Hu, Reyhane Askari-Hemmat, Melissa Hall et al.12/18/2025

arXiv PDF

Key Summary

•Reward models are like scorekeepers that tell AI which answers people like more, and this paper builds the first big test for scorekeepers that judge both pictures and words together.
•The new benchmark, called MMRB2, covers four real tasks: making pictures from text, editing pictures, creating mixed stories with text and pictures in order, and solving puzzles by thinking with images.
•Each task has 1,000 carefully checked A-vs-B pairs with strong human-expert agreement, so we can see if a judge model picks the same winner as people.
•Top frontier models still miss a lot: Gemini 3 Pro gets about 75–80% agreement with humans, GPT-5 and Gemini 2.5 Pro get 66–75%, and humans are over 90%.
•Popular older judges like GPT-4o score only around 51–65%, which means they are no longer reliable for grading today’s best multimodal models.
•Open-source judges improved: Qwen3-VL-32B reaches about 64–70% on many tasks, close to some fast API models.
•Scores on MMRB2 strongly predict real-world wins: using better judges to pick the “best of 8” generations boosts downstream benchmarks a lot.
•The study uncovers key weaknesses: judges struggle more when comparing two answers from the same model, and they are biased to prefer answers that include images, even when the text-only answer is actually better.
•Test-time scaling (asking the same judge multiple times) gives only tiny gains, so new methods are needed to make multimodal judges more reliable.
•MMRB2 sets a clear, challenging target so researchers can build better reward models that truly understand mixed text-and-image content.

Why This Research Matters

Good judges create better AI. When reward models can reliably prefer the outputs humans like, AI learns to produce clearer posters, cleaner edits, and more accurate step-by-step guides with matching images. MMRB2 gives a trustworthy way to spot which judges are ready and which need work, so teams don’t waste time training on weak feedback. Because MMRB2 scores predict real-world improvements, it directly helps companies and researchers choose judges that boost product quality. Exposing biases—like overvaluing answers that include images—tells us exactly what to fix so systems don’t get dazzled by flashy visuals. In short, this benchmark speeds up progress toward creative, correct, and dependable multimodal AI that helps in school, work, and everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a talent show needs fair judges who can watch dancing, listen to singing, and read poetry—even when these performances happen back-to-back? If the judges only knew about singing, the whole show would feel unfair.

🥬 The Concept (Reward Models): A reward model is a scorekeeper that tells AI, “Answer A is better than Answer B,” based on what people prefer.

How it works:
1. Show the scorekeeper a question (or task) and two candidate answers.
2. The scorekeeper picks the one people would like more.
3. AI uses these scores to learn to give better answers next time.
Why it matters: Without good scorekeepers, AI may practice the wrong things—like cheering for off-key singing.

🍞 Anchor: When an AI writes two photo captions, a reward model should prefer the one that matches the picture and is well written.

🍞 Hook: Imagine reading a comic where pictures and speech bubbles appear in a specific order. If someone shuffled them, you’d lose the story.

🥬 The Concept (Omni/Multimodal Models): Omni models handle mixed media—text and images—together in one interleaved sequence.

How it works:
1. Take a prompt that might include words and pictures.
2. Produce a reply that might also mix words and new pictures in order.
3. Keep the sequence and content consistent.
Why it matters: If we judge only text or only images, we miss whether the whole comic flows.

🍞 Anchor: A “how to bake” guide that shows step-by-step photos and captions must keep steps in order and match each image to its text.

The World Before: For years, AI judges (reward models) mainly graded text-only tasks—like summaries or math reasoning. For pictures, people used simple automatic checks like “Do the image and caption look similar?” These helped, but missed tricky stuff like multiple objects, exact positions, or whether rendered text in an image was spelled right.

The Problem: As omni models got better at mixing text and images (stories, edits, multi-image reasoning), there was no standard way to measure how good multimodal reward models were. Evaluating pictures is hard to automate, and evaluating whole mixed sequences is even harder. Also, existing datasets didn’t cover the everyday, practical prompts people really ask for.

🍞 Hook: Think of a soccer referee trained only on kids’ games trying to ref a pro match—lots of fouls get missed.

🥬 The Concept (Task-specific Metrics): These are automatic shortcuts (like CLIPScore or VQAScore) that try to guess what people would prefer.

How it works:
1. Compute a similarity or answer-correctness score.
2. Use it as a proxy for quality.
Why it matters: When models and tasks become harder and more creative, these shortcuts can fail.

🍞 Anchor: A metric might say a poster is “similar” to the prompt, but ignore misspelled text or a wrong number of objects.

Failed Attempts: Older visual metrics or small preference datasets worked okay for simple images but broke on today’s rich tasks (like multi-image edits or “think with images” puzzles). Even some trained reward models learned from older-generation outputs and didn’t generalize to the newest models.

The Gap: We needed a single, challenging, human-grounded benchmark covering all the major multimodal jobs—text-to-image, editing, interleaved generation, and reasoning—built from real, practical prompts and frontier model outputs, with reliable expert preferences.

🍞 Hook: You know how a science fair needs clear rubrics, lots of examples, and several judges to agree, so the winners feel fair?

🥬 The Concept (Human Expert Consensus & Preference Pairs): Gather many high-quality A-vs-B comparisons where experts strongly agree which is better.

How it works:
1. Collect tough prompts across diverse tasks.
2. Generate candidate answers from many strong models.
3. Ask multiple experts to pick A or B (and say why), keeping only high-agreement cases.
Why it matters: Strong consensus pairs let us tell whether an AI judge truly matches human taste.

🍞 Anchor: If three experts all say “A” because it follows the edit exactly and preserves the subject, a good reward model should also pick A.

Real Stakes: These evaluations guide how we train the next generation of creative and helpful AIs. This affects posters with readable text, safe product edits, step-by-step learning content with consistent images, and tools that can actually reason about what’s in a photo. Better judges mean better everyday results—fewer weird hands in photos, fewer confusing stories, and more accurate multimodal answers.

02Core Idea

🍞 Hook: Imagine building a fair judge panel for a talent show that mixes singing, dancing, and magic—then discovering most judges only know how to grade singing.

🥬 The Concept (MMRB2 Benchmark): MMRB2 is a big, carefully built test that checks whether multimodal reward models make the same choices as human experts across four tasks.

How it works:
1. Collect 1,000 expert-approved preference pairs for each task: text-to-image, image editing, interleaved generation, and multimodal reasoning.
2. Use real, practical prompts and state-of-the-art model outputs (including agents) near the frontier.
3. Keep only pairs with strong human consensus, curated with an ensemble filtering step.
Why it matters: Without a trustworthy test, you can’t tell which reward models are truly learning what people value in mixed text–image content.

🍞 Anchor: If a judge model consistently agrees with humans on which of two edited images better follows the instruction, that judge is reliable for training and evaluating real systems.

The “Aha!” Moment in one sentence: If we standardize tough, human-verified A-vs-B comparisons across all key multimodal jobs, we can finally measure—and improve—omni reward models.

Three Analogies:

Orchestra Conductor: A conductor needs to judge strings, brass, and percussion together; MMRB2 is the audition piece that reveals whether the conductor truly hears the whole orchestra.
Recipe Taste Test: When a dish mixes sweet, sour, and spicy, a good taster can judge the overall balance; MMRB2 is the tasting menu for mixed text-and-image creations.
Comics Editor: A comics editor checks art, dialogue, and panel flow; MMRB2 is the editorial checklist ensuring everything lines up.

Before vs After:

Before: Judges were piecemeal—okay at text, shaky at images, and lost on interleaved sequences.
After: There’s one rigorous yardstick across four major tasks, so we can compare judges fairly and see real progress.

🍞 Hook: You know how reviewing a book is easier if many trusted reviewers agree on what’s good?

🥬 The Concept (Ensemble Filtering Strategy): Use a panel of diverse models to remove trivial pairs before asking humans, keeping only tricky, informative comparisons.

How it works:
1. Nine multimodal judges rate each pair twice (A vs B and B vs A to avoid position bias).
2. If almost everyone agrees, we drop the pair (too easy).
3. We send the remaining, challenging pairs to human experts.
Why it matters: This concentrates human effort on the comparisons that best reveal judge skill.

🍞 Anchor: If all judges already agree a poster with correct spelling is better than one with gibberish text, humans don’t need to label it again.

Why It Works (intuition):

Breadth: Four task families cover creativity, precision edits, sequence planning, and genuine reasoning.
Depth: Near-frontier prompts plus frontier outputs expose subtle errors (like object binding, spatial logic, or text rendering).
Trust: High-consensus human labels give a solid ground truth.
Fairness: Dual-order evaluation reduces “first-item” bias.

Building Blocks (each with a Sandwich):

🍞 Hook: Imagine ranking two posters with tiny differences. 🥬 Concept (Preference Pair): A prompt with two responses—pick the better one. How: Show A and B; choose which aligns better with the prompt and quality rubric. Why: Pairwise choices sharpen distinctions judges must learn. 🍞 Example: Two edits: one keeps the dog’s face sharp; the other blurs it—experts pick the sharp one.
🍞 Hook: Think of a judge who favors the first act just because it went first. 🥬 Concept (Positional Consistent Dual Evaluation): Evaluate both A-vs-B and B-vs-A. How: Ask twice, flip the order, count agreement. Why: Prevents left/right or first/second bias. 🍞 Example: If a judge always picks “A,” flipping reveals the bias.
🍞 Hook: Picking the best cookie from eight tastes better than baking just one. 🥬 Concept (Best-of-N Sampling with Rewards): Use the judge to choose the best among multiple candidates. How: Generate N outputs; the reward picks the winner. Why: Better judges mean better final outputs in real tasks. 🍞 Example: Out of 8 images for a travel poster, the judge selects the one with correct text and layout.
🍞 Hook: Asking three friends for advice can beat asking one. 🥬 Concept (Test-time Scaling for Judges): Sample a judge multiple times and take majority vote. How: Get K independent judgments; pick the most common. Why: Can smooth out randomness, though gains are small here. 🍞 Example: Three passes slightly boost a judge’s accuracy on hard pairs.

03Methodology

At a high level: Prompt → Generate many candidate responses (text, images, or both) → Filter easy pairs with an ensemble of judges → Human experts label the tricky pairs → Build preference pairs → Evaluate any reward model by how often it agrees with humans (using dual order) → Analyze results and correlations.

Key Tasks (each with a Sandwich):

🍞 Hook: You ask for “a yellow van at the beach with ‘Adventure Awaits!’ in bold.” 🥬 Concept (Text-to-Image): Turn a written description into a picture. How: Feed prompt to multiple generators; collect their images. Why: Checks composition, object binding, and text rendering. 🍞 Example: One image spells the slogan perfectly and shows a van by the sea; another misspells the text—experts prefer the first.
🍞 Hook: Like photoshopping a poster without breaking what you didn’t touch. 🥬 Concept (Image Editing): Change an image exactly as instructed while preserving the rest. How: Provide input image(s) and an edit instruction; gather edited results. Why: Tests faithfulness, preservation, and reasoning-heavy edits (e.g., spatial changes). 🍞 Example: “Remove trees so cows are clear.” Good edit removes only trees; bad edit removes cows, too.
🍞 Hook: Imagine a DIY guide where each step has a mini paragraph and a matching photo. 🥬 Concept (Interleaved Generation): Produce a sequence mixing text and images in the right order. How: Ask models/agents for multi-step text–image outputs; collect candidates. Why: Evaluates planning, consistency across steps, and alignment. 🍞 Example: Crop growth over seasons: images and captions match each phase exactly.
🍞 Hook: Solving a puzzle by drawing arrows and notes right on the picture. 🥬 Concept (Multimodal Reasoning): Reason about visual content, sometimes with helper sketches. How: Use prompts with ground-truth answers; gather responses that include reasoning (text or text+images). Why: Judges must reward correct perception, logic, and explanations. 🍞 Example: “From the stacked chairs’ view, what’s nearest on your right?” Correct answer plus clear directional sketch beats a flashy but wrong one.

Data Pipeline (step-by-step, each critical step says why it exists):

Prompt Collection and Stratification

What: Sample prompts from 21 sources, balancing difficulty and subtypes; add new practical tasks (e.g., multi-image editing, text-heavy edits).
Why: Without diverse, realistic prompts, judges overfit to narrow tricks.
Example: Mix creative posters, spatial reasoning puzzles, and story sequences.

Candidate Response Generation

What: For each prompt, collect outputs from 7–11 frontier models and specialized agents that can call tools (image generation/editing, Python) to follow complex instructions.
Why: If candidates are too weak or too similar, comparisons aren’t informative.
Example: Agents help produce exactly four images for a step-by-step task when single models fail to hit the requested count.

Ensemble Filtering (pre-human pass)

What: Nine different judges rate each A/B pair twice (A vs B, then B vs A). Pairs with ≥90% agreement across both orders are dropped as “too easy.”
Why: Saves human attention for fine-grained, high-value comparisons.
Example: Everyone agrees the misspelled billboard loses—skip it; keep the tricky cases where layout is good but object count is subtly wrong.

Human Annotation with Quality Control

What: Three trained experts evaluate each remaining pair using task-specific rubrics (faithfulness to instruction, faithfulness to input images, image quality, cross-generation coherence, text-image alignment, and text quality). They give a 7‑point preference and rationales.
Why: Builds trustworthy, consistent labels; removes ties and ambiguous ratings.
Example: If scores disagree widely, the pair is dropped to protect label quality.

Special Pair Building for Reasoning

What: Construct pairs from responses where the correct answer and clean reasoning are pitted against either incorrect reasoning (with a correct final answer) or an incorrect answer.
Why: Forces judges to value both the right conclusion and the reasoning quality.
Example: Two answers say “B,” but one misreads the image; the other uses correct spatial logic—humans prefer the second.

Evaluation Protocol (Positional Consistent Dual Evaluation)

What: Every pair is judged in both orders; matches with the human majority are counted as correct.
Why: Reduces bias toward “the first item.”
Example: If a judge flips its choice when order flips, accuracy drops.

Analyses and Scaling Tests

What: Study same-model vs different-model comparisons, mixed-modality biases (text vs text+image), correlations to downstream benchmarks via best-of-N, and test-time scaling (K votes per decision).
Why: Reveals where judges still fail and which improvements actually matter in practice.
Example: Finding strong bias toward image-containing responses in reasoning pairs highlights a concrete target for future training.

The Secret Sauce:

Frontier Coverage: Models and agents produce strong, diverse candidates—so passing means real skill.
High-Agreement Labels: >90% human agreement on kept pairs makes the target reliable.
Bias Controls: Dual-order judging and balanced modality comparisons reduce hidden shortcuts.
Practical Prompts: Near real-world requests stress what people truly care about (e.g., spelling, layout, step counts, spatial truth).

04Experiments & Results

The Test: Measure how often a judge model agrees with human preferences on MMRB2’s A-vs-B pairs, across four tasks. Also test classic metrics and trained reward models. Check whether high MMRB2 scores predict real-world gains using best-of-N sampling.

The Competition:

API-based multimodal LLMs: GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Flash, Gemini 2.5 Pro, Gemini 3 Pro.
Open-source: Gemma 3 family, Qwen2.5-VL, Qwen3-VL (8B to large variants).
Task-specific evaluators: CLIPScore, ImageReward, HPSv2/v3, PickScore, VQAScore, EditReward, UnifiedReward.

Scoreboard with Context:

Gemini 3 Pro leads at about 74–80% agreement across tasks—like an A- to B+ when humans score above 90% (A+).
GPT-5 and Gemini 2.5 Pro reach roughly 66–75%, clearly improved but still trailing human reliability by 15–25 points.
GPT-4o, a commonly used older judge, lands around 51–65%, no longer dependable for frontier evaluations.
Best open-source judge Qwen3-VL-32B scores around 64–70%, competitive on generation tasks and much closer to APIs than before, though still behind on hard reasoning.

Task-specific Evaluators:

Older CLIP-based or VQA-like metrics fall behind on MMRB2’s harder prompts (e.g., ImageReward ≈54% on text-to-image; VQAScore ≈58%).
Preference-trained newer models help (e.g., HPSv3 ≈60% text-to-image, EditReward ≈67% on single-image editing), but still often trail strong MLLM judges like Qwen3-VL-32B or Gemini 3 Pro.
Takeaway: Training on human preferences improves metrics, but distribution shift to frontier outputs hurts many older reward models; large, general MLLMs remain tough-to-beat judges.

Surprising/Diagnostic Findings:

Same-Model vs Different-Model Pairs: All judges agree more with humans on different-model pairs (bigger quality gaps) than on same-model pairs (subtle differences), with gaps up to 12 points for top models. This shows fine-grained discrimination is still hard.
Modality Bias in Reasoning: Judges are biased to pick responses that include images. Accuracy is far higher when the preferred answer contains images than when the preferred answer is text-only (gaps of 27.7–49.3 points for many models). Even the best model, Gemini 3 Pro, shows a notable but smaller gap. This is a key failure mode.
Test-time Scaling: Asking the same judge multiple times and taking a vote yields only tiny gains (~0.8–1.2 points at K=9 for some API models; none for some open-source models). Unlike text-only LLMs, this doesn’t rescue multimodal judging much.

Downstream Correlation via Best-of-N:

Better MMRB2 judges pick better generations on real benchmarks (GenAI-Bench, GEdit-Bench, ISG-Bench, EMMA). Correlations exceed 0.8 (Pearson’s r) across tasks.
Concrete wins: FLUX’s GenAI-Bench score rises from 73% to 79% when GPT-5 selects best-of-8; GPT-4o’s EMMA accuracy jumps from 32% to 45% with a better selector.
Meaning: MMRB2 is not just an academic test—it predicts practical improvements when you use the judge to choose outputs.

Bottom Line:

Human evaluation is still the gold standard (>90%).
Gemini 3 Pro is the current best automated judge but still leaves 20–26% disagreement to close.
Older or narrower evaluators struggle; new, stronger MLLMs and newer preference-trained rewards help but aren’t perfect.
The clear link from MMRB2 scores to real-world gains makes this benchmark a reliable compass for progress.

05Discussion & Limitations

Limitations:

Coverage Boundaries: While broad, MMRB2 focuses on core single-turn tasks. It doesn’t yet cover multi-turn dialogues, safety-sensitive choices, or bias-sensitive preferences in depth, and it omits video/audio.
Frontier Drift: As models evolve, today’s “hard” pairs may become easier. The benchmark will need periodic refreshes to stay challenging.
Annotation Cost: High-consensus expert labels are expensive and time-consuming, limiting how fast we can scale.
Agent Variability: Agent outputs depend on tool stacks; other combinations might reveal new failure modes.

Required Resources:

Access to multiple API models and open-source models for response generation and judging.
Budget and time for expert annotation with robust quality control.
Infrastructure to store and serve interleaved text–image data, plus tooling for dual-order evaluation.

When NOT to Use:

Safety or Bias Audits: MMRB2 isn’t designed to judge sensitive harms or fairness outcomes; specialized safety/bias benchmarks are better.
Audio/Video Tasks: If your system’s main modality is speech or video, MMRB2 won’t fully reflect your needs yet.
Very Domain-Specific Workflows: Highly specialized medical or legal visuals may require domain-oriented evaluations.

Open Questions:

Can we train reward models that weigh reasoning quality over visual flash, reducing bias toward image-containing answers?
What architectures or training signals best improve same-model fine-grained discrimination?
How can we extend to multi-turn, multilingual, and in-the-wild settings without losing label reliability?
Can we develop scalable, semi-automated labeling pipelines that still achieve human-level agreement for multimodal tasks?
What new forms of test-time scaling or self-verification work for multimodal judging beyond simple majority votes?

06Conclusion & Future Work

Three-Sentence Summary:

MMRB2 is a comprehensive benchmark that fairly tests whether multimodal reward models agree with human preferences across text-to-image, image editing, interleaved generation, and multimodal reasoning.
Even the best current judges, like Gemini 3 Pro, still disagree with humans about 20–26% of the time, while many older metrics and models underperform badly on frontier tasks.
MMRB2 scores strongly predict real-world gains when judges pick the best of multiple candidates, making it a practical tool for improving multimodal systems.

Main Achievement:

Creating a reliable, human-grounded, frontier-level testbed—with expert-consensus preference pairs and bias-aware protocols—that finally lets the community measure and improve omni reward models in a unified way.

Future Directions:

Expand to safety/bias-sensitive preferences, multilingual prompts, multi-turn/agentic workflows, and new modalities like video and audio.
Develop training strategies that reduce modality bias and improve fine-grained discrimination on same-model pairs.
Explore better scaling methods (beyond majority voting) for more stable, trustworthy multimodal judging.

Why Remember This:

MMRB2 marks a turning point for AI that reads, writes, and draws: we now have a fair, challenging, human-aligned scoreboard that tells us which judges actually understand mixed text-and-image content and which don’t—so the next generation of creative, accurate, and reliable multimodal AI can be trained with confidence.

Practical Applications

•Use MMRB2 to pick the best judge for best-of-N selection in your image generation pipeline to immediately improve output quality.
•Benchmark your in-house reward model against MMRB2 before deploying it to guide RLHF or DPO training for multimodal tasks.
•Stress-test your image editing model on the MMRB2 editing subset to discover faithfulness and preservation failures.
•Evaluate interleaved text–image storytelling systems with MMRB2 to ensure step counts, ordering, and text–image alignment are correct.
•Audit your multimodal reasoning agent on MMRB2 reasoning pairs to detect modality bias toward image-containing answers.
•Compare open-source judges (e.g., Qwen3-VL-32B) to API judges to balance cost, latency, and accuracy in production.
•Train new reward models on recent, frontier-like data and validate generalization by checking their MMRB2 accuracy lift over older metrics.
•Use MMRB2’s dual-order evaluation protocol in your own A/B tests to reduce position bias in internal model comparisons.
•Incorporate MMRB2 tasks as curriculum checkpoints while scaling omni models to ensure balanced progress across generation and reasoning.
•Run periodic MMRB2 evaluations to track regressions after model updates, especially for text rendering and spatial logic.

Version: 1