Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
Key Summary
- âąReward models are like scorekeepers that tell AI which answers people like more, and this paper builds the first big test for scorekeepers that judge both pictures and words together.
- âąThe new benchmark, called MMRB2, covers four real tasks: making pictures from text, editing pictures, creating mixed stories with text and pictures in order, and solving puzzles by thinking with images.
- âąEach task has 1,000 carefully checked A-vs-B pairs with strong human-expert agreement, so we can see if a judge model picks the same winner as people.
- âąTop frontier models still miss a lot: Gemini 3 Pro gets about 75â80% agreement with humans, GPT-5 and Gemini 2.5 Pro get 66â75%, and humans are over 90%.
- âąPopular older judges like GPT-4o score only around 51â65%, which means they are no longer reliable for grading todayâs best multimodal models.
- âąOpen-source judges improved: Qwen3-VL-32B reaches about 64â70% on many tasks, close to some fast API models.
- âąScores on MMRB2 strongly predict real-world wins: using better judges to pick the âbest of 8â generations boosts downstream benchmarks a lot.
- âąThe study uncovers key weaknesses: judges struggle more when comparing two answers from the same model, and they are biased to prefer answers that include images, even when the text-only answer is actually better.
- âąTest-time scaling (asking the same judge multiple times) gives only tiny gains, so new methods are needed to make multimodal judges more reliable.
- âąMMRB2 sets a clear, challenging target so researchers can build better reward models that truly understand mixed text-and-image content.
Why This Research Matters
Good judges create better AI. When reward models can reliably prefer the outputs humans like, AI learns to produce clearer posters, cleaner edits, and more accurate step-by-step guides with matching images. MMRB2 gives a trustworthy way to spot which judges are ready and which need work, so teams donât waste time training on weak feedback. Because MMRB2 scores predict real-world improvements, it directly helps companies and researchers choose judges that boost product quality. Exposing biasesâlike overvaluing answers that include imagesâtells us exactly what to fix so systems donât get dazzled by flashy visuals. In short, this benchmark speeds up progress toward creative, correct, and dependable multimodal AI that helps in school, work, and everyday life.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how a talent show needs fair judges who can watch dancing, listen to singing, and read poetryâeven when these performances happen back-to-back? If the judges only knew about singing, the whole show would feel unfair.
đ„Ź The Concept (Reward Models): A reward model is a scorekeeper that tells AI, âAnswer A is better than Answer B,â based on what people prefer.
- How it works:
- Show the scorekeeper a question (or task) and two candidate answers.
- The scorekeeper picks the one people would like more.
- AI uses these scores to learn to give better answers next time.
- Why it matters: Without good scorekeepers, AI may practice the wrong thingsâlike cheering for off-key singing.
đ Anchor: When an AI writes two photo captions, a reward model should prefer the one that matches the picture and is well written.
đ Hook: Imagine reading a comic where pictures and speech bubbles appear in a specific order. If someone shuffled them, youâd lose the story.
đ„Ź The Concept (Omni/Multimodal Models): Omni models handle mixed mediaâtext and imagesâtogether in one interleaved sequence.
- How it works:
- Take a prompt that might include words and pictures.
- Produce a reply that might also mix words and new pictures in order.
- Keep the sequence and content consistent.
- Why it matters: If we judge only text or only images, we miss whether the whole comic flows.
đ Anchor: A âhow to bakeâ guide that shows step-by-step photos and captions must keep steps in order and match each image to its text.
The World Before: For years, AI judges (reward models) mainly graded text-only tasksâlike summaries or math reasoning. For pictures, people used simple automatic checks like âDo the image and caption look similar?â These helped, but missed tricky stuff like multiple objects, exact positions, or whether rendered text in an image was spelled right.
The Problem: As omni models got better at mixing text and images (stories, edits, multi-image reasoning), there was no standard way to measure how good multimodal reward models were. Evaluating pictures is hard to automate, and evaluating whole mixed sequences is even harder. Also, existing datasets didnât cover the everyday, practical prompts people really ask for.
đ Hook: Think of a soccer referee trained only on kidsâ games trying to ref a pro matchâlots of fouls get missed.
đ„Ź The Concept (Task-specific Metrics): These are automatic shortcuts (like CLIPScore or VQAScore) that try to guess what people would prefer.
- How it works:
- Compute a similarity or answer-correctness score.
- Use it as a proxy for quality.
- Why it matters: When models and tasks become harder and more creative, these shortcuts can fail.
đ Anchor: A metric might say a poster is âsimilarâ to the prompt, but ignore misspelled text or a wrong number of objects.
Failed Attempts: Older visual metrics or small preference datasets worked okay for simple images but broke on todayâs rich tasks (like multi-image edits or âthink with imagesâ puzzles). Even some trained reward models learned from older-generation outputs and didnât generalize to the newest models.
The Gap: We needed a single, challenging, human-grounded benchmark covering all the major multimodal jobsâtext-to-image, editing, interleaved generation, and reasoningâbuilt from real, practical prompts and frontier model outputs, with reliable expert preferences.
đ Hook: You know how a science fair needs clear rubrics, lots of examples, and several judges to agree, so the winners feel fair?
đ„Ź The Concept (Human Expert Consensus & Preference Pairs): Gather many high-quality A-vs-B comparisons where experts strongly agree which is better.
- How it works:
- Collect tough prompts across diverse tasks.
- Generate candidate answers from many strong models.
- Ask multiple experts to pick A or B (and say why), keeping only high-agreement cases.
- Why it matters: Strong consensus pairs let us tell whether an AI judge truly matches human taste.
đ Anchor: If three experts all say âAâ because it follows the edit exactly and preserves the subject, a good reward model should also pick A.
Real Stakes: These evaluations guide how we train the next generation of creative and helpful AIs. This affects posters with readable text, safe product edits, step-by-step learning content with consistent images, and tools that can actually reason about whatâs in a photo. Better judges mean better everyday resultsâfewer weird hands in photos, fewer confusing stories, and more accurate multimodal answers.
02Core Idea
đ Hook: Imagine building a fair judge panel for a talent show that mixes singing, dancing, and magicâthen discovering most judges only know how to grade singing.
đ„Ź The Concept (MMRB2 Benchmark): MMRB2 is a big, carefully built test that checks whether multimodal reward models make the same choices as human experts across four tasks.
- How it works:
- Collect 1,000 expert-approved preference pairs for each task: text-to-image, image editing, interleaved generation, and multimodal reasoning.
- Use real, practical prompts and state-of-the-art model outputs (including agents) near the frontier.
- Keep only pairs with strong human consensus, curated with an ensemble filtering step.
- Why it matters: Without a trustworthy test, you canât tell which reward models are truly learning what people value in mixed textâimage content.
đ Anchor: If a judge model consistently agrees with humans on which of two edited images better follows the instruction, that judge is reliable for training and evaluating real systems.
The âAha!â Moment in one sentence: If we standardize tough, human-verified A-vs-B comparisons across all key multimodal jobs, we can finally measureâand improveâomni reward models.
Three Analogies:
- Orchestra Conductor: A conductor needs to judge strings, brass, and percussion together; MMRB2 is the audition piece that reveals whether the conductor truly hears the whole orchestra.
- Recipe Taste Test: When a dish mixes sweet, sour, and spicy, a good taster can judge the overall balance; MMRB2 is the tasting menu for mixed text-and-image creations.
- Comics Editor: A comics editor checks art, dialogue, and panel flow; MMRB2 is the editorial checklist ensuring everything lines up.
Before vs After:
- Before: Judges were piecemealâokay at text, shaky at images, and lost on interleaved sequences.
- After: Thereâs one rigorous yardstick across four major tasks, so we can compare judges fairly and see real progress.
đ Hook: You know how reviewing a book is easier if many trusted reviewers agree on whatâs good?
đ„Ź The Concept (Ensemble Filtering Strategy): Use a panel of diverse models to remove trivial pairs before asking humans, keeping only tricky, informative comparisons.
- How it works:
- Nine multimodal judges rate each pair twice (A vs B and B vs A to avoid position bias).
- If almost everyone agrees, we drop the pair (too easy).
- We send the remaining, challenging pairs to human experts.
- Why it matters: This concentrates human effort on the comparisons that best reveal judge skill.
đ Anchor: If all judges already agree a poster with correct spelling is better than one with gibberish text, humans donât need to label it again.
Why It Works (intuition):
- Breadth: Four task families cover creativity, precision edits, sequence planning, and genuine reasoning.
- Depth: Near-frontier prompts plus frontier outputs expose subtle errors (like object binding, spatial logic, or text rendering).
- Trust: High-consensus human labels give a solid ground truth.
- Fairness: Dual-order evaluation reduces âfirst-itemâ bias.
Building Blocks (each with a Sandwich):
-
đ Hook: Imagine ranking two posters with tiny differences. đ„Ź Concept (Preference Pair): A prompt with two responsesâpick the better one. How: Show A and B; choose which aligns better with the prompt and quality rubric. Why: Pairwise choices sharpen distinctions judges must learn. đ Example: Two edits: one keeps the dogâs face sharp; the other blurs itâexperts pick the sharp one.
-
đ Hook: Think of a judge who favors the first act just because it went first. đ„Ź Concept (Positional Consistent Dual Evaluation): Evaluate both A-vs-B and B-vs-A. How: Ask twice, flip the order, count agreement. Why: Prevents left/right or first/second bias. đ Example: If a judge always picks âA,â flipping reveals the bias.
-
đ Hook: Picking the best cookie from eight tastes better than baking just one. đ„Ź Concept (Best-of-N Sampling with Rewards): Use the judge to choose the best among multiple candidates. How: Generate N outputs; the reward picks the winner. Why: Better judges mean better final outputs in real tasks. đ Example: Out of 8 images for a travel poster, the judge selects the one with correct text and layout.
-
đ Hook: Asking three friends for advice can beat asking one. đ„Ź Concept (Test-time Scaling for Judges): Sample a judge multiple times and take majority vote. How: Get K independent judgments; pick the most common. Why: Can smooth out randomness, though gains are small here. đ Example: Three passes slightly boost a judgeâs accuracy on hard pairs.
03Methodology
At a high level: Prompt â Generate many candidate responses (text, images, or both) â Filter easy pairs with an ensemble of judges â Human experts label the tricky pairs â Build preference pairs â Evaluate any reward model by how often it agrees with humans (using dual order) â Analyze results and correlations.
Key Tasks (each with a Sandwich):
-
đ Hook: You ask for âa yellow van at the beach with âAdventure Awaits!â in bold.â đ„Ź Concept (Text-to-Image): Turn a written description into a picture. How: Feed prompt to multiple generators; collect their images. Why: Checks composition, object binding, and text rendering. đ Example: One image spells the slogan perfectly and shows a van by the sea; another misspells the textâexperts prefer the first.
-
đ Hook: Like photoshopping a poster without breaking what you didnât touch. đ„Ź Concept (Image Editing): Change an image exactly as instructed while preserving the rest. How: Provide input image(s) and an edit instruction; gather edited results. Why: Tests faithfulness, preservation, and reasoning-heavy edits (e.g., spatial changes). đ Example: âRemove trees so cows are clear.â Good edit removes only trees; bad edit removes cows, too.
-
đ Hook: Imagine a DIY guide where each step has a mini paragraph and a matching photo. đ„Ź Concept (Interleaved Generation): Produce a sequence mixing text and images in the right order. How: Ask models/agents for multi-step textâimage outputs; collect candidates. Why: Evaluates planning, consistency across steps, and alignment. đ Example: Crop growth over seasons: images and captions match each phase exactly.
-
đ Hook: Solving a puzzle by drawing arrows and notes right on the picture. đ„Ź Concept (Multimodal Reasoning): Reason about visual content, sometimes with helper sketches. How: Use prompts with ground-truth answers; gather responses that include reasoning (text or text+images). Why: Judges must reward correct perception, logic, and explanations. đ Example: âFrom the stacked chairsâ view, whatâs nearest on your right?â Correct answer plus clear directional sketch beats a flashy but wrong one.
Data Pipeline (step-by-step, each critical step says why it exists):
- Prompt Collection and Stratification
- What: Sample prompts from 21 sources, balancing difficulty and subtypes; add new practical tasks (e.g., multi-image editing, text-heavy edits).
- Why: Without diverse, realistic prompts, judges overfit to narrow tricks.
- Example: Mix creative posters, spatial reasoning puzzles, and story sequences.
- Candidate Response Generation
- What: For each prompt, collect outputs from 7â11 frontier models and specialized agents that can call tools (image generation/editing, Python) to follow complex instructions.
- Why: If candidates are too weak or too similar, comparisons arenât informative.
- Example: Agents help produce exactly four images for a step-by-step task when single models fail to hit the requested count.
- Ensemble Filtering (pre-human pass)
- What: Nine different judges rate each A/B pair twice (A vs B, then B vs A). Pairs with â„90% agreement across both orders are dropped as âtoo easy.â
- Why: Saves human attention for fine-grained, high-value comparisons.
- Example: Everyone agrees the misspelled billboard losesâskip it; keep the tricky cases where layout is good but object count is subtly wrong.
- Human Annotation with Quality Control
- What: Three trained experts evaluate each remaining pair using task-specific rubrics (faithfulness to instruction, faithfulness to input images, image quality, cross-generation coherence, text-image alignment, and text quality). They give a 7âpoint preference and rationales.
- Why: Builds trustworthy, consistent labels; removes ties and ambiguous ratings.
- Example: If scores disagree widely, the pair is dropped to protect label quality.
- Special Pair Building for Reasoning
- What: Construct pairs from responses where the correct answer and clean reasoning are pitted against either incorrect reasoning (with a correct final answer) or an incorrect answer.
- Why: Forces judges to value both the right conclusion and the reasoning quality.
- Example: Two answers say âB,â but one misreads the image; the other uses correct spatial logicâhumans prefer the second.
- Evaluation Protocol (Positional Consistent Dual Evaluation)
- What: Every pair is judged in both orders; matches with the human majority are counted as correct.
- Why: Reduces bias toward âthe first item.â
- Example: If a judge flips its choice when order flips, accuracy drops.
- Analyses and Scaling Tests
- What: Study same-model vs different-model comparisons, mixed-modality biases (text vs text+image), correlations to downstream benchmarks via best-of-N, and test-time scaling (K votes per decision).
- Why: Reveals where judges still fail and which improvements actually matter in practice.
- Example: Finding strong bias toward image-containing responses in reasoning pairs highlights a concrete target for future training.
The Secret Sauce:
- Frontier Coverage: Models and agents produce strong, diverse candidatesâso passing means real skill.
- High-Agreement Labels: >90% human agreement on kept pairs makes the target reliable.
- Bias Controls: Dual-order judging and balanced modality comparisons reduce hidden shortcuts.
- Practical Prompts: Near real-world requests stress what people truly care about (e.g., spelling, layout, step counts, spatial truth).
04Experiments & Results
The Test: Measure how often a judge model agrees with human preferences on MMRB2âs A-vs-B pairs, across four tasks. Also test classic metrics and trained reward models. Check whether high MMRB2 scores predict real-world gains using best-of-N sampling.
The Competition:
- API-based multimodal LLMs: GPT-4o, GPT-4.1, GPT-5, Gemini 2.5 Flash, Gemini 2.5 Pro, Gemini 3 Pro.
- Open-source: Gemma 3 family, Qwen2.5-VL, Qwen3-VL (8B to large variants).
- Task-specific evaluators: CLIPScore, ImageReward, HPSv2/v3, PickScore, VQAScore, EditReward, UnifiedReward.
Scoreboard with Context:
- Gemini 3 Pro leads at about 74â80% agreement across tasksâlike an A- to B+ when humans score above 90% (A+).
- GPT-5 and Gemini 2.5 Pro reach roughly 66â75%, clearly improved but still trailing human reliability by 15â25 points.
- GPT-4o, a commonly used older judge, lands around 51â65%, no longer dependable for frontier evaluations.
- Best open-source judge Qwen3-VL-32B scores around 64â70%, competitive on generation tasks and much closer to APIs than before, though still behind on hard reasoning.
Task-specific Evaluators:
- Older CLIP-based or VQA-like metrics fall behind on MMRB2âs harder prompts (e.g., ImageReward â54% on text-to-image; VQAScore â58%).
- Preference-trained newer models help (e.g., HPSv3 â60% text-to-image, EditReward â67% on single-image editing), but still often trail strong MLLM judges like Qwen3-VL-32B or Gemini 3 Pro.
- Takeaway: Training on human preferences improves metrics, but distribution shift to frontier outputs hurts many older reward models; large, general MLLMs remain tough-to-beat judges.
Surprising/Diagnostic Findings:
- Same-Model vs Different-Model Pairs: All judges agree more with humans on different-model pairs (bigger quality gaps) than on same-model pairs (subtle differences), with gaps up to 12 points for top models. This shows fine-grained discrimination is still hard.
- Modality Bias in Reasoning: Judges are biased to pick responses that include images. Accuracy is far higher when the preferred answer contains images than when the preferred answer is text-only (gaps of 27.7â49.3 points for many models). Even the best model, Gemini 3 Pro, shows a notable but smaller gap. This is a key failure mode.
- Test-time Scaling: Asking the same judge multiple times and taking a vote yields only tiny gains (~0.8â1.2 points at K=9 for some API models; none for some open-source models). Unlike text-only LLMs, this doesnât rescue multimodal judging much.
Downstream Correlation via Best-of-N:
- Better MMRB2 judges pick better generations on real benchmarks (GenAI-Bench, GEdit-Bench, ISG-Bench, EMMA). Correlations exceed 0.8 (Pearsonâs r) across tasks.
- Concrete wins: FLUXâs GenAI-Bench score rises from 73% to 79% when GPT-5 selects best-of-8; GPT-4oâs EMMA accuracy jumps from 32% to 45% with a better selector.
- Meaning: MMRB2 is not just an academic testâit predicts practical improvements when you use the judge to choose outputs.
Bottom Line:
- Human evaluation is still the gold standard (>90%).
- Gemini 3 Pro is the current best automated judge but still leaves 20â26% disagreement to close.
- Older or narrower evaluators struggle; new, stronger MLLMs and newer preference-trained rewards help but arenât perfect.
- The clear link from MMRB2 scores to real-world gains makes this benchmark a reliable compass for progress.
05Discussion & Limitations
Limitations:
- Coverage Boundaries: While broad, MMRB2 focuses on core single-turn tasks. It doesnât yet cover multi-turn dialogues, safety-sensitive choices, or bias-sensitive preferences in depth, and it omits video/audio.
- Frontier Drift: As models evolve, todayâs âhardâ pairs may become easier. The benchmark will need periodic refreshes to stay challenging.
- Annotation Cost: High-consensus expert labels are expensive and time-consuming, limiting how fast we can scale.
- Agent Variability: Agent outputs depend on tool stacks; other combinations might reveal new failure modes.
Required Resources:
- Access to multiple API models and open-source models for response generation and judging.
- Budget and time for expert annotation with robust quality control.
- Infrastructure to store and serve interleaved textâimage data, plus tooling for dual-order evaluation.
When NOT to Use:
- Safety or Bias Audits: MMRB2 isnât designed to judge sensitive harms or fairness outcomes; specialized safety/bias benchmarks are better.
- Audio/Video Tasks: If your systemâs main modality is speech or video, MMRB2 wonât fully reflect your needs yet.
- Very Domain-Specific Workflows: Highly specialized medical or legal visuals may require domain-oriented evaluations.
Open Questions:
- Can we train reward models that weigh reasoning quality over visual flash, reducing bias toward image-containing answers?
- What architectures or training signals best improve same-model fine-grained discrimination?
- How can we extend to multi-turn, multilingual, and in-the-wild settings without losing label reliability?
- Can we develop scalable, semi-automated labeling pipelines that still achieve human-level agreement for multimodal tasks?
- What new forms of test-time scaling or self-verification work for multimodal judging beyond simple majority votes?
06Conclusion & Future Work
Three-Sentence Summary:
- MMRB2 is a comprehensive benchmark that fairly tests whether multimodal reward models agree with human preferences across text-to-image, image editing, interleaved generation, and multimodal reasoning.
- Even the best current judges, like Gemini 3 Pro, still disagree with humans about 20â26% of the time, while many older metrics and models underperform badly on frontier tasks.
- MMRB2 scores strongly predict real-world gains when judges pick the best of multiple candidates, making it a practical tool for improving multimodal systems.
Main Achievement:
- Creating a reliable, human-grounded, frontier-level testbedâwith expert-consensus preference pairs and bias-aware protocolsâthat finally lets the community measure and improve omni reward models in a unified way.
Future Directions:
- Expand to safety/bias-sensitive preferences, multilingual prompts, multi-turn/agentic workflows, and new modalities like video and audio.
- Develop training strategies that reduce modality bias and improve fine-grained discrimination on same-model pairs.
- Explore better scaling methods (beyond majority voting) for more stable, trustworthy multimodal judging.
Why Remember This:
- MMRB2 marks a turning point for AI that reads, writes, and draws: we now have a fair, challenging, human-aligned scoreboard that tells us which judges actually understand mixed text-and-image content and which donâtâso the next generation of creative, accurate, and reliable multimodal AI can be trained with confidence.
Practical Applications
- âąUse MMRB2 to pick the best judge for best-of-N selection in your image generation pipeline to immediately improve output quality.
- âąBenchmark your in-house reward model against MMRB2 before deploying it to guide RLHF or DPO training for multimodal tasks.
- âąStress-test your image editing model on the MMRB2 editing subset to discover faithfulness and preservation failures.
- âąEvaluate interleaved textâimage storytelling systems with MMRB2 to ensure step counts, ordering, and textâimage alignment are correct.
- âąAudit your multimodal reasoning agent on MMRB2 reasoning pairs to detect modality bias toward image-containing answers.
- âąCompare open-source judges (e.g., Qwen3-VL-32B) to API judges to balance cost, latency, and accuracy in production.
- âąTrain new reward models on recent, frontier-like data and validate generalization by checking their MMRB2 accuracy lift over older metrics.
- âąUse MMRB2âs dual-order evaluation protocol in your own A/B tests to reduce position bias in internal model comparisons.
- âąIncorporate MMRB2 tasks as curriculum checkpoints while scaling omni models to ensure balanced progress across generation and reasoning.
- âąRun periodic MMRB2 evaluations to track regressions after model updates, especially for text rendering and spatial logic.