MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Peizhou Huang; Zixuan Zhong; Zhongwei Wan; Donghao Zhou; Samiul Alam; Xin Wang; Zexin Li; Zhihao Dou; Li Zhu; Jing Xiong; Chaofan Tao; Yan Xu; Dimitrios Dimitriadis; Tuo Zhang; Mi Zhang

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Intermediate

Peizhou Huang, Zixuan Zhong, Zhongwei Wan et al.1/18/2026

arXiv PDF

Key Summary

•This paper introduces MMDeepResearch-Bench (MMDR-Bench), a new test that checks how well AI “deep research agents” write long, citation-rich reports using both text and images.
•It contains 140 expert-made tasks across 21 domains, split into Daily (everyday problems) and Research (analysis-heavy) regimes.
•The benchmark grades reports with a three-part pipeline: FLAE for writing quality, TRACE for citation faithfulness, and MOSAIC for matching text to visuals.
•A strict rule called Visual Evidence Fidelity (VEF) makes sure any claim tied to the images matches what the images truly show, using a hard pass/fail threshold.
•Experiments on 25 advanced systems show trade-offs: good writing doesn’t always mean reliable citations, and visual grounding is still a big challenge.
•Gemini Deep Research ranks first overall on MMDR-Bench, led by strong evidence coverage and grounding.
•Adding vision helps only when the model reads images accurately; otherwise, it can introduce mistakes that spread through the report.
•Agent systems can gather more evidence but sometimes mix up entities during long, multi-step synthesis.
•The evaluation is interpretable and robust: it blends fixed formulas with careful LLM judging and remains stable across different judge models.
•All data, code, and metrics are released to help the community build safer and more trustworthy research agents.

Why This Research Matters

When people use AI to make sense of charts, diagrams, and documents, they need more than pretty sentences—they need faithful, well-cited answers that match the visuals. MMDR-Bench sets a higher bar by checking writing, sources, and image alignment all at once, so we can see where models truly stand. This reduces the risk of confident-sounding mistakes that could mislead readers in areas like health, policy, or finance. It also gives builders clear, fine-grained feedback to fix specific weaknesses, like misreading small numbers on a chart. Because the scoring is interpretable and versioned, organizations can track real progress and avoid regressions. By releasing data and code, the authors enable a community effort toward safer, more trustworthy research AIs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a science fair project. You don’t just read words—you also study charts, photos, and diagrams. Your final report needs to be clear, true, and show where every fact came from.

🥬 The Concept (Deep Research Agents and why we need a new test): Deep Research Agents (DRAs) are AI helpers that search the web in multiple steps, collect evidence, and write long reports with citations. How it works: 1) read a question, 2) look up sources (sometimes many times), 3) collect evidence from text and images, 4) write a report, and 5) show exactly which source supports each claim. Why it matters: Without careful checking, an AI might write something that sounds great but doesn’t match the evidence—especially when images like charts and diagrams are involved.

🍞 Anchor: Think of an AI asked to compare two vaccines using a chart and a paper. If it misreads the chart’s numbers, the whole report could be wrong even if the writing sounds smart.

The World Before: A lot of AI tests focused on short answers (like multiple choice) or only on text. These tests didn’t check whether a full report correctly tied claims to proper citations, nor whether text matched pictures like charts or diagrams. Even when tests used images, they were usually quick Q&A tasks, not long, research-style write-ups.

The Problem: Real research needs more than just good writing. It needs faithful use of evidence. In multimodal tasks (text + images), models often: 1) cite the wrong sources or skip citations, 2) misread tiny chart details (numbers, axis labels), 3) write nice summaries that don’t match the figures. Because these tasks are open-ended, there isn’t always one “right answer,” so we must judge the whole report—its structure, its citations, and its visual grounding.

Failed Attempts: Earlier benchmarks tried three paths: 1) Test only browsing skill (can you find a page?), 2) Test only text-based deep research (can you write a long report with text citations?), and 3) Test short multimodal perception (can you answer a small visual question?). Each one missed the end-to-end challenge of using images and text together while writing a citation-rich report.

The Gap: No benchmark asked models to do the full job: understand a multimodal task, plan searches, gather sources, connect claims to specific citations, and make sure text aligns with visuals—all judged in a clear, reproducible way.

Real Stakes: People use AI to make health choices, understand news, and learn from scientific visuals. If an AI misreads a medical chart or links the wrong figure to a claim, it can lead to bad decisions. We need a trustworthy way to see if an AI is careful, honest, and visually accurate.

🍞 Anchor: Suppose you upload a photo of your eye drops and ask if they fit your symptoms, and also request safe alternatives near London. A good AI must: identify the product from the image, check reliable sources, cite them, and give an answer that matches the picture and the facts. That’s exactly what this benchmark tests.

02Core Idea

🍞 Hook: You know how a teacher grades a big project using a rubric—checking writing quality, sources, and whether pictures match the text? Now imagine a super-rubric for AI that does the same thing.

🥬 The Concept (Main innovation in one sentence): MMDeepResearch-Bench (MMDR-Bench) is a full end-to-end test that scores AI research reports on writing quality, citation faithfulness, and text–image consistency—using a clear, interpretable, multi-stage evaluator.

How it works (the building blocks):

The benchmark gives each task as an image–text bundle (the question plus important images). 2) The AI writes a citation-rich report. 3) The evaluator runs three modules: FLAE (writing quality), TRACE (evidence and citation faithfulness, including a strict visual evidence check called VEF), and MOSAIC (does the text truly match the images?). 4) Scores are fused into a final 0–100 score with weights: FLAE 20%, TRACE 50%, MOSAIC 30%. 5) VEF is enforced with a pass/fail gate using a task-specific visual ground truth.

Why it matters: Without all three checks, models can seem smart while quietly drifting from the evidence—especially around visuals. This design makes it hard to “sound good” without actually being correct.

🍞 Anchor: Think of a book report that includes a graph. MMDR-Bench not only checks if your paragraphs read well, it also verifies each claim’s footnote, and confirms your words truly match the graph’s lines and labels.

Multiple Analogies:

Coach analogy: The AI is a player; MMDR-Bench is a three-coach team—one coach checks form (FLAE), the second checks you played by the rulebook with proof (TRACE), and the third checks you and the replay video agree (MOSAIC).
Museum guide analogy: The report is a tour. FLAE checks the guide’s clarity, TRACE checks the labels under each artwork (citations) match what the guide says, and MOSAIC checks that descriptions fit what’s on the wall.
Detective analogy: The AI is a detective; FLAE scores the case file’s organization, TRACE checks each clue is real and linked to its source, and MOSAIC confirms the photo evidence actually shows what the report claims.

Before vs After:

Before: Benchmarks judged either short multimodal questions or long text-only reports; models could pass by writing smoothly without proving alignment to images.
After: We demand honest, citation-grounded, visually faithful reports end-to-end. Strong prose isn’t enough—you must show your work and match the pictures.

Why it works (intuition without equations):

Decomposing the score forces discipline: Writing quality (FLAE) can’t hide missing or wrong evidence (TRACE); and even good citations can’t hide mismatched claims about images (MOSAIC + VEF). The fixed VEF threshold (pass if ≥ 6/10, else fail) acts like a seatbelt—non-negotiable safety around visuals.

Building Blocks (each with a sandwich):

🍞 Hook: You know how teachers grade different parts of an assignment differently? 🥬 FLAE: What it is: A writing-quality grader that blends simple, transparent text formulas with an LLM judge. How it works: 1) compute text features (structure, readability, sections), 2) get LLM scores for Readability, Insightfulness, Structure, 3) fuse them with adaptive weights per task. Why it matters: Without FLAE, a model might dump citations randomly or write messy reports that are hard to trust. 🍞 Anchor: Like grading an essay for clarity, depth, and organization.

🍞 Hook: When you make a claim, you should show your source. 🥬 TRACE: What it is: A citation auditor that checks if claims are actually supported by the cited URLs and the task’s intent. How it works: 1) extract claim–URL pairs, 2) fetch pages, 3) judge support and contradictions, 4) combine Consistency, Coverage, and Fidelity with a special VEF visual check. Why it matters: Without TRACE, an AI could quote the wrong page or cherry-pick evidence. 🍞 Anchor: Like checking each footnote in a school report to make sure it points to a page that truly says what you claimed.

🍞 Hook: Ever describe a picture and someone says, “That’s not what it shows!” 🥬 MOSAIC: What it is: A visual integrity checker that ensures image-referenced text matches the actual images. How it works: 1) find image-linked parts of the report, 2) sort images by type (chart, photo, diagram), 3) apply type-appropriate checks (numbers for charts, objects for photos), 4) aggregate per-item scores. Why it matters: Without MOSAIC, a model could say a chart rises when it really falls. 🍞 Anchor: Like verifying your caption truly describes the photo.

🍞 Hook: A referee sometimes makes a call that can’t be argued. 🥬 VEF: What it is: A strict pass/fail gate for visual claims using task-written visual ground truth. How it works: 1) experts write a minimal text description of what’s clearly in the images, 2) the judge compares the report’s visual claims to this ground truth, 3) if score < 6/10 or identity-critical error occurs, it fails. Why it matters: It blocks pretty words from over-ruling the facts shown in the image. 🍞 Anchor: Like a strict ID check: if the wrong person is in the photo, the claim is rejected.

Together, these parts enforce honest, well-structured, and visually faithful research reports.

03Methodology

High-level recipe: Input (task with images + question) → Agent writes a citation-rich report → FLAE and TRACE score in parallel → If both are valid, run MOSAIC on image-linked parts → Weighted sum builds the final 0–100 score.

Step-by-step with purpose, pitfalls, and examples:

Task packaging (image–text bundle)

What happens: Each of the 140 tasks includes a question and a few key images (charts, diagrams, screenshots). Tasks are split into Daily (40 tasks; everyday visuals like app screenshots) and Research (100 tasks; info-dense figures). Domains span 21 areas (e.g., Health, Environment, Computer Science).
Why this step exists: It ensures that visuals are not optional—they’re necessary evidence. Without this, models might ignore images and guess from text.
Example: “Is this eye drop suitable for my symptoms? Also list safe alternatives near London.” The image shows the product label; the model must read it, cite trustworthy sources, and answer location-specific availability.

Report generation protocol

What happens: The agent must produce a long-form report with in-body citations (numbers mapped to a single URL in a References block). For image-dependent claims, the input images should be embedded or clearly referenced before making conclusions.
Why it exists: Standardized formatting makes checking easier and prevents “citation hide-and-seek.” Without this, claim–source matching would be unreliable.
Example: If the report says, “The ROC curve’s AUC is high [3],” reference [3] must be a page that actually defines/mentions that value and matches the figure discussed.

FLAE: Formula–LLM Adaptive Evaluation (20% of final)

What happens: FLAE scores three dimensions: Readability, Insightfulness, Structural Completeness. It mixes: (a) formula-based features (section headers, sentence distribution, reference presence), and (b) an LLM judge. Task-adaptive weights pick which dimension matters more for that task.
Why it exists: Different tasks value different writing qualities (e.g., some want heavy structure; others prize insight). Without FLAE, a report could be sloppy or shallow yet slip through.
Example: A Research task on self-attention complexity expects clear sections and non-trivial reasoning with citations; FLAE emphasizes Structure and Insightfulness.

TRACE: Trustworthy Retrieval–Aligned Citation Evaluation (50% of final)

What happens: TRACE parses the report’s citations, fetches the cited pages, and scores whether claim–URL pairs are supported. It aggregates three citation-fidelity metrics—Consistency, Coverage, and Textual Fidelity—plus Visual Evidence Fidelity (VEF). VEF has a strict PASS/FAIL using a task-specific visual ground truth (threshold score 6/10; identity-critical errors fail automatically). The VEF portion has a fixed share within TRACE (λ_VEF = 0.4), giving VEF 0.2 of the total score.
Why it exists: Models often sound confident but mis-cite, over-specify, or twist a source. Without TRACE (and especially VEF), the benchmark couldn’t guarantee that claims are both source-backed and visually faithful.
Example: If the report states, “The chart’s y-axis is log-scaled and peaks at 10,000 [5],” TRACE checks [5] and—via VEF—the actual provided chart image to confirm the axis type and peak value.

MOSAIC: Multimodal Support–Aligned Integrity Check (30% of final, gated)

What happens: If FLAE and TRACE yield non-zero valid scores, MOSAIC activates. It gathers all text referring to images (MM-items), routes each image to a type-specific checker (chart, diagram, photo), and scores three dimensions per item: Visual–Semantic Alignment, Visual Data Interpretation Accuracy, and Complex VQA. It then aggregates these item scores.
Why it exists: Visuals differ: reading a bar chart is not the same as judging a photograph. Without MOSAIC’s type-aware checks, models could misread numbers or mis-describe scenes.
Example: For a bar chart, MOSAIC checks numeric plausibility (are maxima/minima, trends, and units correctly stated?). For a museum photo, it checks whether the described object truly appears.

Gating and aggregation

What happens: FLAE and TRACE run in parallel. If they’re valid (thresholds τ_F = τ_T = 0 in this study), MOSAIC runs; otherwise its score is set to zero. The final score = 0.2FLAE + 0.5TRACE + 0.3*MOSAIC.
Why it exists: We don’t want to reward visual integrity when the basic writing or citation isn’t valid. Gating keeps evaluations honest and interpretable.
Example: A report that is well-written but has poor citations won’t reach the MOSAIC stage and thus cannot score high overall.

Visual Evidence Fidelity (VEF) details

What happens: Experts write a minimal “text version” of key visual facts (titles, axes, numbers, labels, entities). The judge compares the report’s visual claims with this ground truth. Scores below 6/10 or identity-critical mistakes are a hard FAIL.
Why it exists: It prevents “maybe” answers from passing when visuals are misread. Fixed rules make outcomes auditable and stable across judge models.
Example: If the image shows a line chart peaking at 2022 with value 3.4, claiming a 2021 peak at 3.6 fails VEF.

The Secret Sauce (what’s clever):

Dual-channel FLAE blends transparent formulas with an LLM judge and adapts weights per task. TRACE goes beyond link-checking to verify claim-level support and add a strict visual guardrail (VEF). MOSAIC uses type-specific judging to fairly evaluate diverse images. Together, the pipeline blocks “good-sounding but unfaithful” reports and surfaces fine-grained errors for diagnosis.

Concrete mini-walkthrough: Confusion matrix + ROC task

Input: Two images (confusion matrix diagram, ROC curve), prompt asking for analysis and calibration advice with citations.
Agent: Extracts 10+ facts from the images, explains imbalance effects, derives formulas, cites docs and papers.
FLAE: Rewards clear sections, correct math narrative, and structured references.
TRACE: Checks if claims (e.g., precision depends on class ratio) match cited sources and whether image-based statements agree with the prompt’s visual intent via VEF.
MOSAIC: Verifies the text’s ROC and confusion-matrix descriptions match the actual diagrams (e.g., TP/TN positions, curve shape) and that any numbers or qualitative statements align.

04Experiments & Results

The Test: The authors evaluate 25 advanced systems, including single-turn LLMs, web-enabled multimodal models, and full deep research agents. They measure three things: 1) FLAE (writing quality), 2) TRACE (citation faithfulness with VEF), and 3) MOSAIC (text–image integrity). The overall MMDR-Bench score combines them with weights 0.2/0.5/0.3.

The Competition: Systems span from strong single models (e.g., GPT-4.1, GPT-5.2, Qwen 3 VL) to tool-using models (Gemini 2.5/3 series, Claude 4.5 series) and end-to-end Deep Research Agents (Gemini Deep Research, Perplexity Sonar Deep Research, Tongyi Deep Research, ChatGPT Deep Research).

The Scoreboard (context):

Gemini Deep Research (Gemini 3 Pro backbone) ranks first overall with a final score of 49.41 on a 0–100 scale—like scoring an A- when many others get Cs on this hard exam. Its edge comes from stronger evidence coverage and grounded reporting.
Gemini 3 Flash and Gemini 3 Pro are the strongest among non-agent, web-enabled baselines, hovering in the mid-40s.
GPT-4.1, GPT-5.1/5.2, and related models show complementary strengths: some excel at multimodal extraction accuracy, others at visual-fidelity pass rates.
Clear trade-offs appear: smooth prose (high FLAE) doesn’t guarantee strong TRACE or MOSAIC. In other words, sounding good isn’t the same as being correct and well-cited.

Surprising Findings:

Vision helps only when it’s accurate: Comparing text-only to vision-enabled versions in the same family shows that adding vision can backfire if the model misreads fine-grained details (like small numbers, dates, or axis labels). These detail errors (DTE) can seed wrong premises that contaminate retrieval and synthesis.
Multimodal alignment vs citation grounding can diverge: Agent systems that gather lots of sources sometimes drift on entities during long synthesis, causing EMI (entity mis-identification) despite good multimodal alignment. Early correct entities can become mismatched after multiple retrieval and summarization steps.
Tools help, but the backbone and retrieval quality matter most: Larger, better-trained backbones with well-tuned retrieval and cross-checking outperform smaller agent systems. Some offline models even beat certain web-enabled models in coverage, hinting that retrieval orchestration (not just access) is a bottleneck.

Domain-level patterns:

Daily tasks are noisier and require robust handling of user screenshots and casual visuals. Gemini 2.5 Flash and GPT-5.2 are consistently strong here.
Research tasks show domain specialization. Gemini Deep Research and Gemini 3 Flash perform well across most research domains. GPT-5.2 shines on structured technical areas like Computer/Data Science. Qwen 3 VL 235B excels on visually dense scientific domains (e.g., Environment & Energy) where chart/diagram reading is decisive.

Human consistency and robustness:

Human agreement: The full evaluator (with VEF and MOSAIC) matches expert preferences better than a plain prompt judge (pairwise agreement 73.5% vs 61.2%, and higher system-level score correlation). Dropping VEF or MOSAIC reduces alignment.
Judge stability: Re-scoring the same reports with different judge backbones shifts module tendencies (e.g., one judge is stricter on VEF) but barely changes the overall score (about 0.3 absolute points), showing the pipeline’s balanced design.

What it means:

The benchmark exposes where today’s models struggle: precise reading of visuals and ironclad claim–citation discipline. It also shows that agentic search improves coverage but can introduce new errors (entity drift) that must be managed.

Concrete illustrations:

Example 1 (health daily): The best systems read eye-drop labels correctly, cite trustworthy sources about symptoms and ingredients, and suggest London-available alternatives. Lower-scoring systems misread labels or cite blogs that don’t support the claim.
Example 2 (CS research): Strong models extract accurate facts from transformer and attention diagrams and correctly state self-attention’s time and memory complexity with citations; weaker ones flip labels or over-claim without proper sources.

05Discussion & Limitations

Limitations:

Coverage, not infinity: MMDR-Bench has 140 tasks across 21 domains—broad but not exhaustive. Some real-world multimodal scenarios (e.g., videos, interactive maps) aren’t included yet.
Visual GT dependency: VEF relies on expert-written textualized visual ground truth. While versioned and audited, writing GT is labor-intensive and may miss rare edge-cases.
Live retrieval variability: Although the pipeline handles inaccessible links with reason-aware penalties, web drift and regional blocking can still affect TRACE.
Judge biases: Different LLM judges show slightly different tendencies (e.g., strictness on prompt faithfulness). The multi-stage design reduces but does not eliminate this.

Required resources:

Compute and APIs: Running agents with browsing and then judging with parallel modules (including OCR/routing for MOSAIC) requires stable APIs and moderate compute.
Data engineering: Storing images, GT versions, and evaluation artifacts; maintaining canary tests for regression protection.

When not to use:

Pure vision or pure text tasks: If your use case is only OCR or only text summarization, more focused benchmarks may be simpler and faster.
Non-citation writing: If your application never requires sources (e.g., creative writing), TRACE/VEF constraints might be unnecessary overhead.

Open questions:

Beyond images: How should we extend to video, audio, and interactive content while keeping evaluation interpretable?
Agent policies: What retrieval and synthesis strategies reduce entity drift without sacrificing coverage?
Ground truth scaling: Can semi-automatic methods help expand reliable visual ground truth at scale?
Safety and fairness: How do we detect and reduce subtle biases in source selection and visual interpretation across languages and cultures?

06Conclusion & Future Work

Three-sentence summary: MMDeepResearch-Bench is a new end-to-end benchmark that evaluates how well AI deep research agents write citation-rich reports using both text and images. It grades reports with three coordinated modules—FLAE for writing, TRACE (with strict VEF) for citation faithfulness, and MOSAIC for text–image integrity—producing interpretable, fine-grained diagnostics. Tests on 25 systems reveal that great prose is not enough; trustworthy evidence use and visual grounding remain key bottlenecks.

Main achievement: The paper delivers the first unified, multimodal deep-research benchmark with an interpretable, judge-in-the-loop pipeline that strictly enforces visual faithfulness and claim–citation alignment.

Future directions: Expand beyond images to video and interactive media; improve automatic ground-truthing for visuals; design agent strategies that curb entity drift; and explore domain-specialized rubrics while maintaining cross-domain comparability.

Why remember this: MMDR-Bench raises the standard from “sounds smart” to “proves it,” setting a clear path for building research AIs that are readable, reliable, and truly grounded in both text and visuals.

Practical Applications

•Select the right AI for evidence-heavy tasks by comparing MMDR-Bench scores, especially TRACE and VEF.
•Debug model failures by inspecting fine-grained errors (e.g., entity drift vs detail extraction) surfaced by the evaluator.
•Train better retrieval strategies by targeting low Coverage or Consistency in TRACE.
•Improve visual reading by fine-tuning on chart/diagram tasks that hurt MOSAIC or VEF.
•Harden report templates using FLAE feedback to boost structure and readability without sacrificing faithfulness.
•Adopt strict citation formats (one index → one URL) to simplify claim–source auditing in production.
•Monitor regressions over time with versioned visual ground truths and canary tasks.
•Design agent loops (plan–search–verify) that reduce entity misattribution during multi-step synthesis.
•Benchmark domain readiness (e.g., Health vs Energy) before deploying models to specific verticals.
•Use Daily vs Research regimes to stage model rollouts: start with casual tasks, then graduate to analysis-heavy cases.

Version: 1