Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Key Summary
- •The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
- •Older tests often let AI models guess answers using words alone or by finding a perfect copy of the whole image online, which is unrealistic.
- •VDR-Bench fixes this by making models focus on specific parts of an image (crops), search in rounds, and connect facts step by step like a detective.
- •The authors also add a new score called Entity Recall that checks whether a model actually found the right things (people, places, logos) before answering.
- •A simple multi-round cropped-search workflow and a guidance trick called Multi-turn Visual Forcing help models do better at real visual search.
- •On VDR-Bench, models do poorly without search but improve a lot when they use cropped image search plus text search; Gemini 2.5 Pro jumps to 30.0% with the full method.
- •Open-source models that actively search can outperform bigger closed models that rely on memory—this is called the “lazy search” effect.
- •The benchmark was built through a strict, human-verified pipeline to block text-only shortcuts and stop easy one-shot image matches.
- •Results show a strong link between finding the right entities and getting the right answer—good search leads to good answers.
- •This work gives practical guidance for building stronger, more realistic multimodal research agents.
Why This Research Matters
Real-world problems often mix pictures and words, like checking if a product photo matches its description or verifying a news image. VDR-Bench pushes AI models to handle that reality by forcing them to look closely, search carefully, and prove their answers. This means we can build assistants that don’t just guess but actually check the facts across images and text. Better evaluation leads to better tools for shoppers, students, reporters, and safety teams. It also encourages models to provide evidence you can see and trust, improving transparency. Over time, this could reduce misinformation spread by misleading images and create more reliable AI helpers in daily life.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re solving a mystery with a photo and the internet. If you only read comments about the photo without looking closely at what’s in it, you might guess the wrong answer.
🥬 The Concept: Multimodal Large Language Models (MLLMs)
- What it is: MLLMs are AI systems that can understand and use both pictures and words together.
- How it works: 1) Look at the image; 2) Read the text; 3) Connect clues from both; 4) Answer the question.
- Why it matters: Without using both, the AI might miss key visual clues or misunderstand the text. 🍞 Anchor: If you ask, “What brand is on the umbrella in this photo?”, the model must see the logo—not just guess from the caption.
🍞 Hook: You know how you can ask a friend, “What color is the car in this picture?” and they just tell you after looking?
🥬 The Concept: Visual Question Answering (VQA)
- What it is: VQA is an AI task where the model answers questions about an image.
- How it works: 1) Read the question; 2) Inspect the picture; 3) Match visual parts to question words; 4) Produce a short answer.
- Why it matters: Without careful looking, the AI might answer from memory instead of from the image. 🍞 Anchor: Ask, “What animal is sitting on the bench?” The model should look and say “cat,” not guess “dog.”
🍞 Hook: Think of a detective who uses both a photo of a suspect and articles from the web to solve a case.
🥬 The Concept: Vision-DeepResearch systems
- What it is: These are AI agents that search the web and images across many steps to answer tough visual-and-text questions.
- How it works: 1) Look at the image; 2) Search the web; 3) Compare results to what’s in the image; 4) Repeat until confident; 5) Answer with proof.
- Why it matters: Real questions often need both seeing and reading, not just one. 🍞 Anchor: To identify a stadium from a game photo, the agent must spot the team’s colors, search for likely stadiums, and confirm with multiple sources.
🍞 Hook: Ever try to solve a puzzle by only reading the instructions and ignoring the picture?
🥬 The Concept: Textual search bias (shortcut using language priors)
- What it is: When AI leans too much on text hints or its own world knowledge and skips visual checking.
- How it works: 1) Notice text clues; 2) Cross-check with other words; 3) Guess the answer; 4) Never verify with the image.
- Why it matters: It gives inflated scores that don’t measure real visual understanding. 🍞 Anchor: If a question lists team names, the model might guess the stadium without looking at the photo’s details.
🍞 Hook: Imagine winning a treasure hunt by googling a photo and finding the exact same photo with the answer in the title.
🥬 The Concept: Overly idealized whole-image retrieval (perfect-retrieval bias)
- What it is: When a benchmark lets models upload the full image and find an identical copy online with the answer.
- How it works: 1) Use entire image as search query; 2) Get near-duplicate; 3) Read the title/caption; 4) Answer without real reasoning.
- Why it matters: It’s unrealistic and too easy; real-world search rarely has perfect copies. 🍞 Anchor: Searching the entire concert photo returns the original poster page with the band’s name right in the title.
🍞 Hook: Think of a teacher who says, “Show your work.”
🥬 The Concept: Visual-search–centric evaluation
- What it is: Tests designed so the model must actually use the image, not just words.
- How it works: 1) Focus on parts of the image; 2) Search with those parts; 3) Cross-check with text; 4) Build an evidence chain.
- Why it matters: It measures true skill in seeing, searching, and verifying. 🍞 Anchor: To answer “Which brand is the umbrella?”, the model must crop the logo, search it, and confirm “Ferrari.”
The World Before: MLLMs became good at answering image questions, but tests often allowed shortcuts. Models could guess from text hints or rely on memorized facts. On the image side, feeding the whole picture to search engines frequently brought back exact matches with the answer. This made scores look high even when models didn’t truly perform visual search.
The Problem: We lacked a benchmark that forces genuine, step-by-step visual search and multi-hop reading—just like how people solve tricky, real-world puzzles.
Failed Attempts: Prior datasets mixed images and facts but often allowed: (1) text-only solving; (2) one-shot whole-image matches; (3) shallow, one-step questions.
The Gap: A benchmark was needed that blocks shortcuts, emphasizes entity-level visual localization, and requires cross-modal, multi-hop reasoning.
Real Stakes: This affects everyday tools—shopping apps that verify products from photos, news-checkers that confirm image claims, or education helpers that explain museum photos accurately. If we can’t measure real visual search, we can’t build trustworthy assistants.
02Core Idea
🍞 Hook: You know how a good detective doesn’t just look once—they zoom in on clues, search again, and link facts together?
🥬 The Concept: The “Aha!”
- What it is: Build a vision-first benchmark (VDR-Bench) that forces models to do real visual search by cropping image regions, searching in multiple rounds, and linking facts through multi-hop reasoning—then score whether they actually found the right entities.
- How it works: 1) Curate questions that need the image; 2) Require crop-based visual queries; 3) Expand difficulty via knowledge graphs; 4) Judge answers and entity discovery; 5) Encourage multi-turn visual forcing so models really search.
- Why it matters: Without this, we reward guessing and perfect matches instead of real understanding. 🍞 Anchor: To identify a building’s original purpose, the model must crop a tower detail, find the landmark, read trusted pages, and then answer.
🍞 Hook: Remember using a map and zooming in to find a tiny street sign?
🥬 The Concept: Multi-round cropped-search workflow (CIS)
- What it is: A strategy where the model crops important image parts and searches repeatedly to reduce noise.
- How it works: 1) Pick a region (like a logo); 2) Search images; 3) Update the crop/scale; 4) Search again; 5) Keep the best matches.
- Why it matters: Whole-image search can be messy; focusing on parts makes retrieval realistic and accurate. 🍞 Anchor: Crop just the Ferrari logo on an umbrella, search it, and confirm the brand.
🍞 Hook: Imagine a family tree or a web of related facts.
🥬 The Concept: Knowledge graph and multi-hop reasoning
- What it is: A knowledge graph connects entities (like companies, people, places). Multi-hop means following several links to reach an answer.
- How it works: 1) Start from the visual entity; 2) Walk to related nodes (e.g., founder → birth year); 3) Collect facts; 4) Answer.
- Why it matters: Real questions often need more than one step. 🍞 Anchor: From a company logo in the image → to its headquarters city → to the year it moved there → final answer.
🍞 Hook: When you do a scavenger hunt, finding the right item matters as much as knowing the final riddle.
🥬 The Concept: Entity-level retrieval
- What it is: Checking whether the model actually found the correct named things (e.g., “Norman Tower”).
- How it works: 1) Log all searched entities; 2) Compare to gold entities; 3) Give credit for correct discoveries.
- Why it matters: Right entities → right evidence → right answers. 🍞 Anchor: If the gold entity is “Signal Iduna Park,” the model only gets full credit if it found that stadium, not just “a stadium in Germany.”
🍞 Hook: Think of a coach who encourages you to look again, closer, and in different spots.
🥬 The Concept: Multi-turn Visual Forcing (MVF)
- What it is: A prompting strategy that nudges the model to keep cropping, re-searching, and verifying with cross-modal evidence.
- How it works: 1) Propose regions; 2) Crop and search; 3) Re-assess; 4) Repeat until confident; 5) Cite sources.
- Why it matters: It combats the habit of guessing from memory and pushes real visual investigation. 🍞 Anchor: When unsure about a jersey, MVF guides the model to crop the badge, then the sponsor, then the stadium banner, building converging proof.
Before vs After:
- Before: Benchmarks allowed text-only shortcuts and perfect whole-image matches.
- After: VDR-Bench requires crop-based, iterative visual search and multi-hop reasoning, judged on both answers and discovered entities.
Why It Works (intuition): Cropping reduces background noise, like tuning a radio to the right station. Multi-hop reasoning builds a solid chain from “what I see” to “what I know.” Entity-level checks ensure the model didn’t just guess. MVF counteracts laziness by rewarding careful, repeated looking.
Building Blocks:
- Vision-first curation (pre-filter, crop, human-verify)
- Seed VQA from verified visual entities
- Knowledge-graph expansion to add hops
- Solvability checks to block shortcuts
- Two metrics: Answer Accuracy (LLM judge) and Entity Recall (did you find the right entities?)
- The CIS+TS+MVF recipe: cropped image search + text search + guided multi-turn verification.
03Methodology
At a high level: Input image and question → [A: Vision-first cropping and search] → [B: Entity extraction and human verification] → [C: Seed VQA creation] → [D: Knowledge-graph expansion] → [E: Solvability and quality checks] → Output: Final benchmark item with gold entities and answer.
Step A: Vision-first Cropping and Search
- What happens: Annotators crop meaningful regions (logos, faces, landmarks) from the image and use each crop as a query in web-scale image search.
- Why this step exists: Whole-image search often returns identical photos with the answer in the title. Cropping forces realistic, noisier retrieval focused on the entity.
- Example: From a race crowd photo, crop only the red umbrella’s logo and search that crop.
Step B: Entity Extraction and Human Verification
- What happens: From the search results, candidate entity names (e.g., “Ferrari,” “Norman Tower”) are extracted using an MLLM filter and then confirmed by humans to ensure consistency with the crop.
- Why this step exists: It ensures the visual entity truly matches the crop and can’t be trivially sourced via full-image search.
- Example: If the cropped logo matches multiple brands, humans remove ambiguous results until a single correct brand remains.
Step C: Seed VQA Generation
- What happens: Verified entities are turned into clear VQA pairs that require recognizing and grounding that entity (e.g., identify the building in the image).
- Why this step exists: It links the visual entity to a concrete, image-dependent question.
- Example: “In the image, what brand does the red umbrella belong to?”
Step D: Knowledge-Graph–Based Complexity Expansion
- What happens: Starting from the visual entity, the pipeline walks a knowledge graph to create multi-hop questions (e.g., logo → company → founder → award year).
- Why this step exists: Many real tasks require more than simple recognition—they need reasoning across several facts.
- Example: From the “Norman Tower” crop, ask: “What was the tower’s original purpose when first built?” requiring historical lookup.
Step E: Solvability and Quality Checks
- What happens: Automatic checks confirm the answer can be recovered using only the recorded visual crops, their retrieved pages, and the knowledge-graph hops. Human reviewers then remove any item with text-only shortcuts, ambiguity, or near-duplicate retrieval.
- Why this step exists: It guarantees each question truly needs the image and a defensible evidence chain.
- Example: If a question can be answered from the text alone without touching the cropped visual evidence, it gets rejected.
The Secret Sauce
- Crop-First, Not Image-First: Cropping reduces background noise and perfect-match shortcuts.
- Multi-Hop by Design: Knowledge-graph walks force reasoning beyond one step.
- Dual Metrics: You’re graded on both the final answer and whether you found the right entities.
- MVF Guidance: Prompts that say, “Look again—crop, re-search, verify,” lift performance.
Now, the evaluation tools and metrics:
🍞 Hook: When you’re graded on a science project, the teacher checks both your result and your process notes.
🥬 The Concept: Answer Accuracy via LLM-as-judge
- What it is: A large model evaluates whether the final answer is correct.
- How it works: 1) Extract the model’s final answer; 2) Use a consistent judging prompt; 3) Mark correct/incorrect.
- Why it matters: It standardizes grading across many answer styles. 🍞 Anchor: If the correct answer is “Signal Iduna Park,” “BVB Stadion Dortmund” may also be accepted when judged equivalent.
🍞 Hook: Think of a scavenger hunt where you must actually find the list items—not just describe them.
🥬 The Concept: Entity Recall (ER)
- What it is: A score that checks whether the model’s search trajectory discovered the key entities that matter.
- How it works: 1) Collect all entities the model searched; 2) Compare to the gold entity sequence using an LLM judge for semantic matches; 3) Mark success if coverage is sufficient.
- Why it matters: It rewards real searching and penalizes lucky guesses. 🍞 Anchor: If the gold path includes “Ferrari,” the model earns credit only if it actually found and cited “Ferrari,” not just “a racing brand.”
Finally, the agent-side recipe used in experiments:
- Direct Answer: No external tools—just image + question.
- CIS+TS: Cropped Image Search plus Text Search tools for iterative retrieval.
- CIS+TS+MVF: Same tools, but with Multi-turn Visual Forcing prompts that push more rounds of cropping and evidence checks.
🍞 Hook: When someone is very confident, they may skip reading instructions.
🥬 The Concept: Lazy search phenomenon
- What it is: Strong models rely on memory and avoid searching, which hurts performance on vision-heavy tasks.
- How it works: 1) Model trusts prior knowledge; 2) Skips or minimizes tool use; 3) Misses visual evidence; 4) Answers worse.
- Why it matters: Good memory isn’t enough; real-world tasks demand active search. 🍞 Anchor: An AI that “knows” popular stadiums but won’t crop and check logos can still get the stadium wrong.
Putting it all together: The benchmark’s pipeline makes each item depend on verified visual entities and multi-hop facts. The agent’s tools (CIS, TS, MVF) are set up to mirror how a careful person would investigate images online—zooming in, searching again, and confirming before answering.
04Experiments & Results
The Test: The authors measured two things—(1) Answer Accuracy: did the model give the right final answer? and (2) Entity Recall: did the model actually find the key entities along the way? This double-check means models are rewarded for both results and real evidence gathering.
The Competition: They evaluated top multimodal models like Gemini 2.5 Pro, GPT-5, Claude-4-Sonnet, and strong open-source Qwen3-VL variants, comparing three modes: Direct Answer, CIS+TS (cropped image + text search), and CIS+TS+MVF (add Multi-turn Visual Forcing).
The Scoreboard (with context):
- On VDR-Bench, Direct Answer scores are low for all models (e.g., Gemini 2.5 Pro: 8.2%), proving you can’t rely on memory alone. That’s like trying a test without studying the actual chapter.
- Adding CIS+TS helps, but not dramatically in the cross-benchmark analysis table (e.g., Gemini 2.5 Pro: 10.6%). This shows that just having tools isn’t enough if you don’t use them well.
- With CIS+TS+MVF, scores jump a lot on VDR-Bench (e.g., Gemini 2.5 Pro reaches 30.0% overall accuracy; Qwen3-VL-235B-A22B reaches 27.4%). That’s like moving from a low D to a solid C or better because you finally followed the full research process.
- In the CIS+TS setting, Qwen3-VL-235B-A22B achieved 21.2%, outperforming all closed-source models in that setup—evidence that active search can beat passive memory (the “lazy search” effect).
- Entity Recall rises alongside accuracy when MVF is used, showing a strong positive link: models that find the right entities also answer better.
Surprising Findings:
- Open-source models that embrace search can beat larger, closed models that lean on priors. Searching well matters more than just “being big.”
- The “lazy search” effect is real: powerful models sometimes skip tools and underperform until prompted to try multi-round visual search.
- Whole-image search alone is not a silver bullet on VDR-Bench; focusing on entity crops is essential.
Category Patterns:
- Gains from MVF appear across many visual domains (sports, architecture, art & music, etc.), suggesting the improvements are general, not niche.
- Sports and “other” categories often show large boosts, likely because logos, uniforms, and banners benefit most from crop-based, iterative search.
Takeaway Numbers:
- Gemini 2.5 Pro: 8.2% (Direct) → 16.2% (CIS+TS, in one table variant) → 30.0% (CIS+TS+MVF). Big MVF jump signals that guided iteration produces real benefits.
- Qwen3-VL-235B-A22B: 8.8% (Direct) → 21.2% (CIS+TS) → 27.4% (CIS+TS+MVF). Active, scaled search pays off.
In plain words: When models are nudged to zoom in on image parts, search multiple times, and verify with text, they stop guessing and start finding—accuracy and entity recall climb together.
05Discussion & Limitations
Limitations:
- Judge Dependence: Both Answer Accuracy and Entity Recall use an LLM-as-judge to assess correctness and semantic matches. While consistent prompts reduce noise, judge choices can still introduce bias.
- Engine Bias and Drift: Visual and text search depend on external engines. Changes in indexing or ranking could affect difficulty over time.
- Domain Coverage: VDR-Bench spans 10 diverse domains, but no benchmark can cover everything (e.g., very niche scientific diagrams or rare regional signage).
- Cost and Latency: Multi-round cropping and iterative searching increase compute and time costs, which can be high for large models.
- Ambiguity at Scale: Despite strict filtering, some edge cases may still allow partial shortcuts or ambiguous interpretations.
Required Resources:
- Access to robust image and web search APIs; ability to send cropped images as queries.
- An MLLM capable of reading images and text, plus running multi-step tool use.
- Enough compute budget for multiple search rounds (MVF increases steps but improves reliability).
When NOT to Use:
- Purely text-only tasks where images don’t add value—simpler text benchmarks are cheaper and faster.
- Time-critical settings where extra search rounds are infeasible (e.g., hard real-time systems).
- Closed environments with no web access or where privacy forbids external search.
Open Questions:
- Can we learn when to stop searching automatically—balancing accuracy with cost?
- How can we reduce reliance on LLM judges while keeping semantic fairness (better automatic metrics)?
- Can the system learn smarter cropping policies (which region, which scale, which order) from experience?
- How to handle videos and dynamic scenes where entities move and change over time?
- Can we personalize search to user goals while keeping verification rigorous and reproducible?
06Conclusion & Future Work
3-Sentence Summary: VDR-Bench is a vision-first benchmark that forces models to perform real visual search by cropping, searching in multiple rounds, and linking facts through multi-hop reasoning. It blocks text-only shortcuts and perfect whole-image matches, then grades both final answers and whether key entities were truly found. A simple, practical workflow—cropped-image search plus Multi-turn Visual Forcing—substantially boosts performance and offers a roadmap for building better multimodal research agents.
Main Achievement: The paper redesigns evaluation for multimodal deep research—introducing a carefully curated, human-verified benchmark with crop-first retrieval, knowledge-graph expansion, and dual metrics (Answer Accuracy and Entity Recall) that finally measure what matters: grounded, cross-modal evidence gathering.
Future Directions: Smarter, learned cropping policies; cost-aware stopping rules; lighter, fairer judging metrics; extensions to video and dynamic scenes; and training recipes that reduce “lazy search” by rewarding strong evidence chains.
Why Remember This: It shifts the field from guessing and lucky duplicates to genuine seeing-and-searching, giving researchers a realistic yardstick and practitioners a practical recipe for building trustworthy, visual-first AI detectives.
Practical Applications
- •E-commerce photo verification: Confirm the brand/model in a listing by cropping logos and matching them to trusted sources.
- •News fact-checking: Validate whether an event photo truly matches the claimed place and time using entity-level search.
- •Museum and education guides: Identify artwork or architecture in photos and explain historical context via multi-hop reasoning.
- •Technical support: Recognize device models from partial photos (ports, badges) and retrieve accurate manuals or specs.
- •Sports analytics: Identify teams and venues from uniforms, crests, and stadium features, then pull related stats.
- •Travel planning: Spot landmarks from trip photos and surface official visiting info (hours, tickets, history).
- •Enterprise asset auditing: Verify equipment types and serial badges from factory floor images to link correct documentation.
- •Safety and compliance: Detect and confirm safety signage, gear compliance marks, and manufacturer certifications from images.
- •Counterfeit detection: Compare cropped brand marks against authentic references and report mismatches with evidence.
- •Academic research assistants: Build evidence chains from images in field notes (e.g., plant species) to authoritative references.