Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Wenxuan Huang; Yu Zeng; Qiuchen Wang; Zhen Fang; Shaosheng Cao; Zheng Chu; Qingyu Yin; Shuang Chen; Zhenfei Yin; Lin Chen; Zehui Chen; Yao Hu; Philip Torr; Feng Zhao; Wanli Ouyang

Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

Intermediate

Wenxuan Huang, Yu Zeng, Qiuchen Wang et al.1/29/2026

arXiv PDF

Key Summary

•The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.
•Vision-DeepResearch teaches a multimodal model to search like a careful detective: crop different regions of an image, try multiple scales, run many text queries, and keep iterating until it finds solid proof.
•It connects visual searching with strong text-only research skills by turning the image into a detailed description (text bridging) so a text research model can continue the long investigation.
•The team builds high-quality training data with verified questions, multi-step search trajectories, and clever obfuscation to make problems more realistic and harder to shortcut.
•Training happens in two phases: supervised fine-tuning (to learn the basic behaviors) and reinforcement learning (to practice in the real web and get rewards for correct answers).
•An asynchronous rollout system speeds up long, tool-heavy training by running many searches in parallel and safely stopping broken loops.
•Across six benchmarks, Vision-DeepResearch beats previous multimodal research agents and even strong closed-source agent workflows in the same setup.
•Ablations show that multi-scale image cropping plus text search together are crucial: each alone helps, but together they deliver the best balance and highest accuracy.
•The method is practical for real-world, noisy search engines and long questions that need evidence from many places, but it still needs web access, compute, and careful data curation.
•This work matters because it moves AI closer to being a reliable visual-and-text researcher that can verify facts, cite sources, and handle messy situations like humans do.

Why This Research Matters

Real-world information is messy: photos are cluttered, and search engines respond differently to tiny changes. An AI that can zoom into the right parts of an image, try multiple search angles, and keep reasoning over many steps is far more likely to find reliable, source-backed answers. This helps students verify facts, journalists check images, shoppers confirm products, and analysts investigate events. By training on realistic, multi-hop data and practicing in the live web, Vision-DeepResearch behaves more like a patient human researcher than a guesser. That shift makes AI answers more trustworthy and useful in daily life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you look for a friend in a crowded playground, you don’t just stare at the whole crowd once? You scan faces, zoom in on hats, and try different spots until you spot them. Computers that answer questions about pictures used to behave more like the first way—one quick glance and a guess. That worked sometimes, but often failed in real, noisy scenes.

🍞 Hook: Imagine asking, “Who is giving this lecture?” while showing a photo of a busy auditorium. If the computer searches the entire picture at once, it might latch onto a banner, a logo, or a random audience member and miss the actual speaker.

🥬 Visual Question Answering (VQA)

What it is: A computer answers questions about an image.
How it works: 1) Read the question; 2) Look at the picture; 3) Find the parts that matter; 4) Use what it sees (and sometimes outside info) to answer.
Why it matters: Without VQA, the system can’t connect the question to the picture—it would be like reading a question with your eyes closed. 🍞 Anchor: Ask, “What team is this basketball player on?” The model looks at the jersey and logo in the image and replies “Los Angeles Lakers.”

🍞 Hook: You know how typing the same question slightly differently in a search engine can bring up totally different results? Real web search is messy like that.

🥬 Hit-Rate Problem

What it is: The chance that a search query actually finds the right evidence.
How it works: 1) You make a query (full image, a crop, or text); 2) The search engine returns results; 3) If the correct clue doesn’t show up, you try again with a different crop/wording/scale.
Why it matters: If hit-rate is low and you only try once, you often miss the needed proof and answer wrong. 🍞 Anchor: Searching a full concert photo for “singer” might return venue ads. Cropping just the singer’s face and trying again suddenly pulls up the singer’s official page.

🍞 Hook: Think of solving a tough puzzle: you plan, try an action, look at what happened, then plan again.

🥬 ReAct Paradigm (Reason-then-Act)

What it is: A loop where the model thinks first, then calls a tool (like a search engine), then thinks again using the new info.
How it works: 1) Plan; 2) Call a tool; 3) Read the result; 4) Update the plan; 5) Repeat until done.
Why it matters: Without ReAct, the model either never checks sources or never updates its plan, so it gets stuck or guesses. 🍞 Anchor: To find “What year did this museum open?” the model plans a query, searches, reads a page, notices it’s about renovations not opening year, and tries a refined query until it finds the correct date.

The world before this paper: Multimodal models (that see images and read text) got better at everyday tasks, but complex, fact-heavy questions were tricky. Earlier systems often assumed a single full-image search or a couple text queries would find everything. In reality, images are cluttered, engines vary, and questions may require hopping between multiple pages and visual regions.

🍞 Hook: Learning to research is like learning to ride a bike—you fall, adjust, and try again. Models need practice on long, multi-step journeys.

🥬 Multimodal Deep-Research Paradigm

What it is: A method where models mix images and text, repeatedly searching and reasoning across both.
How it works: 1) Inspect the image; 2) Propose entity crops at different sizes; 3) Search images and the web; 4) Read and summarize pages; 5) Repeat and connect clues.
Why it matters: Without this, models stop too early or miss fine details, like a small name tag or a subtle logo. 🍞 Anchor: To answer “What date was this Ivy League talk?” the model tries many crops (speaker’s face, poster corner), finds the speaker’s bio page, verifies Ivy League membership, and finally extracts the exact date.

Failed attempts: One-shot full-image searches got drowned by background noise; short, two-to-four step workflows couldn’t reach faraway facts; purely text-based agents lacked visual grounding. The gap: a robust, long-horizon, multimodal search-and-reason loop that handles noisy images and unstable engines, while being trained on data that actually reflect those messy realities.

Real stakes: This matters for verifying product photos, identifying people or places responsibly, helping students learn with source-backed answers, and assisting journalists or analysts who must piece together facts from images and text under uncertainty. Without deeper multimodal research, AI behaves like a hurried guesser—not a careful helper who checks the evidence.

02Core Idea

🍞 Hook: You know how good detectives zoom into clues, look again from another angle, and ask more people until the story fits? That’s the spirit here.

🥬 The “Aha!” in one sentence: Teach multimodal models to research like persistent detectives—crop images at multiple scales, run many rounds of visual and text searches, and connect the dots over long steps—so they actually find and verify evidence in noisy, real settings.

Multiple analogies:

Library hunt: Instead of grabbing the first big book (full image), you look up specific chapters (crops), check footnotes (text pages), and follow references (multi-turn). You stop only when the citation matches.
Treasure map: You don’t just follow one arrow; you zoom into different map squares, read travel notes, and try side paths, then combine clues until X marks the spot.
Cooking a complex dish: You watch short clips (image crops), check a blog (text), test a spice (refine query), and keep tasting (verify) until it’s right.

Before vs After:

Before: One or two searches; often miss small but key visual entities; stop early; answers are shallow or wrong.
After: Dozens of reasoning steps; many image crops at different scales; broad text searches; verified, sourced answers even in cluttered scenes.

Why it works (intuition, no equations):

Breaking a big, noisy image into focused crops raises the chance that search engines match the right entity (better hit-rate).
Long-horizon ReAct lets the model learn from each try and adapt queries, like a human iterating.
Bridging the image to a text description lets a strong text research model continue the chain, carrying visual context forward.
Training on tough, verified, multi-hop data plus RL in the real web environment bakes these habits into the model.

Building blocks (each with a mini sandwich):

🥪 Multi-entity, Multi-scale Visual Search

What: Search using many carefully chosen crops at different sizes.
How: 1) Propose boxes; 2) Crop at multiple scales; 3) Query image search; 4) Visit and summarize pages; 5) Keep useful evidence.
Why: One shot often fails; multiple zoomed looks greatly improve matching. Example: A tiny lapel pin crop finally reveals a university name.

🥪 Long-Horizon Multi-Turn Trajectories

What: Reason-search cycles that can run dozens of steps.
How: 1) Plan; 2) Tool-call; 3) Observe; 4) Update; 5) Repeat with stop checks.
Why: Some facts hide behind several hops; short runs stall too soon. Example: 20 steps to go from a face crop to a bio page to a dated event listing.

🥪 Text Bridging

What: Turn the image into a rich text description so a text research LLM can keep going.
How: 1) Describe the image; 2) Keep the history of visual crops and observations; 3) Let the text model continue ReAct.
Why: Text models are already great at long, tool-heavy research—bring that power into vision tasks. Example: “A woman at a Dartmouth podium” becomes text context the model uses to search university sites.

🥪 Cold-Start Supervision

What: Teach the basics with curated, verified trajectories before RL.
How: Feed 30K high-quality examples with step-by-step thoughts, tool calls, and answers.
Why: Without a good start, RL wanders; with it, the model learns stable habits. Example: Show it how to crop, search, summarize, and stop when evidence is enough.

🥪 Reinforcement Learning (RL)

What: Practice live on the web and earn rewards for right answers.
How: 1) Roll out long trajectories; 2) Judge correctness; 3) Update the policy (GRPO); 4) Repeat.
Why: Offline data can’t cover everything; RL sharpens decision-making in the wild. Example: The model learns to avoid dead-end pages and pick better queries next time.

Altogether, the model becomes a careful multimodal researcher that doesn’t give up early and doesn’t guess without checking.

03Methodology

At a high level: Image + Question → Propose crops and reason → Multi-scale image search + page visit + summarize → Decide if enough visual evidence → Bridge to text → Long text-based research with search/visit/python → Verified answer.

Step-by-step (with what/why/examples):

Input and Setup

What: Receive an image I and question q; prepare prompts to encourage ReAct-style planning.
Why: Clear prompting nudges the model to plan, act, and reflect in cycles instead of guessing.
Example: q = “Given this took place at an Ivy League school, what is the lecture date?”

Propose Multi-Entity, Multi-Scale Crops

What: The MLLM reasons about likely regions (faces, logos, posters) and proposes many bounding boxes at different sizes.
Why: Small logos and name tags can be decisive; multi-scale crops boost hit-rate.
Example: Boxes around speaker’s face, podium sign, poster corner, background banner; each is cropped at small/medium/large.

Visual Search → Visit → Summarize

What: For each crop set in a turn: (a) image search; (b) visit top URLs; (c) summarize pages with an auxiliary model.
Why: Directly feeding raw pages can overflow the context and contain clutter; summarization filters noise and keeps matches.
Example: Face crop yields a faculty page; the summarizer confirms the photo match and extracts “Dr. Caroline Robertson.”

Judge Visual Sufficiency

What: A judge model looks at accumulated visual evidence and decides if it’s enough to switch to text research.
Why: Prevents stopping too soon (missing key facts) or going too long (wasting steps).
Example: After several rounds, the judge says “enough” once a correct identity and affiliation are confirmed.

Text Bridging

What: Convert the image into a detailed text description D; keep the chain of reasoning, tool calls, and observations; then let a strong text-only deep-research model continue.
Why: Text research LLMs excel at long ReAct sequences; bridging lets them inherit the visual context and push further.
Example: D: “A speaker at a Dartmouth podium labeled ‘Department of Psychological and Brain Sciences.’”

Text-Based Deep Research

What: The text model runs web search, visits pages, summarizes, and can use Python for small computations.
Why: Many facts (dates, names, places) live in text; careful multi-hop browsing is often required.
Example: It finds a Dartmouth events page listing “Neuroscience lecture by Dr. Caroline Robertson – 2018-10-18.”

Merge and Verify the Full Trajectory

What: Combine the visual and text phases into one coherent trace: plans, actions, observations, and final answer.
Why: This complete example becomes training data, showing not just the answer but how to find it.
Example: The merged trace spans ~25+ turns with multiple crops and searches, ending with the correct date.

Data generation “secret sauce” components:

🥪 Fuzzy Multi-hop VQA Synthesis

What: Build tougher, realistic questions by starting from verified entity crops and then obfuscating entities and answers.
How: 1) Generate simple entity Q&A; 2) Obfuscate by chaining relations (answer obfuscation) and swapping related entities via random web walks (entity obfuscation); 3) Use judges to keep solvable, non-shortcut questions.
Why: Prevents trivial or templated patterns and forces true multi-hop reasoning. Example: “What’s the cat’s name?” evolves to “What’s the name of the teacher of the cat owner’s daughter?” while still grounded by the image’s specific cat.

🥪 Cold-Start Supervision

What: 30K curated trajectories (16K verified VQA with multimodal traces, 8K text-only QA, 6K fuzzy multi-hop VQA), trained with next-token prediction on thoughts, tool calls, and answers.
How: Mix data so the model learns to integrate visual and textual evidence and plan long sequences.
Why: Gives the model a solid playbook before RL. Example: Trajectories show when to crop more, when to switch to text search, and when to stop.

🥪 Reinforcement Learning with GRPO and LLM-as-Judge

What: Practice on 15K VQA instances in the live web; reward = 1 if correct, 0 otherwise; optimize with Group Relative Policy Optimization (GRPO).
How: Sample long rollouts (up to ~50 turns), evaluate final answers with a judge model, update policy with leave-one-out baselines.
Why: Online interaction teaches the model to make better decisions and avoid traps not seen in offline data. Example: The model learns that “searching the speaker’s lab page” works better than “generic university search” for event dates.

🥪 Asynchronous Rollout System

What: A multi-threaded, queued scheduler that runs many tool calls in parallel and returns results asynchronously.
How: Batch crop searches together; handle slow pages without blocking others; detect loops and format errors to stop bad runs.
Why: Long, tool-heavy RL can be painfully slow; async rollouts increase throughput >10× and stabilize training. Example: Ten crop searches fire in parallel; valid ones return quickly while broken ones get cut off and don’t stall the batch.

Safeguards and stability tricks:

Trajectory interruption: detect long repetitive text or repeated tool-call failures and stop early.
Masking bad trajectories: exclude severely broken runs from gradient updates to avoid overpowering negative signals.
Numerical stability: train in BF16 for long contexts to avoid FP16 overflow issues.

Output: A multimodal agent that can run long ReAct loops, crop images at multiple scales, search broadly, summarize, bridge to text, and deliver verified answers with higher accuracy in noisy conditions.

04Experiments & Results

The test: Does multi-scale, long-horizon multimodal research actually beat simpler methods on real, messy tasks? The team evaluates on six tough benchmarks: VDR-Bench, FVQA, MMSearch-Plus, MMSearch, LiveVQA, and BrowseComp-VL (BC-VL). They compare three paradigms: (1) direct answer (no tools), (2) ReAct-style agent workflows with a shared toolset, and (3) specialized multimodal deep-research agents (including theirs).

The competition: Strong closed-source models (GPT-5, Gemini-2.5 Pro/Flash, Claude-4/3.7-Sonnet) and open models/agents (Qwen3-VL family, WebWatcher, MMSearch-R1). Everyone uses a fair, unified tool environment when applicable.

The scoreboard with context:

Direct answering (no tools) is weak on open-domain, fact-heavy multimodal questions. For example, a capable baseline like Qwen3-VL-30B-A3B-Thinking averages ~24%. That’s like taking a quiz without being allowed to look up references.
ReAct-style agent workflows help a lot; Gemini-2.5 Pro averages ~50.7%. That’s a big jump—like moving from a C to a solid B by actually consulting sources.
Vision-DeepResearch sets new marks among open agents and challenges closed-source agent workflows in the same setup. The 8B model improves over the comparable agentic backbone by +10.4 average points, with huge gains on MMSearch (+17.6) and LiveVQA (+13.7). The 30B-A3B version reaches ~56.9 average—about an A-—with consistent jumps on VDR (+17.6), FVQA (+16.5), and MMSearch-Plus (+18.5). That means the model not only looks carefully at images but also tracks down long-tail facts reliably.

Ablations that make numbers meaningful:

Retrieval strategies (Table 2): • Direct Answer: ~12.0% avg—very low, proving external evidence is essential. • Whole-Image Search (WIS): ~16.0% avg—small gains; still distracted by clutter. • WIS + Text Search: ~29.3% avg—now we’re combining visual anchors with textual proof; big leap. • Cropped-Image Search (CIS): ~23.0% avg—multi-scale cropping raises hit-rate, especially on VDR (4.8→15.4), but stalls without text. • CIS + Text Search: ~40.0% avg—best balance; cropping finds the right visual entities, and text fills in long-tail knowledge. Like using both a microscope and an encyclopedia.
Training data and methods (Table 3): • Base model (no deep-research SFT): ~24.3% avg; missing long-horizon habits. • + Verified VQA trajectories (SFT): boosts MMSearch-Plus and BC-VL strongly; the model starts to use tools sensibly. • + Text-only QA trajectories (SFT): similar boosts, confirming that text-bridging transfers know-how from text-only deep research to multimodal tasks. • + Fuzzy multi-hop VQA (SFT): further gains, showing the value of hard, realistic, non-shortcut data. • + RL on top: best results (e.g., VDR 37.8%, MMS+ 28.5%, BC-VL 53.7%), meaning live practice with rewards sharpens decisions.

Surprising findings:

Just cropping isn’t enough; you need both precise visual anchors and broad text search to excel across tasks.
The text-bridging trick works: long-horizon behaviors from text-only research models transfer meaningfully to the image+text setting.
RL didn’t just make trajectories longer; over time, the model learned to get higher rewards with shorter, smarter paths—evidence of improved planning and stopping.

Takeaway: In head-to-head comparisons under the same agentic setup, Vision-DeepResearch’s strategy of multi-scale crops + long-horizon ReAct + text bridging + SFT+RL delivers consistent, sizable gains. It’s not one magic ingredient—it’s the full recipe working together.

05Discussion & Limitations

Limitations (honest and specific):

Web dependence: The agent needs stable access to image/text search and page content. If sites block bots, change layouts, or throttle access, performance can drop.
Compute and latency: Long-horizon searches (dozens of steps, many crops) cost time and money. Async rollouts help during training, but real-time use on slow networks can still feel heavy.
Visual ambiguity: In crowded scenes with very similar entities (e.g., two people with near-identical badges), even multi-scale crops can retrieve mixed or misleading matches.
Judge sensitivity: The LLM-as-Judge can occasionally pass or fail borderline answers, which affects both training rewards and evaluation.
Domain shifts: If tasks move far beyond the training distribution (e.g., rare domains with unusual imagery), hit-rates and reasoning chains may weaken until further fine-tuning.

Required resources:

A capable base MLLM (e.g., Qwen3-VL scale) and access to robust text-only deep-research LLMs for bridging and data generation.
Tooling: image search, web search, webpage visit/summarize, optional Python execution, and an LLM judge.
Infrastructure for long contexts (up to ~64K tokens), multi-threaded async execution, and storage for large trajectories.

When NOT to use it:

Simple, closed-book questions where the answer is obvious from the image alone; the overhead of long research isn’t worth it.
Strictly offline settings (no internet) where trusted retrieval is impossible.
Real-time or on-device scenarios with tight latency/power budgets; the multi-turn web workflow may be too slow or costly.

Open questions:

Robustness under adversarial web noise: Can we detect and discount misleading look-alike images or SEO spam more reliably?
Better stopping criteria: Can we learn finer-grained confidence to stop earlier when enough evidence is present, without missing edge cases?
Richer rewards: Beyond 0/1 correctness, could source quality, citation completeness, or evidence diversity become part of the training signal?
Generalization to video and GUIs: How well do multi-scale cropping and long-horizon research extend to moving scenes or interactive interfaces?
Self-correction: Can the agent autonomously flag contradictions and run mini-audits of its own evidence chains before answering?

Bottom line: Vision-DeepResearch is a big step toward reliable, source-backed multimodal answers in the wild, but it needs the web, compute, and good data. With smarter stop rules, richer rewards, and broader domains, it could become an even stronger everyday research partner.

06Conclusion & Future Work

Three-sentence summary: Vision-DeepResearch turns a multimodal model into a persistent researcher that crops images at multiple scales, runs many visual and text searches, and connects evidence over long reasoning chains. It learns these habits through a carefully built data pipeline (verified and obfuscated questions, full trajectories) plus supervised fine-tuning and reinforcement learning with an asynchronous rollout system. The result is state-of-the-art performance on six factual multimodal benchmarks, surpassing prior open agents and rivaling strong closed-source workflows under the same setup.

Main achievement: Showing that multi-scale visual retrieval, long-horizon ReAct, and text-bridging—trained via high-quality trajectories and RL—dramatically raise hit-rates and accuracy in noisy, real-world multimodal research.

Future directions: Add richer rewards (evidence quality and citations), improve stopping confidence, extend to video and GUI tasks, and scale RL further with cheaper, more parallel rollouts. Also, explore stronger safety checks for misleading web content and better de-duplication of near-identical entities.

Why remember this: It’s a recipe for building AI that doesn’t just guess—it investigates. By combining careful visual cropping, broad text search, and many thoughtful steps, the system finds and verifies facts like a patient human researcher, making AI answers more trustworthy and useful in everyday, messy reality.

Practical Applications

•Photo fact-checking for newsrooms: verify who is in a picture and when/where an event happened, with links.
•Shopping assistants: match product photos via multi-scale crops and confirm specs from official pages.
•Academic helpers: identify speakers, venues, and dates from event photos and retrieve citations.
•Customer support: recognize device models from images (logos, ports, labels) and fetch the right manuals.
•Real estate or travel search: match landmarks or building features from photos to authoritative sources.
•Compliance and audits: confirm brand use (logos, uniforms) and link to documented approvals or policies.
•Education: build homework helpers that cite sources for image-based questions (e.g., museum exhibits).
•Digital asset management: tag and de-duplicate images by reliably identifying entities at multiple scales.
•Insurance claims: verify vehicle make/model and incident details via multi-scale crops and official records.
•Scientific curation: identify lab equipment or specimen labels from images and link to reference databases.

Version: 1