InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Kaican Li; Lewei Yao; Jiannan Wu; Tiezheng Yu; Jierun Chen; Haoli Bai; Lu Hou; Lanqing Hong; Wei Zhang; Nevin L. Zhang

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Intermediate

Kaican Li, Lewei Yao, Jiannan Wu et al.12/21/2025

arXiv PDF

Key Summary

•This paper builds a tough new test called O3-BENCH to check if AI can truly think with images, not just spot objects.
•It splits the job into two helpers: a thinker (vReasoner) and a finder (vSearcher) so the system can solve tricky, real-world picture problems.
•The finder learns a new skill called generalized visual search, which means locating fuzzy, relational, or conceptual regions described in natural language.
•They train the finder with a hybrid reinforcement learning recipe that mixes on-the-fly tasks with ground-truth guidance for efficient and stable learning.
•When plugged into strong models like GPT-5-mini or Gemini-2.5-Flash, the trained finder boosts scores across many benchmarks.
•On O3-BENCH, GPT-5-mini jumps from 39.0% to 61.5% accuracy when teamed with the trained finder, a big step forward.
•O3-BENCH focuses on high-resolution, information-dense images such as composite charts and detailed maps that demand multi-hop reasoning.
•The system proves that divide-and-conquer multi-agent design can make open multimodal systems feel closer to frontier o3-style capabilities.
•Results show the approach generalizes across different reasoner families and resolutions, though performance still depends on good region descriptions.
•This work makes image thinking more reliable for tasks like reading reports, navigating maps, and cross-checking scattered visual details.

Why This Research Matters

Many real-world tasks involve big, busy images where clues are scattered—like reading report tables, navigating venue maps, and checking legends. This work makes AI better at those tasks by teaching it to find exactly what matters, even when directions are fuzzy, and to reason step by step. The multi-agent design means improvements can be shared widely: one strong searcher can help many different reasoners. Tough tests like O3-BENCH reveal true progress on skills people actually need from visual assistants. As the approach matures, it can power reliable help in education, travel, accessibility, and more. It’s a practical path toward AI that doesn’t just see but genuinely understands what it’s looking at.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you and a friend trying to solve a puzzle map to find a treasure. One of you reads clues and plans the route, while the other zooms in to spot tiny symbols and street names. You can only win if both jobs are done well and you cooperate.

🥬 The Concept (Thinking with images): Thinking with images means using pictures (like charts, diagrams, and maps) not just to see objects, but to reason step by step with the details inside them. How it works: 1) Look over the picture to decide what matters; 2) Zoom in to pull out small but important parts; 3) Combine pieces from different spots; 4) Do math or logic; 5) Decide the answer. Why it matters: Without this, AI can miss key clues in busy images and fail at real tasks like reading a financial report or navigating a theme park map.

🍞 Anchor: If you ask, “Which ride near Draken Snack can a 128 cm child ride alone?” the AI must find the snack icon, look nearby, read the ride names and height rules, compare them, and choose correctly.

🍞 Hook: You know how some tests are too easy and only ask for the most obvious thing, like “What color is the car?” But real life is messier—lots of small labels, legends, and rules hidden in different corners of a page.

🥬 The Concept (O3-BENCH): O3-BENCH is a new, hard test for AI that checks both careful looking and clever reasoning on high-resolution, information-dense images. How it works: 1) Use big, cluttered charts and maps; 2) Ask questions that need multiple steps; 3) Force the AI to gather clues from separate regions (like legends and map areas); 4) Make answers multiple-choice with realistic distractors (including a “No Right Choice”). Why it matters: Without a tough test like this, models can look good on easy problems but fail on real tasks that need cross-region, multi-step thinking.

🍞 Anchor: For example, O3-BENCH may ask you to find a meal stand on a map, read nearby attraction rules from a legend, compare heights, and then pick the only ride a 128 cm kid can take alone.

🍞 Hook: Think of two teammates playing a game: one plans, the other searches. If one tries to do both perfectly alone, they might get overwhelmed and make mistakes.

🥬 The Concept (Divide-and-conquer, multi-agent): Divide-and-conquer splits a hard job into parts and assigns each to a specialist. How it works: 1) A reasoning specialist (vReasoner) plans and decides what to look for; 2) A searching specialist (vSearcher) locates exactly the right image regions; 3) They talk back and forth until the answer is found. Why it matters: Without splitting roles, a single model often loses track in busy images and can’t juggle searching and long reasoning well.

🍞 Anchor: The reasoner says, “Find the legend for Draken Valley’s height rules.” The searcher zooms, crops, and returns the exact region so the reasoner can continue.

🍞 Hook: When you tell a friend, “Find the area to the left of the wooden chair,” they don’t look for a single object name; they understand a fuzzy description of a place.

🥬 The Concept (Generalized visual search): Generalized visual search means finding regions of interest described in free-form language, including relational or conceptual hints (not just object names). How it works: 1) Read a natural-language description; 2) Understand relationships like “left of” or “near”; 3) Infer concepts like “the chart showing revenue by year”; 4) Propose and verify the best-fitting region. Why it matters: Without it, AI is stuck looking only for named objects, missing many real information targets like legends, titles, or cross-chart sections.

🍞 Anchor: “The small box that lists attraction height rules” is not an object class; it’s a conceptual region the AI must find among many boxes.

🍞 Hook: Picture a coach who rewards good moves in practice so players repeat them more.

🥬 The Concept (Reinforcement learning): Reinforcement learning (RL) teaches an AI to make better decisions through rewards for good actions. How it works: 1) The AI tries a sequence of steps; 2) It gets a score when it helps solve the task or matches ground truth; 3) Over time, it learns to make better choices to earn higher scores. Why it matters: Without RL, the finder won’t reliably learn to home in on the right region from fuzzy instructions.

🍞 Anchor: If the AI’s crop overlaps well with the true answer box, it gets a higher reward and learns that kind of search is good.

🍞 Hook: Ever solve a big math word problem by breaking it into smaller steps? That’s what great problem solvers do.

🥬 The Concept (Multi-step/interleaved reasoning): Multi-step (interleaved) reasoning means repeatedly switching between planning, looking, and calculating. How it works: 1) Break the question into sub-goals; 2) Search the image for each clue; 3) Update the plan; 4) Combine clues to answer. Why it matters: Without interleaving, the AI either overthinks without seeing details or stares at details without a plan.

🍞 Anchor: “Find the meals legend → locate Draken Snack on map → read nearby rides → compare height rules → answer.”

The world before: Many multimodal models were good at basic perception—spotting objects or reading single labels—but struggled with big, busy images that needed multiple lookups and reasoning across far-apart regions. People tried single-agent systems that handled everything in one place. These worked okay for simple tasks but broke down on dense charts or maps where you must bounce between legends, indexes, and local details.

Failed attempts: Tool-using pipelines with fixed detectors could find one object but didn’t handle fuzzy, relational instructions or multi-hop paths well. End-to-end search methods often focused on one region in natural images and didn’t generalize to arbitrary documents, posters, and maps. Many benchmarks didn’t truly measure the “think with images” skill.

The gap: There was no tough, reasoning-first benchmark for high-density visual tasks, and no well-trained, cooperative “finder” specialized in generalized visual search that could plug into various “thinkers.”

Real stakes: This matters for daily life—navigating venues with complex maps, reading reports with many charts and tables, planning routes with constraints (like height limits or opening hours), and verifying cross-referenced information. Better image thinking makes AI assistants more accurate and trustworthy in the real world.

02Core Idea

🍞 Hook: Imagine a school project where one teammate is great at planning the report and another is amazing at digging up the exact quotes and figures from a giant document. Together, they crush it.

🥬 The Concept (The Aha!): The key insight is to split “thinking with images” into two cooperating agents—one that reasons (vReasoner) and one that searches (vSearcher)—and to train the searcher with RL so it can find fuzzy, conceptual regions described in plain language. How it works: 1) The reasoner plans and asks for specific regions; 2) The searcher localizes those regions (even if described fuzzily) using a crop tool; 3) The reasoner uses the returned close-ups to continue multi-step reasoning; 4) RL teaches the searcher to pick better boxes over time. Why it matters: A single model juggling all of this tends to miss details or over-summarize; specialization plus RL makes the whole system more reliable and flexible.

🍞 Anchor: To answer, “What ride near Draken Snack can a 128 cm kid ride alone?”, the reasoner asks for the meals legend, then the map area near icon #2, then the attraction legend with height rules; the searcher brings back each exact region; the reasoner compares and answers “Valkyrie.”

Three analogies:

Librarian + Researcher: The librarian (searcher) fetches the right shelves and pages; the researcher (reasoner) reads and concludes. Before, the researcher had to run around the whole library—slow and error-prone. After, the duo works efficiently.
Photographer + Editor: The photographer (searcher) frames the perfect close-ups; the editor (reasoner) assembles the story. Without crisp crops, the editor can’t tell a compelling, accurate narrative.
Scout + Strategist: The scout (searcher) pinpoints critical terrain features; the strategist (reasoner) plans the route. Mixing both roles in one person gets chaotic; splitting them makes victories more likely.

Before vs. After:

Before: Single-agent systems often fail on dense charts/maps that need multiple hops across different regions.
After: A trained vSearcher lets many different reasoners gather the exact visual evidence they need, turning messy multi-region tasks into solvable steps.

Why it works (intuition):

Cognitive load sharing: The reasoner invests effort in planning and logic, while the searcher invests effort in localization, so neither is overloaded.
Language-to-vision grounding: The searcher is taught to map fuzzy text descriptions (like “the table showing Year 10 fees”) to precise boxes, which general detectors don’t cover.
Feedback loops: The reasoner’s feedback and final answer correctness help the searcher learn what’s genuinely useful.
Two kinds of supervision: On-the-fly tasks mimic real usage (good for alignment), while ground-truth box supervision (IoU) sharpens accuracy.

🍞 Anchor: When a chart question requires reading units from one legend, numbers from a second table, and a date range from a title, the reasoner decomposes these sub-asks and the searcher returns each region perfectly cropped—making the math straightforward.

Building blocks (explained with sandwiches):

🍞 Hook: You know how you sometimes say, “Show me the part to the left of the graph title”? That’s not an object—it’s a relation. 🥬 The Concept (Generalized visual search): Find regions from natural, fuzzy descriptions. How: parse the text, reason about relations (“left of,” “near”), infer concepts (legends/tables), propose a box, optionally crop to verify. Why: Real tasks often describe areas, not just objects. 🍞 Anchor: “The chart comparing revenue over the last decade” → Find the multi-line chart with a time axis and the right title.
🍞 Hook: Coaches reward good plays so teams repeat them. 🥬 The Concept (Hybrid RL): Train the searcher with two kinds of practice—(a) on-the-fly tasks scored by usefulness to the reasoner, and (b) ground-truth boxes scored by overlap (IoU). Why: Mixing realistic tasks with precise supervision stabilizes and speeds learning. 🍞 Anchor: If a proposed crop actually helps answer correctly, that behavior gets reinforced.
🍞 Hook: A clean close-up beats a blurry wide shot when you need to read tiny labels. 🥬 The Concept (Crop tool use): The searcher can request crops to return just the needed region. Why: Without cropping, the reasoner might miss small fonts or cluttered details. 🍞 Anchor: Cropping the attraction legend makes the height rules easy to read.
🍞 Hook: Solving a maze means looking ahead, peeking at new paths, and adjusting your plan. 🥬 The Concept (Interleaved reasoning): Alternate between planning and searching. Why: Without interleaving, you either stare at the wrong place or plan without evidence. 🍞 Anchor: “Find the meals key → locate #2 Draken Snack → find closest ride → check height rule → answer.”
🍞 Hook: A universal plug turns many devices into one power strip. 🥬 The Concept (Plug-and-play searcher): The trained vSearcher can be attached to many different reasoners (GPT-5, Gemini, etc.) and still help. Why: This spreads benefits widely without retraining every model. 🍞 Anchor: GPT-5-mini plus the searcher jumps from 39.0% to 61.5% on O3-BENCH; Gemini-2.5-Flash also gains on multiple tests.

03Methodology

High-level pipeline: Input (a high-resolution image and a question) → Step A: vReasoner plans and issues region descriptions → Step B: vSearcher localizes and returns crops → Step C: vReasoner integrates evidence and repeats as needed → Output: final answer with reasoning.

Step-by-step, like a recipe:

Read the question and make a plan (vReasoner).
- What happens: The reasoner breaks the problem into smaller sub-asks, such as “Find the meals legend,” “Locate Draken Snack,” “Check nearby ride’s height rule.”
- Why this step exists: Without decomposition, the system might try to answer from a single glance and miss crucial pieces.
- Example: For the Draken Valley map, the plan first needs the meals list to map names to icons and numbers.
Describe a target region in plain language (vReasoner → vSearcher).
- What happens: The reasoner sends a free-form description like “left-side legend listing Draken Valley attractions and their height/guardian rules.”
- Why: Real images don’t come with object tags; free-form descriptions match how humans point to regions (“the box under the title,” “the chart on the right”).
- Example: “Zoom the lower-right Draken Valley around meal marker #2 ‘Draken Snack’ and show nearby attractions.”
Localize and crop the region (vSearcher).
- What happens: The searcher translates the description into a bounding box, and may call the crop tool to return a precise close-up.
- Why: Without accurate crops, the reasoner may read the wrong legend or miss tiny text.
- Example: The searcher returns the legend panel with height rules that includes Valkyrie and Klake.
Read and reason with the evidence (vReasoner).
- What happens: The reasoner extracts needed details (names, numbers, symbols), compares options, and possibly asks for the next region.
- Why: Complex tasks need multiple hops and cross-checks (e.g., legend → map → index), not just one lookup.
- Example: After seeing the legend (Valkyrie 120 cm+), it requests the map region near Draken Snack to check proximity.
Repeat the loop as needed (interleaving).
- What happens: The duo alternates description → localization → reading until the plan’s sub-asks are satisfied.
- Why: New evidence may change what to look for next.
- Example: If Klake is nearby but 135 cm+, the reasoner continues to verify that Valkyrie (120 cm+) is the closest allowed ride.
Produce the final answer with justification (vReasoner).
- What happens: The reasoner states the choice and the reasoning chain.
- Why: Explanations help verify correctness and improve reliability.
- Example: “Valkyrie (E) because its height rule allows 128 cm alone, and it is nearest to Draken Snack.”

The secret sauce: a hybrid RL-trained vSearcher that generalizes to fuzzy, conceptual search tasks and plugs into many reasoners.

Training the vSearcher (hybrid RL):

Out-of-loop RL (ground-truth supervision):
- What: Pre-generate (image, region description, true box) triples using layout detection on InfographicVQA images, with high-level descriptions created by a small model.
- How: The searcher proposes a box; it is rewarded by overlap with the true box (high IoU → higher reward) and correct tool formatting.
- Why: Gives precise, efficient guidance to learn accurate localization.
- Example: “The right table showing Year 10 fees” → reward scales with how well the predicted box overlaps the labeled table.
In-loop RL (realistic cooperation):
- What: During training, a strong reasoner (e.g., GPT-5-mini) generates region descriptions on the fly while solving collage-style hard problems.
- How: The searcher gets a pseudo reward when its crop is judged helpful by the reasoner and the final answer is correct. Rewards are globally normalized for stability.
- Why: Teaches the searcher to be useful in real multi-hop workflows, not just match static boxes.
- Example: If returning a legend crop helps the reasoner answer correctly, the searcher gets rewarded.

Data construction:

Collages for in-loop RL: Multiple images are stitched into a big canvas so the target is tiny relative to the whole—forcing true search. Problems come with QA and target boxes filtered for difficulty.
Infographic layouts for out-of-loop RL: A document layout detector proposes candidate regions; a small model writes human-like region descriptions; these triples supervise precise box learning.

Reward shaping (intuitively):

Format reward: Encourages correct tool use and output structure.
IoU reward: Encourages accurate localization against ground truth in out-of-loop training.
Pseudo-IoU (usefulness) reward: Encourages returning crops that actually help the reasoner succeed during in-loop training.
KL regularization to a reference policy: Keeps learning stable and avoids drifting too far.

What breaks without each component:

No planning (reasoner): The system guesses and often overlooks needed regions.
No generalized search (searcher): It can’t handle fuzzy descriptions like “the chart comparing years.”
No cropping: Tiny text remains unreadable; accuracy drops.
No out-of-loop RL: Localization stays sloppy; slow learning.
No in-loop RL: The searcher may be precise but not helpful in real, interleaved tasks.

Concrete walkthrough with data:

Input: A 4K mall map with legends, symbols, and floor guides. Question: “Which floor has the Prayer room, and near which zone is it located?”
Steps: Reasoner asks: “Show the legend symbol for Prayer room” → Searcher returns legend crop → Reasoner asks: “Show the floor guide area for where Prayer room appears” → Searcher returns 2F guide crop → Reasoner asks: “Show the 2F map area near H&M to the north” → Searcher returns crop → Reasoner answers: “2F, North zone.”

04Experiments & Results

The test: Do multi-agent systems with a trained vSearcher actually help different reasoners solve tough, high-resolution, multi-hop tasks? The authors measured accuracy on O3-BENCH (their new benchmark) and also on existing ones like V*-Bench, HR-Bench, Tree-Bench, VisualProbe-Hard, and MME-RealWorld (lite). They focused on how much a reasoner improves when given access to the trained vSearcher versus no searcher or an untrained searcher.

The competition: They compared open and proprietary systems, including GPT-4o, GPT-5-nano, GPT-5-mini, GPT-5, OpenAI o3, and Gemini-2.5-Flash. They also referenced recent research systems like DeepEyes, Pixel Reasoner, and Mini-o3 for context.

The scoreboard (with context):

O3-BENCH difficulty: Even OpenAI o3 reached only 40.8% accuracy, showing that the benchmark is hard and truly multi-hop.
GPT-5-mini + InSight-o3-vS (trained vSearcher): Jumps from 39.0% to 61.5% on O3-BENCH. That’s like moving from a low C to a solid B+ on a tough exam most students find confusing.
Gemini-2.5-Flash: Also gains across multiple benchmarks when paired with InSight-o3-vS (for example, up to roughly +7–12 points on O3-BENCH depending on settings), showing cross-family generalization.
Plug-and-play: The same trained vSearcher helps different reasoners (GPT-5-nano, GPT-5-mini, Gemini-2.5-Flash), proving it’s not overfitted to just one model.
Versus untrained searcher: A reasoner with an untrained Qwen2.5-VL-7B as searcher gains little or even loses performance in some cases, highlighting the importance of the specialized RL training.

Why these numbers matter: On O3-BENCH, many questions require hopping between legends, indexes, and main panels. Big gains here mean the system is truly getting better at image thinking—not just memorizing object names.

Surprising and nuanced findings:

Resolution matters—but less for a good searcher: Higher input resolution helps reasoners on their own. However, the trained vSearcher still brings meaningful gains even at lower resolutions and remains effective across a wide range of its own input sizes.
Usefulness over frequency: As the vSearcher gets better, the reasoner often needs fewer calls. Early in training, calls go up (learning to format and cooperate), then go down as localization becomes more accurate.
Task dependence: Some strong reasoners (like Gemini-2.5-Flash) already do fairly well on O3-BENCH solo, yet still benefit from the vSearcher. In a few narrow cases on other benchmarks, tool-calling habits of certain models limited gains.

Ablations (what changed what):

Hybrid RL vs. single-RL: Combining in-loop (realistic cooperation) and out-of-loop (ground-truth IoU) training produced the best overall results. Dropping either component hurt performance.
Reward design: Removing reasoner feedback or outcome supervision (final-answer correctness) reduced gains. Global normalization for rewards stabilized training.
Cropping/tool formatting: Encouraging correct tool use was necessary—without it, crops were less reliable.

Big picture: The vSearcher substantially boosts performance where problems require interleaved search and reasoning across multiple regions and tiny details—the exact situations where people want image-savvy assistants to shine.

05Discussion & Limitations

Limitations:

Dependent on good requests: The searcher follows the reasoner’s descriptions. If the reasoner asks for an unhelpful or wrong region, the searcher can’t fix the plan by itself.
Coverage of concepts: While generalized search is broad, some highly unusual or domain-specific layouts may still confuse the searcher without fine-tuning.
Tool reluctance in some backbones: Certain base models are hesitant to call tools or don’t use them well, which can bottleneck the benefit.
Remaining errors: Even strong pairs (e.g., GPT-5-mini + vSearcher) still make mistakes on very dense or tricky items, especially when legends, scales, and units combine in surprising ways.

Required resources:

Computing for RL: Training the searcher with both in-loop and out-of-loop RL requires GPU time and careful orchestration with a strong reasoner.
High-resolution handling: Systems should handle big images and multiple crops smoothly.
Layout tools (for data building): Generating out-of-loop supervision relies on a layout detector and a small captioning model for region descriptions.

When NOT to use:

Simple, single-region tasks: If the problem is a one-glance lookup (e.g., "What color is the car?"), the overhead of a multi-agent setup may not pay off.
Extremely low-res inputs: If the image is too blurry for any crop to reveal legible text, search cannot rescue the task.
Non-visual problems: If the challenge is mostly text or world knowledge, a visual searcher adds little.

Open questions:

Joint training: What happens if we also train the reasoner end-to-end with the searcher so both improve together without instability?
Richer tools: Beyond cropping, could the searcher learn to rotate, enhance, or segment regions, or chain multiple tools for trickier documents?
Self-correction loops: Can the searcher proactively suggest better sub-asks when the reasoner’s request seems off?
Trust and transparency: How should the duo present evidence and uncertainty to users so choices are easy to verify?

06Conclusion & Future Work

Three-sentence summary: This paper introduces O3-BENCH, a hard benchmark that truly tests whether AI can think with images by requiring multi-hop, cross-region reasoning on high-resolution charts and maps. It proposes INSIGHT-O3, a two-agent framework where a reasoning agent (vReasoner) cooperates with a reinforcement-learned visual searcher (vSearcher) specialized in generalized visual search. The trained vSearcher is plug-and-play and significantly boosts multiple reasoners across benchmarks, including a large jump on O3-BENCH.

Main achievement: Showing that a hybrid RL-trained, generalized vSearcher—working in tandem with a separate reasoner—converts messy, multi-region visual tasks into reliable, step-by-step solutions and delivers strong, cross-model gains.

Future directions:

End-to-end co-training of reasoner and searcher for tighter coordination.
Expanding the searcher’s toolset (e.g., rotate, enhance, segment) to handle more document quirks.
Building richer datasets of interleaved search-and-reasoning traces to further improve generalization.
Applying the approach to robotics and UI agents that must visually navigate complex environments.

Why remember this: It’s a clear, practical step toward AI that doesn’t just see pictures but actually thinks with them—like a careful student who reads the legend, checks the map, does the math, and explains the answer. By splitting the work and training the searcher well, the system becomes both more accurate and more useful in the kinds of visual tasks people face daily.

Practical Applications

•Interactive map help: Find facilities (restrooms, ATMs, info desks) and plan routes across large venue maps.
•Report reading: Pull legends, units, and numbers from different tables/charts to answer what-if questions.
•Education support: Guide students through multi-chart science problems by locating and comparing key panels.
•Travel planning: Cross-check attractions, nearby services, and restrictions (e.g., age/height) from theme-park maps.
•Business analytics: Scan dashboards to extract KPIs across multiple widgets and verify trends and time ranges.
•Accessibility tools: Read cluttered signs, legends, and indexes for users who need help with small text or complex layouts.
•Document QA: Answer questions about contracts or manuals by locating relevant clauses, tables, and footnotes.
•UI assistance: For screenshots, find buttons, menus, and status panels from fuzzy instructions to guide users.
•Robotics and AR: Identify conceptual regions (storage areas, hazard signs) from natural-language guidance.
•Customer support: Triage image-based tickets by locating serial numbers, labels, and error panels across device photos.

Version: 1