MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Jiawei Chen; Xintian Shen; Lihao Zheng; Zhenwei Shao; Handong Cui; Chaoqun Du; Li Gong; Feng Gu; Xuefeng Hao; Wei He; Jiabang He; Yi Hu; Bin Huang; Shanshan Li; Qizhen Li; Jing Luo; Zide Liu; Xiaobo Liu; Ning Mao; Lifu Mu; Xuhao Pan; Zhiheng Qu; Chang Ren; Xudong Rao; Haoyi Sun; Qian Wang; Shuai Wang; Zhichao Wang; Wei Wang; Lian Wen; Jiqing Zhan; Hongfu Yang; Sheng Yang; Jiajun Yang; Pengfei Yu; Hongyuan Zhang; Bin Zhang; Chunpeng Zhou; Zheng Zhou; Shucheng Zhou; Shuo Xie; Yun Zhu; Hao Ma; Tao Wei; Pan Zhou; Wei Chen

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Intermediate

Jiawei Chen, Xintian Shen, Lihao Zheng et al.12/29/2025

arXiv PDF

Key Summary

•MindWatcher is a smart AI agent that can think step by step and decide when to use tools like web search, image zooming, and a code calculator to solve tough, multi-step problems.
•It mixes its own thoughts with tool use in one smooth loop, so it can look, think, search, and look again until it finds a solid answer.
•Instead of copying examples (which can lead to bad habits), it learns with reinforcement learning and a clever reward recipe that praises correct answers and clean formatting while penalizing tool-use mistakes.
•A special training trick called step-wise normalization keeps learning fair for every reasoning step, not just the longest parts.
•MindWatcher can reason with pictures by cropping, zooming, and searching a high-quality local image library so it isn’t stuck waiting on expensive external APIs.
•They built a new test called MWE-Bench to fairly check tool-using skills across cars, animals, plants, people, landmarks, and sports.
•The big 32B model beats or ties larger and newer models on tool-using tasks, and even the tiny 2B–4B distilled versions do great thanks to good tool habits.
•Results reveal a 'genetic inheritance' effect: the base model’s strengths and limits still shape the agent, even after strong reinforcement learning.
•Real-world performance depends a lot on the tool environment (for example, which search engine you use can swing scores a lot).
•Code, data pipelines, tools, and several smaller distilled models are being open-sourced to help others build similar agents.

Why This Research Matters

MindWatcher shows how to turn a clever AI into a practical helper that can see, read, and act to get reliable answers. It reduces dependence on memorized facts by smartly using tools for fresh, long-tail, or visual details. A curated, local image library and disciplined rewards make training faster, cheaper, and more trustworthy. The same ideas help students research better, drivers understand car parts, and support teams solve customer issues from screenshots. With fairer evaluation and strong habits against hallucination, this approach moves AI closer to safe, real-world use. Smaller distilled versions mean more people and companies can deploy capable agents without giant hardware. In short, it’s a practical blueprint for building doers, not just talkers.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a school project with a photo, a map, and the internet. You peek at the photo, think a bit, search online, zoom into the photo, think some more, and then write your answer. That back-and-forth is how real problem solving works.

🥬 The Concept: Tool-Integrated Reasoning (TIR)

What it is: TIR is when an AI not only thinks but also decides when to use tools—like web search, image zoom, or a calculator—while thinking.
How it works: The AI thinks, chooses a tool, gets results, thinks again, and repeats until it’s ready to answer.
Why it matters: Without tools, an AI is stuck with only what it memorized, so it can’t handle fresh news, rare facts, or tiny image details. 🍞 Anchor: When you ask, “Who is the person on the right wearing jersey number 8, and what team did they just join?”, TIR lets the AI zoom into the image to read the jersey, search the web for sports news, and then answer correctly.

The World Before:

Big language models knew a lot but were trapped by ‘parametric knowledge’—what they learned during training. That means they missed long-tail facts, up-to-date events, and fine-grained details in images.
Agents often used hand-made workflows: If question type A, call tool X; if type B, call tool Y. This was rigid and broke easily in messy, real-life situations.
Multi-agent systems split thinking and acting into several models, which helped flexibility but added delays, cost, and complexity.

🍞 Hook: You know how sometimes you need to switch between different subjects to solve a tricky puzzle?

🥬 The Concept: Interleaved Thinking

What it is: A way for the AI to switch smoothly between thinking and tool use at any point.
How it works: The AI can pause its thoughts to call a tool, then use the tool’s response to guide its next thought.
Why it matters: Without it, the AI would think too long without checking facts or would overuse tools without planning. 🍞 Anchor: Solving a plant ID question, the AI can think, crop the leaf region, think, search a plant database, and then finish the answer.

Problems Researchers Faced:

Most agents were text-only; few could truly use images during reasoning (not just image search, but also image manipulation like cropping or zooming).
Supervised fine-tuning (SFT) often made models imitate the look of tool-using chats instead of learning smart strategies. That caused tool overuse on easy questions and loops on hard ones.
Visual search APIs are expensive, slowing training and deployment.

🍞 Hook: Think of a comic book where words and pictures together make the story clear.

🥬 The Concept: Multimodal Chain-of-Thought (CoT)

What it is: The AI’s step-by-step reasoning includes both text and image operations.
How it works: It mixes “I think…” with actions like crop, zoom, and object grounding so pictures actively guide the reasoning.
Why it matters: Without multimodal CoT, the AI might miss small visual clues that change the whole answer. 🍞 Anchor: To find “the king’s tomb on the hill,” the AI zooms into a palace photo, identifies the landmark, then searches text to confirm the king’s name.

The Gap MindWatcher Fills:

A single model plans and acts, can think with images, calls tools only when needed, and learns these habits using reinforcement learning instead of just copying examples.
It also uses a large local image retrieval database to reduce cost and improve reliability.

🍞 Hook: When you play a game, you pick moves based on the current situation.

🥬 The Concept: Markov Decision Process (MDP)

What it is: A way to model decisions step by step, where each action changes what you know next.
How it works: State (what’s known) → choose action (think or tool) → observe result → repeat until done.
Why it matters: Without an MDP view, it’s hard to train the agent to make good choices at every step. 🍞 Anchor: The agent’s state includes the question, the image, and past tool results; each new tool call updates the state.

Real Stakes in Daily Life:

In cars: identify parts or traffic signs from images and fetch up-to-date rules.
In school: combine photos, charts, and the web to answer research questions.
In customer help: read screenshots, search manuals, and compute refunds correctly.
In medicine or farming: zoom into images, ground objects, and fetch reliable references.
In safety: avoiding hallucinated tool results makes answers more trustworthy.

02Core Idea

🍞 Hook: Picture a helpful classmate who not only thinks out loud but also grabs a ruler, a map, or a calculator at just the right time.

🥬 The Concept: The Aha! Insight

What it is: Teach one multimodal model to interleave thought and tool use at any time and learn these habits with reinforcement learning that fairly rewards each reasoning step.
How it works: The model writes thoughts, calls tools, reads results, and keeps going. Training uses a step-wise normalized GRPO objective and a hybrid reward that checks correctness, formatting, and good tool manners (like not hallucinating tool results).
Why it matters: Without this, agents either overthink without checking facts, spam tools without planning, or get stuck in loops. 🍞 Anchor: On a landmark question, the agent crops the photo, searches a local visual database, verifies details by web search, and then answers.

Three Analogies:

Chef and Kitchen: The brain is the chef; the pantry is the web; the knife is image cropping; the oven is a code interpreter. Interleaving means tasting and adjusting seasoning at every step.
Detective Work: Examine the scene (zoom), run a database lookup (visual search), read a file (webpage extraction), do a calculation (code), and stitch clues together.
Video Game Inventory: Swap tools mid-quest—binoculars to spot, journal to research, and calculator to solve—then proceed.

🍞 Hook: You know how fair grading looks at each part of your work, not just the last page?

🥬 The Concept: Step-wise Normalized GRPO (training stability)

What it is: A training rule that balances learning across every thought and tool step, so long steps don’t drown out short ones.
How it works: It normalizes by action steps and by token length within each step, with rewards computed per trajectory.
Why it matters: Without it, the model may overfit long tool segments and ignore important short decisions, causing unstable tool behavior. 🍞 Anchor: If one problem needs 2 short tool calls and another needs 6 long ones, both still teach the model fairly.

🍞 Hook: Like a teacher’s rubric that scores both the right answer and neat work.

🥬 The Concept: Hybrid Reward

What it is: A reward recipe that mixes outcome accuracy, format correctness, and penalties for tool-use hallucinations.
How it works: 1) A model-judge checks if the final answer is correct; 2) a parser checks tag formatting; 3) a penalty fires if the model calls tools without waiting for real responses.
Why it matters: Without this, the agent might be sloppy with tags, believe made-up tool results, or chase the wrong goal. 🍞 Anchor: The agent gets points for the right sports score, loses points for messy outputs, and gets penalized if it pretends a search already returned when it didn’t.

Before vs After:

Before: Agents copied the look of tool use but often overcalled tools or looped; visual reasoning was shallow; external APIs were costly and slow.
After: One model plans and acts with images; uses tools only when needed; learns robust habits; and relies on a curated local visual database to work faster and cheaper.

Why It Works (intuition, no equations):

Fair gradients keep all steps important, so early planning and later verification both improve.
Turn-taking rules stop the model from imagining tool results that never came.
Multimodal CoT keeps the picture in the loop, so the model aims its searches precisely.
A local visual database cuts cost and latency, making practice frequent and safe.

Building Blocks (mini sandwiches):

🍞 Hook: Like switching between thinking and doing during homework. 🥬 Thought-and-Action Tags: The model writes thoughts and tool calls with special tags so everything stays organized. Without tags, thoughts and actions would blur together. 🍞 Anchor: The agent writes a think block, then a tool_call block, then continues thinking.
🍞 Hook: Like having the right school supplies in your backpack. 🥬 Tool Suite: crop/zoom, object grounding and visual search, external text retrieval, webpage extraction, and a Python interpreter. Without these, the agent couldn’t check facts, zoom details, or compute. 🍞 Anchor: Identify a car model from an image, then verify specs by web search, then compute totals with code.
🍞 Hook: Like a library that you trust more than random blogs. 🥬 Local Visual Retrieval Database: A curated image library across eight categories with expert filtering. Without it, searches could be noisy, slow, or expensive. 🍞 Anchor: Finding a plant species by searching a clean, verified local index.
🍞 Hook: Like a referee who watches the whole game. 🥬 Model-based Judge: A strong model checks if the final answer matches the ground truth. Without it, grading open-ended answers would be unreliable. 🍞 Anchor: The judge marks the agent’s sports result as correct or not.
🍞 Hook: Like leveling up from easy puzzles to hard ones. 🥬 Curriculum Learning: Data is graded by how many and which tools are needed. Without it, the agent might be overwhelmed early or bored later. 🍞 Anchor: Start with single zoom-and-answer tasks, then progress to zoom+search+read tasks.

03Methodology

High-Level Overview: Input question and image → Interleaved reasoning loop (think or tool) → Observe results → Repeat until confident → Final answer.

Step-by-step Recipe:

State and Action Space

What happens: The agent treats the whole process as an MDP. At each step, it either writes a think segment or a tool_call segment. Tools return observations that become part of the next state.
Why this exists: It cleanly separates planning from acting and ensures each tool result actually changes what the agent knows.
Example: Given a palace photo and the question about a king’s tomb, the agent plans to crop the hill area, then search for that landmark’s details.

Thought-and-Action Serialization

What happens: The model uses tags like <think> ... </think> and <tool_call> ... </tool_call> so the system can parse, execute, and feed back results safely.
Why this exists: Without rigid tagging, the environment couldn’t tell what to execute or when to wait for tool results, causing crashes or confusion.
Example: The agent writes a think block to plan a crop; then emits a tool_call with bbox coordinates; then gets a new image back.

Multimodal CoT with Visual Operations

What happens: The agent can crop or zoom images, then ground objects in regions and run a local visual search to get likely identities and confidence scores.
Why this exists: Many answers depend on small visual clues (logos, plates, leaf veins). Without active image operations, the agent might miss them.
Example: It crops the car grille, runs object grounding, and searches the local car database to identify the model.

External Text Retrieval

What happens: The agent forms a focused query and asks a search engine for top-ranked results (title, abstract). It can follow up by visiting URLs and extracting page content or summaries with a helper tool.
Why this exists: Photos rarely contain the whole answer. Without text retrieval, the agent can’t confirm history, specs, or news updates.
Example: After identifying Sanssouci Palace, it searches for the tomb on the hill and opens a trusted page to confirm that it’s Frederick II.

Local Code Interpreter

What happens: The agent can run Python code offline to compute, transform data, or check logic.
Why this exists: Some tasks need exact math or data reshaping. Without code, you risk arithmetic mistakes or clumsy reasoning.
Example: After reading ticket prices and discounts, it multiplies and sums to get the final cost.

Training with Step-wise Normalized GRPO

What happens: During RL, the model generates multiple trajectories per question. Each trajectory gets a reward based on correctness, format, and tool discipline. Gradients are balanced per action step and per token length within steps.
Why this exists: Ensures fair learning across all parts of the reasoning chain, stopping long tool calls from dominating.
Example: A short crop-and-answer sample and a long zoom+search+read sample both shape the policy meaningfully.

Hybrid Reward and Turn-taking Discipline

What happens: Rewards include outcome accuracy (via a judge), format checks (strict tags, no stray text), and a penalty if the model calls tools without waiting for real responses.
Why this exists: Keeps outputs clean, goals aligned, and prevents tool-call hallucinations that derail training.
Example: The agent that guesses tool outputs loses points; the one that waits for actual responses gains stability.

Data Pipelines and Curriculum

What happens: Two automated pipelines create image–text QA with verified facts and graded difficulty; sports news provides objectively checkable answers; private images are linked to knowledge graphs; human reviewers ensure uniqueness and time consistency.
Why this exists: RL needs reliable rewards and non-ambiguous questions. Without strong data curation, the agent would learn from noisy or shifting labels.
Example: Converting “this season” to an exact year avoids future answer drift.

Local Visual Retrieval Database (MWRD)

What happens: Experts build a curated, high-precision image library across eight categories, updated regularly.
Why this exists: Reduces API cost and provides trusted, fast visual evidence. Without it, repeated training calls would be slow and expensive.
Example: Identifying plant species by searching a vetted, labeled collection.

Asynchronous Tool Invocation Inside a Synchronous Loop

What happens: The rollout loop stays synchronized for stability, but actual tool calls run in parallel to save time, respecting rate limits.
Why this exists: Tool latency, not model decoding, is the main bottleneck. Without parallel calls, training would crawl.
Example: While a web page is loading, another batch crops images and runs local search.

Distillation for Small Models

What happens: The trained 32B teacher rolls out tool-using trajectories for many inputs. Small 2B–4B students train for one epoch to copy these habits.
Why this exists: Not everyone can run a 32B model in production. Distillation packs good strategies into smaller, cheaper models.
Example: A 3B student learns when to crop, search, and compute, achieving strong scores.

Secret Sauce:

Interleaved thinking plus multimodal CoT keeps the agent’s eyes and brain working together.
Step-wise normalized GRPO fairly trains every step.
Hybrid rewards enforce clean, honest tool use.
A curated local visual database and parallel tool layer make training practical and fast.

04Experiments & Results

The Tests and Why:

The team built MWE-Bench to test real multimodal tool use in six areas: Car, Animal, Plant, Person, Landmark, and Sports. This checks whether the agent can combine image operations, retrieval, reading web pages, and reasoning.
They also evaluated on filtered subsets of public benchmarks (MMSearch, SimpleVQA) and a pure-text tool-using benchmark (WebWalkerQA) to see if MindWatcher generalizes.

The Competition:

Closed and open models like Gemini 2.5 Flash, GPT-5 mini, Qwen2.5-VL-32B, Qwen3-VL 32B Thinking, WebWatcher, and more. This is like racing against both famous sprinters and new up-and-comers.

The Scoreboard (contextualized):

On MWE-Bench with agent-style reasoning, MindWatcher-32B scores about 75.35 overall. Think of this like getting an A when strong rivals are at B+ to A-.
Category highlights: top accuracy in Vehicle, Animal, Plant, and Person. It remains competitive in Landmark and Sports.
Distilled models shine: MindWatcher-4B reaches about 69.63, while the 2B and 3B land near mid-60s, which is impressive for their size.
On MMSearch and SimpleVQA subsets, MindWatcher matches or beats strong baselines and stays competitive on WebWalkerQA, showing that the focus on tools didn’t ruin plain text reasoning.

Surprising Findings:

Tool Capacity Matters a Lot: Changing the search engine (Sogou vs Bing vs Quark) can swing results dramatically, especially by domain and language. This means environment choice can matter as much as model choice.
Genetic Inheritance: Even after RL, the agent’s performance curves and tool-call patterns still echo its base model. RL sharpens habits but doesn’t erase deep strengths or weaknesses.
Decision Trigger Boundary: Some big models refuse to call tools on round 0 and pay for it in accuracy. MindWatcher learns to trigger tools earlier when needed.

What the Numbers Mean in Practice:

The 32B agent achieves state-of-the-art tool-using performance on MWE-Bench, proving that interleaved thinking and multimodal CoT plus RL training translate to real gains.
Smaller distilled versions inherit robust tool habits, offering great cost-performance for teams who can’t deploy huge models.
Tool environment design (like picking the right search engine for your language and domain) is critical to fair evaluation and real deployments.

Anecdotes from Case Studies:

A landmark-and-history question shows how interleaved cropping and web reading leads to the precise king’s tomb answer.
A sports example shows how turning vague time (“this season”) into exact dates prevents answer drift, keeping rewards reliable.
A gadget example (smart glasses) shows how the agent zooms components, then searches for brand specs to list functions accurately.

05Discussion & Limitations

Limitations:

Base Model Ceiling: The agent keeps a “genetic” link to its foundation model; long-range reasoning and some multimodal limits remain.
Tool Dependence and Latency: Real-time performance can bottleneck on search engines and websites; slow tools slow agents.
Visual Library Coverage: The local image database is curated but finite; rare categories may be missing.
Reward Model Fragility: Model-as-judge can misgrade edge cases; strict formatting parsers might be brittle.
Language and Domain Variance: Different languages and topics favor different search engines, making results uneven.

Required Resources:

A capable multimodal base model (32B for best results, or 2B–4B distilled for cost-sensitive cases).
RL infrastructure with vLLM-like batching, asynchronous tool layers, and judge models.
A curated local visual database and access to reliable search engines or internal knowledge sources.
Data pipelines for building verified, time-stable multimodal QA with curriculum difficulty.

When NOT to Use:

Fully offline settings with no tool access or where external content is prohibited.
Ultra-low-latency tasks where any network call is too slow.
Purely memory-based Q&A where a simpler model already answers perfectly.
Domains where tools are known to be noisy or untrustworthy.

Open Questions:

How to break the genetic ceiling so RL can add truly new deep reasoning skills beyond the base model.
How to evaluate fairly when tool quality and latency differ by environment and language.
How to scale to longer horizons with better memory and context management.
How to make model-judges more reliable and less biased.
How to expand the visual library safely while keeping high precision and low cost.

06Conclusion & Future Work

Three-Sentence Summary: MindWatcher is a multimodal tool-using agent that interleaves thinking with actions like zooming images, searching the web, reading pages, and running code to solve complex questions. It learns these habits through reinforcement learning with step-wise normalization and a hybrid reward that rewards correct answers, clean formats, and honest tool use, all supported by a curated local visual database. On MWE-Bench and other tests, it reaches state-of-the-art tool-augmented performance, and distilled small models inherit much of its power.

Main Achievement: It shows that a single model can plan and act with images and tools in one loop, and that careful RL training plus curated tools can outperform larger or newer models on real, multi-step tasks.

Future Directions:

Break the genetic ceiling by pairing stronger base models with RL that targets long-horizon compositional skills.
Build fairer benchmarks that factor in tool-environment variability and multilingual needs.
Grow the local visual library while keeping expert-level precision and low inference cost.
Improve judge models and add better safety checks for tool outputs.

Why Remember This: MindWatcher turns an AI from a smart talker into a smart doer—thinking with pictures, checking facts with tools, and learning good habits. It’s a blueprint for practical, trustworthy agents that can help in school, work, and the real world.

Practical Applications

•Automated product identification: crop logos or parts, search a local visual library, and fetch specifications.
•Sports fact checking: read match photos, identify players by jerseys, and confirm results via web search.
•Travel assistant: recognize landmarks from photos, retrieve historical facts, and summarize trustworthy pages.
•Customer support triage: read screenshots, search internal docs, and compute refunds or settings with a code tool.
•Education helper: combine textbook images and web pages to answer research questions with sources.
•Field botany or farming: zoom into leaf regions, identify species or diseases, and retrieve treatment guidelines.
•Automotive maintenance: recognize car models/parts and fetch up-to-date manuals or recall info.
•Compliance and auditing: extract page content, cross-check claims, and compute metrics reproducibly.
•News QA: resolve dates to exact timestamps and verify outcomes to avoid answer drift over time.
•R&D search: ground objects in figures, retrieve similar images or datasets, and summarize technical web pages.

Version: 1