AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang; Kaituo Feng; Dongyang Chen; Zhongyu Wang; Zhixun Li; Sicheng Gao; Meng Meng; Xu Zhou; Manyuan Zhang; Yuzhang Shang; Xiangyu Yue

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Intermediate

Chaoyang Wang, Kaituo Feng, Dongyang Chen et al.12/18/2025

arXiv PDF

Key Summary

•AdaTooler-V teaches an image-and-video AI to first ask, “Do I really need a tool?” before using one, which saves time and boosts accuracy.
•The key idea is a new reward system (AT-GRPO) that pays the model only when tool use actually helps, and penalizes it when tools are unnecessary.
•A sample-specific Tool Benefit Score (∆S) estimates how much tools improve answers for that exact problem, so rewards fit each case.
•Two datasets power training: a 100k Chain-of-Thought set for a good starting brain, and a 300k verifiable set for reinforcement learning.
•The model learns a think–act–see loop: think about what to do, use a visual tool if needed, look at the result, and repeat until ready to answer.
•Across 12 benchmarks, AdaTooler-V-7B hits 89.8% on the tough high-resolution V* test, beating GPT-4o and Gemini 1.5 Pro.
•On video reasoning, performance steadily rises as more frames are provided, showing strong temporal understanding gains.
•Ablations show both supervised warm-up (SFT) and the adaptive reward are crucial; removing either reduces accuracy.
•The approach cuts “overthinking” and avoids meaningless tool calls, reducing inference cost while improving reliability.
•All code, models, and data are released, making it practical to build smarter, more efficient multimodal agents.

Why This Research Matters

Adaptive tool-use means AI acts more like a thoughtful helper than a gadget hoarder, saving time and compute while improving accuracy. This can make image and video assistants faster on phones and browsers, where every millisecond and watt count. It also boosts reliability for critical tasks, like reading diagrams, analyzing surveillance clips, or checking forms with small text. By rewarding usefulness, not just activity, the approach encourages transparent, focused reasoning that’s easier to trust. Finally, releasing code, data, and models makes it practical for teams to build efficient multimodal agents for real-world applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you don’t grab a magnifying glass to read a giant street sign, but you might use one to inspect a tiny ant? Smart helpers use special tools only when they need them.

🥬 The Concept: Multimodal Large Language Models (MLLMs) are AIs that understand both words and visuals (images and videos) and can use visual tools (like zooming/cropping or picking video frames) while they think. Many recent systems embraced “thinking with images,” where the AI interleaves its thoughts with quick tool uses to look closer at details.

How it works (before this paper): Models learned long Chain-of-Thought (CoT) reasoning and sprinkled in tool calls throughout. This often improved tough visual questions because the AI could zoom in, crop, or fetch frames to verify details.
Why it matters: Without tools, the AI can miss tiny clues (like small text on a sign or a subtle shape) and make mistakes.

🍞 Anchor: Imagine a detective AI solving a picture puzzle. It can think step by step and zoom in to read a tiny label before concluding.

🍞 Hook: Imagine a student who always grabs a calculator—even for 2 + 2. That’s overkill and slows them down.

🥬 The Concept: Blind tool-use is when a model calls tools even when they aren’t needed.

How it works: Training setups and rewards sometimes push the model to use tools by default, whether the problem is simple or hard.
Why it matters: Using tools all the time wastes time, costs more compute, and can cause overthinking—wandering away from the best solution path.

🍞 Anchor: On a simple clock-reading question, zooming three times before answering is like taking a detour for no reason.

🍞 Hook: Think of learning a sport—you get better by trying things and getting points for good choices.

🥬 The Concept: Reinforcement Learning (RL) is a way to train AIs by rewarding good decisions and discouraging bad ones.

How it works:
1. The AI tries ways to solve a task.
2. It gets a reward based on how good the result is.
3. Over time, it repeats strategies that earn more reward.
Why it matters: RL can teach not just what to answer, but how to think and when to use tools.

🍞 Anchor: A robot learning to clean a room gets points for finishing faster and neater, so it learns efficient moves.

🍞 Hook: When you explain your thinking in math, you write steps, not just the final answer.

🥬 The Concept: Chain-of-Thought (CoT) is the AI’s step-by-step reasoning.

How it works: The model writes out intermediate steps, checks ideas, and then concludes.
Why it matters: CoT helps the AI make fewer logic mistakes and stay on track.

🍞 Anchor: To find the time difference between two clocks, the AI lists the times, subtracts them, and shows the minutes.

🍞 Hook: Sometimes you need to look closer at a picture or review a key video moment.

🥬 The Concept: Vision tool interactions are actions like CropImg (zoom/crop), FrameAt (grab one video frame), VideoClip (cut a span of frames), or PathTracer (draw a path) that help the AI inspect visuals.

How it works: The AI decides an action, runs the tool, and gets a new image/clip to examine.
Why it matters: These tools surface tiny details text-only reasoning might miss.

🍞 Anchor: To answer “What color is the handbag?” the AI can crop the image to the bag and then answer confidently.

🍞 Hook: You don’t use a microscope for every homework problem, just the ones with tiny organisms.

🥬 The Concept: The problem before this paper was that models lacked a way to decide if a tool was necessary for each question.

How it works (failed attempts):
- Always encourage tool-use: leads to slow, costly, and sometimes worse answers.
- Reward long thoughts only: can cause overthinking trails.
- Fixed rules: don’t generalize well across tasks.
Why it matters: Without a “Should I use a tool?” decision, models waste resources and sometimes get less accurate.

🍞 Anchor: A good student knows when to show work and when a quick mental math is enough.

🍞 Hook: Picture a coach who adjusts praise depending on how helpful a tool really was for that play.

🥬 The Concept: This paper’s gap is a missing knob that connects tool-use to actual usefulness per question.

How it works: We need a way to measure “Did tools help on this exact sample?” and then use this to shape rewards.
Why it matters: If we reward tool-use only when it truly helps, the model learns to be efficient and accurate.

🍞 Anchor: If a zoom helps spot a hidden sign and fixes the answer, that deserves a reward; if zooming did nothing, it shouldn’t.

🍞 Hook: Imagine your tablet snapping from “simple mode” to “pro mode” based on task difficulty.

🥬 The Concept: AdaTooler-V is a model that adaptively chooses between plain text thinking and tool-augmented thinking.

How it works: It first asks if tools seem useful; if yes, it enters a think–act–see loop with tools; if not, it answers with text-based CoT.
Why it matters: This keeps answers fast when possible and careful when necessary.

🍞 Anchor: On a multi-image clock puzzle, AdaTooler-V solves it with text only; on a tiny-object question, it zooms in first.

02Core Idea

🍞 Hook: You know how a chef doesn’t use a blender for every dish—only when ingredients need it?

🥬 The Concept: The key insight in one sentence: Reward a model for using visual tools only when those tools measurably improve its answer on that specific problem.

How it works:
1. For each training example, estimate how much tool-use helps (Tool Benefit Score, ∆S).
2. During training, add a bonus when tools help, and add a penalty when they don’t.
3. Balance this with the normal correctness reward so accuracy and efficiency rise together.
Why it matters: This turns tools from a habit into a smart choice, cutting overthinking and cost.

🍞 Anchor: If zooming in turns a wrong guess into the right answer, that earns a reward; if it changes nothing or harms, it doesn’t.

Three analogies for the same idea:

Chef analogy: Use the whisk for fluffy eggs (helpful), skip it for pouring water (not helpful).
Student analogy: Use a calculator for long division, but not for 5 + 3.
Detective analogy: Use a magnifying glass to read a tiny serial number, not to spot a giant billboard.

Before vs. After:

Before: Models often called tools by default, padding long thoughts and sometimes drifting off-track.
After: The model first judges whether tools are likely to help. If yes, it interleaves thoughts with tools; if not, it answers directly.

🍞 Hook: Imagine scoring how much a helper tool actually helped you finish homework.

🥬 The Concept: Tool Benefit Score (∆S) says how much better a model performs with tools than without them on the same question.

How it works:
1. Solve the same problem multiple times with tools and without.
2. Compare accuracies and subtract.
3. Positive means tools helped; negative means they didn’t.
Why it matters: This makes the reward smart and sample-specific.

🍞 Anchor: If a reference model gets 6/8 correct with zooms but 3/8 without, ∆S is positive—tools helped.

🍞 Hook: Think of a video game that gives more points for the right move and gently reduces points if you repeat it too many times.

🥬 The Concept: AT-GRPO is a reinforcement learning recipe that mixes normal correctness rewards with an adaptive tool-use reward shaped by ∆S and the number of tool calls.

How it works:
1. Compute a base reward for being correct and well-formatted.
2. Add a tool reward: positive if ∆S > 0, negative if ∆S < 0, smoothly adjusted by how many tools were used.
3. Normalize rewards within a group and update the policy.
Why it matters: The model learns to prefer the shortest, most useful tool trajectories—or none at all.

🍞 Anchor: On a tiny-text question, two smart crops may earn a good bonus; on an easy color question, calling three tools earns a penalty.

Why it works (intuition):

It reduces overthinking by not paying for unnecessary steps.
It preserves attention to the original image/video instead of drowning in extra crops/clips.
It aligns incentives: correctness first, tools only if they help get there.

Building blocks:

A two-stage training plan: SFT warm-up with 100k interleaved CoT traces, then RL with 300k verifiable samples.
A think–act–see loop with four tools (CropImg, FrameAt, VideoClip, PathTracer).
A sample-specific usefulness estimate (∆S) to guide adaptive tool rewards.
Grouped normalization and a small KL regularizer to keep learning stable and on-distribution.

🍞 Hook: It’s like teaching a helper to say, “Do I need a tool?” before grabbing one.

🥬 The Concept: The main change is not making new tools, but making a better brain that knows when to use them.

How it works: By connecting rewards to real usefulness per sample, the strategy generalizes across images and videos.
Why it matters: This delivers both higher accuracy and fewer wasted tool calls—win-win.

🍞 Anchor: Results show AdaTooler-V beats strong baselines on high-res images and complex videos while becoming more efficient.

03Methodology

At a high level: Input (question + image/video) → Decide if tools are needed → If yes: loop [Think → Act (tool) → See (observation)] → Answer; If no: Text-only Chain-of-Thought → Answer.

🍞 Hook: Picture a student who either answers from memory or pulls out a ruler and calculator if the problem is tricky.

🥬 The Concept: Adaptive reasoning pipeline with thoughts, actions, and observations.

What it is: A loop where the model reasons, calls a visual tool if helpful, inspects the result, and keeps going until it’s ready to answer.
How it works:
1. Thought (T): The model plans the next step (e.g., “crop the top-left corner”).
2. Action (C): It picks a tool—CropImg, FrameAt, VideoClip, or PathTracer—and applies it.
3. Observation (E): It receives a new image patch or clip and updates its context.
4. Stop when enough evidence is gathered, then answer.
Why it matters: This lets the model zoom in for hard details or skip tools for simple tasks.

🍞 Anchor: For “What color is the handbag?”, the model crops the area with the bag, sees it clearly, then answers “white.”

Step-by-step details and why each step exists:

Input parsing

What happens: The model reads the question and loads the image(s) or video.
Why it exists: Without understanding the task and media, it can’t plan sensible actions.
Example: “Find the time difference between these two clocks” with two clock images.

Decision to use tools

What happens: The model judges if tool-use seems necessary.
Why it exists: Skipping this leads to blind tool-use and wasted time.
Example: For the clock difference, it likely answers via text-only CoT; no tools needed.

Thought (T)

What happens: The model writes an internal plan (e.g., “zoom into the sign at the bottom”).
Why it exists: Planning makes tool calls purposeful instead of random.
Example: “I’ll crop the lower-right to read the tiny label.”

Action (C) using vision tools

Tools:
- CropImg: zoom/crop a region of an image.
- FrameAt: grab a single frame at a time t from a video.
- VideoClip: extract frames from t_start to t_end.
- PathTracer: draw a line/path for spatial reasoning.
Why it exists: Tools make subtle evidence visible and checkable.
Example: On a surveillance video, grab a key frame when a person enters a room.

Observation (E)

What happens: The model gets the resulting crop/clip and adds it to its context.
Why it exists: New evidence updates the plan—just like checking your work.
Example: After cropping, the brand name becomes legible.

Answering

What happens: When confident, the model produces the final answer.
Why it exists: The goal is a correct, concise answer, not infinite exploring.
Example: “Answer: 275 minutes.”

🍞 Hook: Learning to use tools wisely starts with guided practice, then careful rewards during free play.

🥬 The Concept: Two-phase training—Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL).

What it is: First teach patterns with examples, then refine by rewards.
How it works:
1. SFT (AdaTooler-V-CoT-100k): The model studies many multi-turn traces showing how to interleave thoughts and tools.
2. RL (AdaTooler-V-300k): The model explores, gets verifiable rewards, and learns when tools truly help.
Why it matters: SFT gives a stable starting brain; RL sharpens decision-making.

🍞 Anchor: Like practicing worked examples before tackling timed quizzes that score your strategy.

Data and rewards (verifiable tasks):

Multiple-choice: reward = 1 for exact match, else 0.
Numeric QA: reward = 1 if exact number matches.
OCR: reward based on how close the text is to ground truth (e.g., word error rate).
Free-form QA: reward from text similarity scores (e.g., ROUGE variants).
Why this matters: Clear, checkable rewards make RL stable and fair.

🍞 Hook: Imagine grading how much a tool helped, not just whether the final answer was right.

🥬 The Concept: Tool Benefit Score (∆S) estimation with a reference model.

What it is: A per-sample number showing if tools helped.
How it works:
1. A strong model solves the same question multiple times with tools and without.
2. Compute average accuracies; subtract without-tools from with-tools.
3. Use this ∆S during RL to scale tool-use rewards.
Why it matters: This personalizes rewards to each problem’s true needs.

🍞 Anchor: If zooming helped the reference model go from 40% to 80% on a puzzle, ∆S is high and positive.

Adaptive Tool-use GRPO (AT-GRPO) in plain words:

Base reward: credit for correct, well-formatted answers.
Tool-use reward: scaled by ∆S and gently adjusted by how many tools were used (few useful calls are great; excessive calls add little or even hurt).
Total reward = base reward + α × tool-use reward (α balances how much we care about tool efficiency).
Group normalization: compare answers within a batch so the model learns from relative improvements.
KL regularization: keep the new policy from drifting too far from a stable reference, for steady training.

Concrete mini-examples:

Single image, tiny text: 1–2 crops reveal the answer; tool reward positive → model learns to crop.
Two clocks: No tools needed; calling crops earns a penalty → model learns to skip tools.
Video timeline: Selecting a short clip that shows the event sequence gets a positive bonus.

Secret sauce:

The reward depends on usefulness per sample, not a blanket rule. That’s what makes tool-use adaptive and generalizable across images, multi-image sets, and videos.

04Experiments & Results

The Test: What was measured and why

Accuracy on 12 multimodal benchmarks spanning image detail, logic/math, spatial reasoning, charts, and video understanding.
Efficiency signals like response length (shorter often means fewer unnecessary tool calls) and scaling with frame count on video.
Why: To see if adaptive tool-use raises correctness while avoiding overthinking and extra cost.

The Competition: Strong baselines

Proprietary: GPT-4o, Gemini 1.5 Pro.
Open-source: Qwen2.5-VL-7B-Instruct (base), Pixel Reasoner, DeepEyes, Mini-o3, Video-R1, and others.

The Scoreboard (with context):

High-resolution V*: AdaTooler-V-7B scores 89.8%—like an A+—surpassing GPT-4o and Gemini 1.5 Pro and beating specialized tool-based models (Pixel Reasoner 84.3%, DeepEyes 85.6%, Mini-o3 88.2%).
General image reasoning: Consistent gains on MME, InfoVQA, and MMBench show broad improvements, not just niche wins.
MathVista: 74.5% vs. the base model’s 68.2%—over a 6-point jump—indicating better disciplined reasoning.
Multi-image spatial reasoning: MMSI-Bench (36.8) and SPAR-Bench (40.3) improvements show the system knows when to extract visual evidence across images.
Video reasoning: With 32 frames, AdaTooler-V-7B hits 46.7% (VSI-Bench), 54.6% (VideoMMMU), and 68.4% (MVBench), surpassing Qwen2.5-VL-7B-Instruct and Video-R1 baselines. On Video-Holmes, it reaches 55.6% vs. 27.8% (base) and 36.5% (Video-R1)—more than 2× over the base, showing strong causal, sequential reasoning.
More frames, more power: Performance rises as input frames grow from 32 → 64 → 128, showing the method scales with richer temporal context.

Training dynamics that make numbers meaningful:

Accuracy climbs during RL training (about 0.60 → 0.70), showing that adaptive rewards steadily improve reasoning.
Response length drops then stabilizes, reflecting fewer unnecessary tool calls after the model learns to avoid blind tool-use.

Ablations (what mattered):

SFT + AT-GRPO > SFT + standard GRPO > GRPO alone: Both the supervised warm-up and adaptive reward are important; removing either reduces average accuracy by several points.
The α balance (weight on tool reward) is robust: 0.6–0.8 performs best; too small under-emphasizes efficiency decisions.
Tool-use matters: An end-to-end text-only RL variant (no tools) falls behind—e.g., V* drops from 89.8% to 84.4%—proving tools add unique value when used wisely.

Surprising findings:

Avoiding tools can be as valuable as using them: Penalties for unnecessary calls shorten responses and often raise accuracy.
Gains amplify on complex, long-range video tasks: Adaptive tool-use shines where targeted frame/clip selection is critical.
The approach generalizes across modalities (single image, multi-image, video) using one unified reward idea tied to ∆S.

05Discussion & Limitations

Limitations:

∆S uses one reference model to judge tool helpfulness, which can bias the usefulness estimate. A learned or ensemble-based estimator could be more robust.
Rewards are best for verifiable formats (multiple-choice, numeric, OCR). Open-ended generation is only roughly rewarded via text overlap, which may miss nuance.
Data is mostly from public benchmarks, so real-world long-tail noise, domain shifts, and tricky edge cases are underrepresented.
Tool set is fixed (crop, frame, clip, path). Some domains may need different tools (e.g., table parsers, 3D viewers).

Required resources:

Hardware: The reported setup used 8× NVIDIA H100 80GB GPUs, long context windows (~4096 tokens), and multi-turn, tool-augmented training.
Software: A training stack like verl-tool/verl/vLLM, plus robust data curation and verifiable reward scripts.

When not to use:

Fully creative, open-ended tasks (e.g., storytelling about an image) where “correctness” isn’t easily checked.
Domains needing specialized tools not provided (e.g., medical DICOM viewers) unless extended.
Extremely noisy videos/images where any small crop/frame offers limited signal; the model may still struggle.

Open questions:

Can we learn ∆S directly (from a reward model or judge ensemble) to reduce bias and cover open-ended tasks?
How to expand the tool library and teach tool selection (which tool, how many times) even more precisely?
Can we couple tool-use decisions with latency/energy budgets to optimize real-world deployment costs?
How to better preserve and cite visual evidence (traceable reasoning) for higher trust and debuggability?
What curriculum best mixes easy and hard cases so the model masters when to skip vs. when to inspect?

06Conclusion & Future Work

Three-sentence summary: AdaTooler-V is a multimodal model that decides whether a visual tool is truly needed before using it, making answers both faster and more accurate. It does this with AT-GRPO, a reinforcement learning method that rewards tool-use only when it measurably helps on each sample (using a Tool Benefit Score, ∆S). Trained with a 100k CoT warm-up and a 300k verifiable RL set, it achieves state-of-the-art results across images and videos, including 89.8% on the V* benchmark.

Main achievement: Turning tool-use from a habit into a smart, per-question decision by tying rewards to actual usefulness, which lifts accuracy while reducing unnecessary compute.

Future directions:

Learn a better, less biased usefulness estimator (beyond a single reference model).
Strengthen rewards for open-ended tasks with learned multimodal judges.
Broaden toolkits (e.g., tables, 3D, domain-specific viewers) and align them with efficiency budgets and trustable evidence trails.

Why remember this: AdaTooler-V shows that asking “Do I need a tool?” before acting—then rewarding that wisdom—can transform how AI thinks with images and videos: more precise, less wasteful, and easier to trust.

Practical Applications

•Smart document readers that decide when to zoom into signatures, stamps, or tiny footnotes to verify details.
•Video analysts that pull only the key frames or short clips needed to answer event-order or cause–effect questions.
•Retail shelf auditors that crop product labels only when needed, cutting processing time for thousands of photos.
•Education tools that solve visual math problems by zooming into figures only when measurements are unclear.
•Customer support bots that check product photos, zoom in on problem areas, and avoid extra steps when the issue is obvious.
•Medical triage assistants (non-diagnostic) that enlarge relevant text/markers on consent forms and instructions, skipping unhelpful crops.
•Security review tools that automatically fetch crucial video moments (entrance/exits) instead of scanning full footage blindly.
•Quality control systems that inspect high-resolution images of parts, invoking magnification only when defects are suspected.
•News/media summarizers that select representative video clips to explain sequences without downloading or scrubbing entire videos.
•Field inspection apps that dynamically zoom into utility meters or serial numbers to record precise readings.

Version: 1