AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang; Xiaowen Zhang; Jongrong Wu; Zhi Gao; Shilin Yan; Zhenxin Diao; Kunpeng Gao; Xuanyan Chen; Yuwei Wu; Yunde Jia; Qing Li

AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Intermediate

Xintong Zhang, Xiaowen Zhang, Jongrong Wu et al.2/2/2026

arXiv PDF

Key Summary

•AdaptMMBench is a new test that checks if AI models know when to just look and think, and when to use extra visual tools like zooming or brightening an image.
•It judges the “mode choice” (text-only vs. tool-augmented) using a balanced score called MCC so results are fair even when hard and easy problems are uneven.
•It also grades the thinking process itself with key step coverage (did the model follow the important steps?), tool effectiveness (were tools used correctly?), and efficiency (steps, tools, tokens).
•The benchmark spans five areas—real-world photos, OCR documents, GUIs, knowledge diagrams, and math—so it tests many kinds of seeing-and-thinking.
•Bigger and closed-source models generally choose modes better, but good choice (high MCC) doesn’t always mean high final accuracy.
•Good key step coverage matches higher accuracy, suggesting that following the right steps matters a lot for getting answers right.
•There is a big gap between adaptive performance and an oracle setting where perfect visual evidence is given, showing tool use is a main bottleneck.
•Some models overuse tools even when not needed, while others barely use tools at all; both behaviors hurt adaptive reasoning.
•Token count doesn’t neatly track the number of steps or tools used, so speed and cost can’t be judged by tokens alone.
•Future work should focus on smarter tool selection and better visual generation to close the gap with oracle performance.

Why This Research Matters

In everyday life, we switch tools only when needed—flashlight in the dark, magnifier for tiny print—and good AI should do the same. AdaptMMBench helps build assistants that answer faster and cheaper by avoiding unnecessary visual tool use. It also makes them more reliable, since following the right steps and using tools correctly reduces guesswork. This matters for scanning receipts, navigating software interfaces, reading forms, and solving math and science problems from pictures. By showing exactly where models succeed or stumble—decision choice, step coverage, or tool execution—teams can target improvements that deliver real user benefits. Over time, this leads to AI that is not just smart but also thoughtful and efficient.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing homework with a photo in your book. Sometimes you can answer by just looking. Other times you need a magnifying glass or to brighten the image on your tablet. Knowing when to use a tool saves time and avoids mistakes.

🥬 Machine Learning

What it is: Machine Learning is when computers learn patterns from examples to make predictions or decisions.
How it works:
1. Collect lots of examples (pictures, text, answers).
2. Let the computer find patterns that connect inputs to correct outputs.
3. Test it on new problems to see if it learned well.
Why it matters: Without learning from examples, the computer can’t improve or handle new, tricky tasks. 🍞 Anchor: A spam filter learns from many emails (examples) to spot new spam messages.

🥬 Multimodal Models

What it is: These are models that can understand more than one kind of data at once, like images and text.
How it works:
1. Read the text (question).
2. Look at the image (picture or diagram).
3. Combine both to answer better than using either alone.
Why it matters: Many real problems need both seeing and reading; without both, answers are incomplete. 🍞 Anchor: Reading a map’s legend (text) and the roads (image) together to find the best route.

🥬 Vision-Language Models (VLMs)

What it is: VLMs are multimodal models that connect vision (images) with language (text).
How it works:
1. Encode the picture into features (what’s where and how it looks).
2. Read the question.
3. Use language reasoning guided by visual features to answer.
Why it matters: Without joining vision and language, the model can’t answer questions about pictures. 🍞 Anchor: “What color is the stop sign?” requires both seeing the sign and understanding the question’s words.

🥬 Tool-based Reasoning (Thinking with Images)

What it is: It’s when the model actively uses visual tools (zoom, rotate, brighten) to get better visual details before answering.
How it works:
1. Check if the current view is enough.
2. If not, call a tool (e.g., zoom into a tiny number).
3. Re-check the image and continue reasoning.
Why it matters: Without tools, small text or tilted images can cause wrong answers. 🍞 Anchor: Zooming into a receipt total so the model can read “$7.99.”

🥬 Reasoning Processes

What it is: The step-by-step chain of thoughts the model uses to reach an answer.
How it works:
1. Identify what’s being asked.
2. Gather needed info from the image and text.
3. Follow logical steps to compute or conclude.
Why it matters: Skipping key steps leads to lucky guesses or brittle answers. 🍞 Anchor: To add numbers from a chart, the model must first read the values, then sum, then average.

🥬 Computation Efficiency

What it is: How much work (steps, tools, tokens) the model spends to get an answer.
How it works:
1. Count steps and tool calls.
2. Track tokens generated.
3. Compare time and cost with accuracy.
Why it matters: Wasting tools on easy tasks slows everything and can even hurt accuracy. 🍞 Anchor: If you check every tiny corner with a magnifying glass for a big, easy sign, you waste time.

The world before: Early VLMs mostly stared at one fixed picture and answered using text-only thinking. This “first glance” strategy struggled with tiny text (OCR), dense UIs, and rotated pages. People tried pushing higher resolutions, adding more text “chain-of-thought,” or counting fewer tokens to claim efficiency. But these missed a key idea: difficulty is not one-size-fits-all. A big, smart model might read small text without tools, while a smaller model needs to zoom.

The problem: Existing evaluations used static difficulty labels (easy, hard) and simple metrics (final accuracy, token count, how often tools were called). These don’t tell if the model chose the right mode (text-only vs tool-augmented) for its own ability. They also don’t deeply check the process—did the model cover the essential steps? Did it use tools correctly?

Failed attempts: Token reduction looked like efficiency, but sometimes models reduced tokens by skipping needed steps. Counting tool calls seemed informative, but overusing tools on easy tasks or underusing them on hard tasks both looked “bad” with no context. Accuracy alone didn’t show whether a correct answer came from good reasoning or a lucky guess.

The gap: We needed a benchmark that (1) lets each model define which tasks are hard for it, (2) fairly scores the model’s mode choice, and (3) inspects the reasoning process quality, not just the final answer.

Real stakes: This matters for apps that read receipts, navigate software UIs, grade homework with diagrams, or help scientists analyze charts. Smarter mode selection means faster, cheaper, and more reliable AI assistants in everyday life.

02Core Idea

🍞 Hook: You know how you decide when to use a flashlight? In bright daylight, you don’t need it. In a dark room, you do. A smart helper knows when to switch tools.

Aha in one sentence: Make models judge, for themselves, when to think with just text and when to call visual tools, then grade both that choice (fairly) and the quality of the steps they take.

🥬 Adaptive Multimodal Reasoning

What it is: The model dynamically chooses between text-only reasoning and tool-augmented visual reasoning.
How it works:
1. Start with text+image.
2. Ask: “Do I have enough visual evidence?”
3. If no, call a visual tool (zoom/rotate/brighten); if yes, continue with text-only.
Why it matters: Without adaptiveness, models waste tools on easy tasks or fail hard tasks by refusing tools. 🍞 Anchor: Reading a big street sign? No tools. Reading a tiny serial number? Zoom first, then answer.

🥬 Confusion Matrix

What it is: A table that counts correct vs. wrong decisions in different categories.
How it works:
1. Label cases where tools are truly required vs. redundant.
2. Count: TP (used tools when required), TN (skipped tools when redundant), FP (used tools but redundant), FN (skipped tools but required).
3. Use these counts to judge decision quality.
Why it matters: Without tracking all four outcomes, we can’t fairly judge decision-making. 🍞 Anchor: A coach tracks right shots taken and smart passes skipped—not just points scored.

🥬 Statistical Metrics

What it is: Math scores that summarize performance (fairly) across different outcomes.
How it works:
1. Take counts from the confusion matrix.
2. Compute a single number that balances classes.
3. Compare models even when data is imbalanced.
Why it matters: Simple accuracy can be misleading if one choice (use/don’t use tools) dominates the dataset. 🍞 Anchor: A class with 90% true/false questions all “true” makes guessing “true” look good—unless you use a balanced metric.

🥬 Matthews Correlation Coefficient (MCC)

What it is: A balanced score from -1 to 1 that fairly measures how well the model picks the right mode (tools or not).
How it works:
1. Plug TP, TN, FP, FN into the MCC formula.
2. Get a single fairness-aware score.
3. 1 means perfect, 0 means chance, -1 means always wrong.
Why it matters: Without MCC, overusing or underusing tools can look okay just because the dataset is imbalanced. 🍞 Anchor: In a quiz with few “hard” questions, MCC stops a student who always picks “easy” from looking unfairly great.

🥬 Key Step Coverage

What it is: A check that the model’s reasoning includes the essential human-verified steps.
How it works:
1. List the key steps (like a recipe).
2. Read the model’s reasoning.
3. See how many steps (in order) it meaningfully covered.
Why it matters: Correct answers without key steps are fragile; coverage shows real understanding. 🍞 Anchor: If you bake a cake but skip “add eggs,” you might get lucky once, but it’s not solid cooking.

🥬 Tool Invocation

What it is: The moment the model calls a tool (zoom, rotate, brighten, generate lines) with specific settings.
How it works:
1. State the goal (e.g., read tiny text).
2. Pick the right tool and parameters (where to zoom, how much to rotate/brighten).
3. Use the output to continue reasoning.
Why it matters: Wrong tool or wrong settings creates bad evidence and wrong answers. 🍞 Anchor: Zooming the wrong corner of a page won’t help you read the address.

🥬 Reasoning Efficiency

What it is: A view of how many steps, tools, and tokens the model uses to solve a task.
How it works:
1. Count steps and tool calls.
2. Track token usage.
3. Compare cost vs. benefit.
Why it matters: Spending too much effort on simple tasks wastes time and money. 🍞 Anchor: Using ten hints on an easy puzzle is not efficient.

Three analogies for the idea:

Doctor’s checkup: Sometimes a glance and a few questions are enough (text-only). Sometimes you need an X-ray (tool). MCC judges how well the doctor decides.
Detective work: Don’t dust for fingerprints on a billboard-sized clue; do it when the clue is tiny. Key step coverage checks if the detective followed the right trail.
Cooking: Taste first (text-only). If the flavor’s unclear, use a thermometer (tool). Efficiency checks you didn’t dirty ten pans for a sandwich.

Before vs After:

Before: Models were judged mostly by accuracy, token count, or how often they used tools—mixing up decision quality with final results.
After: We separately grade (1) the choice of mode (with MCC), and (2) the thinking process (key steps, tool correctness, efficiency), revealing hidden strengths and weaknesses.

Why it works (intuition):

Difficulty is model-specific: A taller kid reaches the top shelf without a stool; a shorter kid needs one. MCC grades whether each kid chooses the stool appropriately. Key step coverage proves they actually followed the recipe, not guessed.

Building blocks:

A diverse dataset across five domains.
Three modes: text-only, adaptive, and oracle-visual.
Dynamic tool-required vs. tool-redundant labels per model.
MCC for the mode decision, plus process metrics (coverage, tool correctness, efficiency).

03Methodology

At a high level: Input (image + question) → Decide mode (text-only or tools) → Reason step by step (possibly with tools) → Output answer and logs → Evaluate choice (MCC) and process (coverage, tool use, efficiency).

Data format: Each sample is a 5-tuple (I, Q, A, E, K):

I: the image; Q: the question; A: the ground-truth answer.
E: visual tool annotations (where to zoom; how to rotate or brighten; sometimes auxiliary line drawing).
K: the key human-verified reasoning steps.

Three evaluation modes:

🥬 Text-Reasoning Mode

What it is: The model answers using text reasoning without actively transforming the image.
How it works:
1. See I and Q.
2. Think step-by-step in text only.
3. Produce an answer.
Why it matters: It shows what the model can do without extra help and defines when tools are truly needed. 🍞 Anchor: Reading a big, clear label without zooming.

🥬 Adaptive Reasoning Mode

What it is: The model can choose to use tools if it believes more visual evidence is needed.
How it works:
1. Start with I and Q.
2. If uncertain, call a tool with parameters (like a bounding box to zoom).
3. Use the new view and continue reasoning; repeat if needed.
Why it matters: This mirrors real-world solving—only use tools when they help. 🍞 Anchor: Zooming into a small chart legend before calculating an average.

🥬 Oracle-Visual Mode

What it is: The model is given the perfect visual evidence (e.g., the exact zoomed region or corrected image) and then reasons with text only.
How it works:
1. Provide I and the ideal crop or enhanced view I_E.
2. Do text reasoning.
3. Output answer.
Why it matters: It sets an upper bound: if the model still fails, the issue is reasoning; if it succeeds, the earlier problem was tool use. 🍞 Anchor: Handing the student the exact paragraph to read removes the need to search the page.

Adaptive mode selection labels (per model):

Tool-Required: The sample can’t be solved via text-only for this model; tools are needed.
Tool-Redundant: The sample can be solved via text-only for this model; tools are unnecessary and may add noise. This makes difficulty dynamic and fair across small and large models.

Scoring the decision with MCC:

Build a confusion matrix with TP (used tools when required), TN (skipped tools when redundant), FP (used tools when redundant), FN (skipped tools when required).
Compute MCC to fairly summarize the decision quality even if the dataset has more of one type.

Scoring the process:

Key Step Coverage: Check how many essential steps (K) the model’s reasoning covered in order. This allows condensed reasoning but rewards staying on the correct path.
Tool Effectiveness: For each tool call, judge if it fit the stated goal and produced useful evidence (e.g., correct region, correct rotation). Count the fraction that are valid.
Efficiency: Record number of reasoning steps, number of tool calls, and token usage; compare cost and conciseness across models.

Worked example 1 (OCR):

Q: “What is the zip code?” I is a rotated, dark document.
Text-Only: The model fails to read the blurred text.
Adaptive: The model rotates + brightens, then zooms to the address block, reads “20418,” and answers.
Oracle: Given the perfect corrected crop, the model answers “20418” directly.
Labels: For this model, the sample is Tool-Required. If the model used tools, that’s TP.
Process: Did the model’s steps include locating the address line? Did its zoom hit the right box? Was the rotation angle appropriate? Were steps efficient?

Worked example 2 (GUI):

Q: “How many tabs are visible along the top of the ‘Fields’ dialog?”
Text-Only: The model can count large, clear tabs without zoom.
Adaptive: If it zooms multiple times unnecessarily, that’s FP and could even confuse the layout.
Oracle: A provided crop of the tab row confirms the count.
Process: Key steps include finding the dialog, locating the tab row, and counting tabs. Tool effectiveness checks whether any zoom targeted the right strip.

The secret sauce:

Dynamic difficulty boundaries: Instead of a human saying “this is hard,” text-only success or failure per model defines whether tools are required, personalizing difficulty.
MCC for fair decision scoring: This resists imbalance tricks (like always using tools).
Multi-dimensional process audit: Key step coverage checks logic; tool effectiveness checks execution; efficiency checks cost—all alongside final accuracy.

Putting it together like a recipe:

Input → Choose mode (text vs tools) → If tools, specify parameters → Reason step-by-step → Produce answer + logs → Evaluate MCC (decision) + coverage, tool validity, efficiency (process) + accuracy (outcome).

04Experiments & Results

The test: Models were evaluated on 1,420 samples across five domains—real-world photos, OCR documents, GUIs, knowledge diagrams/science, and math (including some that need drawing auxiliary lines). Each sample can be solved in text-only, adaptive, and oracle-visual modes. We measured: mode selection (MCC), key step coverage, tool effectiveness, efficiency (steps, tools, tokens), and accuracy.

The competition: Both closed-source (GPT-5, Gemini-3-Pro) and open-source (Qwen3-VL 8B/32B/235B, PixelReasoner, Deepeyes, Deepeyes v2, Thyme, PyVision, AdaptVision) models.

Scoreboard highlights (with context):

Mode selection (MCC): GPT-5 reached about 0.41 (like getting a solid A on choosing when to use tools), while top open-source Qwen3-vl-235B scored about 0.26. Some specialized models were near zero due to overusing or almost never using tools.
Process quality: Qwen3-vl-235B led key step coverage (~84.83%), and larger models tended to use tools more effectively (correct region/transform). This aligns with better accuracy, suggesting “following the right steps” strongly predicts success.
Accuracy by mode: Adaptive generally beats text-only, but there’s a big jump to oracle-visual. For example, GPT-5 improved from ~78.69% (adaptive) to ~88.69% (oracle), showing tool use—not reasoning itself—is often the bottleneck.
Efficiency isn’t just tokens: Token usage didn’t cleanly match the number of steps or tools. A model could take fewer steps but still use many tokens (longer explanations), or vice versa.

Domain stories:

OCR and GUI: Adaptive zooming/rotation helped read tiny text and count UI elements. But extra zooms on already clear items caused confusion.
Real-world photos: Tool use mattered for tiny objects (e.g., cat’s eye color details), but unnecessary zooms on obvious features slowed models.
Knowledge and Math: Oracle-visual mode unlocked higher scores, especially where small diagram details mattered. For geometry tasks that benefit from drawing auxiliary lines, closed-source models did better; open-source models still need stronger visual generation.

Surprising findings:

Decoupling of MCC and final accuracy: A model could pick modes wisely (good MCC) yet still miss answers due to weak tool execution or reasoning; another could get high accuracy but make poor mode choices (e.g., overuse tools), suggesting hidden inefficiency and fragility.
Model scaling helps selection: Within Qwen3-VL, bigger models chose modes more reliably, implying capacity improves meta-cognition about difficulty.
Tool effectiveness varies widely: Some models frequently zoomed the wrong spot or rotated incorrectly, hurting adaptive runs even on solvable tasks.
Visual generation matters: On tasks needing auxiliary lines, oracle-visual gains were large; adaptive runs lagged unless the model could create or use generated visuals well.
Error anatomy (GPT-5 sample): About 42.3% errors came from visual reasoning failures (wrong region/transform), 7.3% from context noise despite correct visuals, 8.3% from needless tool use on text-sufficient tasks, and 28.8% from capability ceilings. This maps where to improve next.

In plain terms: The best models are getting good at deciding when to pull out the magnifying glass, but many still fumble using it correctly. Following the right steps goes hand-in-hand with answering right, and there’s still a big prize waiting if models learn to get perfect visual evidence on their own.

05Discussion & Limitations

Limitations:

Domain coverage is broad but finite; real-world use can include videos, 3D scenes, handwriting quirks, or complex GUIs beyond those seen here.
Oracle-visual assumes annotations perfectly capture the needed evidence; rare corner cases might still need more context.
LLM judges (for key steps and tool validity) introduce small evaluation noise, though prompts and reviews aim to keep this stable.
Some models cannot yet generate visuals (e.g., draw auxiliary lines), limiting their adaptive ceiling on those tasks.

Required resources:

A VLM that can run in three modes (text-only, adaptive with tools, oracle-visual with supplied evidence) and can log its reasoning/tool calls.
Access to simple visual tools (zoom, rotate, brightness/contrast) and, optionally, visual generation utilities.
Enough compute to process 1,420 diverse samples and store reasoning traces for process evaluation.

When NOT to use:

Pure text tasks: If no image reasoning is needed, a text-only benchmark is more efficient.
Highly interactive environments (e.g., full web browsing or multi-window OS control) beyond the provided GUIs; a dedicated agentic benchmark may fit better.
Real-time systems with tight latency budgets, if you can’t enable or analyze tool calls.

Open questions:

How can models better self-diagnose visual uncertainty and choose the minimal helpful tool (right place, right transform, first try)?
What training signals best teach tool timing and parameter selection (e.g., rewards for correct early stopping, penalties for redundant calls)?
How to robustly evaluate and train visual generation (auxiliary lines, overlays) for diagrams and charts?
Can we design unified strategies that connect meta-cognition (mode choice), tool skill (execution), and logic (key steps) into one learning loop?
How can efficiency be improved without sacrificing correctness or step coverage—especially under long-context constraints?

06Conclusion & Future Work

Three-sentence summary: AdaptMMBench is a benchmark that fairly tests whether a vision-language model knows when to use visual tools and whether it reasons through the right steps. It separates the quality of the mode choice (with MCC) from the quality of the reasoning process (key step coverage, tool effectiveness, efficiency) and the final accuracy. Results show that many models still underperform in tool execution and that smarter, more reliable tool use could unlock big gains.

Main achievement: Turning adaptive multimodal reasoning into a clear, fair, and multi-view evaluation—dynamic difficulty per model, MCC for the decision, and deep process auditing for how the answer was reached.

Future directions: Improve tool selection and parameterization, strengthen visual generation for diagrams, train models to stop using tools earlier when not needed, and connect meta-cognition, tool skill, and logic into a single learning objective.

Why remember this: Knowing the answer is good, but knowing when to grab the magnifying glass—and how to use it well—is what makes an assistant fast, accurate, and trustworthy. AdaptMMBench is the flashlight that shows us where models are careful thinkers and where they’re still guessing in the dark.

Practical Applications

•Receipt and invoice readers that only zoom or enhance when totals are small or low-contrast.
•Document OCR assistants that auto-rotate and brighten pages only when text is unreadable.
•GUI navigation bots that count tabs or buttons without needless zooms, improving speed and stability.
•Homework helpers that draw auxiliary lines on geometry diagrams only when the logic requires them.
•Scientific chart analyzers that first try direct reading, then selectively zoom into axes or legends if needed.
•Customer support tools that decide when a screenshot needs enhancement before extracting information.
•Mobile accessibility apps that adaptively magnify or re-orient camera views for tiny labels and signs.
•Industrial inspection systems that selectively zoom into potential defects while skipping clear regions.
•E-learning graders that check key solution steps in student reasoning, not just the final numeric answer.
•Agent frameworks that train better tool timing and parameter choices to cut latency and cloud costs.

Version: 1