AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Mingyang Song; Haoyu Sun; Jiawei Gu; Linjie Li; Luxin Xu; Ranjay Krishna; Yu Cheng

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Intermediate

Mingyang Song, Haoyu Sun, Jiawei Gu et al.1/26/2026

arXiv PDF

Key Summary

•AdaReasoner teaches AI to pick the right visual tools, use them in the right order, and stop using them when they aren’t helping.
•It learns this not by memorizing one tool at a time, but by treating tool use as a general reasoning skill.
•A special training recipe mixes carefully built multi-step examples with reinforcement learning that rewards good tool planning.
•An adaptive trick hides tool names and rephrases their descriptions so the AI understands what tools do, not just what they’re called.
•On tough tasks like finding safe paths on a grid or choosing the right jigsaw piece, AdaReasoner beats many big-name models.
•The AI can even use brand-new tools at test time (like A* pathfinding) and quickly figure out when they’re helpful.
•It learns to adopt helpful tools, discard distracting ones, and adjust how often it uses them based on the task.
•The biggest limit becomes the tools’ quality, not the model’s size, letting smaller open models reach state-of-the-art results.
•This approach makes visual AI more reliable in long, multi-step problems that need checking and correcting along the way.

Why This Research Matters

In real life, we rarely solve visual problems in one shot; we look closely, use tools, check our work, and adjust. AdaReasoner gives AI that same practical rhythm, so it can organize helpers like OCR, pathfinders, and editors to finish tricky tasks. This means more reliable assistants for navigation on maps, picking the right product on websites, or verifying that a diagram’s route is safe. Because the model learns what tools do—not just their names—it can adapt when companies change APIs or when new tools appear. Smaller, open models become far more capable, making advanced assistance more accessible and affordable. As a result, everyday apps can become smarter, safer, and more helpful without needing the biggest model.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a LEGO castle. Sometimes your hands are enough. Other times, you need a ruler to measure, a pencil to sketch, or a phone to check the design. You don’t always use every tool—you choose what helps, when it helps.

🥬 The Concept: Multimodal Large Language Models (MLLMs) are AIs that understand pictures and text. They’re great at chatting about images, but they struggle with careful, step-by-step visual reasoning—like planning a safe path on a grid, or testing which puzzle piece fits—unless they can call external tools (like a pathfinder, text reader, or image editor). How it works (before this paper):

People trained models to follow rigid scripts: “First crop, then read text,” no matter the situation.
Others used one tool in a loop (e.g., just zooming-in repeatedly), which helped a little but wasn’t flexible.
When new tools or new tasks appeared, these models got confused, because they had memorized patterns instead of learning to plan. Why it matters: Without adaptive tool use, AI wastes time, makes more mistakes, and breaks when tasks change. It’s like trying to build every LEGO castle with the same three steps, even when the pieces are different.

🍞 Anchor: Think of a maze picture where you must go from Start to Goal without falling into holes. A smart human might first point to key spots, then let a pathfinder tool suggest a route, then draw the route to double-check. Old models either did the same steps every time or only used one tool. They needed a better way.

🍞 Hook: You know how a good coach doesn’t just yell plays; they watch the game, swap players, and change tactics on the fly?

🥬 The Concept: Dynamic Tool Orchestration is teaching AI to be that coach—choosing which tools to use, when to call them, and how to combine them across many steps. How it works (before this paper):

Prompt recipes told the model exactly when to call a tool, which was brittle.
Single-tool RL agents improved perception but couldn’t flexibly coordinate diverse tools.
Models overfit to specific tool names (“Calculator”) instead of understanding what a tool actually does. Why it matters: Without orchestration, AI wastes useful tools or overuses the wrong ones, like calling a pathfinder when it’s time to verify text.

🍞 Anchor: In a jigsaw task, you should first find the hole, then try piece A, then B, and compare. Old systems often either never tried the insert tool or kept trying it when it wasn’t needed.

🍞 Hook: Imagine learning a new board game. At first, a friend shows you ideal moves (examples). Then you play many rounds and learn from wins and losses (practice).

🥬 The Concept: This paper proposes a two-part recipe plus a crucial twist:

Tool Cold Start (examples): curated, multi-turn tool-using trajectories that include reflection and even tool failures, so the model learns robust patterns.
Tool-GRPO (practice): a reinforcement learning stage that rewards good multi-step tool choices and final correct answers.
Adaptive Learning (the twist): randomizing tool names and paraphrasing descriptions so the model understands functions, not labels. Why it matters: Without the examples, the AI struggles to discover good plans. Without the practice, it doesn’t stabilize adaptive habits. Without the twist, it memorizes names and fails on new tools.

🍞 Anchor: After this training, when the A* pathfinding tool appears at test time, the model recognizes its purpose from the description and uses it wisely for navigation—but not for unrelated verification steps.

🍞 Hook: Think about switching from a small toolbox to a smart workshop where the right tool rolls over when you need it.

🥬 The Concept: The gap this paper fills is making tool use a general reasoning skill, not a hard-coded script. How it works:

High-quality, long, multi-step examples show not just what to do, but why (with Chain-of-Thought).
Reinforcement learning optimizes the whole sequence based on success.
Adaptive randomization protects against overfitting to tool names. Why it matters: Now the model can coordinate multiple tools, handle unfamiliar tools, and switch strategies as the task changes.

🍞 Anchor: In tests, a 7B open model with AdaReasoner jumps from middling scores to near-perfect on visual spatial planning—beating even larger, closed systems on several tasks. That’s like a junior varsity team outplaying varsity by using smarter plays and better teamwork.

02Core Idea

🍞 Hook: You know how a Swiss Army knife has many tools, but the trick is knowing which one to flip out, in what order, and when to stop?

🥬 The Concept (Aha in one sentence): Teach the AI to treat tool use as a flexible reasoning skill—so it can pick, sequence, and adapt tools across multiple steps based on the task and feedback. How it works:

Show multi-turn, high-quality examples that include reflection and even failures.
Use Tool-GRPO (a group-based RL method) to reward correct formatting, good tool calls, and correct final answers across turns.
Randomize tool identities and reword descriptions so the model learns meaning, not names. Why it matters: Without this, models overuse or misuse tools, crumble on new tools, and fail to plan over many steps.

🍞 Anchor: In a maze, the model learns to point to start/goal, call A* to find a path, draw the path to verify, and then answer—skipping tools if it already knows enough.

Three analogies for the same idea:

Orchestra Conductor: The AI is a conductor. Tools are instruments. The model cues the right instruments, blends them, and quiets them when not needed.
Chef’s Kitchen: The AI is a chef. Tools are utensils and appliances. It picks a whisk for batter, a knife for chopping, and a thermometer to check doneness, adjusting steps if the dish needs more time.
Sports Playbook: The AI is a quarterback. Tools are plays. It reads the defense (task context), calls the right play (tool), audibles if needed (adapts), and sequences plays to drive downfield (multi-step plan).

Before vs After:

Before: Fixed scripts, single-tool loops, brittle to new tools and tasks.
After: Adaptive tool planning, coordinated multi-turn strategies, and zero-shot use of unseen tools.

Why it works (intuition, no equations):

Good examples teach structure; RL teaches priorities and trade-offs; randomization forces real understanding. Together, they push the model to consider, “Do I need a tool? Which one? What next?”
A multi-part reward (format + tool-call quality + final accuracy) keeps the model honest: follow the rules, use tools well, and end with a correct answer.
Adaptive rewards make tools optional when you’re sure, but helpful when you’re uncertain—just like humans.

Building blocks (each with sandwich pattern):

Dynamic Tool Orchestration 🍞 Hook: Like packing for a trip—you don’t bring everything, just what you’ll actually use. 🥬 The Concept: Choosing, ordering, and combining tools over several steps based on context and feedback. How it works: read the problem, try a tool, look at its output, update the plan, repeat until confident. Why it matters: Prevents wasted steps and wrong turns. 🍞 Anchor: In jigsaw, first detect the hole, then insert a candidate piece, inspect the result, and pick the best fit.
Tool Cold Start (TC) 🍞 Hook: Think of training wheels that teach balance before you ride fast. 🥬 The Concept: Curated, multi-step examples that demonstrate correct tool use, reflection, and resilience to failure. How it works: create step-by-step demos, execute real tools to fill in inputs/outputs, and write the reasoning between steps. Why it matters: Without TC, the model struggles to discover good sequences on its own. 🍞 Anchor: The maze demo points to start/goal/holes, tests a path, redraws after errors, and then answers.
Tool-GRPO (TG) 🍞 Hook: Like playing practice matches and learning from every round. 🥬 The Concept: A reinforcement learning method that compares multiple generated solutions and boosts the better ones. How it works: sample several multi-turn solutions, score each (format, tool use, final answer), and nudge the model toward the best. Why it matters: Stabilizes adaptive behaviors like adopting A* for navigation but skipping it for verification. 🍞 Anchor: With TG, calls per sample for A* rise in navigation (useful) and fall in verification (distracting).
Adaptive Learning (ADL) 🍞 Hook: Imagine all tool labels are stickered over—can you still pick the right one by reading the manual? 🥬 The Concept: Hide tool names with random strings and paraphrase descriptions so the model learns what a tool does, not just its name. How it works: randomize identifiers (Func_X7a2), reword descriptions while keeping meaning, and train across varied versions. Why it matters: Improves zero-shot generalization to new tools and new tasks. 🍞 Anchor: Even when A* appears only at test time with a new name, the model recognizes it as a pathfinder and uses it correctly.

03Methodology

High-level recipe: Input (image + question + tool list) → Step A (Tool Cold Start data) → Step B (Tool-GRPO reinforcement learning) → Step C (Adaptive Learning throughout) → Output (an AI policy that plans and adapts tools across turns).

Step A: High-quality Trajectory Data Curation (Tool Cold Start)

Abstract plans with reflection and robustness 🍞 Hook: When you learn math, good worksheets include easy problems, tricky ones, and places to check your work. 🥬 The Concept: Build ideal, human-like multi-turn plans for each task that include reflection, backtracking, and even tool failures. How it works:

Write a blueprint: perception → planning → verification.
Add Reflection and Backtracking so the model learns to notice mistakes and fix them.
Add Explicit Tool Failure so the model practices switching to its own reasoning when a tool misbehaves. Why it matters: Without these, the model over-trusts tools and can’t recover from hiccups. 🍞 Anchor: In VSP, if a drawn path hits a hole, the trajectory shows the model revising the route.

Execute real tool calls 🍞 Hook: A recipe isn’t real until you crack eggs and stir batter. 🥬 The Concept: Programmatically run each tool in the plan to fill in real inputs/outputs. How it works: call POINT to get coordinates, then DRAW2DPATH to overlay paths, etc. Why it matters: The model sees authentic tool I/O, not placeholders. 🍞 Anchor: In Jigsaw, DETECTBLACKAREA returns the missing region box, then INSERTIMAGE actually places each candidate patch.
Chain-of-Thought (CoT) linking 🍞 Hook: Good teachers show their work so you can follow along. 🥬 The Concept: Generate the reasoning text that connects tool calls and decisions. How it works: use a strong LLM to write clear “why” between steps. Why it matters: Without CoT, the model copies actions without understanding. 🍞 Anchor: “I’ll call A* because holes block the direct route; then I’ll draw the path to check.”

Step B: Multi-turn Tool-GRPO (Reinforcement Learning)

Overview: Sample several complete multi-turn answers; score each; update toward higher-scoring ones. Two key designs: multi-turn reward accumulation and adaptive tool reward.

Multi-turn Reward Accumulation 🍞 Hook: In a relay race, every runner must pass the baton correctly, or the whole team is disqualified. 🥬 The Concept: The reward checks every turn for correct format, averages tool-call quality, and gives accuracy credit only at the end. How it works:

Format Reward (all-or-nothing across turns): if any turn breaks the structure (<think>, then either <tool_call> or <response>), reward = 0 for the whole run.
Tool Reward (0–4 per tool turn): score structure, valid name, correct parameter names, and sensible parameter values; average across tool turns.
Accuracy Reward (final turn): 1 if the final answer is right, else 0. Why it matters: Keeps multi-step reasoning neat, precise, and goal-focused. One sloppy turn spoils the run. 🍞 Anchor: In VSP, a pathfinding attempt with malformed JSON gets zeroed out, training the model to respect tool-call syntax.

Adaptive Reward (tools-as-needed) 🍞 Hook: When you’re sure of an answer, you don’t need a calculator. When you’re unsure, using one is smart. 🥬 The Concept: Give full credit for correct answers whether or not tools were used, but when the final answer is wrong, award partial credit for informative tool use and penalize blind guesses. How it works: asymmetric reward—concise success is fine, but if you fail, you’d better have tried helpful tools. Why it matters: Prevents tool spamming and encourages thoughtful, uncertainty-aware tool use. 🍞 Anchor: In GUIQA, if the model gets it wrong but used CROP+OCR well, it gets some credit; if it guessed without tools, it doesn’t.
Emergent adaptive behaviors (from TG)

Adopt helpful tools: A* calls per sample climb for navigation.
Discard irrelevant tools: A* calls fade for verification.
Modulate frequency: POINT is used more in navigation than verification.

Step C: Adaptive Learning for Generalization (applied in both A and B)

Token-level randomization 🍞 Hook: If every wrench is labeled with a random code, a real mechanic still picks the right one by shape and use. 🥬 The Concept: Replace all tool and parameter names with random strings (e.g., Func_X7a2, Para_9a7y). How it works: strip away name hints so the model reads the description to infer function. Why it matters: Beats overfitting to names like “Calculator.” 🍞 Anchor: The model still picks the pathfinder tool from its description of “shortest obstacle-free path,” even with a gibberish name.
Semantic-level paraphrasing 🍞 Hook: Teachers can explain the same rule in many different words. 🥬 The Concept: Rephrase tool and parameter descriptions while keeping the meaning. How it works: generate diverse, equivalent documentation text. Why it matters: Makes the model robust to different manuals and APIs. 🍞 Anchor: Whether the docs say “Draw a 2D path” or “Overlay a route using U/D/L/R,” the model understands it’s the same action.

Concrete example (VSP navigation): Input: grid image with start/goal/holes; tools include POINT, ASTAR, DRAW2DPATH (names randomized). Process: model POINTs start/goal/holes → ASTAR proposes path → DRAW2DPATH overlays route → model inspects → final path string. Output: “R,R,U,L,D,D …” in a boxed answer.

04Experiments & Results

The Test: What they measured and why

Visual Spatial Planning (VSP & VSPO): can the model plan safe routes (Navigation) and check paths (Verification)? These require multi-step perception + planning + verification.
Jigsaw: can it detect the hole, try pieces, and judge the best fit? This probes visual compositional reasoning.
GUIQA and WebMMU: can it focus on the right UI region and extract text/actions? Tests perception-to-action grounding.
Visual Search (V*): tougher, open scenes; checks general visual ability.

The Competition: Baselines

Direct SFT (plain supervised fine-tuning), Direct GRPO (rule-based RL), and prior tool-planning agents like DeepEyes and PixelReasoner.
Strong closed models (e.g., GPT‑5, Claude Sonnet 4, Gemini 2.5 Flash) and open models (Qwen2.5-VL 3B/7B/32B/72B, InternVL3 78B).

The Scoreboard (with context)

With AdaReasoner’s full recipe (TC + TG + Adaptive Learning), the 7B model jumps by roughly +24.9 points on average across demanding benchmarks.
On VSP/VSPO (spatial planning), accuracy rises from low 30s to mid-to-high 90s—like going from a C to an A+.
On Jigsaw (structured visual reasoning), AdaReasoner-7B reaches about 88.6% and leads tool-planning models—like solving almost 9 out of 10 puzzles correctly.
On GUI/Web tasks, it delivers solid gains; while open-ended tasks can still reward sheer model size, tools add meaningful reliability.
Against GPT‑5 + tools, AdaReasoner surpasses it on multiple structured tasks (e.g., VSP and Jigsaw), showing that smarter tool planning can beat raw scale.

Surprising Findings

Tools shift the bottleneck: once tool planning is learned, model size matters less; tool quality matters more. Smaller open models perform near or above bigger closed models.
Zero-shot tool use: when A* appears only at test time, the model quickly adopts it for navigation but avoids it where it doesn’t help, after reinforcement learns to stabilize that behavior.
Frequency control: the model naturally dials up or down how often it calls each tool per task, without being explicitly told how many times to call them.

Concrete stats glimpses (mirroring paper trends)

Adoption: A* calls per sample > 1.0 for navigation after RL, driving navigation scores into the mid-to-high 90s.
Discarding: A* usage decays toward zero for verification, keeping verification near-perfect (~99%).
Reliability: High tool execution success on jigsaw (≈98–100%) even with new tool definitions, indicating genuine understanding rather than memorization.

What this means

The system learns “tool sense”: pick what helps, ignore what doesn’t, and check your work.
Generalization improves: randomized tool names and paraphrased docs force real comprehension, letting the model operate in new toolboxes and new tasks.

05Discussion & Limitations

Limitations

Unseen tools need stabilization: while the model can adopt brand-new tools at test time, consistent and safe usage usually improves further with RL exposure.
Tool quality becomes the ceiling: if a perception tool is inaccurate, the policy may inherit those errors despite perfect planning.
Open-ended tasks still favor scale: in very broad, ambiguous settings with no deterministic tools, sheer model size can retain an edge.

Required Resources

A tool server or equivalent setup to host perception/manipulation/planning tools, including heavier expert models for OCR or detection.
Compute for two stages: supervised fine-tuning on curated trajectories and RL (Tool-GRPO) with multi-sample rollouts.
A curated dataset of multi-turn traces that include reflection and tool failures.

When NOT to Use

If the task is trivial and tools add overhead (e.g., obvious answers where tools only slow response).
If tools are unreliable or unavailable (e.g., latency or API instability) causing more harm than help.
If strict privacy forbids sending images/text to external tools and no on-device equivalents exist.

Open Questions

Automatic tool discovery: can the agent propose new tools it wishes it had, or learn capabilities end-to-end without explicit APIs?
Safety and robustness: how to detect and recover from adversarial or buggy tool outputs at scale?
Cost/latency trade-offs: how to optimize for speed and price while preserving accuracy (e.g., learned budgets for tool calls)?
Beyond vision: how well does this extend to audio/video streams and real-world robotics with noisy sensors?
Reward shaping: what other multi-turn rewards best encourage verification, calibration, and minimal hallucination?

06Conclusion & Future Work

Three-sentence summary

AdaReasoner turns tool use into a general reasoning skill: the model learns which visual tools to call, when to call them, and how to compose them across many steps.
It achieves this with a three-part recipe: curated multi-turn examples (with reflection and failures), Tool-GRPO reinforcement learning, and Adaptive Learning that hides names and paraphrases docs.
The result is strong, adaptive, and generalizable tool planning that lifts smaller open models to state-of-the-art performance on structured visual reasoning.

Main achievement

Showing that dynamic tool orchestration—rather than model size alone—can dominate performance on multi-step visual tasks, enabling zero-shot use of new tools and robust cross-task generalization.

Future directions

Broaden the toolset (e.g., richer planners, 3D geometry, temporal video tools), add self-checkers for tool reliability, and learn cost-aware policies that balance speed, accuracy, and price.
Extend to real-world agents (robots, UI assistants) and streaming inputs, where timing and feedback loops matter.

Why remember this

AdaReasoner reframes visual reasoning: with the right training signals, AI can behave like a thoughtful problem solver—adopting helpful tools, discarding distractions, and verifying its own work—bringing practical, reliable progress to everyday visual tasks.

Practical Applications

•Smart map helpers that plan safe walking routes in puzzles or educational apps and visually verify the path.
•Shopping assistants that crop product areas and use OCR to extract prices and buttons accurately from screenshots.
•Document processors that detect blank regions, insert signatures or stamps, and check alignment automatically.
•Robotics planning where the agent identifies obstacles, computes routes, and rechecks plans after sensor updates.
•Accessibility tools that focus on key UI regions and read labels aloud with high accuracy.
•Education apps that guide students through multi-step visual reasoning (e.g., geometry proofs with drawn helpers).
•Quality control in imaging (e.g., comparing a target layout to a proposed fix by inserting and visually checking parts).
•Web automation agents that find the right button or link, verify it with OCR, and then propose safe actions.
•Game agents that coordinate perception (locate items), planning (shortest path), and verification (draw and check route).
•Data labeling assistants that pre-locate objects, overlay suggestions, and ask humans to confirm with fewer clicks.

Version: 1