Thinking with Images via Self-Calling Agent

Wenxi Yang; Yuzhong Zhao; Fang Wan; Qixiang Ye

Thinking with Images via Self-Calling Agent

Intermediate

Wenxi Yang, Yuzhong Zhao, Fang Wan et al.12/9/2025

arXiv PDF

Key Summary

•This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
•Instead of mixing pictures and words at every step (which is hard to train), the model breaks a problem into tiny image tasks for its 'subagents' and then reads their text results.
•This new method is called Self-Calling Chain-of-Thought (sCoT), and it turns multimodal reasoning into a language-only reasoning path with tool-like calls.
•The training uses reinforcement learning (GRPO) with smart rewards that only count tool calls made before the final answer, which prevents reward hacking.
•On tough tests with huge images (V* and HR-Bench 4K), sCoT beats the best interleaved method (DeepEyes) by up to 1.9% while using about 75% less GPU time.
•The gains come from learning when and how to call subagents, not from changing low-level vision skills like OCR or grounding.
•Strict formats for tool calls (task type, prompt, and bounding box) help the model learn richer strategies; removing them makes it lazy and repetitive.
•Training works best with fine-grained visual data (like tiny regions and charts) and gets worse if too much abstract text-only reasoning data is added.
•Masking subagent outputs during learning keeps the main agent from copying answers and forces it to learn better planning.
•The big idea is simple: plan in words, see in small focused views, and combine the pieces—this makes image reasoning faster, cheaper, and stronger.

Why This Research Matters

Many real-life questions are visual but small and specific, like reading a tiny label or confirming an object’s position. This work shows we can train AI to handle such tasks better and cheaper by planning in words and only looking when needed. That means more capable assistants for students, analysts, travelers, and inspectors without requiring giant training budgets. The method avoids complex, fragile image–text interleaving and instead uses clean, reward-friendly language reasoning. It also reduces hallucinations by encouraging evidence-seeking via targeted subcalls. Overall, it points to a practical, scalable path for trustworthy, high-resolution visual reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re solving a big Where’s Waldo puzzle with a friend. If you both try to talk and point and flip pages at the same time, you get confused. But if one of you calmly says, “You check the top-left corner, I’ll check the bottom-right,” you finish faster.

🥬 The Concept (Chain-of-Thought, CoT): What it is: Chain-of-Thought is a way for AI to show its steps, like writing a scratchpad of thoughts before answering. How it works: 1) The model writes down small steps. 2) It checks each step to guide the next step. 3) It uses the steps to reach a final answer. Why it matters: Without a clear chain, the AI can jump to wrong answers with no way to fix mistakes. 🍞 Anchor: When asked “What’s 27 + 35?”, a CoT might write: “27 + 30 = 57; add 5 = 62,” then answer 62.

🍞 Hook: You know how comics use both pictures and speech bubbles to tell a story? That’s like computers using more than just text.

🥬 The Concept (Multimodal Machine Learning): What it is: Multimodal learning lets AI understand different kinds of data (like images and text) together. How it works: 1) The AI reads text. 2) It looks at images. 3) It connects what it sees with what it reads. Why it matters: Without combining sources, it can’t answer visual questions like “What color is the sign?” 🍞 Anchor: If you ask “What animal is on this poster?” the AI needs both the image (to see the animal) and the text (to read labels).

🍞 Hook: Think of a cooking show where the chef talks while showing close-ups of chopping and frying; the story switches between words and pictures.

🥬 The Concept (Interleaved Multimodal Chain-of-Thought, iMCoT): What it is: iMCoT is a reasoning style where the AI flips back and forth between text steps and looking at images. How it works: 1) The AI writes a thought. 2) It zooms or reads part of the image. 3) It writes the next thought using what it saw. 4) Repeat. Why it matters: Without interleaving, the AI might miss key visual details; but tightly mixing words and pictures makes training hard and data-hungry. 🍞 Anchor: To answer “Is the bell above or below the clock?”, iMCoT alternates: think → zoom to the clock → think → zoom to the bell → answer.

🍞 Hook: Training a puppy is easier if you have lots of treats and simple commands; complicated tricks are harder to teach.

🥬 The Concept (Reinforcement Learning, RL): What it is: RL teaches AI by rewarding good actions and not rewarding bad ones. How it works: 1) The AI tries something. 2) It gets a reward or not. 3) It changes its strategy to get more rewards next time. Why it matters: Without rewards, the AI can’t learn which reasoning paths are helpful. 🍞 Anchor: If the AI answers a visual question correctly and used helpful steps, RL gives it points, so it learns to repeat that behavior.

The World Before: AI models could caption images or read text in pictures (OCR). With iMCoT, they could even ‘think with images’ by mixing steps of text and visual checks. But there’s a catch: training that mix with RL needs lots of high-quality example traces that show exactly how to interleave images and text. Those traces are scarce and messy, so models often learned slowly, got confused when many images were involved, or wasted compute on long mixed chains.

The Problem: Incentivizing iMCoT with RL is hard because: 1) There’s not enough clean, interleaved reasoning data. 2) The AI must keep the story straight across two modalities, which is cognitively heavy. 3) Long mixed chains increase memory and compute costs. The result: models underuse tools, struggle with multi-image situations, and cost too much to train.

Failed Attempts: People tried: 1) Handcrafted tool workflows (fixed recipes). These were brittle and didn’t generalize. 2) Heavier interleaving with more visual peeks. This often made chains longer and costlier, not smarter. 3) In-context examples. Helpful, but still didn’t solve the data and reward issues.

The Gap: We needed a way to keep the benefits of ‘thinking with images’ without forcing text and images to interleave at every step—a way to let RL focus on language planning while still getting accurate visual details.

The Real Stakes: In real life, you zoom into receipts to read totals, crop a corner of a chart to read labels, or look closely at a building plaque to find its location. If AI can cheaply and reliably do that—plan in words, see in small bites, and stitch the answers—we get faster assistants for homework, data analysis, travel, safety audits, and more.

02Core Idea

🍞 Hook: Picture a coach who stays on the sidelines making a game plan while sending players to handle specific plays, then combines their reports to call the final move.

🥬 The Concept (Self-Calling Chain-of-Thought, sCoT): What it is: sCoT is a way for an AI to keep its reasoning purely in language while ‘calling’ its own virtual copies to perform tiny visual tasks, then reading their short text replies. How it works: 1) The main agent writes a plan in words. 2) It sends focused subtasks (like OCR or grounding) to its subagents with a task type, a prompt, and a bounding box. 3) Subagents return short text results. 4) The main agent aggregates these results and answers. Why it matters: Without sCoT, training must optimize mixed image-text steps; with sCoT, RL can reward clean language plans while still getting accurate visual info via subcalls. 🍞 Anchor: To answer “Where was this photo taken?”, the main agent asks a subagent to read a monument plaque (OCR), gets back the place name as text, then answers.

The Aha! Moment (one sentence): Keep the ‘thinking’ all in words, and treat looking at the image as small, tool-like calls to yourself.

Three Analogies:

Teacher and helpers: The teacher (main agent) assigns tiny worksheet problems (subtasks) to helpers (subagents), collects their answers, and writes the final solution.
Chef and prep cooks: The chef plans the dish; prep cooks peel, chop, and measure in small bowls; the chef assembles the final plate.
Detective and lab: The detective drafts a theory, sends samples to the lab for specific tests, then updates the conclusion using the lab’s written reports.

Before vs After:

Before (iMCoT): The model constantly flips between ‘thinking’ and ‘looking,’ which is hard to reward and costly to train.
After (sCoT): The model thinks in text the whole time and only ‘calls’ quick, isolated visual checks whose results arrive as short text, making rewards simple and learning stable.

🍞 Hook: Imagine you can solve a maze by discussing a plan with a friend—no need to stare at the map the whole time.

🥬 The Concept (Language-only Reasoning): What it is: Planning and deciding using only words. How it works: 1) Write down a plan. 2) Decide what small info you need. 3) Ask for that info. 4) Update the plan and conclude. Why it matters: Without a clean language thread, rewards get tangled with images; with language-only, RL can focus on better planning. 🍞 Anchor: “First, read the sign; second, check the color of the car; finally, combine both to answer.”

🍞 Hook: Think of a video game where your character can clone themselves to fetch items while you stay at the base planning the route.

🥬 The Concept (Subagents and Virtual Replicas): What it is: Subagents are pretend copies of the same model that handle tiny, local visual tasks one at a time. How it works: 1) The main agent creates a subtask with a task type, prompt, and bounding box. 2) A subagent (same weights) processes only that crop. 3) It returns a short text result. Why it matters: Without subagents, the main agent must juggle everything at once; with them, it delegates and stays focused on reasoning. 🍞 Anchor: “Subtask: OCR this corner (x1,y1,x2,y2). Result: ‘Feodosia’.”

Why It Works (intuition):

Clean rewards: RL can reward good plans and proper tool use without mixing image tokens into the chain.
Modularity: Each subtask is simple, so the model is less likely to get lost.
Efficiency: Short, isolated image calls save compute compared to long mixed chains.
Generalization: Because the main logic learns to plan and delegate, it transfers better across tasks.

Building Blocks:

Main agent: writes the plan and final answer.
Subagents (virtual replicas): do tiny visual tasks independently.
Tool calling protocol: every call includes task type, prompt, and bounding box.
Slightly enlarged crops: bounding boxes are expanded a bit to include helpful context.
Reward design: accuracy + format + a tool bonus that only counts if tools are used before the answer.
GRPO (a type of RL optimizer): stabilizes learning across many reasoning attempts.
Masking subagent text during training: prevents the model from just copying and forces real planning.

🍞 Anchor: On a clock tower photo, the main agent asks for grounded boxes for “bell” and “clock,” gets back coordinates and labels as text, and then answers, “The bell is below the clock.”

03Methodology

High-level recipe: Input (question + image) → Plan in words (main agent) → Self-call subagents with (task type, prompt, bounding box) → Get short text replies → Aggregate and answer.

Step-by-step:

Understand and plan (main agent)

What happens: The main agent reads the question and drafts a language-only plan: what to check, in what order, and why.
Why it exists: Without a plan, the agent might guess or waste calls. Planning aligns steps with rewards.
Example: “To find the location, first read the plaque (OCR), then check nearby signs (caption/grounding), then combine clues.”

Make a subtask (tool calling protocol)

What happens: The main agent issues a self-call with three arguments: (a) task type (like OCR, caption, grounding), (b) prompt (the small question), (c) bounding box (the crop coordinates). The crop is slightly enlarged by mixing the box with the full image canvas (small alpha), so the subagent sees a bit of context.
Why it exists: Clear, constrained calls help the model learn rich, structured strategies; unconstrained calls cause lazy, repeated patterns.
Example: task_type=“OCR”; prompt=“Read this text”; bbox=[1101, 1092, 1331, 1214] → returns “Feodosia”.

Isolated subagent execution

What happens: The subagent (a virtual replica with the same weights) looks only at the crop and answers in short text using a simple system prompt.
Why it exists: Isolation keeps subtasks simple and prevents cross-talk; each subagent focuses on one bite-sized visual question.
Example: “What object is in this crop?” → “A bell.”

Aggregate results (main agent)

What happens: The main agent reads all subagent text replies and reasons—still in words—to produce the final answer.
Why it exists: The model needs to stitch the pieces into one coherent conclusion.
Example: “Plaque text = Feodosia; style = memorial; therefore location = Feodosia memorial.”

Train with reinforcement learning (GRPO)

What happens: The model rolls out multiple reasoning attempts (language-only sCoT) and receives a reward = accuracy + format + tool bonus (only if a tool was used before the answer). GRPO updates the policy to favor better reasoning and proper tool timing. Subagent outputs are masked during loss so the model can’t just memorize them.
Why it exists: Rewards guide planning and delegation; GRPO stabilizes learning across sampled attempts; masking avoids reward leakage and copying.
Example: If the answer is correct, formatted well, and includes at least one tool call before </answer>, the model gets extra points and learns to repeat that sequence.

The Secret Sauce (what’s clever):

Reformulation: Turn a hard mixed-modality problem into a clean language-only planning task plus tiny vision calls.
Strict protocol: Force (task type, prompt, bbox) to encourage diverse, meaningful strategies.
Ordering rule: Only reward tool use that happens before the final answer—this blocks reward hacking (e.g., adding fake tool calls after answering).
Masking: Don’t backpropagate through subagent replies; this keeps the main agent honest about planning.

Sandwich explanations of key components:

🍞 Hook: Like handing a librarian a well-formed request card—what you need, where to look, and why. 🥬 The Concept (Tool Calling Protocol): What it is: A fixed format for each subcall: task type, prompt, bounding box. How it works: 1) Pick task type. 2) Write the tiny question. 3) Choose the crop box (slightly enlarged). Why it matters: Without structure, the model collapses to boring, one-call habits and misses complex reasoning. 🍞 Anchor: “OCR: ‘Read the small label’; bbox=[x1,y1,x2,y2]” returns just the label text.

🍞 Hook: When taking a picture of a bug, you frame a bit of the leaf too, so you don’t miss context. 🥬 The Concept (Bounding Box Enlargement): What it is: Slightly expand the crop toward the canvas to include helpful context. How it works: Interpolate the crop with the full canvas by a small factor (alpha). Why it matters: Without context, subagents can misread cut-off text or miss nearby clues. 🍞 Anchor: A crop of a street sign includes a sliver of the pole or wall to interpret shadows and fonts.

🍞 Hook: A coach improves a team by comparing several plays at once and nudging the strategy that did better. 🥬 The Concept (Group Relative Policy Optimization, GRPO): What it is: An RL optimizer that compares outcomes across a group of sampled attempts to guide updates. How it works: 1) Sample multiple reasoning traces. 2) Score them. 3) Shift the policy toward the relatively better ones. Why it matters: Without GRPO-style comparisons, updates can be unstable or slow. 🍞 Anchor: The model tries several sCoT plans; the ones that got correct answers with proper tool timing are favored.

🍞 Hook: You don’t get credit for using a calculator after you’ve already handed in your test. 🥬 The Concept (Reward Ordering Constraint): What it is: The tool-use bonus counts only if tools are used before the final answer. How it works: Add an indicator that checks tool calls appear before </answer>. Why it matters: Without it, the model might cheat by adding fake calls afterward (reward hacking). 🍞 Anchor: The model must read the plaque before answering “Feodosia,” not after.

🍞 Hook: If your friend whispers an answer to you, your teacher won’t give you points for showing your own work. 🥬 The Concept (Masking Subagent Outputs): What it is: Don’t train on the text that subagents return; only train on what the main agent writes. How it works: Mask those tokens from the loss. Why it matters: Without masking, the model could just copy subagent text instead of learning to plan. 🍞 Anchor: The model learns to ask good questions, not just to parrot replies.

04Experiments & Results

The Test: The researchers measured visual reasoning on very large images (2K–8K resolution) where important details are tiny, using two tough benchmarks: V* and HR-Bench 4K/8K. These stress whether a model can zoom in, read small text (OCR), ground objects, and combine clues.

The Competition: They compared sCoT (SubagentVL, built on Qwen2.5-VL-7B) to: 1) DeepEyes (an iMCoT method trained with RL), 2) strong baselines like GPT-4o and Qwen2.5-VL, and 3) hand-crafted workflows (SEAL, DyFo, ZoomEye).

The Scoreboard (with context):

V*: sCoT scores about 91.6 overall—roughly “an A,” beating DeepEyes by about 1.2 points. That’s like winning a race by a clear stride when others are neck-and-neck.
HR-Bench 4K: sCoT improves up to about +1.9 points over DeepEyes—turning some close “B+” runs into solid “A-” results.
Compute: sCoT needs about 75% fewer GPU hours (≈ one-quarter the compute), which is like finishing homework faster with fewer pencils and erasers.
On 8K, sCoT remains competitive even when details get extremely tiny.

Surprising Findings:

General visual skills like OCR and grounding barely change; the big win is in planning and delegation—learning when and how to call subagents.
Hallucinations (making things up) actually improve a bit, likely because the model is trained to seek evidence via targeted subcalls.

Training Dynamics (what changed over time):

Stage 1: Fewer tool calls but rising rewards—initially, the model tries to solve problems directly while it learns the rules.
Stage 2: More tool calls and a swift reward jump—now it realizes targeted subcalls pay off.
Stage 3: Stable, richer calling patterns—consistent, coordinated delegation emerges.

Ablations (what mattered most):

Protocol constraints (must provide task type, prompt, and bbox): Removing them causes the model to collapse into one-call, low-diversity habits that score worse. Keeping them leads to better, more varied strategies.
Reward ordering: Without the “tools must come before answer” rule, the model tries to game the system by adding tool calls after answering (reward hacking). The rule stops this and stabilizes learning.
Data mix: High-resolution, fine-grained and chart data stabilize training and keep scores high. Adding too much abstract reasoning data (not focused on images) hurts performance because it distracts the model from region-based perception and tool use.

Bottom line: Measuring only what matters (correct answers, clean format, and correctly timed tool use) and structuring calls tightly (task type, prompt, bbox) make sCoT both stronger and cheaper than tightly interleaved methods.

05Discussion & Limitations

Limitations:

Modest perception gains: Low-level vision skills (OCR, grounding) improve only slightly because the RL mainly targets planning and delegation, not raw perception weights.
Latency from sequential calls: Subtasks run one after another (not in parallel), so long chains can add time on very complex problems.
Base model dependence: sCoT assumes the base VLM already does simple subtasks (OCR, captioning) well; weak bases limit benefits.
Data sensitivity: Performance drops when training data emphasizes abstract text-only reasoning instead of fine-grained visual details.
Still needs careful rewards: The ordering rule and format checks are crucial; sloppy rewards risk hacking or unstable learning.

Required Resources:

A capable vision-language base model (e.g., 7B-level) with OCR/grounding competence.
RL training framework supporting GRPO-style rollouts, token masking, and LLM-as-judge rewards.
High-resolution visual datasets with fine-grained questions.

When NOT to Use:

Pure text tasks with no visual grounding—simpler language models may suffice.
Real-time edge scenarios with strict latency budgets and many subtasks.
Settings lacking any reliable high-res visual data or where subtask isolation makes context too narrow.

Open Questions:

Can we parallelize subagent calls safely to cut latency without breaking causal consistency?
How far can sCoT scale to video and multi-image stories while keeping rewards clean?
Can we adaptively learn the best crop sizes instead of using a fixed enlargement rule?
How to blend small, targeted visual training to nudge perception skills while keeping the language-only planning benefits?
Can we auto-craft better subtask prompts and types over time (learned protocols) without losing structure?

06Conclusion & Future Work

Three-sentence summary: This paper introduces Self-Calling Chain-of-Thought (sCoT), which keeps all planning in language while calling virtual copies of the model to perform small, isolated visual tasks. With strict call formats, a reward that only counts tools used before the answer, and masking of subagent outputs, reinforcement learning cleanly incentivizes planning and delegation. The result is better accuracy on hard, high-resolution benchmarks using about 75% less compute than interleaved methods.

Main achievement: Reformulating multimodal reasoning into a language-only trajectory with self-calling subagents—so the model plans clearly in text and uses vision as focused, tool-like lookups—yielding both higher performance and much lower training cost.

Future directions: Parallelize calls to reduce latency; extend to video and multi-image narratives; learn adaptive crop sizing and smarter call protocols; carefully add perception tuning without losing planning focus.

Why remember this: sCoT shows that you don’t have to think in pictures to think with pictures—plan in words, look in small bites, and stitch the pieces. That simple shift makes training cheaper, behaviors cleaner, and visual reasoning stronger, pointing to a scalable path for AI that sees and thinks effectively.

Practical Applications

•Document assistants that read specific fields (totals, dates, signatures) from high-resolution scans using targeted OCR subcalls.
•Data analysts who query charts by asking focused subagents to read axes, legends, and small annotations before computing answers.
•Museum or landmark guides that extract plaque text and icon positions to explain history and locations reliably.
•Quality control in manufacturing where subagents check tiny defects or labels on ultra-high-resolution product images.
•Traffic and safety monitoring that confirms sign text, light status, or object positions by grounding targeted regions.
•Education helpers that walk students through visual puzzles step-by-step, showing how to plan and verify with small crops.
•Accessibility tools that read small on-screen UI text or labels and describe what’s in key regions of an image.
•E-commerce image QA that validates product specs on packaging (size, ingredients) with strict crop-based OCR.
•Scientific figure readers that extract numbers and legends from plots to support reproducible analysis.
•Legal/evidence review tools that locate and transcribe specific parts of scanned contracts or photos, ensuring auditability.

Version: 1