A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Zixin Zhang; Kanghao Chen; Hanqing Wang; Hongfei Zhang; Harold Haodong Chen; Chenfei Liao; Litao Guo; Ying-Cong Chen

A4-Agent: An Agentic Framework for Zero-Shot Affordance Reasoning

Intermediate

Zixin Zhang, Kanghao Chen, Hanqing Wang et al.12/16/2025

arXiv PDF

Key Summary

•This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.
•It splits the job into thinking (what part to use) and pointing (exactly where it is), which makes the system both smarter and more precise.
•A4-Agent adds an imagination step that edits the image to show a likely interaction (like a hand on a handle), helping the model reason better.
•The three roles are Dreamer (imagine the action), Thinker (decide the correct object part), and Spotter (pinpoint the pixels).
•Because each role uses powerful pre-trained models, the whole system works in a zero-shot way and generalizes well to new objects and scenes.
•On hard benchmarks like ReasonAff, RAGNet, and UMD, A4-Agent reaches state-of-the-art results, beating models that were specially trained for the task.
•This modular design is easy to upgrade: swap in a better vision-language model, detector, or segmenter, and it gets stronger without retraining.
•The method is interpretable: we can see the imagined images, the chosen part in text, and the final mask, making debugging and trust easier.
•It can help robots and apps understand not just what things are, but how to use them safely and effectively in real life.

Why This Research Matters

A4-Agent helps machines do more than just recognize objects—it helps them use objects correctly, which is crucial for robots, AR assistants, and accessibility tools. Its zero-shot design means it can handle new tools and environments without expensive retraining, making deployment faster and cheaper. The imagination step improves understanding of tricky instructions, which boosts safety and reliability in real-world tasks. Its modular design lets teams upgrade parts independently, keeping pace with rapid progress in AI models. The approach is interpretable, so people can inspect the imagined scene, the chosen part, and the final mask to build trust. In everyday life, this translates to smarter home robots, better industrial automation, and helpful guidance for people in unfamiliar settings.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re in a new kitchen and someone says, “Open the fridge.” Even if the fridge looks different from the one at home, you still know to pull the handle.

🥬 The Concept (Affordance Prediction):

What it is: Affordance prediction means figuring out how you can use an object, and especially which exact part you should touch to do a task.
How it works: 1) Read the instruction (like “open the fridge”), 2) Understand which part matters (the handle), 3) Find that part’s exact spot in the image.
Why it matters: Without it, a robot (or app) could know there’s a fridge but not where to pull, push, or press, so tasks would fail.

🍞 Anchor: If the task is “turn on the lamp,” affordance prediction highlights the switch or button on the lamp, not the lampshade.

🍞 Hook: You know how you first think about what to do, and then you actually do it?

🥬 The Concept (High-Level Reasoning):

What it is: High-level reasoning is the brainy part that interprets the instruction and decides the correct object part to use.
How it works: 1) Read the words, 2) Match them with what you see, 3) Pick the correct part (e.g., handle vs. door surface).
Why it matters: Without reasoning, the system might pick a random part, like focusing on the refrigerator’s logo instead of its handle.

🍞 Anchor: Given “open the microwave,” reasoning selects the door button or handle, not the display screen.

🍞 Hook: After you decide what to do, you still need to point exactly where.

🥬 The Concept (Low-Level Grounding):

What it is: Low-level grounding is precisely locating the chosen part in the pixels of the image.
How it works: 1) Get a hint (like “the handle”), 2) Find its region (box/points), 3) Trace its boundaries (mask).
Why it matters: Without grounding, instructions stay vague, and a robot arm wouldn't know exactly where to move.

🍞 Anchor: On “press the elevator button,” grounding draws a tight outline around the correct button, not the whole panel.

🍞 Hook: Picture learning what a kiwi is by hearing a description, not by seeing thousands of photos.

🥬 The Concept (Zero-Shot Learning):

What it is: Zero-shot learning means doing new tasks or recognizing new things without getting specific training examples for them.
How it works: 1) Learn general knowledge from big, broad data, 2) Read a new instruction, 3) Apply your general knowledge to act right away.
Why it matters: Without zero-shot ability, models would need retraining for every new tool or scene.

🍞 Anchor: If you’re told, “Use the squeegee to wipe the window,” zero-shot learning lets the system figure out what part to hold and where to swipe—even if it’s never seen that exact squeegee before.

🍞 Hook: Think of a friend who’s great at both reading and looking at pictures at the same time.

🥬 The Concept (Vision-Language Models):

What it is: Vision-language models (VLMs) understand images and text together.
How it works: 1) Read the instruction, 2) Look at the image, 3) Connect words to visual details to reason about the scene.
Why it matters: Without VLMs, models can’t smartly connect “open,” “refrigerator,” and “handle” across both text and picture.

🍞 Anchor: When asked, “Where would you hold the mug to drink?”, a VLM helps choose “the handle of the mug.”

🍞 Hook: Imagine a detective first circling suspects in a photo before taking a careful look.

🥬 The Concept (Object Detection):

What it is: Object detection finds and names things in an image by drawing boxes around them.
How it works: 1) Scan the image, 2) Propose likely objects, 3) Label and box them.
Why it matters: Without detection, the system may not even know where the fridge or mug is to begin with.

🍞 Anchor: Detectors quickly box the “door handle” area before a finer step cleans up the shape.

🍞 Hook: Coloring inside the lines is different from just pointing at the general area.

🥬 The Concept (Segmentation):

What it is: Segmentation is coloring exactly which pixels belong to the target part.
How it works: 1) Take a hint (like a box or a point), 2) Expand to the exact outline, 3) Produce a precise mask.
Why it matters: Without segmentation, a robot might grab part of the door and the wall at the same time.

🍞 Anchor: For “hold the spoon by the bowl,” segmentation outlines just the spoon’s bowl pixels.

The world before this paper: Many systems tried to do everything in one end-to-end model—both the thinking (reasoning) and the pointing (grounding) at once. These models needed lots of training on special datasets, didn’t generalize well to new objects, and struggled to balance understanding with precision. People also tried fine-tuning big models to output masks directly, but they still ran into trade-offs: getting better at reasoning could make the pixel outlines worse, and vice versa. The gap: we needed a way to keep the brainy part and the pixel-precise part strong—without forcing them into one tangled system—and to work zero-shot in the real world. That’s the space A4-Agent fills, with a decoupled, modular design that uses imagination to make reasoning clearer and grounding easier.

02Core Idea

🍞 Hook: You know how coaches split a team into roles—strategy, playmaking, and finishing—so everyone does what they do best?

🥬 The Concept (Decoupling):

What it is: Decoupling means separating the “think about what to do” part from the “pinpoint exactly where” part.
How it works: 1) First choose the right object part in words, 2) Then find that part’s exact pixels with vision tools.
Why it matters: Without decoupling, a single model must juggle too much at once, hurting both reasoning and precision.

🍞 Anchor: Like deciding “turn the oven knob” first (plan), then precisely grasping that knob (action).

Aha! moment in one sentence: If we let different expert models each do their specialty—imagine, reason, and localize—then we can do zero-shot affordance prediction better than one big model trained end-to-end.

Multiple analogies:

School project: One kid sketches (Dreamer), one writes the plan (Thinker), one builds the model (Spotter).
Cooking: The chef imagines the dish, the sous-chef chooses the ingredients, the line cook plates with precision.
Sports: The strategist draws the play, the playmaker sets up the pass, the striker places the ball exactly.

🍞 Hook: Imagine drawing a cartoon panel showing a hand on a handle before you explain how to open the door.

🥬 The Concept (A4-Agent):

What it is: A4-Agent is a three-stage, training-free framework for affordance prediction: Dreamer (imagine), Thinker (decide), Spotter (locate).
How it works: 1) Dreamer edits the image to show a likely interaction, 2) Thinker reads the instruction plus both images to choose the correct part in text, 3) Spotter turns that text into precise boxes, points, and a segmentation mask.
Why it matters: Without this split-and-coordinate design, the system either misunderstands the task or misses the exact pixels.

🍞 Anchor: For “preheat the oven,” A4-Agent imagines a hand on the dial, decides “the knob of the oven,” then masks exactly that knob.

🍞 Hook: Pretend you act out the move before you actually do it.

🥬 The Concept (Dreamer – imagination-assisted reasoning):

What it is: Dreamer uses a generative image editor to visualize a plausible interaction (e.g., a hand grasping a handle).
How it works: 1) Build a short, physically plausible edit prompt from the task and image, 2) Edit the image to add the interaction, 3) Produce the imagined scene.
Why it matters: Without seeing an example interaction, the Thinker can misinterpret vague instructions like “open,” “turn,” or “lift.”

🍞 Anchor: For “turn on the faucet,” Dreamer adds a hand pressing the right handle; that picture guides the next step.

🍞 Hook: Think of a good reader who explains exactly which part of a picture matters.

🥬 The Concept (Thinker – semantic decision):

What it is: Thinker is a vision-language model that decides which object part matches the task and outputs it in simple text (e.g., “the handle of the bucket”).
How it works: 1) Look at both original and imagined images, 2) Reason about the instruction, 3) Output a clean, machine-readable description of the part.
Why it matters: Without a clean decision in words, the next step can’t target the right region.

🍞 Anchor: Given “hold the wine glass,” the Thinker outputs “the stem of the wine glass.”

🍞 Hook: After picking the right part in words, someone still needs to point exactly where to touch.

🥬 The Concept (Spotter – pixel-precise grounding):

What it is: Spotter turns the text description into exact boxes, keypoints, and masks using detectors and segmenters.
How it works: 1) Detect with open-vocabulary text (e.g., “door handle”), 2) Get boxes and points, 3) Refine with segmentation to a crisp mask.
Why it matters: Without Spotter, the answer remains a sentence, not a place you can actually act on.

🍞 Anchor: For “press the microwave button,” Spotter outlines only the correct button pixels.

Before vs. After:

Before: One big model tries to both understand the task and draw pixel-perfect masks; performance and generalization suffer.
After: Separate experts do what they do best—and coordinating them zero-shot beats trained baselines across datasets.

Why it works (intuition):

Imagination reduces ambiguity, text reasoning locks in the correct part, and coarse-to-fine grounding nails the pixels.
Each piece uses the strongest pre-trained model for that job, so the whole is better than any single model.

Building blocks:

Generative image editor for Dreamer (makes plausible interaction scenes)
Vision-language model for Thinker (decides object part in text)
Open-vocabulary detector + segmentation for Spotter (boxes/points → precise mask)
Simple interfaces between parts (prompts in, text out, boxes/points in, masks out)

03Methodology

High-level recipe: Input (image + instruction) → Dreamer (imagine interaction) → Thinker (choose object part in text) → Spotter (boxes/points → mask) → Output (affordance regions).

Step A: Dreamer – Imagine how to operate

What happens: The system creates a short, careful prompt (e.g., “Edit the input image to show a right hand grasping the vertical refrigerator handle, photorealistic, keep others unchanged.”) and uses a generative editor to add the plausible interaction into the original image.
Why this step exists: Many instructions are ambiguous (“open,” “turn,” “lift”). Seeing a likely interaction (a hand on the right spot) disambiguates which part matters. Without it, the Thinker can choose the wrong part.
Example with data: Input image: a fridge with double doors. Task: “Open the refrigerator.” Dreamer’s edited image: adds a hand pulling the right door handle. This directly hints that the handle is the actionable part.

Step B: Thinker – Decide what to operate

What happens: A vision-language model reads the original image, the imagined image, and the task, then outputs a simple JSON with fields like task, object_name, and object_part (e.g., “the handle of the refrigerator”). The model’s free-form “thinking” isn’t used—only the clean Output JSON.
Why this step exists: Converting the visual situation plus instruction into a clear text target keeps the next stage focused. Without it, the Spotter might search too broadly or latch onto wrong regions.
Example with data: For “Can you preheat the oven for me?”, Thinker returns: {"task": “…", "object_name": "oven", "object_part": "the knob of the oven"}.

Step C: Spotter – Locate where to operate (coarse-to-fine)

What happens: 1) Open-vocabulary detection uses the text description (e.g., “the knob of the oven”) to propose bounding boxes and keypoints for the target. 2) A segmentation model refines those hints into a crisp pixel mask.
Why this step exists: Segmentation shines with good prompts (boxes/points) but not with raw text. Detection bridges text to geometry; segmentation draws perfect boundaries. Without detection, segmentation can’t be reliably aimed; without segmentation, final masks are too rough.
Example with data: Description: “the handle of the bucket.” Detector proposes 2 boxes; keypoints land near the handle center; segmentation returns a precise handle mask.

🍞 Hook: Like tracing a shape with a pencil outline before coloring it in.

🥬 The Concept (Coarse-to-Fine Grounding):

What it is: First get a rough location (box/point), then refine to exact boundaries (mask).
How it works: 1) Detector → boxes/points, 2) Segmenter → mask.
Why it matters: Skipping the coarse step often leads to messy masks; skipping the fine step leaves sloppy boundaries.

🍞 Anchor: Box the faucet area, then segment the exact handle pixels you press.

Interface details (kept simple and robust):

Dreamer input: image + instruction → short edit prompt → edited image
Thinker input: (original image, edited image, instruction) → JSON text of object and part
Spotter input: JSON text → detector outputs boxes + points → segmentation outputs mask

The Secret Sauce:

Imagination-assisted reasoning: The edited image injects concrete, physically plausible contact cues into the reasoning process, boosting accuracy, especially for tricky verbs (press, twist, scoop).
Decoupled modularity: Each stage can be upgraded independently (better editor, better VLM, better detector/segmenter) without retraining the whole pipeline.
Interpretability: We can inspect the imagined image, the textual part choice, and the final mask to debug why a decision was made.

Failure handling and what breaks without each step:

Without Dreamer: The Thinker can misread instructions; accuracy drops on ambiguous actions.
Without Thinker: The Spotter may chase the wrong text target and find the wrong region.
Without Spotter’s two-step grounding: Either the boxes are too coarse or the masks are unreliable.

Concrete walk-through:

Input: Image of a sink and faucet; Task: “Turn on the water.”
Dreamer: Adds a hand pressing the correct faucet handle.
Thinker: Outputs “the handle of the faucet.”
Spotter: Detects handle region → refines to exact handle mask → returns {box, keypoint, mask} for action.
Output: A precise segmentation mask ready for a robot gripper or AR overlay.

04Experiments & Results

🍞 Hook: When you take a test, you don’t just want a score—you want to know what that score means.

🥬 The Concept (IoU-style Metrics):

What it is: IoU and related scores measure how much the predicted mask overlaps with the true mask.
How it works: 1) Count the overlap area (intersection), 2) Divide by total covered area (union), 3) Higher is better. Variants: gIoU (average overlap), cIoU (dataset-level cumulative overlap), P@50 (percent of predictions with IoU ≥ 0.5), P@50:95 (stricter average precision across thresholds).
Why it matters: Without good overlap, a model could be “close” but still miss the target—bad for real tasks.

🍞 Anchor: If you predict the microwave’s wrong button, your overlap score drops a lot.

The tests and why: The authors evaluated A4-Agent on three major benchmarks: ReasonAff (complex instructions needing deep reasoning), RAGNet (large-scale reasoning-based affordance segmentation with subsets 3DOI and HANDAL), and UMD Part Affordance (classic affordance types over everyday tools). They also tried open-world images to test generalization, because real life is messier than datasets.

The competition: A4-Agent was compared with open-vocabulary segmenters (like OVSeg, SAN), end-to-end MLLM segmenters (like AffordanceLLM, LISA), two-stage MLLM+SAM pipelines (like Seg-Zero, Vision-Reasoner, Affordance-R1), and strong open VLMs (Qwen2.5-VL, InternVL3). Many baselines are fine-tuned for this task, while A4-Agent is zero-shot.

The scoreboard with context:

ReasonAff: A4-Agent reached about 70.5 gIoU and 64.6 cIoU, with strong precision (P@50 ≈ 75.2, P@50:95 ≈ 55.2). That’s like getting an A when others get B’s, beating even specialized trained models like Affordance-R1.
RAGNet (3DOI): A4-Agent hit about 63.9 gIoU and 58.3 cIoU, roughly 24 gIoU points higher than a strong baseline (Vision-Reasoner). On HANDAL-easy and HANDAL-hard, it also led the pack (≈61.1/61.7 gIoU and ≈61.0/59.6 cIoU), which is like winning not just one race, but the sprint and the marathon too.
UMD: A4-Agent achieved ≈65.4 gIoU and ≈59.8 cIoU, with high precision (P@50 ≈ 77.3, P@50:95 ≈ 43.8), outpacing fine-tuned methods by big margins. That’s like acing both the basic skills test and the tricky questions.

Surprising findings:

Imagination helps a lot: Adding Dreamer (the think-with-imagination step) boosted performance for both open and closed models; in some cases, an open model with imagination beat a stronger closed model without it.
Modular upgrades pay off: Swapping in a more powerful VLM (e.g., GPT-4o) or a stronger segmenter improved results without retraining—evidence that decoupling is not just neat, but useful.
Even without training: The zero-shot setup still topped specialized, fine-tuned baselines on multiple datasets, showing that coordinating foundation models can beat end-to-end training for complex reasoning + grounding.

Qualitative highlights:

Novel objects and scenes: The system handled tools and layouts beyond standard kitchens—choosing the right screwdriver tip, identifying a projector’s key control, or picking a slotted spoon to drain water.
Interpretable chain: You can see the imagined hand placement, read the chosen part (e.g., “the stem of the wine glass”), and view the precise mask—making behavior easier to trust and debug.

05Discussion & Limitations

Limitations:

Computation and latency: Running a generator (Dreamer), a large VLM (Thinker), and detector+segmenter (Spotter) can be slow, which is tough for real-time robotics.
Dependence on pre-trained models: If the generator imagines an implausible interaction or the VLM misreads context, downstream steps may falter.
Occlusion and tiny parts: Very small or heavily occluded parts (e.g., a tiny latch) remain challenging, even with coarse-to-fine grounding.
Domain shifts: Highly unusual tools or non-standard affordances (e.g., broken handles or improvised tools) can still confuse the system.

Required resources:

Access to strong foundation models (a capable image editor, a high-quality VLM, a robust open-vocabulary detector, and a strong segmenter like modern SAM variants).
Moderate-to-high compute and memory; batching and caching help, but edge devices may struggle.

When NOT to use:

Hard real-time control loops (fast-moving robots) where milliseconds matter.
Safety-critical actions without additional verification (e.g., medical tools) because imagination or grounding errors could be risky.
Environments with strict privacy or compute limits that prevent using large models.

Open questions:

How to make imagination safer and more physically grounded (e.g., contact mechanics, kinematics-aware edits)?
Can we make the VLM’s Output guarantees stronger (format, reliability) under heavy noise or adversarial scenes?
How to unify 2D affordance masks with 3D understanding for better manipulation planning?
Can we learn lightweight adapters that keep modularity but reduce latency without full end-to-end retraining?

06Conclusion & Future Work

Three-sentence summary: A4-Agent is a training-free, three-stage framework that predicts where to interact with objects from just an image and instruction, by separating reasoning from grounding. It adds an imagination step to visualize plausible interactions, helping a vision-language model choose the correct part and a detector+segmenter to localize it precisely. The result is strong zero-shot performance that beats fine-tuned baselines across multiple benchmarks and generalizes to open-world scenes.

Main achievement: Proving that decoupling (Dreamer–Thinker–Spotter) plus imagination-assisted reasoning can outperform monolithic, fine-tuned systems for affordance prediction.

Future directions:

Faster, lighter components for real-time use; physics-aware imagination to reduce implausible edits; and integration with 3D perception for robust manipulation.
Stronger reliability checks (e.g., self-consistency, multi-view reasoning) and better on-device performance.

Why remember this: A4-Agent shows that coordinating the right experts—imagine, decide, and locate—can unlock zero-shot, accurate, and interpretable affordance reasoning that’s ready to help robots and apps act effectively in the real world.

Practical Applications

•Home robots that can follow natural language commands like “open the dishwasher” or “turn on the lamp” and act on the exact part.
•AR assistants that highlight the correct knob, button, or handle for tasks like setting an oven or adjusting a camera.
•Factory co-bots that localize safe grasp points on new parts and tools without retraining.
•Assistive technology that guides users to the right part to press, turn, or hold in unfamiliar devices.
•Warehouse picking systems that identify the correct graspable region on novel items.
•Maintenance and repair guidance that pinpoints exact screws, latches, or release tabs for disassembly.
•Educational apps that teach tool usage by showing the actionable part and demonstrating interaction.
•Remote support where a technician’s instruction is turned into an on-screen highlight for the user’s camera view.
•Household inventory and safety checks that localize parts like child-locks or gas valves for quick verification.
•Robotic manipulation planners that use precise masks to compute safe, stable grasps or contact points.

Version: 1