XR: Cross-Modal Agents for Composed Image Retrieval

Zhongyu Yang; Wei Pang; Yingfang Yuan

XR: Cross-Modal Agents for Composed Image Retrieval

Beginner

Zhongyu Yang, Wei Pang, Yingfang Yuan1/20/2026

arXiv PDF

Key Summary

•XR is a new, training-free team of AI helpers that finds images using both a reference picture and a short text edit (like “same jacket but red”).
•Instead of one big guess, XR works in steps: imagine the goal, do wide-but-smart matching, then double-check facts before deciding.
•Three agent types share the work: imagination agents draft target captions, similarity agents score candidates from text and image views, and question agents verify details.
•XR mixes signals from both words and pixels using Reciprocal Rank Fusion, so strong hints from either side can lift the right images to the top.
•In tests on three popular benchmarks (FashionIQ, CIRR, CIRCO), XR beats strong systems, with gains up to about 38% over baselines.
•Ablations show every agent matters: remove any piece and accuracy drops, especially on fine-grained edits like color or sleeve length.
•XR works without extra training, travels well across domains, and prefers medium-size multimodal models for a great speed–accuracy tradeoff.
•Asking a small set of true/false questions (around three) is enough to catch tricky mistakes and keep results faithful to the user’s edit.
•Setting a modest candidate pool (about 100) balances coverage and cost, and fusing text and image with a light text weight (lambda≈0.15) works best.

Why This Research Matters

Online shopping, education, and everyday search often need “same-but-with-a-twist” results, like keeping a product’s style while changing color or features. XR’s team-based approach respects both the original look (from the image) and the requested change (from the text), so users actually get what they asked for. Because XR is training-free and relies on general-purpose backbones, it is easier to deploy and adapt across domains. Its verification step builds trust: the system can check whether crucial edits are truly present before ranking results. This reduces wasted clicks, returns, and frustration, and makes multimodal search feel more like collaborating with a careful assistant. Over time, the same imagine–match–verify recipe can power richer assistants that handle video, audio, and interactive queries.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you show a friend a photo of a jacket and say, “I want this, but in red with a hood”? It’s not just finding any red jacket—it has to look like your photo AND match your changes.

🥬 Filling (The Actual Concept)

What it is: Composed Image Retrieval (CIR) is a way to search for images using both a reference picture and a short text edit (for example, “same shoes but white laces”).
How it works (old world): Before this paper, most systems mashed image and text into a single number and picked the closest images in one shot.
Why it matters: Without careful reasoning, systems mix up edits (like color vs. pattern), lose tiny details (like “long sleeves”), or ignore visual hints the text didn’t mention.

🍞 Bottom Bread (Anchor) Imagine uploading a blue dress photo and typing “make it red with long sleeves.” You want a red, long-sleeved version of that same style—not just any red dress.

The World Before

Search engines and shops mostly depended on text-only keywords or simple image matching. If you typed “red,” they might return any red thing, even if it looked nothing like your reference.
CIR appeared to fix this by letting you combine an image with a small text edit. But most methods still used a single embedding (a compact vector) or generated one caption and matched once, which often missed fine-grained edits.

The Problem

Three popular strategies struggled:
1. Joint embedding: Put the “image+text” into one vector and match. But it blurs delicate edits like “remove stripes but keep the same collar.”
2. Caption-to-Image: First write one target caption from the query, then match to images. This can drop the tiny yet important visual cues.
3. Caption-to-Caption: Compare only texts (candidate captions vs. target caption), throwing away useful visual evidence.
None of these had a careful, step-by-step checker to confirm the requested edits actually appear in the final choice.

🍞 Top Bread (Hook) Imagine building a LEGO model by guessing once and never checking. If your guess is off, the whole structure is wrong.

🥬 Filling: Cross-Modal Reasoning

What it is: Cross-modal reasoning means connecting clues from different types—like pictures and words—so they agree.
How it works: Look at the image for visual details, read the text for requested changes, and make them fit together.
Why it matters: If the text says “add a hood,” the final image must actually show a hood, not just be “kind of similar.”

🍞 Bottom Bread (Anchor) A “blue shirt → make it red, keep the logo” should return a red shirt with the same logo. Text explains the change; the picture preserves style.

Failed Attempts and the Gap

One-shot similarity is brittle: it treats all clues at the same time, then commits. If anything gets lost (like sleeve length), there’s no second chance.
Single-modality views (only text or only image) ignore half the story.
Missing piece: a progressive, multi-step process that imagines the goal, matches from multiple angles, and then verifies facts across text and image.

Real Stakes

E-commerce: “Same sneakers but with black soles” should give you the same silhouette with the requested change—saves time and returns what customers truly want.
Personal media and education: Find the exact photo version you need (“same classroom, but with the projector on”).
Search engines: More precise answers when users mix pictures and short edits.

🍞 Top Bread (Hook) Imagine a school project team: one teammate drafts, another shortlists, the last fact-checks. That’s reliable.

🥬 Filling: Progressive Retrieval Process

What it is: A step-by-step pipeline that refines results as it learns more from the query.
How it works: 1) Imagine the target, 2) Coarsely filter many candidates, 3) Finely verify details.
Why it matters: Each stage catches mistakes from the previous one, so fewer errors sneak through.

🍞 Bottom Bread (Anchor) First, sketch the ideal jacket (draft). Next, pick the 100 closest options (shortlist). Finally, check “is it truly red, with a hood?” (verify).

02Core Idea

🍞 Top Bread (Hook) You know how a coach assigns positions—striker, defender, goalie—so a soccer team wins more often? Teamwork beats solo.

🥬 Filling (The Actual Concept)

What it is: XR’s key insight is to turn image retrieval into a team sport: specialized agents imagine, match, and verify, working step by step.
How it works: 1) Imagination agents create target captions from both text and image perspectives, 2) Similarity agents score candidates from text and visual angles, 3) Question agents ask true/false checks to confirm the edits are really there.
Why it matters: Without this team and process, tiny edits get lost, and results drift away from the user’s actual intent.

🍞 Bottom Bread (Anchor) “Same backpack but green, keep front pocket.” XR imagines the green-pocket version, shortlists likely images, then checks: is it green, and does it still have the pocket?

The “Aha!” Moment in One Sentence Treat retrieval as progressive, cross-modal reasoning with multiple agents that first imagine the goal, then combine multi-view similarities, and finally verify facts.

Three Analogies

Chef Kitchen: One chef imagines the dish from a recipe and a photo (imagination), sous-chefs gather the best ingredients (coarse matching), and a taster checks the final flavor notes (verification).
Detective Work: Profile the suspect (imagination), round up likely suspects from multiple clues (coarse), then verify alibis line by line (fine).
School Newspaper: Writer drafts (imagination), editor shortlists stories (coarse), fact-checker verifies names, dates, quotes (fine).

Before vs After

Before: One embedding or one caption, one pass, and hope for the best.
After: A deliberate loop—imagine → shortlist with hybrid scores → verify facts → re-rank. The final list aligns better with both the reference look and the text edit.

Why It Works (Intuition, no equations)

Dual imagination (text-based and vision-based) captures different strengths: words explain changes cleanly; images preserve tiny visual cues.
Multi-perspective similarity lets strong hints from either modality lift a candidate—if text screams “add hood,” or image look-alike is great, both can help.
Verification questions convert fuzzy similarity into concrete checks (“Is it red?”), shutting the door on near-misses.
Reciprocal Rank Fusion (RRF) merges rankings instead of raw scores, so a noisy scale in one view can’t wreck the final order.

Building Blocks (each with a Sandwich)

🍞 You know how artists sketch before painting? 🥬 Imagination Agents
- What it is: Two agents write target captions—one guided by text+reference caption, one by text+reference image.
- How it works: They generate C_t (text-perspective caption) and C_v (vision-rich caption), plus lists of edits/attributes (M_t, M_v).
- Why it matters: A clearer target prevents early mistakes from snowballing. 🍞 Anchor: From a blue sneaker + “make it red, keep white sole,” they produce captions describing a red sneaker with a white sole.
🍞 Imagine trying both “matches the description” and “looks like the picture” when shopping. 🥬 Similarity Agents
- What it is: One agent matches via text space; another matches via image–text space.
- How it works: Each scores candidates against C_t and C_v from both text and visual views, then XR fuses them.
- Why it matters: One view alone misses clues; two views catch more. 🍞 Anchor: An image slightly off in color but perfect in style can still rank high if another view vouches for it.
🍞 Think of a science fair judge asking yes/no questions to check claims. 🥬 Question Agents
- What it is: Agents create a few true/false checks from the edit list (M_t, M_v, and user text).
- How it works: They ask both the candidate image and its caption, awarding points only if both agree.
- Why it matters: It stops near-misses that “look” right but fail the exact edit. 🍞 Anchor: “Does it have long sleeves?” If the image says no, that candidate drops.
🍞 Friends vote on where to eat; combining their rankings is fairer than averaging star ratings. 🥬 Reciprocal Rank Fusion (RRF)
- What it is: A way to merge ranked lists from different views.
- How it works: Convert each view’s scores into ranks and combine them so high ranks count most.
- Why it matters: Different views use different scales; ranks normalize them. 🍞 Anchor: If text-view ranks an image 2nd and image-view ranks it 5th, the combo still keeps it near the top.

03Methodology

At a high level: Input (reference image Ir + text edit Tm) → Imagination (C_t, C_v, M_t, M_v) → Coarse Filtering (multi-view similarity + RRF) → Fine Filtering (questions + re-ranking) → Output (final ranked images)

Step 0: Prepare Candidate Captions

What happens: A caption agent writes a caption for every candidate image and for the reference image.
Why: Having text for each image lets us compare in text space and ask text-based questions later.
Example: Candidate #17 caption: “red hoodie with front pocket.” Reference caption: “blue hoodie with front pocket.”

🍞 You know how authors outline a story before writing? 🥬 Imagination Stage (A_it and A_iv)

What happens:
1. Text Imagination Agent (A_it): Uses the edit text Tm and the reference caption C_r to write a clean, semantic target caption C_t and an edit list M_t (explicit changes).
2. Vision Imagination Agent (A_iv): Uses Tm and the reference image Ir to write a vision-grounded target caption C_v and an attribute checklist M_v (which visual traits must be present/absent).
Why this step exists: If we don’t define the target clearly from both language and picture angles, later matching can drift.
Example with data:
- Reference image: blue running shoe with white sole.
- Tm: “make it red; keep white sole; add mesh upper.”
- C_t (text view): “a red running shoe with a white sole and mesh upper, same silhouette as reference.”
- M_t: {color: red, keep: white sole, add: mesh upper}
- C_v (vision view): “red athletic shoe, white outsole, breathable mesh, similar shape to reference image.”
- M_v: {has_mesh: true, color_red: true, white_sole: true}

🍞 Think of trying on clothes under two lights: daylight and indoor. 🥬 Coarse Filtering: Multi-Perspective Similarity (A_st and A_sv)

What happens:
1. For each candidate, the Text Similarity Agent (A_st) compares candidate’s caption (C_a) to both C_t and C_v.
2. The Vision Similarity Agent (A_sv) compares the candidate image (I_a) to both C_t and C_v using a vision-language encoder.
3. Within each modality (text side and visual side), the two comparison scores are summed to get a text-score and a visual-score per candidate.
4. Reciprocal Rank Fusion (RRF) merges the text and visual rank lists into one shortlist of top k' candidates (e.g., 100).
Why this step exists: One view is brittle. Combining text-and-vision views catches different clues and keeps recall high.
Example with data:
- Candidate #4: Style matches perfectly but caption forgot to mention “mesh.” Text view rank = 8, visual view rank = 2 → RRF keeps it near the top.
- Candidate #19: Caption says “mesh,” but the image looks like leather. Text rank = 3, visual rank = 30 → RRF balances this uncertainty.

🍞 Picture a referee asking a few clear questions to avoid a bad call. 🥬 Fine Filtering: Question-Based Verification (A_q, A_qt, A_qv)

What happens:
1. Question Agent (A_q) turns edits into a few true/false checks, using M_t, M_v, and Tm. Examples: “The shoe is red.” “The sole is white.” “Upper is mesh.”
2. Text Question Scorer (A_qt) asks these about the candidate’s caption.
3. Vision Question Scorer (A_qv) asks these about the candidate image directly.
4. A candidate gets credit for a question only if the answer is correct in both views.
5. Multiply the question score by a normalized blend of the earlier similarity scores (with a small text weight, e.g., lambda≈0.15) to re-rank the k' candidates into the final top-k.
Why this step exists: Similarity is fuzzy. Yes/no checks make sure crucial edits really show up, preventing near-miss winners.
Example with data:
- Candidate #4 passes “red?” and “white sole?” but fails “mesh upper?” → loses points and can drop below a fully correct candidate.

The Secret Sauce

Dual imagination reduces the language–vision gap, anchoring the goal from two sides.
Hybrid similarity plus RRF keeps high-recall coverage while resisting noisy scales.
True/false verification adds crisp, interpretable signals so tiny but vital edits don’t get lost.
Progressive narrowing (imagine → shortlist → verify) mirrors how people decide and prevents early mistakes from locking in.

04Experiments & Results

🍞 You know how a report card makes more sense if you know the class average? 🥬 The Test

What they measured: How often the correct image appears in the top results. Two main metrics:
- Recall@k (R@k): Did the right answer show up in the top k? Like checking if your favorite song made the top-10 chart.
- mean Average Precision at k (mAP@k): Rewards putting all correct answers high on the list when multiple targets exist.
Why: CIR needs both precision (edits are right) and coverage (don’t miss the right item). 🍞 Anchor: If R@10 = 83%, that’s like getting an A when most are near a C.

The Competition

Training-based and training-free baselines, including PALAVRA, SEARLE/iSEARLE, LinCIR, FTI4CIR, CIReVL, LDRE, ImageScope.
XR is training-free, so it’s impressive when it beats systems that were trained specifically for the task.

The Datasets

FashionIQ: Clothes with fine-grained edits (color, sleeves, prints).
CIRR: Natural photos with subset evaluation to distinguish very similar images.
CIRCO: Large, distractor-heavy benchmark with multiple correct answers (good for real-world messiness).

Scoreboard with Context

FashionIQ (average across categories): XR hits about 36.7% R@10 and 57.1% R@50 with CLIP-B/32, jumping roughly 8+ points in R@10 over strong training-free baselines. That’s like moving from “good” to “honor roll.”
CIRCO: XR reaches about 31% mAP@5 and ~31% mAP@50 (with CLIP-L/14 it’s even higher), beating the best baseline by around 7+ points at mAP@50—big gains in a noisy, real-world-like setting.
CIRR: XR scores roughly 83% R@10 and about 95% R@3 in the fine-grained subset task—like acing trick questions others keep missing.

Surprising/Notable Findings

A few true/false questions (around three) are enough; more gives little extra and costs time.
Medium-size multimodal backbones (e.g., InternVL3-8B or Qwen2.5-VL-7B) are the sweet spot: strong grounding without huge cost.
Reciprocal Rank Fusion outperforms naive score-averaging because it focuses on rank, not noisy scales.
A shortlist of about 100 candidates balances diversity and speed; going to 500 yields small gains at much higher cost.

Ablations (Why Each Piece Matters)

Visual similarity alone boosts recall but can wander semantically.
Text similarity alone keeps meaning but misses tiny looks.
Combining both yields large gains: cross-modal views complement each other.
Adding question-based checks prunes false positives and lifts edit faithfulness; using both image and text questions works best.

Bottom Line

Across fine-grained fashion, natural photos, and distractor-heavy sets, XR’s agentic, progressive reasoning keeps the user’s intent intact and wins by a clear margin, despite requiring no extra training.

05Discussion & Limitations

Limitations

Narrow modality focus: XR is tailored for image–text. It doesn’t yet handle video timing, audio cues, or interactive, multi-turn edits.
Dependency on generated language: Captions and verification questions come from large multimodal models and can carry biases or occasional hallucinations.
Efficiency at extreme scale: Very large candidate pools raise latency; more optimization or pruning strategies could help.

Required Resources

A vision-language encoder (e.g., CLIP) for similarity scoring.
A medium multimodal LLM (e.g., InternVL3-8B) for imagination and question answering.
GPU memory and time to score a few hundred candidates (k'≈100 is a good balance in practice).

When NOT to Use

When edits involve motion or time (“same person, but walking instead of sitting, 2 seconds later”)—that’s video-focused.
When only text is available and image cues matter a lot (XR shines with both modalities).
Ultra-low-latency scenarios where even a small set of verification questions is too slow.

Open Questions

How to extend XR to video, audio, and 3D while keeping the clean imagine–match–verify loop?
Can we reduce reliance on heavy models using lighter agents or distillation while preserving accuracy?
Can agents learn to ask the “best” few questions adaptively, per query difficulty?
How to detect and correct biased or incomplete captions automatically during imagination?

06Conclusion & Future Work

Three-Sentence Summary

XR reframes composed image retrieval as a team effort: imagine the target, shortlist with hybrid similarities, then verify facts across text and image before deciding.
This progressive, cross-modal reasoning is training-free yet outperforms strong baselines on FashionIQ, CIRR, and CIRCO, especially on fine-grained edits.
Ablations confirm each agent’s necessity: imagination anchors the goal, similarity widens coverage, and questions lock in correctness.

Main Achievement

Establishing a practical, training-free, multi-agent recipe—imagine → coarse-match → fine-verify—that reliably preserves both the reference look and the requested edits.

Future Directions

Extend to richer modalities (video timing, audio), design lighter agents, and make question-asking adaptive.
Explore human-in-the-loop feedback and better bias checks for generated captions.
Integrate with large-scale search systems to power interactive, edit-aware web and shopping experiences.

Why Remember This

XR shows that careful teamwork across modalities—drafting, matching, and fact-checking—beats one-shot matching.
The pattern is simple yet powerful and general: imagine the goal, keep multiple views, then verify. That recipe travels well to many multimodal problems.

Practical Applications

•E-commerce search: “Same sneakers but with black soles,” returning faithful style-preserving options.
•Personal wardrobe apps: Find photos of “this coat, but zipped and with the hood up.”
•Design and prototyping: Rapidly explore variations (“same chair, darker wood, leather seat”).
•Digital asset management for media teams: Retrieve exact scene variations (“same shot, but at night”).
•Education: Students search class photos for specific changes (“same lab setup, safety goggles on”).
•Content moderation support: Verify specific visual attributes before flagging content.
•Recommendation systems: Suggest closely related items with precise attribute tweaks.
•Photography workflows: Locate images matching client edit notes (“keep pose, change background to brick”).
•Interior decoration: “This lamp, but brass and taller,” for catalog search.
•Visual documentation: Find product revisions across time (“same PCB but with added connector”).

Version: 1