Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Tingyu Song; Yanzhao Zhang; Mingxin Li; Zhuoning Guo; Dingkun Long; Pengjun Xie; Siyue Zhang; Yilun Zhao; Shu Wu

Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Intermediate

Tingyu Song, Yanzhao Zhang, Mingxin Li et al.1/22/2026

arXiv PDF

Key Summary

•This paper introduces EDIR, a new and much more detailed test for Composed Image Retrieval (CIR), where you search for a target image using a starting image plus a short text change.
•EDIR is built using image editing so the authors can precisely control the kinds of changes (like color, shape, weather, or adding an object) and cover 15 fine-grained subcategories.
•The benchmark has 5,000 carefully checked queries and a gallery of 178,645 images, with hard negatives that look very similar to make the test tough but fair.
•Across 13 popular models, even the strongest ones struggled to be good at every category, revealing big gaps in handling negation, counting, spatial relations, viewpoint, and fine details like texture.
•EDIR exposes modality bias in older benchmarks where models could over-rely on text; here you must truly use both the image and the text together.
•A custom model trained in-domain on EDIR-style data (EDIR-MLLM) jumped to 59.9% Recall@1, showing many issues are solvable with targeted data.
•However, reasoning-heavy skills like counting or viewpoint improved only a little, suggesting current model architectures have intrinsic limits.
•The paper provides a clear taxonomy (5 main categories, 15 subcategories) that maps real-world edits and makes evaluation more complete and diagnostic.
•Human checks and multi-stage automated filters help ensure high data quality, with low false positive and negative rates.
•EDIR aims to guide future models toward real compositional understanding instead of shortcutting through benchmark biases.

Why This Research Matters

EDIR makes image search feel more like how people think: “like this, but change that,” across real edit types we use in daily life. It reduces shortcuts where models ignore the picture and only read the text, pushing true multimodal understanding. This helps shopping sites find the right variant, design tools suggest precise edits, and photo apps surface the exact moment you imagine. By pinpointing weaknesses (like negation, counting, viewpoint), EDIR guides researchers to build models that are more helpful and reliable. Over time, this means faster, smarter, and more accurate visual assistants for everyone. It also sets a new, fair standard for judging progress so results actually reflect real-world skill. Ultimately, EDIR nudges AI from “good at tests” to “good at helping people.”

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you look for a picture on your phone, you might think, “I want one like this beach photo, but at sunset, and with no people”? Regular search doesn’t let you say all that clearly.

🥬 Filling (New Concept: Composed Image Retrieval, CIR)

What it is: CIR is finding a target image using a starting (reference) image plus a short text that says how to change it.
How it works: 1) Start with a picture you like. 2) Add a tiny wish in text, like “make the sky orange.” 3) The system hunts for another image that matches both the original look and the requested change. 4) It ranks candidates and picks the best match.
Why it matters: Without CIR, you can only search by words or by picture alone, missing the key idea: “like this, but changed in this specific way.” 🍞 Bottom Bread (Anchor): Imagine showing a blue sneaker photo and typing “same shoe but red shoelaces.” CIR helps you find that exact style change.

The World Before: Before this paper, researchers built CIR tests (benchmarks) like CIRR, FashionIQ, and CIRCO. These helped, but they usually covered just a few kinds of changes (like fashion attributes) or had fuzzy categories. As a result, models could look good on paper yet still fail at everyday requests such as removing an object, changing time of day, or rotating the viewpoint.

🍞 Top Bread (Hook): Imagine your teacher only quizzes you on addition and not subtraction, fractions, or word problems. You might get an A, but we don’t really know if you’re good at math.

🥬 Filling (New Concept: Fine-Grained Evaluation)

What it is: A fine-grained evaluation checks many small, specific skills instead of just a few big ones.
How it works: 1) Make a list of tiny, important abilities (color change, count, spatial move, style, etc.). 2) Create clear tests for each one. 3) Score models per skill, not just overall. 4) See where they shine or stumble.
Why it matters: Without this, a model can hide weaknesses. With it, we can target fixes. 🍞 Bottom Bread (Anchor): Think of a spelling test that checks each tricky sound (ph, gh, sh) so you know exactly which part needs practice.

The Problem: Existing CIR benchmarks were either too narrow (only fashion), too coarse (few categories), or biased in a way that let models rely mostly on text. That means they didn’t force true teamwork between the image and the text.

🍞 Top Bread (Hook): Imagine grading a painting by only reading its title. That would miss the whole picture.

🥬 Filling (New Concept: Modality Bias)

What it is: When a model overuses one input type (like text) and underuses the other (the image).
How it works: 1) The test allows shortcuts. 2) The model learns to win by reading words and ignoring visuals. 3) Scores look high, but real understanding is low.
Why it matters: If the image doesn’t matter, you’re not truly solving CIR. 🍞 Bottom Bread (Anchor): If a quiz lets you copy your friend’s answer, you might get 100%, but you didn’t learn the topic.

Failed Attempts: People tried tagging broad categories or writing queries after choosing targets, but that often missed important types of edits or mixed categories in confusing ways. Some tests even rewarded text-only tricks.

The Gap: We needed a test that covers real-life edits in a balanced way and controls what’s being changed—so we know exactly what skill the model is being graded on.

🍞 Top Bread (Hook): Picture a chef who can precisely add salt, sugar, or spice to taste. Control matters.

🥬 Filling (New Concept: Image Editing for Query Synthesis)

What it is: Using image editing tools to create target images that match specific, controlled text changes.
How it works: 1) Pick a source photo. 2) Write an edit instruction (e.g., “make it nighttime”). 3) Use an editing model to create the new image. 4) Turn the instruction into a natural search query.
Why it matters: This gives fine control over what’s changed, so each test truly measures that skill. 🍞 Bottom Bread (Anchor): Start with a room photo, then edit in a lamp. Now you have the perfect test for “addition.”

Real Stakes: In shopping (“same jacket but leather”), creative tools (remove a sign, change style), and everyday organizing (“like this photo but no people and at dusk”), CIR powers intuitive, fast finds. Getting the test right pushes models to become truly helpful, accurate, and trustworthy in daily life.

02Core Idea

🍞 Top Bread (Hook): Imagine a playground with stations for every skill: hopping, balancing, throwing. If you only test hopping, you won’t know who’s great at balancing.

🥬 Filling (New Concept: EDIR Benchmark)

What it is: EDIR is a new, balanced, fine-grained test for CIR built from controlled image edits across 5 main categories and 15 subcategories.
How it works: 1) Define a clear taxonomy (color, shape, add/remove, spatial, style, time, weather, and more). 2) Use image editing to create target images that exactly match chosen changes. 3) Rewrite edits into natural queries. 4) Add hard negatives (very similar but wrong in one key way). 5) Filter and human-check for quality. 6) Evaluate many models fairly.
Why it matters: Without EDIR, models can look smart while skipping hard skills. With EDIR, we see real strengths and weaknesses. 🍞 Bottom Bread (Anchor): “Same living room, but remove the rug.” EDIR gives you a target image with the rug gone and a distractor where the rug is still there, forcing true understanding.

The “Aha!” Moment in one sentence: If you generate the target images by precisely editing the source, you can control what skill is being tested and finally evaluate CIR fairly and deeply.

Three Analogies:

Music Mixer: Instead of guessing which knob was turned, we, the testers, turn one knob (color, style) ourselves and then check if models notice it.
Science Experiment: Keep most things the same (control) and change just one variable (the edit) so the result clearly shows cause and effect.
Sports Drills: Test dribbling, passing, and shooting separately to know what to practice next.

Before vs After:

Before: Few categories, fuzzy labels, and shortcuts let models pass without really seeing the image.
After: Clear subcategories, precise edits, and tough hard negatives expose true visual-text reasoning.

🍞 Top Bread (Hook): You know how a to-do list breaks a big job into bite-size tasks so you don’t get overwhelmed?

🥬 Filling (New Concept: Taxonomy)

What it is: A structured list of edit types organized into 5 main groups and 15 subgroups.
How it works: 1) Attribute: color, material, shape, texture. 2) Object: addition, remove, replace, count. 3) Relationship: spatial, action, viewpoint. 4) Global environment: style, time, weather. 5) Complex: mixes several changes.
Why it matters: It covers real life broadly and evenly, so no single skill dominates. 🍞 Bottom Bread (Anchor): “Same car but matte black (material)” is different from “same scene but snowy (weather).” The taxonomy cleanly separates them.

Why it works (intuition, no equations):

Control: Editing makes the change unambiguous.
Contrast: Hard negatives keep everything else similar so models must pay attention to the requested change.
Coverage: The taxonomy prevents blind spots (like ignoring spatial moves or texture).
Checks: Automated and human filters reduce mislabeled or unclear cases.

Building Blocks:

Source images filtered for quality.
Edit instructions per selected subcategories.
Composite edits to form targets and hard negatives.
Query rewriting (direct and negation forms).
Two-stage model-based filtering plus human validation.
Balanced sampling to reach 5,000 queries across 15 subcategories and a 178,645-image gallery.

🍞 Bottom Bread (Anchor): Like a well-designed obstacle course with stations for crawling, jumping, and balance beams, EDIR ensures every skill gets tested, not just running fast.

03Methodology

At a high level: Input (source image) → Choose fine-grained edit(s) → Edit the image to make target(s) + hard negatives → Rewrite edit into a natural query → Filter for quality → Build the benchmark query with candidates → Output evaluation-ready triplets.

Step 1: Seed Image Selection

What happens: Start from LAION-400M, then use an MLLM (Qwen2.5-VL-32B) to filter out low-quality, text-only, abstract, or unusable images.
Why this exists: Poor source images make edits unclear or impossible, which would weaken the test.
Example: Remove a blurry document scan; keep a clear kitchen photo with identifiable objects.

🍞 Top Bread (Hook): Imagine starting an art project. If your paper is torn or soggy, your drawing won’t turn out well. 🥬 Filling (New Concept: Multimodal Embedding Models)

What it is: Models that turn both images and text into vectors in the same space so they can be compared.
How it works: 1) Encode image into numbers. 2) Encode text into numbers. 3) Bring them into one shared space. 4) Match by closeness.
Why it matters: CIR needs images and words to “speak” the same language to compare them fairly. 🍞 Bottom Bread (Anchor): Like translating both English and Spanish into a shared sign language so they can be matched.

Step 2: Subcategory Selection and Instruction Generation

What happens: For each source image, an MLLM suggests 5–6 suitable subcategories (e.g., color, remove, spatial) and writes 2–3 atomic edit instructions per subcategory.
Why this exists: Ensures coverage across many edit types while keeping each instruction concrete and realistic.
Example: For a castle photo: “Add a stone texture to the walls (texture),” “Change the sky to cloudy (weather),” “Remove the tourists (remove).”

Step 3: Composite Edits and Image Editing

What happens: Build composite instruction sets like {a, b, c, d}, where a and b are base edits shared across several outputs, and c or d are the distinctive edits. Use an editing model (Qwen-Image-Edit) to generate multiple edited images.
Why this exists: Shared base edits make images look related; distinctive edits force the model to notice the exact requested change.
Example: Base = “make evening” + “apply watercolor style”; Distinctive = “add a lamp” vs “remove the rug.”

🍞 Top Bread (Hook): Think of a spot-the-difference game where most things stay the same and one or two details change. 🥬 Filling (New Concept: Hard Negatives)

What it is: Tricky candidate images that look very similar to the target but miss the requested change.
How it works: 1) Keep base edits the same. 2) Change a different attribute than the query asks for. 3) Present both to the model. 4) Only one matches the text perfectly.
Why it matters: Without hard negatives, the task is too easy and doesn’t test careful understanding. 🍞 Bottom Bread (Anchor): If the query says “remove the rug,” a hard negative might keep the rug but match everything else.

Step 4: Query Rewriting (Direct and Negation)

What happens: An LLM (Qwen3-32B) rewrites raw edit instructions into natural CIR queries. It supports direct statements (“make it snowy”) and negation (“show this scene but not sunny”).
Why this exists: Real users speak naturally and often say “not red” or “without the sign.” Models must understand both styles.
Example: Edit “Change dress to red” → Query “Find this dress but in a different color” (negation form) or “A dress in red” (direct form when appropriate).

🍞 Top Bread (Hook): You know how people ask the same thing in different ways—“turn off the lights” or “don’t keep the lights on”? 🥬 Filling (New Concept: Negation Queries)

What it is: Queries that say what not to keep rather than exactly what to change to.
How it works: 1) Identify the attribute to avoid. 2) Write the request without naming a specific final value. 3) The model must exclude the original trait.
Why it matters: People often ask “not this” when they don’t know the exact target. 🍞 Bottom Bread (Anchor): “Same shirt but not striped.”

Step 5: Two-Stage Filtering and Human Validation

What happens: First automatic check compares the composite instruction with the generated image; second check compares the final CIR query with the source–target pair. Then humans validate a 12% sample.
Why this exists: Multiple filters reduce errors like mismatched edits or over/under-description.
Example: If the text says “add a vase,” but the target lacks it, the pair is filtered out. Reported error rates are low: 8.0% false positives, 7.3% false hard negatives, 11.7% global false negatives in sampling.

Step 6: Building the Benchmark

What happens: Sample 300 queries per simple subcategory and 800 for the Complex category to reach 5,000 queries; assemble a 178,645-image gallery with hard negatives and additional edited images.
Why this exists: Balanced coverage across all 15 subcategories enforces fairness and breadth.
Example: Attribute (color/material/shape/texture), Object (addition/remove/replace/count), Relationship (spatial/action/viewpoint), Global (style/time/weather), Complex (mixtures).

🍞 Top Bread (Hook): In sports day, you don’t only do sprints; you also do long jump, relays, and throws so everyone’s skills show. 🥬 Filling (New Concept: Recall@1)

What it is: A score that checks if the top-ranked result is the correct target.
How it works: 1) Model ranks candidate images. 2) If the first one is right, it’s a hit. 3) Count hits over all queries. 4) Higher is better.
Why it matters: It measures whether the model confidently picks the correct image, not just somewhere in the top few. 🍞 Bottom Bread (Anchor): Like asking, “Did you guess the mystery picture on your first try?”

04Experiments & Results

The Test: The authors evaluate 13 models on EDIR using Recall@1 (and also report Recall@3). The aim is to see whether models can pick the exact target when shown a source image and a fine-grained change in text.

The Competition: They include non-MLLM methods built on CLIP (PIC2WORD, SEARLE, MAGICLENS) and many modern MLLM-based embedding models (RzenEmbed-7B, Ops-embedding, GME-2B/7B, MMRet-MLLM, E5-V, VLM2Vec-2B, UniME-2B/7B, mmE5). They also train an in-domain model, EDIR-MLLM, using 225k synthesized triplets to see what improves with targeted data.

The Scoreboard (with context):

Non-MLLM models average about 18.4% Recall@1. That’s like getting a low grade when the test asks for very specific details; they can find the general group but miss the exact match.
MLLM-based models do better, averaging around 36.9% across categories, with stronger results in addition, replace, and action but weaker in texture, remove, and shape. That’s like moving from a C- to a C+/B- depending on the skill, but still far from straight A’s.
The in-domain model (EDIR-MLLM) jumps to 59.9% Recall@1, like moving from average to a solid B+/A-, proving many gaps are fixable with the right practice data.
Recall@3 numbers rise for all models, but challenging categories (remove, viewpoint) remain tough, confirming the benchmark’s difficulty.

Surprising Findings:

Modality Bias in older benchmarks: On CIRCO, some models do better using only text than using both image and text, meaning they can “cheat” by ignoring the image. EDIR reduces such shortcuts by requiring true multimodal reasoning.
Category disparities: Even top models can be uneven—good at adding or replacing objects but poor at negation (“not red”), counting, spatial moves, viewpoint shifts, and tiny visual cues (texture/material/shape).
Correlation study: EDIR performance correlates positively with prior benchmarks but reveals coverage gaps and biases in them (e.g., missing remove or spatial queries), explaining why models that looked great elsewhere struggle here.

🍞 Top Bread (Hook): Imagine a spelling bee that adds tricky, rarely tested letter pairs—suddenly you see who truly mastered spelling. 🥬 Filling (New Concept: Compositional Reasoning)

What it is: The ability to understand and combine multiple conditions or relationships at once.
How it works: 1) Parse each condition (e.g., count=2, spatial=left of, viewpoint=low angle). 2) Keep them all in mind. 3) Check that a candidate image satisfies every single one.
Why it matters: Real requests are often multi-step and relational; missing one part leads to a wrong answer. 🍞 Bottom Bread (Anchor): “Two mugs on the shelf, the blue one on the left, seen from above” requires combining number, position, and camera angle flawlessly.

Error Taxonomy from EDIR:

Negation handling fails: “not red” or “remove the hat” often trips models.
Compositional reasoning gaps: count, spatial, viewpoint, style changes need deeper reasoning than many models have.
Multiple constraints: In complex queries, models match only some parts, not all.
Fine-grained sensitivity: Texture, material, shape are subtle and frequently missed.

Takeaway: EDIR’s fine-grained design, hard negatives, and balanced categories reveal where models truly need work and which improvements require better data versus better model architectures.

05Discussion & Limitations

Limitations:

Cost and scale: The image editing and multi-stage filtering pipeline is computationally heavy, which limits even larger datasets.
Bounded complexity: Complex queries here usually combine up to three edits; ultra-complex, interdependent edits (four or more) remain an open challenge.
Evaluation-first: EDIR is built mainly to diagnose model skills, not to be a universal training recipe.

Required Resources:

Access to large-scale image data (e.g., LAION-400M) and reliable MLLMs for filtering and rewriting.
A capable image editing model to produce controlled target images.
Compute for generating, filtering, and validating hundreds of thousands to millions of triplets.

When NOT to Use:

Narrow, specialized domains where the taxonomy doesn’t fit (e.g., medical scans with domain-specific edits) without adapting categories.
Settings where text-only or image-only search is sufficient; EDIR’s strength is testing multimodal composition.
If you cannot support the compute cost for reproducing or extending the pipeline.

Open Questions:

How to push beyond three simultaneous constraints and still keep queries clear and evaluable?
What model changes (architectures, training losses, spatial reasoning modules) most help with counting, viewpoint, and negation?
Can we reduce modality bias further, ensuring models always need both image and text?
How to scale data synthesis cheaply while keeping edits realistic and diverse?
Can better verification (automatic or human-in-the-loop) further cut false positives/negatives without huge cost?

Honest Assessment: EDIR is a rigorous, balanced “skills test” that reveals real weaknesses. Many shortcomings improve with targeted, in-domain data (e.g., color, material, texture), but reasoning-heavy abilities (count, spatial, viewpoint, negation) likely need architectural advances, not just more examples.

06Conclusion & Future Work

Three-Sentence Summary: EDIR is a fine-grained, image-editing-derived benchmark for Composed Image Retrieval that covers 5 main categories and 15 subcategories with 5,000 queries and a large image gallery. By precisely controlling edits and adding hard negatives, EDIR exposes real strengths and weaknesses in 13 modern models, especially around negation, counting, spatial relations, viewpoint, and tiny visual details. In-domain training boosts some skills significantly, but reasoning-heavy gaps remain, pointing to intrinsic model limits.

Main Achievement: Turning controlled image edits into a balanced, diagnostic benchmark that fairly tests true multimodal, compositional understanding—and reveals where current systems fall short.

Future Directions: Scale synthesis more efficiently; increase complexity beyond three constraints; add better spatial/logic modules to models; reduce modality bias; explore improved verification and human-in-the-loop quality checks. Domain adaptations (e.g., indoor design, robotics, scientific imagery) could extend the taxonomy where needed.

Why Remember This: EDIR changes how we judge progress in CIR—from broad, sometimes biased tests to a precise, fine-grained exam that maps directly to everyday user needs. It gives researchers a clear scoreboard for the right skills and a roadmap for building models that truly understand “like this, but change that.”

Practical Applications

•E-commerce visual search: “Same jacket but leather and dark brown,” retrieving the exact variant.
•Interior design assistance: “This room but remove the rug and add a floor lamp near the sofa.”
•Creative photo editing: Find reference images matching “like this portrait, but watercolor style at dusk.”
•Inventory organization: “This product but with two units side by side from a low angle.”
•Education and training: Drill specific visual skills (texture, viewpoint) in AI systems and measure progress.
•Digital asset management: “Same logo scene but remove the background sign,” to quickly locate the right version.
•Content moderation support: Spot subtle attribute changes (e.g., symbols removed/added) with fine-grained checks.
•Robotics perception testing: Evaluate spatial and viewpoint understanding for robot camera inputs.
•Fashion assistants: “Same dress but not striped” to find non-striped alternatives efficiently.
•A/B creative testing: Retrieve style or weather variants of scenes for marketing and media production.

Version: 1