Half-Truths Break Similarity-Based Retrieval
Key Summary
- •Similarity-based image–text models like CLIP can be fooled by “half-truths,” where adding one plausible but wrong detail makes a caption look more similar to an image instead of less similar.
- •This happens because training mostly teaches models to match whole sentences to images, not to check each small part (like objects and relations) carefully.
- •The paper introduces CS-CLIP, which teaches the model at the part level by breaking captions into entities (things) and relations (how things connect) and contrasting each correct part with a minimally edited wrong version (a foil).
- •On the COCO dataset, standard CLIP prefers the correct short caption over a half-truth only 40.6% of the time; CS-CLIP lifts this “Half-Truth Accuracy” to 69.3%.
- •Incorrect relation additions (like swapping who-is-doing-what or changing spatial words) are the hardest; CS-CLIP improves this tough case from 32.9% (CLIP) to 65.5%.
- •CS-CLIP also improves across 16 compositional benchmarks, boosting average Image-to-Text accuracy to 57.8%, about +5.7 points over CLIP.
- •The trick works without changing the CLIP architecture or how retrieval is done at test time; it only changes how we fine-tune.
- •There is a small trade-off: zero-shot classification drops a bit (e.g., Acc@1 from 63.6% to 59.9%), which is common when fine-tuning on a smaller dataset.
- •Ablations show that supervising with matched unit foils is the key driver of gains, and updating both encoders helps most for relation understanding.
- •Making models reject half-truths can make everyday search, recommendation, and accessibility tools more reliable when users add details to their queries.
Why This Research Matters
When people refine searches by adding details (“blue mug on the top shelf”), systems should get more accurate, not more easily tricked. Fixing half-truth failures makes photo search, e-commerce filters, and creative asset libraries return matches that truly fit all requested details. Accessibility tools that describe scenes become more trustworthy by avoiding confident but wrong add-on facts. Dataset curation and content moderation also benefit from better rejection of near-miss, misleading matches. Because CS-CLIP keeps test-time speed and simplicity, it can slot into existing CLIP-style pipelines without slowing them down. Over time, part-level supervision could help AI handle complex, multi-part instructions with fewer subtle mistakes.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you tell someone a story, if you add a made-up detail, a good listener should become more suspicious, not more convinced? That’s how image–text AI should behave too: when a description adds a wrong detail, the match to the picture should go down. But today’s popular models often do the opposite.
🍞 Hook: Imagine you describe a photo with “a dog.” If you wrongly extend it to “a dog on a skateboard” (and there’s no skateboard), the score should drop. 🥬 The Concept: Dual Encoder Architecture (what it is, how it works, why it matters)
- What it is: Two separate encoders turn an image and a text into vectors and compare them by similarity.
- How it works:
- An image encoder makes an image vector.
- A text encoder makes a sentence vector.
- A single similarity score (like cosine) tells how well they match.
- Why it matters: Without this simple scoring, fast retrieval (search) across millions of items would be slow or clumsy. 🍞 Anchor: When you search “red bus,” the image encoder and text encoder meet in the same space, so the most “similar” images (red buses) bubble to the top quickly.
🍞 Hook: Think of learning by comparing things: you learn “cat vs. not-cat” by seeing both. 🥬 The Concept: Contrastive Training
- What it is: A training style that pulls true image–text pairs together and pushes mismatched pairs apart.
- How it works:
- Show a batch of images and their captions.
- Pull each image toward its caption.
- Push it away from all other captions (negatives).
- Why it matters: Without contrasts, the model doesn’t learn sharp boundaries; retrieval gets fuzzy. 🍞 Anchor: If a class trains by checking homework against answer keys (right vs. wrong), they spot mistakes faster later.
The world before: CLIP-style models made retrieval fast and surprisingly strong at matching topics and objects. But people noticed a pattern: these models often behave like a “bag of words” across modalities. That means they catch the right words (dog, park, ball) but don’t reliably check how words fit together (who holds what, left vs. right, near vs. under). Many tests swapped colors, roles, or order to see if models cared. Results showed gaps, especially for relations.
🍞 Hook: You know how a story can be 90% true but still wrong because of one sneaky extra detail? 🥬 The Concept: Half-Truth Vulnerability
- What it is: Adding one plausible but incorrect detail to a correct caption can increase similarity instead of decreasing it.
- How it works:
- Start with a short, correct anchor like “a dog.”
- Append a wrong but realistic add-on: “and a skateboard.”
- The model sometimes scores the longer, wrong version higher.
- Why it matters: If adding wrong info raises scores, search can prefer a more wrong description just because it’s longer or sounds plausible. 🍞 Anchor: On COCO, CLIP prefers the correct anchor over the half-truth only 40.6% overall, and just 32.9% when the add-on is a relation.
Failed attempts: Prior fixes tried sentence-level hard negatives (rewrite the whole caption slightly wrong) or added extra modules. These help some, especially for easy object tweaks, but relations (who does what to whom; where things are) stayed hard. Even stronger pretraining (SigLIP2) still stumbled on relations.
The gap: Training supervises full sentences, not the little parts (entities and relations) that build meaning. So the model may celebrate any extra familiar word (like “skateboard”) even when it’s wrong for the image.
🍞 Hook: Imagine grading a paragraph only overall, never checking each sentence or fact. 🥬 The Concept: Retrieval Scoring
- What it is: A single similarity number that decides which image–text pairs match.
- How it works:
- Compute image and text vectors.
- Take their cosine similarity.
- Rank candidates by this one score.
- Why it matters: If this score doesn’t reflect part-level correctness, tiny wrong details can slip through and be rewarded. 🍞 Anchor: Two captions with the same nouns but different relations can get nearly the same score, even if one is wrong.
Real stakes: When you refine a query (“cat” → “cat sitting under a chair”), the system should get stricter, not more gullible. This affects photo search, shopping (“blue mug on top shelf”), accessibility tools that describe scenes, and content filtering. If the score balloons for half-truths, users can be misled by confident but inaccurate matches—all because one wrong detail sounded nice.
02Core Idea
🍞 Hook: Think of checking a Lego build: you don’t only look at the whole castle; you also inspect each brick and how bricks connect. 🥬 The Concept: Component-Supervised CLIP (CS-CLIP)
- What it is: A way to fine-tune CLIP that teaches the model to verify each caption part (entities and relations) by contrasting it with a minimally edited wrong version.
- How it works:
- Break captions into units: entities (things) and relations (how things connect).
- For each unit, make a tiny, realistic mistake (a foil).
- Train the image to score the correct unit higher than its foil.
- Still keep normal sentence-level training in parallel.
- Why it matters: Without checking parts, the model rewards extra plausible words. With part-level checks, it learns that one wrong brick weakens the whole build. 🍞 Anchor: After CS-CLIP training, adding “skateboard” when none exists lowers similarity instead of boosting it.
The “Aha!” in one sentence: If we supervise the meaning-building parts (entities and relations) directly, the single similarity score starts reflecting true compositional correctness, not just word overlap.
Three analogies:
- Recipe checker: Don’t just taste the soup; also verify each ingredient. A single wrong spice should matter.
- Math proof: Don’t only grade the final answer; check each step. One faulty step invalidates the result.
- Airport security: Don’t just look at the suitcase; scan each item inside. One dangerous piece changes the decision.
Before vs. After:
- Before: Models often like longer, plausible-sounding captions, even if the added detail is wrong—especially for relations.
- After: Models penalize the incorrect add-on. Half-Truth Accuracy rises from 40.6% (CLIP) to 69.3% (CS-CLIP); relation additions go from 32.9% to 65.5%.
Why it works (intuition): The single similarity score learns from whatever supervision we give it. If we only supervise whole sentences, the score mainly tracks coarse overlap. By adding part-level contrasts, we give the score finer “teeth” so it bites on specific details (right color, right role, right spatial preposition). Now, longer but wrong captions get correctly downgraded.
Building blocks (introduced with sandwich explanations):
🍞 Hook: Think of labeling the pieces in a diorama: “red car,” “big tree,” “two birds.” 🥬 The Concept: Entity Units
- What it is: Noun phrases with their attributes and counts (e.g., “a brown horse,” “three dogs”).
- How it works:
- Parse the caption to find concrete things.
- Keep modifiers attached (color, number).
- Use them as small meaning blocks.
- Why it matters: Without clean entity pieces, the model can’t learn fine object/attribute checks. 🍞 Anchor: From “a woman riding a brown horse,” we extract “woman” and “brown horse” as entity units.
🍞 Hook: In board games, who chases whom matters: “dog chases cat” is not “cat chases dog.” 🥬 The Concept: Relation Units
- What it is: Directed links between entities (subject, predicate, object), like “person riding horse” or “ball in park.”
- How it works:
- Choose two entities from the caption.
- Identify a predicate (action/spatial) with direction.
- Store the triple (subject, predicate, object).
- Why it matters: Without relation checks, the model can’t tell role swaps or spatial flips. 🍞 Anchor: “horse near barn” differs crucially from “barn near horse” in many scenes.
🍞 Hook: Spot-the-difference puzzles change just one tiny thing. 🥬 The Concept: Minimal Editing Foils
- What it is: Tiny, realistic edits to a unit that change meaning (object, attribute, predicate, or role) while keeping fluency.
- How it works:
- For an entity: swap the color or the object head (horse→giraffe; brown→white).
- For a relation: flip the predicate, swap roles (when asymmetric), or replace one argument.
- Keep everything else the same.
- Why it matters: Without tightly matched foils, the model might learn shortcuts (topic drift) instead of precise distinctions. 🍞 Anchor: “brown horse” vs. “white horse” is a single-attribute foil; “person riding horse” vs. “horse riding person” is a role-swap foil.
🍞 Hook: Imagine giving grades both for the whole essay and for each paragraph’s facts. 🥬 The Concept: Half-Truth Vulnerability (revisited as the target)
- What it is: The failure that CS-CLIP directly addresses—wrong add-ons getting rewarded.
- How it works: Teach the score to prefer each correct unit over its matched foil.
- Why it matters: Fixing this aligns with broader compositional gains across many benchmarks. 🍞 Anchor: When this target improves, models also do better on attribute binding, spatial reasoning, and role sensitivity tests.
03Methodology
At a high level: Image + Caption → Parse into units → Generate matched foils → Fine-tune with unit-level and sentence-level contrast → Standard retrieval at test time.
Step 1: Parse captions into units
- What happens: A text-only parser (LLM) splits each caption into entity units (things with attributes/counts) and relation units (who-does-what-to-whom; spatial links).
- Why this step exists: If we don’t isolate the small meaning pieces, we can’t supervise them directly; the model stays coarse.
- Example: “a woman riding a brown horse in a park” → Entities: [“woman”, “brown horse”, “park”]; Relations: [“woman riding horse”, “horse in park”].
🍞 Hook: Like sorting Lego bricks by shape and color before building. 🥬 The Concept: Entity Units (recap)
- What it is: Noun phrases with attributes/quantities.
- How it works: Keep modifiers attached so checks become precise.
- Why it matters: Otherwise the model may know “horse” but ignore “brown” or “three.” 🍞 Anchor: “three red apples” stays one unit, not split into “three,” “red,” and “apples.”
🍞 Hook: In a relay race, who hands the baton to whom matters. 🥬 The Concept: Relation Units (recap)
- What it is: Directed triples (subject, predicate, object).
- How it works: Extract short, visually checkable predicates (riding, under, holding, next to).
- Why it matters: If direction is ignored, “cat under table” might be confused with “table under cat.” 🍞 Anchor: “person holding umbrella” ≠“umbrella holding person.”
Step 2: Generate minimally edited foils
- What happens: For each unit, create 1–4 tiny, realistic edits that change meaning while keeping fluency.
- Why this step exists: If foils are too different (off-topic), the model learns to spot topics, not meanings. Minimal foils force part-precision.
- Example (entity): “brown horse” → “white horse” (attribute change); “brown horse” → “brown giraffe” (object change).
- Example (relation): “person riding horse” → “person leading horse” (predicate change); or swap roles when appropriate.
🍞 Hook: Like changing one puzzle piece to see if you notice. 🥬 The Concept: Minimal Editing Foils (recap)
- What it is: Single-component changes that flip meaning.
- How it works: Edit exactly one attribute, object, predicate, or role.
- Why it matters: Without single-step edits, we can’t pinpoint where the model fails. 🍞 Anchor: “vase on table” vs. “vase under table” changes just the predicate but reverses the scene.
Step 3: Encode images and units
- What happens: Use the usual CLIP encoders to get vectors for images and for unit texts (entities/relations), plus the full caption.
- Why this step exists: We keep the same architecture and cosine scoring so test-time stays standard and fast.
- Example: Image vector v; unit vectors u (correct) and u~ (foil); caption vector t.
Step 4: Unit-level contrastive training
- What happens: For each image, we tell the model: “Score the correct unit higher than its foil and higher than other images’ units.” We also include the symmetric direction (units should match their own image more than others).
- Why this step exists: This directly teaches the score to reflect part-level truth. Without it, added plausible words may inflate similarity.
- Example: v·u > v·u~ and v·u > v·(other images’ units).
Step 5: Keep sentence-level training in parallel
- What happens: We also keep the normal sentence-level contrast (with hard negatives like shuffled captions) so global alignment remains strong.
- Why this step exists: Without full-sentence alignment, retrieval can lose its broad matching strength.
- Example: v·t > v·(other captions) and v·t > v·(shuffled hard negatives).
Step 6: Combine losses, fine-tune, and stop
- What happens: We add the unit-level loss (weighted by λu) to the global sentence loss and fine-tune for ~25 epochs on COCO.
- Why this step exists: The combined signal balances fine detail with big-picture alignment.
- Example settings: ViT-B/32, batch 128, AdamW, small learning rate; sample ~2 unit–foil pairs per image per step.
Test-time (no change):
- What happens: We use the standard dual-encoder cosine similarity for retrieval—no extra modules, no decomposition at inference.
- Why this step exists: We want plug-and-play improvements without slowing down or complicating deployment.
- Example: Given a query caption, rank images by cosine similarity to the caption embedding.
The Secret Sauce:
- Matched unit foils focus the model on the exact place meaning changes. This makes the single score sensitive to entities, attributes, roles, and spatial words. Ablations show foils drive most of the gains; updating both encoders especially helps relations; increasing λu mainly boosts Half-Truth robustness with little harm to other metrics.
🍞 Hook: Think of a teacher who grades both the whole essay and each fact-box. 🥬 The Concept: Contrastive Training (recap)
- What it is: Pull true pairs together; push false ones apart—now at both sentence and unit levels.
- How it works: Two losses, same cosine space; balance them.
- Why it matters: If we only grade the essay, not the fact-boxes, a polished but wrong detail can still earn high marks. 🍞 Anchor: After training, “a dog near a log” will beat “a dog away from the log” when the picture shows “near.”
04Experiments & Results
The Test: Half-truth diagnostic
- What they measured: Half-Truth Accuracy—the chance the model prefers the correct short anchor over the anchor plus a single incorrect add-on. Random is 50%. If accuracy is below 50%, the model is more often tricked by half-truths.
- Why: Real users refine queries by adding details. Models should become stricter when details are added, not more gullible.
The Competition: Baselines include zero-shot CLIP, SigLIP, SigLIP2, and fine-tuned methods like NegCLIP, FSC-CLIP, ReadCLIP, CE-CLIP, DeGLA, and others.
The Scoreboard (with context):
- Overall Half-Truth Accuracy on COCO: • CLIP: 40.6% (worse than a coin flip—gets fooled more often than not) • NegCLIP: 56.5% (a helpful bump) • CS-CLIP: 69.3% (like moving from a shaky D to a solid B+)
- By addition type: • Entity additions: CLIP 52.9% → CS-CLIP 75.4% (clear gains) • Relation additions: CLIP 32.9% → CS-CLIP 65.5% (the hardest case is now above a strong pass)
- Similarity gap (anchor minus half-truth): CLIP negative on relations; CS-CLIP positive (+0.014), meaning the model now reliably lowers scores for wrong relation add-ons.
Surprising Findings:
- Many models—even large ones—still fall below chance on relation additions. This shows how sticky relation errors are without targeted supervision.
- CS-CLIP’s improvements on half-truths correlate with better scores across 16 compositional benchmarks (average I2T: 57.8%), suggesting we fixed a root cause, not just the test.
Beyond half-truths: Compositional benchmarks
- Average compositional I2T accuracy: • CLIP: 52.1% • NegCLIP: 55.3% • CS-CLIP: 57.8% (best among evaluated models)
- Group Accuracy on paired datasets also improves with CS-CLIP, meaning both directions (image→text and text→image) benefit together.
- Capability-wise gains: Strong jumps in role sensitivity and predicate sensitivity; solid attribute binding improvements.
Downstream tasks:
- Zero-shot classification (e.g., ImageNet family): Small trade-off (Acc@1: 63.6% CLIP → 59.9% CS-CLIP), similar to other COCO fine-tuned methods.
- Image–text retrieval (COCO, Flickr8k): CS-CLIP achieves top or tied best Recall@1, especially in T2I (71.7%), indicating better fine-grained alignment useful in practical retrieval.
Ablations that make the numbers meaningful:
- Freezing one encoder (text-only or image-only) hurts relation accuracy most; full fine-tuning works best.
- Larger backbones (ViT-L/14) further improve compositional and half-truth metrics.
- Raising the unit-loss weight λu mainly boosts Half-Truth Accuracy (esp. relations) while leaving other metrics steady.
- Using matched unit foils is the key driver of Half-Truth gains; when combined with sentence negatives, you get the best overall package.
Takeaway: CS-CLIP doesn’t just memorize special cases; it learns to check the little building blocks of meaning. That’s why it wins both the new half-truth test and a broad set of compositional challenges.
05Discussion & Limitations
Limitations
- Text-only parsing: The unit extractor never sees the image. It can miss details not said in captions or introduce slight artifacts, which might limit the quality of unit–foil pairs.
- Data scope: Fine-tuning on COCO (smaller than CLIP’s pretraining data) trades some zero-shot accuracy for compositional sensitivity, a common and expected compromise.
- Not a general truth-checker: CS-CLIP reduces half-truth errors but doesn’t guarantee full factual correctness or remove demographic and association biases present in data.
- Narrow perturbations: Minimal foils check single-step mistakes well, but they don’t cover all tricky errors (like invented entities far off-topic).
Required Resources
- A CLIP-style model (e.g., ViT-B/32 or larger) and a GPU setup (the paper used 8Ă—A100s) for fine-tuning.
- A caption parsing + foil generation pipeline (LLM-based), plus light post-processing filters.
- Standard training code (e.g., OpenCLIP) and datasets like COCO.
When NOT to Use
- If you need strongest zero-shot classification on many unrelated domains and cannot afford any fine-tuning trade-off.
- If you lack captions suitable for clean unit parsing (e.g., very noisy/no-caption settings).
- If inference-time speed is not the top priority and you prefer complex architectures that might squeeze a bit more accuracy.
Open Questions
- Joint image–text parsing: Can we improve unit extraction by looking at the picture too, not just the words?
- Scaling up: If we add unit-level supervision during huge pretraining, can we keep or even improve zero-shot accuracy while gaining compositional strength?
- Beyond single edits: How do we handle multi-step, longer-range reasoning and negation more robustly?
- Fairness and safety: Does part-level supervision interact with bias mitigation, and can it help reduce harmful correlations?
- Image-side half-truths: What happens if we minimally edit images (not texts) to create visual foils—can the same idea help there too?
06Conclusion & Future Work
Three-sentence summary: CLIP-style models often reward “half-truths,” where adding one wrong but plausible detail makes a caption seem more similar to an image. CS-CLIP fixes this by supervising the small building blocks of meaning—entities and relations—contrasting each correct unit with a minimally edited foil while keeping the standard dual-encoder setup. This raises Half-Truth Accuracy to 69.3% on COCO and improves broad compositional benchmarks without changing test-time inference.
Main Achievement: Showing that targeted, unit-level supervision can make a single similarity score reflect true compositional correctness—especially for relations—without changing the model’s architecture or retrieval pipeline.
Future Directions: Add vision-aware parsing to create cleaner units and foils; scale unit-level supervision to pretraining to reduce zero-shot trade-offs; explore image-side foils; and deepen coverage for negation, counting, and multi-step reasoning.
Why Remember This: It turns a big idea—“grade the parts, not just the whole”—into a practical recipe that measurably reduces a common failure (half-truths) and lifts performance across many compositional tests, all while preserving CLIP’s beloved speed and simplicity at inference.
Practical Applications
- •Improve image search so adding more details (color, count, position) narrows results correctly instead of drifting.
- •Enhance product discovery (e-commerce) by reliably matching multi-attribute queries like “small red ceramic bowl on a wooden shelf.”
- •Boost accessibility captions by reducing confident errors when adding specifics (e.g., “man holding umbrella under bridge”).
- •Strengthen content moderation by catching misleading near-matches that include a single wrong attribute or relation.
- •Curate datasets by filtering out image–caption pairs that seem similar overall but fail on key details (e.g., wrong role or spatial word).
- •Power creative asset tools (stock libraries) to honor fine-grained prompts (binding colors to the right objects, correct spatial layouts).
- •Support robotics or AR systems that need precise grounding of relations like “tool left of the blue box.”
- •Improve retrieval-based evaluation pipelines for VLM training by using unit-level foils as targeted tests.
- •Assist education tools with visual questions that require verifying who-does-what-to-whom, not just spotting objects.
- •Aid safety audits by stress-testing models with half-truth probes to surface overconfidence on wrong details.