Relational Visual Similarity

Thao Nguyen; Sicheng Mo; Krishna Kumar Singh; Yilin Wang; Jing Shi; Nicholas Kolkin; Eli Shechtman; Yong Jae Lee; Yuheng Li

Relational Visual Similarity

Intermediate

Thao Nguyen, Sicheng Mo, Krishna Kumar Singh et al.12/8/2025

arXiv PDF

Key Summary

•Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.
•This paper introduces relational visual similarity, which matches images by their shared logic and relationships, not just their looks.
•The authors build an image–caption dataset where captions are anonymous and describe the idea (like 'transformation of {subject} over time') instead of the objects.
•They fine-tune a Vision–Language Model so image features line up with these abstract, relational captions.
•Their model, called relsim, retrieves images that share the same underlying idea, even when they look totally different.
•In tests judged by GPT-4o and by people, relsim beats popular baselines like LPIPS, DINO, dreamsim, and CLIP.
•Group-based captioning (using sets of images with the same logic) works much better than captioning single images.
•Relational similarity complements attribute similarity; together, they form a richer map of how images are related.
•This unlocks creative search and analogical image generation—finding or making new images with the same core idea.
•Limitations include dataset scale and bias, VLM hallucinations, and handling images with multiple valid relational readings.

Why This Research Matters

Relational visual similarity lets us search and create by idea, not just by object, which matches how people think and learn. Teachers and students can find visual analogies that make tough concepts click—like seeing plate tectonics through a cracked orange. Designers and artists can discover fresh inspiration by retrieving images that share a clever pattern or layout, not just a shared noun. Scientists and analysts can compare processes (growth, diffusion, symmetry) across domains, sparking cross-field insights. Generative tools can be evaluated on whether they preserved the concept, not only the style, making edits more meaningful. Overall, this brings AI vision a step closer to human-like reasoning, powered by analogies.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how you can say two things are alike in different ways—like two apples look alike, but also Earth is like a peach because both have layers? That's two kinds of 'sameness.'

🥬 Filling (The Actual Concept): What it is: Attribute similarity is when things match by looks (color, shape, texture, class), while relational similarity is when things match by how their parts relate or behave (like 'layers inside,' 'growing over time,' or 'many smalls make one big'). How it works: 1) Your eyes notice features like red, round, furry. 2) Your mind also notices patterns like 'sequence,' 'inside vs. outside,' 'big supports small,' or 'changes over time.' 3) You can match two scenes by the same pattern even if the objects are different. Why it matters: If we only notice looks, we miss clever connections humans use for learning and creativity—like analogies.

🍞 Bottom Bread (Anchor): A line of matches burning from left to right is like a banana ripening from green to yellow—same 'changes over time' pattern, very different objects.

The world before: Computer vision models got great at noticing attribute similarity. Early tools compared pixels or handcrafted features (like SSIM, SIFT). Then deep nets (VGG, ResNet) and modern encoders (DINO, CLIP, dreamsim) learned to match images more like people do: two different dogs are 'the same class' and often counted as similar. This was a huge leap past raw pixels. But even these advanced models mostly asked, 'Do they look alike or belong to the same category?'

🍞 Top Bread (Hook): Imagine you’re sorting picture cards. One pile is 'all cats,' easy by looks. Another pile is 'things that show stages'—a butterfly’s flight path, a flower blooming, a moon’s phases. That second pile needs understanding, not just seeing.

🥬 Filling (The Actual Concept): What it is: Relational visual similarity is matching images by their internal relationships—like sequences, symmetry, containment, or analogy—regardless of object identity. How it works: 1) Identify the key players in the scene (parts, objects). 2) Spot how they relate (order, cause, mirror, nest, transform). 3) Describe the pattern without naming the actual objects (use placeholders like {subject}). Why it matters: Models that miss relationships fail at analogies, creative search, and deeper comprehension.

🍞 Bottom Bread (Anchor): A 'walnut brain' and a 'strawberry heart' both show the pattern 'object shaped or arranged to look like another thing'—a shared idea, not shared category.

The problem: State-of-the-art similarity tools (LPIPS, CLIP, DINO) still lean heavily on attributes. They often fail to link images that share the same logic but have different objects. They can’t easily say, 'These both show a transformation over time' or 'This design turns food into animals by carving.'

Failed attempts: 1) Just tuning vision encoders on more data—still mostly learns attributes. 2) Caption-first retrieval from a single image—VLMs tend to leak object words instead of staying abstract, so they drift back to semantics and looks. 3) Perceptual metrics (LPIPS) and self-supervised features (DINO) don’t encode the higher-level concepts needed for analogy.

🍞 Top Bread (Hook): Imagine giving directions using only road shapes and turns, not street names. That’s how you describe a route’s logic without specific labels.

🥬 Filling (The Actual Concept): What it is: Anonymous captions are object-agnostic descriptions that name the idea, not the objects (e.g., 'transformation of {subject} over time'). How it works: 1) Gather multiple images that share a logic. 2) Write one caption that fits them all using placeholders. 3) Pair each image with that caption so the model learns to connect pictures to the shared idea. Why it matters: Without abstract captions, models slide back into naming objects and colors.

🍞 Bottom Bread (Anchor): 'Arrange many small {object}s to form a larger {shape}' can describe seashell mosaics, cookie collages, or bottle-cap art.

The gap: There was no large, clean dataset that teaches models to connect images by relational logic using object-agnostic language. Without that, training a model to notice relationships was like teaching without a curriculum.

What this paper fills: The authors create a 114k-image dataset with anonymous captions and train a Vision–Language Model (VLM) to align image features with these relational descriptions. This produces a new metric and retriever—relsim—that can say, 'These two images share the same idea,' even when they look different.

Real stakes: This helps people find inspiration by idea (not just by object), teach with analogies (finding visuals that share a concept), support designers and scientists (pattern-based search), and evaluate new generative tools (did the edit keep the idea, not just the style?). It’s a step toward AI that reasons visually more like we do.

02Core Idea

🍞 Top Bread (Hook): Imagine two music pieces played on different instruments but with the same melody. Even if the sounds differ, you recognize the tune.

🥬 Filling (The Actual Concept): What it is: The key insight is to teach a model to match images by abstract captions that capture the shared logic (the 'melody') rather than the surface objects (the 'instruments'). How it works: 1) Filter a giant image pool to keep 'relationally interesting' pictures. 2) Create anonymous, object-agnostic captions from groups of images with the same idea. 3) Train a Vision–Language Model so the image features line up with these abstract text features. Why it matters: Without aligning images to idea-only language, the model keeps grabbing onto looks and misses analogies.

🍞 Bottom Bread (Anchor): A lineup of socks from small to large and a lineup of trees from seedling to tall both fit the caption 'growth stages of {subject} arranged left to right.'

Multiple analogies:

Music: Different instruments, same melody (different visuals, same logic).
Recipes: Different ingredients, same cooking method (different objects, same transformation pattern).
Lego: Different colored bricks, same blueprint (different textures/colors, same arrangement rules).

Before vs After:

Before: Similarity meant 'looks alike' or 'same label' (e.g., 'two cats' or 'two round red fruits').
After: Similarity can also mean 'shares a pattern' (e.g., 'sequence of stages,' 'inside-outside diagram,' 'object arranged to mimic another object'), even when the items look unrelated.

Why it works (intuition):

Language can name invisible ideas (sequence, symmetry, analogy) that pixels alone can’t. By pairing images with idea-only captions, we push the model’s vision features to care about relationships over raw appearance. Group-based captions reduce leakage of object names and make the abstract pattern clearer. A VLM brings world knowledge, helping map visual cues to conceptual patterns.

Building blocks (bite-sized):

🍞 Hook: Imagine choosing the best coach to learn dance—one who explains steps, not just shows them.
🥬 Concept: Filter for 'interesting' images (those likely to have a logic) so the dataset teaches patterns, not just pretty pictures.
🍞 Anchor: 'Food carved into animal shapes' beats 'photo of toast' for learning a reusable idea.
🍞 Hook: Think of a teacher writing a rule that works for multiple examples.
🥬 Concept: Use groups of images with the same logic to write one anonymous caption with placeholders; pair that caption to each image.
🍞 Anchor: 'Use {object} pieces to form a portrait of {subject}' can match bottle caps, paper scraps, or cereal.
🍞 Hook: Imagine lining up puzzle pieces so the picture matches the box cover.
🥬 Concept: Train the VLM so image features match the abstract text features (contrastive learning), nudging images with the same idea closer.
🍞 Anchor: Two very different collages both move near each other in feature space because they share 'many smalls form one big.'
🍞 Hook: Picture a librarian who can shelve books by theme, not only by author.
🥬 Concept: Use the learned features to retrieve by idea—return images with the same logic, not just similar looks.
🍞 Anchor: Search by 'transformation of {subject} over time' and get seasons changing, bread rising, shadows moving—same tune, new instruments.

03Methodology

High-level recipe: Input (images from LAION-2B) → Step A (filter for relationally interesting images) → Step B (make group-based anonymous captions) → Step C (train relsim to align image features with abstract text) → Output (a model that measures and retrieves by relational similarity).

Step A: Filter images for relational potential

🍞 Hook: Imagine panning for gold—you sift out the sand to keep only the shiny bits.
🥬 Concept: What happens: Fine-tune a Vision–Language Model (Qwen2.5-VL-7B) to label images as 'interesting' if they likely contain reusable relational patterns (sequence, analogy, composition). Why this step: LAION-2B is huge and noisy; without filtering, the model sees too many plain 'just a sofa' images and learns weak patterns. Example: Keep 'strawberries arranged into a heart' and drop 'a single plain chair.'
🍞 Anchor: The filter agrees with humans about 93% of the time and keeps about 114k 'interesting' images from a massive pool.

Step B: Create anonymous captions from groups

🍞 Hook: It’s easier to explain a rule when you show multiple examples at once.
🥬 Concept: What happens: Manually collect 532 small groups (each 2–10 images) where all share one logic (e.g., 'food turned into animals by carving'). Use a frozen VLM to write a single abstract caption with placeholders (like {subject}, {object}) that fits every image in the group. Humans verify the caption; then apply a trained captioner to the rest of the filtered set. Why this step: Writing an idea from one image is tricky and often leaks object words; groups make the shared logic obvious and keep the caption abstract. Example: 'Transform a {fruit} into an {animal} by carving' pairs with many images beyond strawberries.
🍞 Anchor: This produces a dataset of {image, anonymous caption} for 114,881 images where the text focuses on the idea, not the nouns.

Step C: Train the relational visual similarity model (relsim)

🍞 Hook: Imagine teaching two kids—one hums the melody (text), one plays it on piano (image); you help them sync up.
🥬 Concept: What happens: Use a VLM as the visual feature extractor (Qwen2.5-VL-7B) and a fixed sentence encoder for text (all-MiniLM-L6-v2). Normalize both features and learn to make the image and its caption embeddings close (and non-matching pairs far) using contrastive learning (InfoNCE) with a learnable temperature. Optionally prepend a short instruction to the image input to 'think about underlying logic.' Why this step: Pure vision encoders cling to attributes; aligning with idea-only text pushes the image features toward relations. Example with data: Image: line of candles at different heights; Caption: 'arrange {subject} from least to most to show progression.' The model pulls these together in feature space.
🍞 Anchor: After training, two unrelated-looking images with the same logic will sit near each other in the model’s space, making relational retrieval possible.

Implementation details (kid-friendly clarity):

🍞 Hook: Think of adding a special bookmark so the model knows where to focus.
🥬 Concept: A learnable query token is appended (like a pointer) so the VLM’s last-layer feature there becomes the 'relational image feature.' Training uses LoRA (a memory-saving tuning trick) for 15k iterations on 8×A100 GPUs; text encoder stays frozen. Why: The query token stabilizes where we read the visual idea from; keeping text fixed turns captions into a steady 'idea ruler.' Without it, features wobble or drift to surface details.
🍞 Anchor: The result is a compact vector for each image that encodes 'idea-ness,' not just 'looks-ness.'

Evaluation protocol (how they checked it works):

🍞 Hook: Imagine a fair judge who ignores costumes and only scores the dance steps.
🥬 Concept: Build a retrieval test set with 14k labeled images plus 14k random distractors. For each query image, each method returns its nearest neighbor; GPT-4o scores relational similarity from 0–10 based only on shared logic. A human AB test also asks people to pick which of two retrieved images matches the query’s idea. Why: If the model truly learned relations, its top neighbor should share the idea, not just the objects. Without this, the model might just return 'another cat.'
🍞 Anchor: relsim wins the highest GPT-judged score and is preferred by humans over baselines in most comparisons.

The secret sauce (what’s clever):

Group-based anonymous captions: Captions written from sets make abstraction stronger and prevent object-word leakage.
VLM backbone: Language gives names to patterns pixels can’t easily express (sequence, containment, analogy), and the VLM brings that knowledge into vision features.
Contrastive alignment to abstract text: Forces the image representation to care about logic over surface attributes.
Instruction steering + learnable query token: Gentle nudges help the model harvest 'idea features' from the right internal spot.

What would break without each step:

No filtering: Too much noise; the model underfits relational patterns.
No groups for captions: Captions become object-specific; retrieval collapses into semantic or attribute matching.
No VLM/abstract text alignment: Features revert to 'what things are' instead of 'how things relate.'
No contrastive training: Images and captions don’t meet in the same idea space; retrieval fails.
No evaluation by logic-only judge: You can’t tell if improvements are due to looks or true relations.

04Experiments & Results

The test (what they measured and why):

Task: Relational image retrieval—given a query, find the most logically similar image in a large pool (not the same object, but the same idea).
Metric: GPT-4o gives a 0–10 score focused only on relational similarity. A human AB test asks which of two retrievals better matches the query’s idea.
Why: These two lenses (an automated judge and human choices) check both consistency and human alignment.

The competition (baselines compared):

Attribute-heavy metrics: LPIPS (perceptual), DINO (self-supervised features), dreamsim (synthetic similarity learning), CLIP-I (image-to-image in CLIP space).
Captioned paths from a single image: CLIP-T (caption → image) and Qwen-T (text-to-text), both using captions generated from only one image (more leakage risk). Also, fine-tuned pure vision encoders (CLIP, DINO) on the same dataset to see if they can catch up.

The scoreboard (with context):

relsim achieves a GPT score of 6.77. Think of it as a solid A when most others land around B.
CLIP-I reaches 5.91 (best baseline), meaning it sometimes catches abstractions, but still often sticks to object/semantic anchors.
DINO (5.14) and LPIPS (4.56) trail—good for looks, weak for logic.
Captioned single-image routes (5.33 for CLIP-T, 4.86 for Qwen-T) underperform because the caption often leaks objects or misses the core pattern when written from just one picture.
Fine-tuned CLIP/DINO improve but still don’t surpass relsim; adding language-grounded knowledge clearly helps.

Human preferences:

In AB tests, people pick relsim’s retrieval more often across all baselines (roughly 42.5%–60.7% win rates), with some ties. This shows the model’s outputs feel more 'idea-matching' to humans, not just 'look-matching.'

Surprising or notable findings:

Group-based captioning strongly beats single-image captioning—seeing multiple examples helps VLMs write truly abstract captions.
Vision-only encoders benefit from training on anonymous captions, but a VLM backbone still wins, reinforcing that language scaffolds relational understanding.
Attribute and relational similarity are complementary. A visualization shows how combining both can separate 'same logic, same appearance' from 'same logic, different appearance' and from 'random.' This hints at richer search and exploration when both axes are available.

Qualitative examples:

Only relsim retrieves a 'food carved to look like an animal' image when the query shows a different food–animal pair.
For 'one element highlighted among many similar ones,' relsim finds scenes with conceptual outliers, not just the same object.

Broader evaluation: Analogical image generation benchmark

The team assembled 200 triplets (input, text instruction, example output) representing 'keep the idea, change the instance.' They measured: LPIPS (perceptual), CLIP (semantic), and relsim (relational). Results show that logical similarity can be high even when visual/semantic similarity is low, and that proprietary models better preserve relational transformations than current open-source ones—pointing to a training data and capability gap.

05Discussion & Limitations

Limitations (clear-eyed view):

Dataset scale and bias: The 532 seed groups for caption training are manually curated and may reflect certain creative tropes better than others. This limits coverage and might encode cultural bias.
VLM hallucinations: Like all large models, the captioner can insert incorrect or overly specific details. Some 'anonymous' captions may sneak in semantics, weakening abstraction.
Multi-relation images: A single image can express several valid logics (e.g., 'symmetry' and 'inside/outside'). The current system doesn’t let users easily pick which relation they mean unless guided by prompts.
Evaluation proxy: GPT-4o as judge is helpful but still a proxy; broader human studies and task-based metrics would deepen validation.

Required resources:

Compute: Training used 8×A100 GPUs with LoRA for ~15k iterations. Inference is modest once trained, but reproducing the pipeline (filtering, captioning, finetuning) needs VLM access and GPU time.
Data: Access to LAION-scale images and careful filtering, plus human verification for seed groups.

When not to use this:

Purely perceptual tasks (e.g., compression artifacts, super-resolution evaluation) where pixel fidelity matters more than concepts.
Category retrieval where users just want 'more of the same object' quickly—attribute similarity or semantics may be faster and simpler.
Tight safety filters that rely on detecting known classes (e.g., policy enforcement) where high recall on attributes is key.

Open questions:

Interactive disambiguation: How can users specify 'which relation' they want (sequence vs. analogy vs. containment) in a friendly way?
Scaling abstraction: Can we automatically discover and expand relational templates from video and multi-image stories?
Fairness and coverage: How to ensure diverse creative logics across cultures and domains, avoiding overfitting to popular internet tropes?
Beyond images: Extending to video (temporal logic), diagrams (structural logic), and cross-modal analogy (sound patterns ↔ visual rhythms).
Joint axes: How to best combine attribute and relational similarity into a single controllable retrieval knob for search and creative tools?

06Conclusion & Future Work

Three-sentence summary: This paper introduces relational visual similarity—matching images by shared logic rather than surface looks—and builds relsim, a model trained to align images with anonymous, object-agnostic captions. Using group-based captioning and a Vision–Language Model with contrastive learning, relsim retrieves images that share ideas (like 'progression over time') even when they look different. It outperforms strong baselines in GPT-4o scoring and human studies and enables creativity-centric tasks like analogical search and generation.

Main achievement: Turning relational similarity into a measurable, learnable signal by pairing images with abstract, placeholder-based captions and aligning a VLM’s visual features to this idea space.

Future directions: Scale up automated discovery of relational groups; add user controls to pick specific relations; blend attribute and relational axes into unified, steerable retrieval; extend to video and multimodal analogies; and build richer benchmarks for analogical generation, especially for open-source models.

Why remember this: It’s a shift from 'seeing what is there' to 'understanding how things relate,' bringing machines closer to the human knack for analogy—the engine of learning, problem-solving, and creativity.

Practical Applications

•Idea-based image search: Find images that share 'progression over time' or 'many smalls form one big' regardless of objects.
•Creative mood boards: Build boards around a relational theme (e.g., symmetry, inversion, containment) for design brainstorming.
•Classroom analogies: Retrieve visuals that explain science concepts via everyday analogies (e.g., Earth layers ↔ peach).
•Storyboarding and comics: Search for panel layouts that share narrative logic (before/after, cause/effect) across topics.
•UX pattern mining: Find interface screenshots with the same relational layout (highlighting, grouping, hierarchy) across products.
•Scientific figure retrieval: Locate diagrams that encode the same structure (flow, feedback loops) even in different fields.
•Analogical image generation: Prompt models to create new images that keep the input’s idea while changing objects.
•Curation and recommendation: Suggest content by shared conceptual pattern to diversify feeds beyond categories.
•Model evaluation: Score edits by whether they preserve the intended idea (relsim), not only visual fidelity (LPIPS) or semantics (CLIP).
•Assistive authoring: Recommend abstract captions for an image that highlight its underlying logic, aiding documentation.

Version: 1