Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

Tingyang Chen; Cong Fu; Jiahua Wu; Haotian Wu; Hua Fan; Xiangyu Ke; Yunjun Gao; Yabo Ni; Anxiang Zeng

Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views

Beginner

Tingyang Chen, Cong Fu, Jiahua Wu et al.12/15/2025

arXiv PDF

Key Summary

•The paper shows that judging vector search only by distance-based recall and speed can be very misleading for real tasks.
•They introduce the Information Loss Funnel to explain three places where useful information gets lost: during embedding, from choosing the wrong similarity metric, and from method sensitivity to data shape.
•Iceberg is a new benchmark that tests vector search end-to-end on real tasks like image classification, face recognition, text retrieval, and recommendation.
•Across eight large datasets (1M–100M vectors) and 13 popular methods, the rankings change a lot when you measure task success instead of synthetic recall.
•Sometimes methods hit 99.9% synthetic recall but return almost all wrong answers for the task (like EVA02 with inner product on ImageNet).
•Iceberg provides a simple, explainable decision tree that picks the right metric and method using fast meta-features such as DBI, CV, RA, and RC.
•Metric choice matters: face recognition preferred inner product, while text retrieval and many image embeddings worked better with Euclidean distance.
•Partition-based methods can beat graph methods on some datasets, but graph methods win on others—data distribution decides.
•The authors outline future directions: task-aware, metric-aware, and distribution-aware vector search.
•Iceberg helps practitioners avoid hidden pitfalls and choose vector search setups that actually improve downstream results.

Why This Research Matters

Real-world systems care about correct answers, not just close vectors. A face ID system must return the right person, a RAG system must fetch the correct passages, and a shop must recommend items users actually want. Iceberg shows how popular scores can look perfect while the task fails badly, and it gives a simple way to avoid those traps. By using the decision tree and meta-features, teams can pick the right metric and method quickly, saving time and computation. This leads to safer security checks, more helpful chatbots, and higher-quality recommendations. As embeddings and tasks evolve, Iceberg’s task-centric view keeps evaluations honest and aligned with user outcomes.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re looking for your lost blue pencil in a giant box of crayons. If you only measure how close crayons are in the box, you might grab many crayons that are near your hand—but not the blue pencil you need.

🥬 The Concept (Vector Similarity Search, VSS): VSS is how computers quickly find things that are “most similar” by comparing their number-based fingerprints (vectors).

What it is: A way to find items that are alike by comparing their vectors instead of searching everything.
How it works: 1) Turn data (images, text, users) into vectors. 2) Build a structure to search fast. 3) For each query, return close vectors. 4) Use a metric (like Euclidean or inner product) to judge closeness.
Why it matters: Without fast, accurate VSS, apps like recommendations, search, or RAG would be slow or wrong. 🍞 Anchor: When you ask a chatbot to find relevant passages, it compares vector fingerprints to fetch matching chunks.

🍞 Hook: You know how report cards show grades, but grades alone don’t say if you actually learned what you needed?

🥬 The Concept (Synthetic Recall): Synthetic recall@K measures how many true nearest neighbors (by distance) were found among the top K results.

What it is: A distance-based accuracy score, not tied to the real task goal.
How it works: 1) Compute the true top-K by the chosen distance. 2) See how many of those the method retrieved. 3) Divide by K.
Why it matters: Alone, it can be misleading because it ignores whether the retrieved items are actually useful for the task. 🍞 Anchor: A search might grab items that are close by vector, but not the right class or answer.

🍞 Hook: Think of judging a soccer team by goals scored, not how nice their uniforms look.

🥬 The Concept (Task-Centric Metrics): These measure success for the real job (like correct label, hit rate, or match quality).

What it is: Metrics like Label Recall@K, Hit@K, or Matching Score@K that reflect task utility.
How it works: 1) Define what counts as a “correct” result (same label, right paragraph, preferred item). 2) Check top-K results. 3) Score based on task labels.
Why it matters: If these don’t improve, a high synthetic recall doesn’t help users. 🍞 Anchor: If you search for “penguin photos,” task-centric means the top results are actually penguins, not just images nearby in vector space.

The World Before: Vector databases and search engines exploded in popularity because they help with recommendation, semantic search, face ID, and RAG. Benchmarks mostly tested a trade-off: recall vs. latency. But these tests used only distance-based ground truths (Euclidean or inner product), not whether results helped the final task.

The Problem: High synthetic recall often doesn’t equal good task outcomes. For example, a method could find vectors close by inner product but still return the wrong class of images.

🍞 Hook: Imagine pouring water through a funnel, but tiny gems sink into the filter and never make it to your cup.

🥬 The Concept (Information Loss Funnel): A three-stage story of where useful meaning gets lost before results reach the task.

What it is: A model explaining why a perfect distance score can still fail real tasks.
How it works: 1) Embedding Loss: turning raw data into vectors discards details. 2) Metric Misuse: wrong similarity metric breaks alignment. 3) Distribution Sensitivity: method choices perform unevenly across datasets.
Why it matters: If any stage breaks, your final results can be bad even if earlier numbers look great. 🍞 Anchor: A face search might retrieve very “close” vectors but pick the wrong person if the metric or method doesn’t match the data’s shape.

Failed Attempts: Communities kept tuning indexes to push recall and speed but rarely checked whether label accuracy, hit rate, or recommendation utility improved. That led to shiny numbers but sometimes worse user experience.

The Gap: We lacked a benchmark that connects vector search to real tasks, exposes the three funnel layers, and explains how to choose the right method for a given dataset.

Real Stakes: In daily life, wrong face matches affect security; bad retrieval harms chatbots; mismatched recommendations lower sales; and mislabeled image retrieval wastes time. We need end-to-end evaluation that mirrors reality.

🍞 Hook: Picking a running shoe just by color won’t make you faster.

🥬 The Concept (ANNS vs. MIPS): Two common vector problems — find nearest neighbors by Euclidean (ANNS) or best alignment by inner product (MIPS).

What it is: ANNS finds smallest distances; MIPS finds largest inner products.
How it works: ANNS uses geometric closeness; MIPS considers direction and magnitude alignment.
Why it matters: Picking the wrong one can sink your task results even if recall is high. 🍞 Anchor: For face recognition with angular losses, MIPS tends to fit better; for many image/text embeddings, Euclidean ANNS often wins.

🍞 Hook: Grading a spelling test by counting how fast you wrote, not how many words you spelled correctly, is silly.

🥬 The Concept (Label Recall@K): Measures if top-K results share the correct label/class with the query.

What it is: Fraction of retrieved items that match the query’s label.
How it works: 1) Retrieve top-K. 2) Count matching labels. 3) Divide by K (averaged over queries).
Why it matters: It reflects true classification/identity usefulness. 🍞 Anchor: For “dog” images, label recall@10 says how many of the 10 are actually dogs.

02Core Idea

🍞 Hook: You know how a telescope lets you see the moon clearly, but clouds can still block your view? Even a great tool can be fooled by things in the way.

🥬 The Concept (Key Insight): The big idea is to judge vector search by how well it helps real tasks, and to explain mistakes using the Information Loss Funnel.

What it is: A task-centric benchmark (Iceberg) that measures downstream utility, diagnoses where information is lost, and guides metric/method choices.
How it works: 1) Evaluate both synthetic and task metrics. 2) Analyze three funnel layers: embeddings, metrics, distribution. 3) Use fast meta-features and a decision tree to pick metric and method.
Why it matters: It prevents being tricked by high recall that doesn’t help the actual job. 🍞 Anchor: An index that gets 99.9% synthetic recall but <1% label recall is like a perfect-looking bridge that collapses when cars drive on it.

Multiple Analogies:

Glasses Analogy: Synthetic recall is like checking how clean your glasses are; task-centric metrics test whether you can read the board. You need both.
Treasure Map Analogy: Distance-based recall is like following a compass perfectly; metric misuse is using the wrong north; task metrics confirm you found the treasure.
Cooking Analogy: Great chopping (recall) isn’t dinner; the meal (task success) needs the right recipe (metric) and oven settings (method/data fit).

Before vs After:

Before: Benchmarks crowned winners using recall–latency curves tied to a single metric and synthetic ground truth.
After: Iceberg re-ranks methods by end-task success and shows where and why synthetic scores fail, offering a simple, interpretable decision tree for choosing what actually works.

🍞 Hook: Imagine a detective solving a case by checking fingerprints (vectors) but also confirming the alibi (task labels).

🥬 The Concept (Information Loss Funnel Layers):

Embedding Loss
- What it is: Turning rich data into fixed vectors drops details.
- How it works: The training objective keeps what helps it most and discards the rest; class boundaries or semantics can blur.
- Why it matters: It sets a ceiling—no search method can rise above lost information.
- Anchor: On ImageNet-DINOv2, label recall@100 caps at ~71% even when synthetic recall is ~99.9%.
Metric Misuse
- What it is: Using a distance that doesn’t match how embeddings were trained.
- How it works: If embeddings are angular, inner product or cosine suits; if they form tight Euclidean clusters, Euclidean suits.
- Why it matters: A wrong metric can make 99.9% synthetic recall return nearly all wrong labels.
- Anchor: With EVA02 and inner product, label recall drops below 1% despite near-perfect synthetic recall.
Distribution Sensitivity
- What it is: Some methods fit certain data shapes better (clusters, angles, density).
- How it works: Partition-based can shine with clear partitions; graph-based can excel with good connectivity and contrast.
- Why it matters: Picking a mismatched method hurts both speed and task utility.
- Anchor: On BookCorpus and Commerce, graph-based methods were up to 3–4× faster; on Glint360K-IR101, partition-based won.

🍞 Hook: Choosing a game plan by asking a few smart questions beats guessing.

🥬 The Concept (Explainable Decision Tree):

What it is: A simple, two-layer tree that first picks the metric (IP vs Euclidean), then the method type (partition vs graph) using meta-features.
How it works: 1) Metric: If DBI_E ≥ DBI_C and CV ≤ 0.1, choose inner product; else choose Euclidean. 2) Method: If RA ≥ 60 or RC < 1.5, pick partition-based; else pick graph-based.
Why it matters: It’s fast, interpretable, and tied to downstream success. 🍞 Anchor: On Glint360K-IR101 (DBI_E=5.51, DBI_C=2.09, CV=0.08), the tree recommends inner product and partition-based methods like ScaNN—matching the best task results.

🍞 Hook: Think of meta-features as quick health checks before a race.

🥬 The Concept (Meta-features: DBI, CV, RA, RC):

DBI (Davies–Bouldin Index)
- What it is: Measures how tight and separate clusters are (lower is better) under Euclidean (DBI_E) or cosine (DBI_C).
- How it works: Compares within-cluster scatter to between-cluster separation.
- Why it matters: Tells whether angular or Euclidean structure dominates, guiding metric choice.
- Anchor: If DBI_E ≥ DBI_C, angular separation is stronger—favor inner product.
CV (Coefficient of Variation of norms)
- What it is: Variation of vector lengths; low CV means similar norms.
- How it works: Standard deviation of norms divided by mean norm.
- Why it matters: With small CV (≤ 0.1), inner product focuses on angles instead of being hijacked by length.
- Anchor: Face embeddings trained with angular losses often have low CV; inner product works well.
RA (Relative Angle)
- What it is: Average angle to the global mean; higher means bigger angular spread.
- How it works: Computes mean angular separation across data.
- Why it matters: High RA can make graph traversal harder; partitioning may be safer.
- Anchor: With RA ≥ 60, partition-based often wins.
RC (Relative Contrast)
- What it is: How different nearest neighbors are from random points; higher means clearer neighbors.
- How it works: Mean distance divided by nearest-neighbor distance.
- Why it matters: Low RC (≤ 1.5) weakens graph guidance; partition-based can help.
- Anchor: If RC is small, choose partition-based to avoid graph traps.

Why It Works (Intuition): Iceberg aligns the whole pipeline with the task. It separates where meaning is lost (embedding), where geometry goes wrong (metric), and where search stumbles (distribution). The meta-features quickly reveal the dataset’s shape, so the decision tree can pick a metric and method that carry more task-relevant signal to the finish line.

03Methodology

High-Level Recipe: Input (raw data) → Embedding (vectors) → Index Build (choose metric & method) → Query (search top-K) → Evaluation (synthetic + task metrics) → Decision Tree Guidance (meta-features → metric → method).

Step 1: Embedding Generation

What happens: Images, faces, texts, and user–item logs are encoded into vectors using modern models (e.g., DINOv2, EVA02, ConvNeXt for images; ArcFace for faces; Stella for text; ResFlow for recommendation).
Why it exists: Raw data is too big and messy; vectors let us compare items numerically.
Example: An ImageNet picture becomes a 768–1536-D vector (depending on the model). A paragraph in BookCorpus becomes a 1024-D vector.

🍞 Hook: Like summarizing a long book into a short book report. 🥬 The Concept (Embedding Loss):

What it is: Some details vanish when compressing data to fixed-length vectors.
How it works: Training keeps what’s most helpful for its loss; other nuances fade.
Why it matters: It sets an upper limit—no search can recover lost information. 🍞 Anchor: On ImageNet-DINOv2, label recall@100 saturated near 71% despite near-perfect synthetic recall.

Step 2: Metric Selection (IP vs Euclidean)

What happens: Pick the similarity metric that matches the embedding geometry.
Why it exists: A mismatched metric makes “close” neighbors semantically wrong.
Example: Face embeddings trained with angular margins prefer inner product; image or text models often align better with Euclidean.

🍞 Hook: Choosing a ruler or a protractor depending on whether you measure length or angle. 🥬 The Concept (Metric Misuse):

What it is: Using a similarity rule that clashes with embeddings’ structure.
How it works: Angular embeddings + Euclidean, or cluster-friendly embeddings + inner product, can both mis-rank neighbors.
Why it matters: Even 99.9% synthetic recall can return mostly wrong labels. 🍞 Anchor: EVA02 + IP yields near-zero task label recall while synthetic recall looks perfect.

Step 3: Index Construction (Partition vs Graph)

What happens: Build a data structure to speed up search.
- Partition-based: cluster or quantize to narrow search space (e.g., ScaNN, IVFPQ, RaBitQ, DB-LSH, Fargo).
- Graph-based: build a proximity graph and greedily traverse (e.g., HNSW, NSG, Vamana, Mobius, ip-NSW, ip-NSW+, NAPG, MAG).
Why it exists: Brute-force is too slow at scale; indexes trade tiny accuracy for big speedups.
Example: On Glint360K-IR101 with inner product, ScaNN (partition) beat top graph methods in speed at similar recall.

🍞 Hook: Sorting books into shelves (partitions) vs. drawing a map of cross-aisle shortcuts (graph). 🥬 The Concept (Distribution Sensitivity):

What it is: Different indexes thrive on different data shapes.
How it works: Clear partitions favor clustering/quantization; good connectivity and contrast favor graphs.
Why it matters: Wrong choice kills speed and sometimes caps recall. 🍞 Anchor: Graphs excelled on BookCorpus; partitions excelled on Glint360K-IR101.

Step 4: Querying

What happens: Given a query vector, search the index to retrieve top-K neighbors under the chosen metric.
Why it exists: Real systems must answer quickly across millions to hundreds of millions of vectors.
Example: For a face image, retrieve top-100 candidate faces by inner product.

Step 5: Evaluation (Synthetic + Task-Centric)

What happens: Measure synthetic recall@K and task utility (Label Recall@K, Hit@K, Matching Score@K).
Why it exists: Synthetic recall checks distance retrieval; task metrics confirm real usefulness.
Example Data:
- ImageNet-DINOv2: Label Recall@100 saturates ~71% at synthetic recall ~99.9%.
- EVA02 + IP: Synthetic recall ~99.9% but label recall <1%.
- BookCorpus: Euclidean ANNS hit@100 near 1.0; IP MIPS under 0.5.
- Glint360K-IR101: MIPS ~98% label recall@100 vs Euclidean ~92%.
- Commerce: Matching Score@100 rises then falls as synthetic recall increases.

Step 6: Decision Tree Guidance

What happens: Compute four meta-features fast and traverse the tree.
1. Metric: If DBI_E ≥ DBI_C and CV ≤ 0.1 → choose inner product; else Euclidean.
2. Method: If RA ≥ 60 or RC < 1.5 → choose partition-based; else graph-based.
Why it exists: Give practitioners a quick, explainable path to good defaults.
Example: Glint360K-IR101 (DBI_E=5.51, DBI_C=2.09, CV=0.08; RA=87, RC=1.38) → inner product + partition (e.g., ScaNN).

Secret Sauce:

Iceberg doesn’t stop at recall–latency; it connects to the task. The three-layer funnel makes problems visible. The meta-features are quick to compute and surprisingly predictive. The decision tree is simple enough to trust and fast enough to use in production pipelines.

Concrete Example Walkthrough (Image Classification, ImageNet-EVA02):

Input: Image → EVA02 vector (1024-D)
Meta-features: DBI_E vs DBI_C show angular structure but CV small; suppose condition suggests inner product. Try IP.
Observation: Synthetic recall soars, but task label recall collapses (<1%). Diagnosis: metric misuse. Switch to Euclidean.
Result: Label recall improves dramatically; now optimize method type (graphs often good here).
Outcome: HNSW or RaBitQ chosen based on RA/RC and speed–utility curves.

04Experiments & Results

The Test: Iceberg evaluates eight large datasets (1M–100M vectors) across four domains—image classification (ImageNet with DINOv2, EVA02, ConvNeXt, AlexNet), face recognition (Glint360K with ViT and IR101 backbones), text retrieval (BookCorpus with Stella), and recommendation (Commerce with ResFlow). It measures both synthetic recall@K and task-centric metrics (Label Recall@K, Hit@K, Matching Score@K), plus throughput (QPS/latency).

The Competition: Thirteen state-of-the-art methods spanning partition-based (ScaNN, RaBitQ, IVFPQ, DB-LSH, Fargo) and graph-based (HNSW, NSG, Vamana, Mobius, ip-NSW, ip-NSW+, NAPG, MAG) approaches were compared for ANNS (Euclidean) and MIPS (inner product).

Scoreboard with Context:

ImageNet-DINOv2: Label Recall@100 caps near 71% even when synthetic recall@100 is ~99.9%—a hard ceiling from embedding loss.
ImageNet-EVA02: With inner product, synthetic recall ≈ 99.9% but label recall <1% (extreme metric misuse). Switching to Euclidean restores task performance.
BookCorpus (Text Retrieval): Euclidean ANNS achieves near 100% hit@100, while inner product MIPS stays under 50%—metric matters.
Glint360K-IR101 (Face): Inner product outperforms Euclidean on label recall@100 (≈ 98% vs ≈ 92%), matching the angular training objective.
Commerce (Recommendation): Matching Score@100 rises with synthetic recall at first, then declines—too much focus on synthetic neighbors can hurt conversions.

Speed Comparisons (with similar synthetic recall@100):

Partition wins: On ImageNet-DINOv2 and Glint360K-IR101, ScaNN (MIPS) is 1.7×–3× faster than top graph methods; RaBitQ (ANNS) is ~2× faster than HNSW on Glint360K-IR101.
Graph wins: On BookCorpus and Commerce, graph indexes are up to 3–4× faster than ScaNN for MIPS; HNSW beats RaBitQ by 2.7–3.2× for ANNS.

Surprising Findings:

Near-perfect synthetic recall can be useless for tasks (EVA02 + IP on ImageNet). That’s like acing the wrong test.
More recall isn’t always better for recommendation; beyond a point, utility drops.
Older embeddings (AlexNet) showed very low label recall (~21%) even with high synthetic recall, highlighting that embedding quality dominates the upper bound.
Some methods hit recall bottlenecks depending on data; e.g., Mobius struggles on Glint360K-IR101 for high synthetic recall.

Task-Centric Leaderboard Highlights (Iceberg 1.0):

ImageNet-DINOv2: Metric IP → ScaNN (partition) led task utility; Metric ED → HNSW (graph) dominated.
ImageNet-EVA02: MAG looked great on synthetic recall but ip-NSW+ won on task utility.
ImageNet-ConvNeXt: ED favored RaBitQ (partition) for task utility; IP favored ScaNN.
Glint360K (ViT, IR101): IP favored ScaNN (partition) for task utility.
BookCorpus: ED favored HNSW (graph) for task utility.
Commerce: IP favored ip-NSW+ (graph) for task utility.

Meaning: Method rankings reshuffle when judged by the task. The decision tree’s metric/method picks are consistent with these outcomes, offering a fast, explainable path to better real-world performance.

05Discussion & Limitations

Limitations:

Metric Coverage: Focuses on Euclidean and inner product (cosine as a special case). Other geometries (e.g., hyperbolic or Mahalanobis) and sparse embeddings are future work.
Systems Scope: Emphasizes algorithmic behavior; does not benchmark cloud-native deployments or specialized hardware optimizations.
Upstream Embedding Loss: Identifies but does not fix losses introduced by training objectives and model capacity.
Evolving Thresholds: Decision-tree thresholds (like CV ≤ 0.1) may shift as embedding practices evolve; Iceberg will need updates.

Required Resources:

Compute: Multi-core CPU servers with ample RAM for building/searching large indexes (1M–100M vectors).
Software: Implementations from Faiss, ScaNN, and C++ graph-based libraries; Iceberg’s open-source scripts and datasets.
Data: Domain datasets with labels (ImageNet, Glint360K, BookCorpus, Commerce) and modern embedding models.

When NOT to Use:

If your application relies on non-Euclidean metrics (e.g., hyperbolic) or sparse high-dimensional features not covered yet.
If you only need small-scale exact search where brute force is trivial.
If your embeddings are rapidly changing online and you can’t recompute meta-features or rebuild indexes as needed.

Open Questions:

Task-Aware Search: How to co-optimize embeddings, indexes, and stopping rules for end-to-end utility without heavy retraining?
Metric-Aware Search: Can indexes self-detect the right metric or blend metrics adaptively per query or region?
Distribution-Aware Search: Can indexes auto-adapt to RA/RC/DBI/CV shifts over time and across segments (e.g., new users, cold items)?
Robustness: How to guarantee no catastrophic failure modes (like the EVA02 + IP case) through automatic detection/safeguards?
Efficiency: How to keep task-centric evaluation light enough for production monitoring and A/B testing?

06Conclusion & Future Work

Three-Sentence Summary: Iceberg is a task-centric benchmark that reveals why high distance-based recall can still fail real applications, using the Information Loss Funnel to pinpoint embedding loss, metric misuse, and distribution sensitivity. It evaluates eight large datasets and 13 methods with both synthetic and task metrics, showing that method rankings change when you measure what truly matters. A simple decision tree powered by fast meta-features guides the choice of metric and method, improving downstream results.

Main Achievement: Turning vector search evaluation from synthetic recall races into end-to-end, task-aligned decision-making—with an interpretable, practical toolkit that consistently avoids hidden pitfalls.

Future Directions: Build task-aware search that co-optimizes embeddings and indexes; develop metric-aware systems that auto-detect and adapt similarity measures; create distribution-aware methods that reconfigure themselves to data geometry in real time. Expand beyond Euclidean/IP to alternative metrics and include systems-level, hardware-aware evaluations.

Why Remember This: Because perfect-looking vector scores can still give you the wrong answers. Iceberg shows where the truth gets lost and hands you a clear, fast way to choose the right metric and method so your system helps the real task—not just the benchmark.

Practical Applications

•Choose the right metric (inner product vs Euclidean) for your embeddings using DBI and CV before you build your index.
•Pick between partition-based and graph-based methods using RA and RC to match your dataset’s geometry.
•Validate your vector database with task metrics (Label Recall@K, Hit@K, Matching Score@K), not just synthetic recall.
•Diagnose failures: if synthetic recall is high but task metrics are low, check for metric misuse (Layer 2 of the funnel).
•Set realistic goals by estimating the task ceiling from embedding loss (Layer 1) to avoid over-optimizing indexes.
•Improve recommendation retrieval by stopping at the utility sweet spot instead of maximizing synthetic recall.
•Benchmark new embeddings on Iceberg’s datasets to see if representation quality raises the task ceiling.
•Use the decision tree to select default index types and parameters per dataset in production deployments.
•Monitor meta-features over time; if RA/RC/DBI/CV drift, auto-switch or retune indexes to maintain utility.
•Design A/B tests that track both recall–latency and downstream utility to ensure improvements matter.

Version: 1