🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios | How I Study AI

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

Intermediate
António Loison, Quentin Macé, Antoine Edy et al.1/13/2026
arXivPDF

Key Summary

  • ‱ViDoRe V3 is a big, carefully built test that checks how well AI systems find and use information from both text and pictures (like tables and charts) in real documents.
  • ‱It includes 26,000 pages from 10 professional domains and 3,099 human-verified queries, each translated into 6 languages, with 12,000 hours of expert annotation.
  • ‱Every query comes with the relevant pages, human-written answers, and bounding boxes that point to exactly where the evidence lives on the page.
  • ‱Visual retrievers beat text-only retrievers at finding the right pages, and late-interaction retrievers are stronger than dense single-vector ones.
  • ‱Adding a strong textual reranker on top of retrieval boosts scores a lot (+13.2 NDCG@10); current visual rerankers help less.
  • ‱Giving the generator visual page images often leads to better answers than giving only text, especially for hard questions.
  • ‱Combining text and image contexts (a hybrid setup) works best on hard queries (54.7% vs. 52.1% text-only and 54.5% image-only), but there’s still a big gap to the oracle (64.7%).
  • ‱Models struggle with open-ended, multi-hop questions, non-textual elements (tables/charts), multi-page reasoning, and precise visual grounding.
  • ‱Cross-language queries (e.g., Spanish question with English or French docs) are harder, dropping retrieval by about 2–3 points.
  • ‱The benchmark is public (with two private hold-out sets) and tied into the MTEB leaderboard to drive fair, real-world RAG progress.

Why This Research Matters

In real jobs—finance, maintenance, HR, government—answers often live in tables, charts, and diagrams across many pages, and people need AI that can find, explain, and prove those answers. ViDoRe V3 tests exactly that: not just whether an AI gives a response, but whether it retrieves the right evidence and points to the precise spot on the page. This helps companies pick and improve the right RAG pipelines so employees and customers can trust the results. It also pushes open research toward better cross-lingual performance, stronger visual rerankers, and tighter grounding—key for global deployments. By providing a realistic, multilingual, multimodal benchmark with human-verified data, ViDoRe V3 accelerates progress toward truly useful, verifiable AI assistants.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re doing a school project and your sources aren’t just storybooks. You also have photos, charts, tables, and slides. You need to find the exact pages with answers, explain those answers clearly, and show exactly where you found them so your teacher can check. That’s what grown-up companies need from AI too.

đŸ„Ź The World Before: For years, Retrieval-Augmented Generation (RAG) helped AI “look things up” before answering, like a student who searches a textbook before writing. But most tests for RAG were simple: a single page of plain text, short fact questions, one document at a time. Real life isn’t like that. Real documents often mix words and visuals (tables, charts, diagrams), and the answer can be spread across multiple pages or even across different documents. Plus, people ask many types of questions—not just what-is or how-many, but also compare-contrast, list-all, or open-ended why/how questions.

đŸ„Ź The Problem: Existing benchmarks didn’t test these real-life needs. They often skipped visual elements, didn’t check if answers were correctly grounded to exact spots on the page, and didn’t test the full RAG pipeline end to end (retrieval + answer writing + source pointing). So systems looked good on paper but stumbled in real scenarios: they missed chart-only facts, failed on multi-page synthesis, or gave answers without trustworthy citations.

đŸ„Ź Failed Attempts: Some datasets focused only on one page at a time or only on pure text. Others tried multimodal data but mostly used extractive “copy a short span” tasks, which don’t demand deeper reasoning. Some benchmarks evaluated only retrieval or only generation, not both together. A few recent efforts moved toward complex multimodal RAG, but they relied on synthetic (machine-made) queries, English-only sources, or restricted grounding to pre-parsed elements—missing flexible, human-validated difficulty and precise visual pointing.

đŸ„Ź The Gap: The community needed a single, realistic, multilingual, multimodal benchmark that (1) uses visually rich documents, (2) includes many query types and formats (questions, keywords, instructions), (3) evaluates retrieval, generation, and grounding together, and (4) is human-verified at scale so we can trust the labels.

🍞 Anchor: Think of a question like, “Where is the S-3’s single point receptacle located?” A good RAG system must find the relevant aircraft manual pages, read the diagram and labels, write the correct answer, and point to the exact spot on the diagram. That’s precisely the kind of scenario ViDoRe V3 was built to test.

— New Concept Explainers —

🍞 Hook: You know how a librarian first finds the right books, then you read and write your report? đŸ„Ź Concept: Retrieval-Augmented Generation (RAG) is an AI method where the model first retrieves relevant documents and then generates an answer using them. How it works: (1) search your library, (2) gather likely pages, (3) read them, (4) write the answer with citations. Why it matters: Without retrieval, AI may guess from memory and get facts wrong or outdated. 🍞 Anchor: When you ask, “What’s Apple’s 2024 net margin?”, RAG first pulls Apple’s 10-K pages, then answers using those pages.

🍞 Hook: Imagine solving a mystery by reading text and studying photos. đŸ„Ź Concept: Multi-modal Retrieval finds helpful info across text, images, tables, and charts. How it works: (1) encode each page as text and/or image signals, (2) compare the query to these signals, (3) rank the most relevant pages, (4) return the best mix. Why it matters: If you ignore visuals, you’ll miss answers that live only in a chart or table. 🍞 Anchor: A table of quarterly revenue growth might be the only place the number appears; multi-modal retrieval brings that page to you.

🍞 Hook: When you cook, sometimes you ask for a recipe, other times just the ingredient list. đŸ„Ź Concept: Diverse Query Types means questions can be extractive, numerical, boolean, enumerative, compare-contrast, multi-hop, or open-ended, and can be phrased as questions, keywords, or instructions. How it works: The benchmark includes all these types and formats to reflect real user needs. Why it matters: A system tuned only for short fact lookups will fail on “compare two firms’ risk trends” or “summarize cross-section insights.” 🍞 Anchor: “List all risk categories” (enumerative) versus “Did risk go up?” (boolean) versus “Explain why risk changed” (open-ended) all appear here.

🍞 Hook: Teachers write good exam questions; that’s how you’re fairly tested. đŸ„Ź Concept: Human Annotation Methodology for Queries is a careful process to write, filter, and verify realistic queries and their answers. How it works: (1) gather varied documents, (2) create summaries, (3) generate human and synthetic queries with controls for type and format, (4) pre-filter candidate pages with a vision-language model, (5) have trained human annotators confirm relevance, draw bounding boxes, and write answers, (6) merge and verify final answers. Why it matters: Without this rigor, benchmarks grow biased, too easy, or untrustworthy. 🍞 Anchor: Experts mark exactly which chart cell supports “What is the 2023 growth rate?”, so future systems are graded on the true evidence.

🍞 Hook: If you say “Look at the graph,” your friend asks, “Which part?” đŸ„Ź Concept: Visual Grounding ties each answer back to a precise spot on the page. How it works: Annotators draw bounding boxes around the evidence; models are asked to do the same. Why it matters: Without grounding, an answer might be right by accident—or impossible to verify. 🍞 Anchor: The answer “74% made in France” points to a highlighted table cell in the annual report.

🍞 Hook: When you circle a single sentence in a photo of a page, you’re zooming in on proof. đŸ„Ź Concept: Bounding Box Localization means drawing a rectangle around the exact visual evidence. How it works: (1) detect the relevant region, (2) mark [x_min, y_min, x_max, y_max], (3) compare to human boxes by overlap. Why it matters: Without precise boxes, users can’t quickly check the source. 🍞 Anchor: For “Where is the refueling port?”, the box encloses the label and arrow on the aircraft diagram.

02Core Idea

🍞 Hook: You know how a great science fair judge doesn’t just look at the poster—they also check the data, ask follow-up questions, and see if you can point to your sources?

đŸ„Ź The "Aha!" Moment (One Sentence): To truly test modern RAG systems, we must evaluate retrieval, answer generation, and exact visual grounding together on realistic, multilingual, visually rich documents using carefully human-verified data.

— Three Analogies —

  1. School Research Report: Before vs. After
  • Before: You were graded only on how neatly you copied a definition from one page.
  • After: You’re graded on finding sources across books and pictures, writing a clear summary, and highlighting the exact lines and images you used.
  1. Cooking with a Pantry and Recipe Photos
  • Before: Taste-test judged only on one text recipe.
  • After: You must pick ingredients from a giant pantry, read notes on sticky photos (charts/tables), cook, and then point to which photo or note supported each choice.
  1. Detective Work
  • Before: Solve a case from one typed witness statement.
  • After: You must inspect pictures (charts), maps (tables), and transcripts (text), answer questions, and mark exactly where each clue is on the evidence board.

— Before vs. After —

  • Before: Benchmarks mostly used single-document text, short extractive questions, and separated retrieval from generation or grounding.
  • After: ViDoRe V3 brings 10 real-world corpora, 3,099 queries (in 6 languages), mixed modalities (text, tables, charts, infographics, images), verified page relevance, human-written answers, and precise bounding boxes—so systems must retrieve, reason, and cite visually.

— Why It Works (Intuition) —

  • Realistic Inputs: Many answers live in visuals (tables/charts). If you don’t see them, you miss the truth.
  • Diverse Questions: A system gets tested across simple facts, calculations, lists, comparisons, multi-document hops, and open-ended synthesis—like actual users ask.
  • End-to-End Scoring: If you only test retrieval, a weak writer can hide. If you only test writing, a bad retriever can seem fine. Testing all three (retrieve+write+ground) exposes real strengths and weaknesses.
  • Human Verification at Scale: With 12,000 hours of annotation (relevance, answers, boxes), the ground truth is trusted enough to compare different pipelines fairly.

— Building Blocks (Explained with Sandwich Patterns) —

🍞 Hook: Ever notice some search engines do better when they can compare pieces side-by-side rather than squish everything into one dot? đŸ„Ź Concept: Late-Interaction Retrievers compare query and document pieces interactively instead of compressing each into a single vector. How it works: (1) split content into tokens/patches, (2) encode them, (3) compute many-to-many matches, (4) aggregate rich signals. Why it matters: One-vector methods can blur away fine details, especially in complex pages. 🍞 Anchor: Asking “find the table cell with 2023 growth” needs fine-grained matching; late-interaction models shine here.

🍞 Hook: After a quick first pass, a librarian gives the pile to a topic expert who reorders it. đŸ„Ź Concept: Reranking is a second stage that reorders top results to improve quality. How it works: (1) retrieve top-K, (2) deeply compare each with the query, (3) sort by better relevance. Why it matters: Without reranking, the best pages can be buried. 🍞 Anchor: Adding a strong textual reranker in ViDoRe V3 gave a big boost (+13.2 NDCG@10) to a text pipeline.

🍞 Hook: Sometimes reading the page image beats reading the transcript, especially for tricky diagrams. đŸ„Ź Concept: Visual Context for Generation gives the model the page images themselves. How it works: (1) feed images to a VLM, (2) let it read charts/tables directly, (3) write answers grounded in visuals. Why it matters: OCR/text-only views can miss layout, labels, or tiny numbers. 🍞 Anchor: On hard queries, image context helped Gemini 3 Pro beat text-only context.

🍞 Hook: When two eyes see the world, you get depth. đŸ„Ź Concept: Hybrid Retrieval combines top pages from visual and textual retrievers. How it works: (1) retrieve top-N images and top-N texts, (2) merge deduplicated set, (3) feed to the generator. Why it matters: Text and image signals capture different clues; together they’re stronger. 🍞 Anchor: Hybrid hit 54.7% on hard queries, better than text-only (52.1%) and matching/edging image-only (54.5%).

🍞 Hook: Switching languages is like switching sports—skills transfer, but not perfectly. đŸ„Ź Concept: Cross-Lingual Retrieval answers queries in one language using documents in another. How it works: (1) encode both languages in a shared space, (2) match query to pages regardless of language, (3) return relevant pages. Why it matters: Real users ask in their own language; systems must still find answers. 🍞 Anchor: ViDoRe V3 shows a ~2–3 point drop cross-lingually—room to improve.

03Methodology

At a high level: Documents → (Summaries + Query Generation) → (VLM Pre-filter) → (Human Relevance + Bounding Boxes + Answers) → (Final Merged Answers) → Benchmark Splits and Evaluation.

Step-by-step, like a recipe:

  1. Document Collection
  • What happens: The team curates 10 diverse, openly licensed corpora (finance, HR, pharma, industrial maintenance, CS textbooks, physics slides, French energy/government reports, etc.), totaling ~26,000 pages. Documents come as images, text, and PDFs.
  • Why it exists: Real RAG must handle long and varied professional documents—not just neat, short web pages.
  • Example: A corporate 10-K, an FDA slide deck, and an aircraft maintenance manual all live side-by-side to mirror realistic workloads.
  1. Context Preparation and Summarization
  • What happens: The pipeline extracts text and image captions (Docling), makes section summaries (Qwen3-235B-Instruct), clusters them (Qwen3-Embedding + UMAP + HDBSCAN), and produces cross-section summaries that blend 2–3 sections.
  • Why it exists: Summaries create diverse contexts for generating rich queries without overfitting to a single page; clustering ensures coverage of different topics and modalities.
  • Example: A cluster for “risk factors” across a finance report yields cross-section summaries mixing text and a chart description.
  1. Query Generation (Three Streams)
  • Synthetic blind contextual: Uses LLM prompts to create varied queries from summaries with enforced diversity of types (open-ended, numerical, etc.) and formats (question, keyword, instruction). Outputs are filtered by an LLM-as-a-judge for informativeness, clarity, domain fit, and type/format correctness; half are rephrased for variety.
  • Human blind contextual: Annotators see only summaries (not the original pages) and write realistic queries. This reduces the bias toward easy copy-and-paste extractive questions.
  • Human extractive (image-based): Annotators view specific pages and write queries most natural for that content (often extractive/numerical), reflecting how users ask about visible details.
  • Why it exists: Real users ask many kinds of questions; mixing streams produces a healthy balance (e.g., extractive dominates human-image; open-ended is richer in synthetic).
  • Example: From a finance cross-section summary, a human writes: “Compare changes in temporary contract usage for EU movers vs. nationals from 2018 to 2023 and explain stability impacts.”

— Concept Sandwich: Diverse Query Types — 🍞 Hook: Sometimes you ask “What’s the number?”; other times you ask “Why did it change?” đŸ„Ź Concept: The benchmark intentionally includes extractive, numerical, boolean, enumerative, compare-contrast, multi-hop, and open-ended queries, phrased as questions, keywords, or instructions. How: prompts enforce balance; humans add realism. Why: Without diversity, systems overfit to easy cases. 🍞 Anchor: “List ISCO codes” (enumerative) and “What caused the stability gap to shrink?” (open-ended) both appear.

  1. VLM Pre-filtering of Candidate Pages
  • What happens: Given a query, Qwen2.5-VL-32B-Instruct reviews page images and flags possibly relevant pages. Queries whose answers would span >30 flagged pages are dropped (too diffuse/ambiguous).
  • Why it exists: The corpora are large. This step narrows the pool before humans carefully label each query-page pair.
  • Example: A technical aircraft question triggers 12 candidate pages (diagrams, parts lists) for human review.

— Concept Sandwich: Late-Interaction Retrievers (used later, but motivated here) — 🍞 Hook: Finding a single important word in a giant page is easier when you compare pieces, not just page-level blobs. đŸ„Ź Concept: Late-interaction retrievers compare query tokens to many document tokens/patches. How: compute fine-grained similarities and pool. Why: Page-level vectors can hide small-but-crucial evidence. 🍞 Anchor: Matching “SPR location” to a diagram label benefits from piecewise comparisons.

  1. Human Relevance Rating + Bounding Boxes + Page Modality Marking
  • What happens: Multiple annotators review each candidate page for each query, rate it on a 0–2 relevance scale, draw bounding boxes around supporting evidence, and label each box’s modality (Text, Table, Chart, Infographic, Image, Mixed, Other). Senior reviewers enforce quality. Agreement is measured with metrics robust to skew (e.g., Gwet’s AC2=0.760).
  • Why it exists: Human verification ensures trustworthy ground truth for all three tasks—retrieval, generation, and grounding.
  • Example: For “Where is the SPR located?”, annotators mark the diagram label and a procedure line; both get boxes and modality tags (Infographic/Text).

— Concept Sandwich: Bounding Box Localization — 🍞 Hook: If you can circle the exact proof, anyone can check your work fast. đŸ„Ź Concept: Annotators draw rectangles that enclose only the evidence. How: Merge boxes per annotator into a single “zone,” compare overlaps across annotators for consistency. Why: Without precise boxes, grounding quality can’t be measured. 🍞 Anchor: A small rectangle around “74%” in a table cell proves the answer.

  1. Answer Writing and Final Merging
  • What happens: Annotators draft answers using only the pages they marked as relevant. Because humans may be partially complete, a VLM (Qwen2.5-VL-32B) merges annotator answers and page images into a single, clean final answer (with checks for accuracy and clarity).
  • Why it exists: Produces one authoritative answer per query, grounded only in the provided evidence.
  • Example: Multiple draft answers about “temporary contract usage” get merged into a precise comparison across years and groups.
  1. Multilingual Expansion and Splits
  • What happens: Queries are translated into six languages (English, French, Spanish, German, Italian, Portuguese). Source docs stay English/French, enabling cross-lingual testing. Eight datasets are public; two are private hold-outs to prevent overfitting.
  • Why it exists: Real users ask in many languages; private sets maintain benchmark integrity.
  • Example: A Spanish query is evaluated against an English 10-K; the system must bridge the language gap.

— Extra Concept Sandwiches Used in Experiments —

🍞 Hook: After a quick search, you let a specialist re-check the top picks. đŸ„Ź Concept: Reranking reorders retrieved results with a powerful model for better precision. How: take top-K, score deeply, sort. Why: First-pass retrieval is fast but imperfect. 🍞 Anchor: In ViDoRe V3, a textual reranker (zerank-2) gave +13.2 NDCG@10 overall.

🍞 Hook: Two teammates—one great at reading pictures, one great at reading text—work together. đŸ„Ź Concept: Hybrid Retrieval combines visual and textual top results. How: union of top-5 image and top-5 text pages. Why: Each sees clues the other misses. 🍞 Anchor: Hybrid reached 54.7% accuracy on hard queries—best non-oracle result.

Quality Controls and Human Measures

  • Inter-annotator agreement for relevance: AC2=0.760 (strong under skew).
  • Bounding boxes: human-human F1≈0.60 (moderate, reflecting subjectivity in box tightness).
  • Ethical and licensing: open licenses, GDPR-compliant annotations, fair compensation, safe content.

Secret Sauce (What makes it clever)

  • End-to-end, multimodal, multilingual, human-verified pipeline—not just one component.
  • Balanced query taxonomy and formats with both human and synthetic generation.
  • Visual grounding baked into the gold labels, enabling precise trust checks.
  • Public + private split and MTEB integration to encourage robust, non-overfit progress.

04Experiments & Results

The Test: What did they measure and why?

  • Retrieval: How well do different retrievers (text-only, visual, late-interaction, dense) find the right pages across languages and modalities? Metric: NDCG@10.
  • Generation: Given retrieved pages (images, text, or both), how often do LLMs produce the correct final answer compared to the gold? Metric: judged correctness by an LLM judge (GPT‑5.2), with easy/hard splits based on parametric knowledge.
  • Visual Grounding: Can models draw bounding boxes that match human evidence? Metric: zone-based F1 versus human boxes (best-match across annotators).

The Competition: Who/what was compared?

  • Textual retrievers (e.g., Jina‑v4, Qwen3‑8B, BGE‑M3, BM25S, LFM2‑ColBERT) vs. visual retrievers (e.g., ColEmbed family, ColNomic, Jina‑v4 visual).
  • Rerankers: textual (zerank-2) vs. visual (jina-reranker-m0).
  • Generators: Gemini 3 Pro/Flash, GPT‑5.2, Qwen3‑VL‑235B variants.
  • Context types: text-only, image-only, hybrid (union), and oracle (gold pages in text and/or images).

The Scoreboard (with context):

  • Visual retrievers outperform text-only ones: Best visual retriever (ColEmbed‑3B‑v2) averaged about 59.8 NDCG@10 across the mixed-language setting, topping comparable text models. Think of this like scoring a B+ where others often get Bs—consistently better at finding the right pages.
  • Late-interaction > dense: Models that compare many tokens/patches tended to win, especially on visually dense pages.
  • Textual reranker gives a big lift: Adding zerank‑2 to a text pipeline boosted performance by +13.2 NDCG@10 on average—like moving from a mid‑B to a solid A‑ in ranking the right pages near the top. By contrast, the visual reranker provided only a tiny average improvement (+0.2) and sometimes hurt.
  • Question format wins: Across query formats, question-style inputs beat instructions and keywords, suggesting better alignment with how retrievers are trained.
  • Visual content is hard: Mixed, Image, and Table queries were harder than Text; Mixed scored the lowest, meaning pages that require integrating multiple visual types are tough.
  • More relevant pages, lower retrieval score: When a query truly spans many pages, NDCG@10 drops; multi-document synthesis remains challenging for retrievers.
  • Cross-lingual penalty: Retrieval drops by ~2–3 points when query and document languages differ—systems still need better multilingual alignment.

Generation Findings:

  • Visual context helps, especially on hard queries: With Gemini 3 Pro, image context beat text context by ~2.4–2.8 points on the hard subset for both oracle and ColEmbed‑3B‑v2 settings. Seeing the page images preserves layout and tiny details.
  • Hybrid context is best overall on hard queries: Hybrid (top-5 image + top-5 text) reached 54.7% accuracy on hard questions, beating text-only (52.1%) and edging visual-only (54.5%). The win suggests complementary strengths.
  • Oracle gap shows room for better retrieval: Even with the best non-oracle setup at 54.7%, the image-oracle hits 64.7% on hard questions—about a 10‑point headroom purely from better retrieval/selection.
  • Parametric knowledge still matters: On easy queries (answerable from model memory), GPT‑5.2 and others can excel; on hard queries, rankings shift, with Gemini 3 Pro often doing better when true reading/grounding is needed.

Visual Grounding Results:

  • Humans vs. models: Human‑human F1 ≈ 0.602; best models much lower—Qwen3‑VL‑30B‑A3B at 0.089, Gemini 3 Pro at 0.065. This is like humans scoring 60/100 on overlap while models score 7–9/100.
  • Recall is the main problem: Only ~16–17% of human‑boxed pages were also boxed by models; ~26–27% of human‑boxed pages got no model box at all.
  • Qualitative patterns: Gemini draws tighter, smaller boxes (risking misses or off-by-one pages); Qwen tends to draw larger, more inclusive boxes (slightly higher overlap). Both make omissions.

Surprising/Notable Nuggets:

  • Visual reranking didn’t help much on average and sometimes hurt, despite strong visual retrieval—suggesting reranking training/data may lag behind.
  • Questions consistently outperform instructions/keywords as input formats; query wording strongly affects retrieval.
  • Even with perfect page selection (oracle), generation isn’t solved; multi-page reasoning and precise grounding remain open challenges.

Bottom Line: ViDoRe V3 shows that to succeed in real document AI, you need strong multimodal retrieval (ideally late-interaction), a capable reranker (today, textual ones shine), hybrid contexts for generation, and major improvements in visual grounding. There’s clear, quantified room to grow, especially for hard, open-ended, and cross-lingual scenarios.

05Discussion & Limitations

Limitations (be specific):

  • Language scope: Source documents are English/French, queries are in six Western European languages. This leaves out non-Latin scripts and low-resource languages; cross-lingual claims shouldn’t be overgeneralized.
  • Document bias: The corpora are long-form public documents (10‑Ks, manuals, reports). Enterprise settings also include short/noisy items (emails, tickets, scans, handwriting) not represented here.
  • Subjectivity in grounding and open-ended answers: Humans reasonably differ on how tight a box should be and how to phrase synthesis. The benchmark mitigates this (multi-annotator, best‑match F1), but some ambiguity is inherent.
  • Synthetic assist: While queries are human-verified and many are human-authored, synthetic generation/filters are used to scale. This is practical but introduces its own distributional quirks.

Required Resources:

  • Compute for multimodal retrieval and LLM/VLM generation (the paper reports ~3,000 H100 hours for evaluation).
  • Skilled annotators (12,000 hours were needed to create ground truth).
  • Storage and tooling for images, PDFs, text extraction, and evaluation harnesses; integration with MTEB if comparing on the leaderboard.

When NOT to Use:

  • If your use case is purely short-form, clean text (e.g., FAQs), simpler text-only RAG benchmarks might suffice and be more cost-effective.
  • If you need handwriting/OCR-heavy, low-quality scans, or many low-resource languages, ViDoRe V3 doesn’t yet cover those distributions.
  • If you only need retrieval or only need generation, component-specific benchmarks may give quicker insights.

Open Questions:

  • How to build stronger visual rerankers that reliably beat or complement textual rerankers across languages?
  • What training signals best improve multi-page reasoning and long-context synthesis in generators, especially for open-ended and multi-hop queries?
  • Can we narrow the 10‑point oracle gap with smarter hybrid selection, query reformulation, or iterative retrieval (agentic RAG)?
  • How to improve visual grounding recall and fine-grained alignment so models approach human F1 (~0.60)?
  • How do results change with broader languages, scripts, and noisier enterprise document ecosystems?

Honest Takeaway: ViDoRe V3 is not “just another dataset.” It’s a demanding, end-to-end, human-validated testbed. It proves visual retrieval’s edge, the power of textual reranking, the benefits of hybrid context, and how far we still are from robust visual grounding and multi-page synthesis. The benchmark’s strengths—and the measured gaps—give practitioners a concrete roadmap for what to fix next.

06Conclusion & Future Work

Three-Sentence Summary:

  • ViDoRe V3 is a multilingual, multimodal, human-verified benchmark that evaluates the full RAG pipeline: retrieval, answer generation, and precise visual grounding, on 26k pages across 10 professional domains.
  • Experiments show visual retrievers outperform text-only ones; textual reranking yields large gains; hybrid (text+image) context improves answer accuracy on hard queries; yet models still struggle with open-ended questions, multi-page synthesis, cross-lingual retrieval, and fine-grained grounding.
  • By releasing public data (with private hold-outs) and integrating with MTEB, ViDoRe V3 creates a fair, realistic proving ground to push RAG systems closer to trustworthy real-world performance.

Main Achievement:

  • The paper delivers the first comprehensive, human-verified, end-to-end multimodal RAG benchmark with fine-grained visual grounding—showing not just whether models answer correctly, but whether they retrieved the right evidence and can point to it precisely.

Future Directions:

  • Train stronger multimodal retrievers and, crucially, visual rerankers; design hybrid selectors that close the 10‑point oracle gap; improve long-context generation for open-ended and multi-hop tasks; and build better visual grounding heads that raise recall and F1 toward human levels. Expand sources to more languages, scripts, and noisier enterprise formats to match real deployments.

Why Remember This:

  • ViDoRe V3 changes the question from “Can your model answer?” to “Can your model find, understand, and prove it across text and visuals, in multiple languages?” That end-to-end bar—retrieval + reasoning + grounding—is the standard real users need. This benchmark sets it and shows exactly where today’s systems shine and where they still fall short.

Practical Applications

  • ‱Benchmark and compare your RAG pipeline end-to-end (retrieval + generation + grounding) on realistic multimodal tasks.
  • ‱Choose retrievers: favor late-interaction visual retrievers for visually rich documents; measure gains on your domains.
  • ‱Add a strong textual reranker (e.g., zerank-2) to significantly improve page ordering and downstream answer quality.
  • ‱Adopt hybrid retrieval (top visual + top textual) to boost accuracy on hard, multi-faceted queries.
  • ‱Prefer question-style query templates over keywords/instructions when building user interfaces or agents.
  • ‱Feed image pages (not just OCR text) into your generator to improve answers on charts/tables/diagrams.
  • ‱Audit trust with visual grounding: require models to output bounding boxes and spot-check against gold boxes.
  • ‱Plan for cross-lingual drops: add multilingual training or translation steps to mitigate 2–3 point retrieval losses.
  • ‱Use the oracle gap (~10 points on hard queries) to prioritize retrieval improvements before tuning generation.
  • ‱Replicate the human-in-the-loop annotation workflow (on your private docs) to create in-house gold data for evaluation and fine-tuning.
#Retrieval-Augmented Generation#Multimodal RAG#Visual Document Understanding#Late-interaction retrievers#Reranking#Hybrid retrieval#Visual grounding#Bounding box localization#Cross-lingual retrieval#MTEB integration#Tables and charts reasoning#Open-ended queries#Multi-page synthesis#Benchmarking#Document AI
Version: 1