M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Adithya S Kolavi; Vyoman Jain

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Intermediate

Adithya S Kolavi, Vyoman Jain12/3/2025

arXiv PDF

Key Summary

•The paper introduces M3DR, a way for computers to find the right document image no matter which of 22 languages the query or the document uses.
•Instead of relying on error-prone OCR text, it learns directly from document pictures plus text using vision-language models (VLMs).
•The team builds a huge synthetic training set by translating layouts into many scripts and generating smart queries with big vision-language models.
•They train two retrievers: NetraEmbed (one vector per document) and ColNetraEmbed (many token vectors per document).
•A special trick called Matryoshka learning lets the same model produce short or long embeddings, trading tiny accuracy drops for big storage savings.
•On a new benchmark (Nayana-IR) with 23 datasets across 22 languages, NetraEmbed beats prior methods by about 152% on cross-lingual search (NDCG@5: 0.716 vs 0.284).
•Surprisingly, simple in-batch contrastive learning worked better than complex hard-negative mining for multilingual data.
•Dense vectors are far cheaper and faster to deploy than multi-vector models while staying more accurate for cross-lingual tasks.
•The approach keeps strong English performance while finally making non-English document retrieval work well.
•All data, models, and code are released to help others build fair, multilingual document search systems.

Why This Research Matters

Many workplaces, schools, and governments store documents in multiple languages and scripts, often filled with tables and figures. A system that understands both visuals and text—and works across languages—means people can find the right information faster and more fairly. This reduces dependence on fragile OCR pipelines that break on non-English fonts and complex layouts. It also allows smaller devices and large data centers to run the same model by choosing smaller or larger embeddings as needed. As a result, more communities gain access to reliable search over their own documents, supporting inclusion, efficiency, and better decision-making.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a giant global library where books, reports, and forms are written in many languages and full of charts, tables, and pictures. You want to ask a question in your language and instantly find the right page—even if that page is in a different language and packed with visuals.

🥬 The Concept (Multilingual Multimodal Document Retrieval — what/why/how):

What it is: It’s a way for computers to find the right document image using both vision (how a page looks) and language (what it says), across many languages.
How it works (recipe):
1. Turn a query (text) and a document page (image) into numbers in the same “meaning space.”
2. Make sure matching query–page pairs are close together in that space; push mismatches far apart.
3. Search for the closest document numbers to the query numbers.
Why it matters: Without this, the system falls apart when text is in different scripts (like Arabic or Devanagari), when OCR fails on fancy fonts, or when visuals like tables and charts carry key meaning.

🍞 Anchor: You ask in Spanish, “¿Cuál es la capital de Mongolia?” and the system correctly retrieves an English PDF page where a map labels “Ulaanbaatar,” even if there’s no perfect OCR text—because it learned from the picture-plus-text together.

🍞 Hook: You know how reading a photo of a worksheet is harder than reading plain typed text? Computers used to first turn images into text with OCR, then search that text.

🥬 The Concept (OCR pipelines — what/why/how):

What it is: OCR tries to read characters from images to get plain text.
How it works: 1) Detect letters, 2) Guess each character, 3) Stitch together lines and paragraphs.
Why it matters: OCR throws away layout and visuals, struggles with unusual fonts/scripts, and makes mistakes that break search—especially in low-resource languages.

🍞 Anchor: If a Hindi form uses special numerals in a table, OCR might misread them; the old system won’t find the page when you ask about those numbers.

🍞 Hook: Think of a friend who’s great at looking at a poster and explaining it in words.

🥬 The Concept (Vision-Language Models — VLMs):

What it is: Models that understand images and text together.
How it works: 1) Turn the image into visual tokens, 2) Read the text tokens, 3) Fuse them to learn joint meaning.
Why it matters: They keep layout, fonts, figures, and words together so nothing important is lost.

🍞 Anchor: A VLM can look at a chart with labels in Japanese and connect it to your English question about “annual rainfall.”

🍞 Hook: Imagine sorting family photos by which ones match a caption like “Grandma baking cookies,” putting matches close and non-matches far.

🥬 The Concept (Contrastive Learning):

What it is: A training style where matching pairs are pulled together and mismatches are pushed apart.
How it works: 1) Embed queries and pages, 2) Score similarities, 3) Increase score for the true pair, decrease for others (in-batch negatives).
Why it matters: Without it, the model won’t learn a clean shared meaning space, and retrieval becomes noisy and unreliable.

🍞 Anchor: After training, the English question “Where is the methods section?” ends up near the right Spanish page image that clearly has the “Métodos” heading.

The world before: Most visual document retrievers were English-centric. They did okay on English PDFs but stumbled on other scripts (Arabic, Kannada, Japanese), fancy typography, and layout-heavy pages. Text-only multilingual embeddings could cross languages, but they lost the visual context (tables, figures, layout cues). OCR pipelines broke too often on non-English, causing cascading errors.

The problem: Could we create a universal retriever that works across many scripts and keeps English strong, without depending on fragile OCR?

Failed attempts and why: 1) Pure OCR+text search: loses layout/visuals and fails on scripts; 2) English-only vision-based retrievers: don’t transfer to other languages; 3) Separate text and image encoders: miss fine-grained interactions between words and visuals.

The gap: No wide-coverage, multilingual, multimodal benchmark and training data existed to teach a model to align across languages and visual layouts at scale.

The stakes: Real workplaces store mixed-language documents—policies, invoices, scientific papers. Students and researchers read sources in multiple languages. Governments archive forms and records across scripts. Without reliable multilingual, multimodal retrieval, people waste time or miss critical info. With it, we unlock fairer access to knowledge for everyone, not just English readers.

02Core Idea

🍞 Hook: Imagine building one “universal bookshelf” where every page from every language sits in the same order of meaning, so you can find it with any language key.

🥬 The Concept (Key insight):

What it is: Train a vision-based retriever on a massive, synthetic, layout-aware, multilingual document set so that text queries and document images from 22 scripts live in one shared meaning space; add a Matryoshka trick so the same model serves tiny to large embeddings.
How it works (recipe):
1. Make parallel document images by translating real layouts into many scripts and rendering with proper fonts.
2. Generate diverse queries (short, long, reasoning) using strong VLMs.
3. Train two retrievers with contrastive learning: a single-vector one (NetraEmbed) and a multi-vector one (ColNetraEmbed).
4. Use Matryoshka representation learning so embeddings can be truncated to 768, 1536, or full size without retraining.
Why it matters: Without a shared space, cross-lingual matches fail; without Matryoshka, storage and speed limits block deployment at scale.

🍞 Anchor: A Thai question about a chart title retrieves the right Japanese page image because both live near each other in the shared space; on your phone, you choose the 768-dim version to save memory and it still works almost as well.

Three analogies for the same idea:

World map analogy: You redraw all cities (documents) from different countries (languages) onto one global map of meaning. Now any traveler (query) can navigate to the right place using any language.
Music key analogy: Songs (documents) are transposed into the same key (embedding space). Whether a singer (query) starts in C or G (language), harmony lines up and the right song is found.
Universal adapter analogy: You bring a plug adapter (shared space) that fits any country’s outlet (script). Your device (query) always connects to the power source (document) no matter where you are.

Before vs. after:

Before: English-strong, non-English weak; OCR fragile; visuals lost; separate encoders miss details; no common benchmark.
After: Cross-lingual, OCR-free alignment across 22 scripts; visuals preserved; single- and multi-vector options; standardized evaluation with Nayana-IR.

🍞 Hook: You know how a librarian can either remember one short summary per book or many sticky notes inside the pages?

🥬 The Concept (Single dense vector vs. ColBERT-style multi-vector):

What it is: Two retrieval styles—one summary vector per doc (fast, tiny) or many token vectors per doc (fine-grained, bigger).
How it works: Single-vector pools tokens into one number vector; multi-vector keeps per-token vectors and uses MaxSim to match each query token to its best document token.
Why it matters: Without single-vector, deployment gets expensive; without multi-vector, you lose some interpretability and ultra-fine matches.

🍞 Anchor: For a billion-page index, single-vectors mean quick lookups; for tricky long-tail questions, multi-vectors can show heatmaps of exactly which region matched.

🍞 Hook: Think of Russian nesting dolls—big on the outside, smaller inside, but each is a complete doll.

🥬 The Concept (Matryoshka Representation Learning):

What it is: Train embeddings that still work well when you keep only the first part (e.g., first 768 numbers of a 2560-length vector).
How it works: During training, compute the contrastive loss at several truncation lengths so each slice is useful.
Why it matters: Without it, shrinking embeddings would break accuracy; with it, you flexibly pick speed/storage vs. accuracy later.

🍞 Anchor: The 768-dim version keeps about 95% of performance but saves ~70% storage—handy for mobile apps or giant indices.

Why it works (intuition):

The synthetic multilingual corpus forces the model to see the same layouts/semantics rendered in different scripts, so it must align meaning rather than script shapes.
Contrastive learning with diverse in-batch negatives across languages gives a strong, stable training signal.
VLM backbones preserve visual+text cues (headings, tables, figures) that pure text encoders miss.
Matryoshka regularizes the embedding so important information lives up front, enabling graceful truncation.

Building blocks:

Synthetic, layout-aware parallel documents + proper fonts.
Query synthesis with multiple types (short, long, reasoning) to cover real needs.
Two retrievers: NetraEmbed (single vector) and ColNetraEmbed (multi vector, late interaction/MaxSim).
Simple, stable contrastive objectives (InfoNCE) with multilingual batches.
Nayana-IR benchmark to measure monolingual and cross-lingual performance fairly across 22 languages.

03Methodology

At a high level: Input (query text in any language + document images in many languages) → [Embed both into one shared space] → [Compare similarities] → Output (top-k matching document images).

🍞 Hook: Imagine making practice worksheets in many languages that all keep the same layout, then asking different kinds of questions about them.

🥬 The Concept (Synthetic Parallel Corpus Generation):

What it is: A pipeline that turns real English document pages into parallel pages across 22 languages while keeping layout and visuals.
How it works:
1. Layout detection: Find text boxes, tables, figures (using tools like DocLayout-YOLO/Docling).
2. Neural translation: Translate with context using NLLB-200 and language-specific models.
3. Visual rendering: Rebuild the page with authentic fonts (e.g., Noto) and correct line breaks/directions at high resolution.
Why it matters: Without faithful layout+font rendering, the model can’t learn that the same meaning can wear different “clothes” (scripts/layouts).

🍞 Anchor: A technical report’s “Methods” section with a table becomes “Métodos” (Spanish) or “方法” (Chinese) on the same-looking page.

🍞 Hook: You know how teachers ask short questions, long questions, and tricky reasoning ones so you really learn?

🥬 The Concept (Query Synthesis):

What it is: Use strong VLMs (like Llama 3.1 90B Vision, Llama 4 Scout) to make varied, high-quality queries for each page.
How it works: For each image, generate 5 query types: 2 basic facts, 1 long answer, 1 multiple choice/short answer, 1 cross-paragraph reasoning; then filter for quality.
Why it matters: Without varied queries, models overfit to only easy lookups and fail on real-world questions.

🍞 Anchor: For a slide with a chart, one query asks “What is the x-axis label?” and another asks “Compare 2021 vs 2022 values and summarize the trend.”

🍞 Hook: Picture two ways to summarize a book: one short blurb or many sticky notes inside.

🥬 The Concept (Two Retrieval Architectures):

What it is: NetraEmbed (single dense vector) and ColNetraEmbed (ColBERT-style multi-vector).
How it works:
- Single-vector: Use a VLM (Gemma 3 4B-IT) to get token states; apply last-token pooling to one 2560-dim vector; L2-normalize; use cosine similarity. Matryoshka trains slices at 768/1536/2560 dims.
- Multi-vector: Keep per-token vectors (e.g., 256 visual tokens); compute MaxSim: for each query token, find its best-matching doc token; sum over tokens; train with InfoNCE.
Why it matters: Single-vector is tiny/fast for huge indices; multi-vector offers fine-grained matching and heatmaps.

🍞 Anchor: Single-vector is like a 10 KB card in a rolodex; multi-vector is like storing 256 mini-cards per page—more detail, more space.

🍞 Hook: Think of “find the right partner” games—pull matches close, push mismatches away.

🥬 The Concept (Contrastive Training with InfoNCE):

What it is: A simple, stable loss that lifts the true pair’s score above all others in the batch.
How it works: In each batch, compute similarities between every query and every doc; raise the true pair’s score; lower the rest; temperature τ tunes sharpness.
Why it matters: Without this, the model doesn’t learn a clean landscape where true pairs cluster together across scripts.

🍞 Anchor: In a batch with Hindi, French, and Arabic queries, each one learns to hug its matching page image and ignore the rest.

Key training choices (and what breaks without them):

Last-token pooling (for single-vector): Decoder-style VLMs pack summary info in the last token; mean-pooling diluted it and dropped NDCG@5 by 13+ points.
In-batch negatives over hard negatives: With multilingual batches, diversity was enough; mined hard negatives caused instability without better results.
Matryoshka slices: Enable deployment-time dimension choices; without pretraining slices, truncation would cut crucial info.

Example with data:

Query (Hindi): “चित्र 2 का शीर्षक क्या है?” (“What is the title of Figure 2?”)
Documents: A set of page images in English, Japanese, and Hindi.
Embedding step: The Hindi query and the correct English page image land close in the embedding space because the model learned figure-title patterns independent of script.
Retrieval: Cosine similarity ranks the English page on top; the user gets the right figure fast.

Evaluation benchmark: 🍞 Hook: You know how sports leagues need fair rules and scoreboards so everyone agrees who’s winning?

🥬 The Concept (Nayana-IR Benchmark):

What it is: A standardized test suite with 23 datasets across 22 languages, for both cross-lingual and monolingual retrieval.
How it works: Provides ~28K document images and ~5.4K queries in BEIR format; measures NDCG@5, Recall@10, MAP@10, MRR@10.
Why it matters: Without a fair, broad benchmark, you can’t tell if a model truly handles different scripts or just English.

🍞 Anchor: The same question in Russian, Hindi, and Japanese is asked against a mixed-language pool; the benchmark checks if your system finds the right page regardless of script.

Secret sauce:

Layout-aware multilingual synthesis + query variety forces true semantic alignment.
Simple, stable InfoNCE with multilingual batches beats complex tricks.
Matryoshka gives practical knobs for cost vs. accuracy.
Two architectures let teams balance scale (single-vector) vs interpretability (multi-vector).

04Experiments & Results

The test: Measure how well models find the right pages when the query and the document can be in different languages (cross-lingual) or the same language (monolingual). Use NDCG@5 as the main score (like a report card grade where higher is better at ranking correct answers up top). Also report Recall@10, MAP@10, and MRR@10 for depth and ranking quality.

The competition: Strong English-focused visual retrievers like ColPali and ColQwen variants; multimodal embedders (GME-Qwen2-VL-2B; Jina-embeddings-v4); and our two M3DR models: NetraEmbed (single-vector) and ColNetraEmbed (multi-vector).

Scoreboard with context:

Cross-lingual (hardest test):
- NetraEmbed: NDCG@5 = 0.716. Think of this as getting an A+, while prior leaders (e.g., ColPali-v1.3 at 0.284) got a C–. That’s ~152% relative improvement.
- ColNetraEmbed: 0.637—still a big jump over baselines.
- Many English-centric baselines fell to 0.00–0.32, showing they didn’t truly generalize beyond English.
Monolingual (same-language search):
- NetraEmbed: 0.738 vs. ColPali-v1.3 at 0.410 (about an 80% relative boost).
- ColNetraEmbed: 0.670.
English (ViDoRe v2):
- NetraEmbed: 0.554 (competitive while prioritizing multilingual strength).
- ColNetraEmbed: 0.551.

Per-language results: NetraEmbed stays strong across Latin, Devanagari, Dravidian scripts, CJK, Arabic, and more—evidence of robust script-agnostic understanding.

Surprising findings:

Simple beats fancy: Training with in-batch negatives (no heavy hard-negative mining) worked best in multilingual settings—likely because each batch’s language diversity already provides rich contrast.
Last-token pooling wins: Mean pooling dropped scores by double digits; decoder-style VLMs store crucial summary info at the end.
OCR pretraining doesn’t transfer: OCR-focused models did poorly for retrieval; recognizing characters isn’t the same as understanding semantic similarity for search.

Matryoshka trade-offs (choose your index size):

2560-dim: NDCG@5 = 0.716 (full accuracy) ~10 KB/doc.
1536-dim: 0.706 (~98.6% of full) ~6 KB/doc.
768-dim: 0.680 (~95% of full) ~3 KB/doc. This means you can shrink storage 70% and still keep most accuracy—great for billion-scale deployments or edge devices.

Dense vs. multi-vector comparison:

Accuracy: Single-vector (NetraEmbed) outperforms for multilingual (0.716 vs 0.637 cross-lingual NDCG@5).
Efficiency: Single-vector is ~250× smaller per document than a naive multi-vector index and much faster at search via standard ANN libraries.
Explainability: Multi-vector provides token-level heatmaps (which regions matched which words), helpful for audits and tricky questions.

Takeaway: The combination of layout-aware multilingual synthesis, diverse queries, simple contrastive training, and Matryoshka-enabled embeddings creates a system that finally makes cross-lingual, multimodal document retrieval work reliably—without tanking English performance.

05Discussion & Limitations

Limitations (be specific):

Rare language pairs: Asking in a low-resource language for a document in another distant low-resource language (e.g., Tamil → Russian) shows a 10–12% drop vs. common pairs; alignment there is still tough.
Tables and numerals: Complex tables with language-specific number formats (e.g., Arabic-Indic digits) remain challenging; a table-aware objective might help.
Granularity: Focus is on document/page-level retrieval; passage- or region-level retrieval within a page is not fully addressed here.
Scale beyond 22 languages: True zero-shot transfer to unseen scripts/languages isn’t proven yet; more data and methods are needed.

Required resources:

Training: A few A100 GPUs and LoRA fine-tuning suffice (reported ~64 GPU hours), making this attainable for many labs.
Serving: A vector database for single-vector (Faiss/Milvus/etc.); more memory/compute if using multi-vector indexes.
Data: The synthetic pipeline plus Nayana-IR benchmark; fonts and renderers to keep typography authentic.

When NOT to use:

If you only need English and have flawless OCR text, a simpler text-only retriever may be cheaper/easier.
If you require pinpoint word-level grounding within crowded tables, a specialized table retriever or OCR+structure system might be better today.
If you must explain every match visually, favor the multi-vector variant over the single-vector one.

Open questions:

Can we further boost rare-pair alignment (e.g., via multilingual teacher-student distillation or language-specific adapters)?
How to make numerals, currency, and date formats robust across scripts?
What’s the best way to do region-level retrieval inside pages without exploding index size?
Can we extend to more scripts with zero-shot generalization and still keep accuracy high?
How do we monitor and ensure fairness across languages in production over time?

06Conclusion & Future Work

Three-sentence summary: M3DR trains vision-based retrievers on a large, layout-aware, multilingual synthetic corpus so that text queries and document images from 22 scripts live in one shared meaning space. With simple but robust contrastive learning and a Matryoshka trick, NetraEmbed (single-vector) and ColNetraEmbed (multi-vector) achieve strong cross-lingual and monolingual retrieval, setting new state-of-the-art results on the Nayana-IR benchmark. The approach preserves English competitiveness while finally making non-English visual document search reliable.

Main achievement: NetraEmbed reaches 0.716 NDCG@5 on cross-lingual retrieval (about 152% relative improvement over a leading baseline) and offers practical deployment via small, flexible embeddings.

Future directions: Improve rare language-pair alignment, build table-aware or numeral-aware objectives, move from page-level to region-level retrieval, and scale to more low-resource languages with zero-shot transfer. Explore hybrid systems that mix single- and multi-vector strengths for both efficiency and explainability.

Why remember this: It shows that with the right data, training recipe, and embedding design, we can make document search work fairly across languages and scripts—bringing the world’s knowledge closer to everyone, not just English readers.

Practical Applications

•Search inside multilingual corporate wikis and policy manuals without needing OCR.
•Find the right page in mixed-language scientific papers and theses for cross-border collaborations.
•Retrieve invoices, receipts, and forms with varied scripts in finance and supply-chain audits.
•Power document-centric RAG systems that cite exact pages (with figures/tables) across languages.
•Enable mobile apps to search scanned notes and worksheets on-device using compact embeddings.
•Help libraries and archives surface relevant pages from digitized cultural heritage collections.
•Support multilingual help desks to quickly locate the right troubleshooting page regardless of script.
•Improve compliance and legal discovery by finding the precise clause across multilingual contracts.
•Assist educators and students to discover relevant textbook pages in their native language or others.
•Speed up customer support by retrieving the exact product manual page from global editions.

Version: 1