ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Mathieu Sibue; Andres Muñoz Garza; Samuel Mensah; Pranav Shetty; Zhiqiang Ma; Xiaomo Liu; Manuela Veloso

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Intermediate

Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah et al.2/12/2026

arXiv

Key Summary

•ExStrucTiny is a new test (benchmark) that checks if AI can pull many connected facts from all kinds of documents and neatly put them into JSON, even when the question style and schema change.
•It blends ideas from key-entity extraction, relation extraction, and visual question answering to better match real office tasks like processing forms, slides, reports, and web pages.
•Every answer must include the exact text, the page number, and a bounding box of where that text came from, so models must both read and point.
•The dataset has 304 query–answer pairs across 110 multi-page documents, with three query types: closed with plain text, closed with a schema, and on-demand (vague) requests.
•To build it, the authors combined careful human-made examples with many synthetic ones from a strong VLM, then had experts fix and validate them.
•They also invented a fair scoring method that uses a small text-only LLM to align different JSON shapes before computing accuracy, so models aren’t punished for harmless formatting differences.
•Closed-source models currently win by a big margin on text extraction (about 18+ ANLS points over the best open model) and stay strong even when many values are requested.
•All models struggle to precisely point to answer locations (low bounding-box IoU), showing a gap between ‘getting the right text’ and ‘proving where it came from.’
•Models perform worse on harder query types (schema-heavy and on-demand), on reformulated questions with fewer word matches, and when answers are missing (unanswerable queries).
•Visual information clearly helps: using OCR text alone drops performance by about 10% ANLS compared to using the document images.

Why This Research Matters

Many everyday processes—paying bills, approving loans, onboarding patients, or tracking shipments—depend on quickly and accurately reading mixed documents. ExStrucTiny checks whether AI can adapt to different question styles and changing schemas, which mirrors how real users ask for information. By requiring exact locations (page and boxes), it supports trust, auditing, and compliance needs where proof matters. The benchmark’s tough cases (low word-overlap, missing answers, multi-entity requests) discourage shortcuts and push true understanding. Findings reveal where current models fall short—especially grounding—so engineers know what to fix. As models improve on ExStrucTiny, businesses can safely automate more of their document workflows, saving time and reducing errors.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your class has a giant folder full of different papers—permission slips, lunch menus, report cards, and posters—and your teacher asks you to fill a spreadsheet with names, dates, and totals from all of them. Doing this by hand would take forever and you could still make mistakes.

🥬 The Concept (Structured Information Extraction): What it is: It’s teaching computers to find important bits (like names, dates, amounts) in documents and put them into a tidy structure (like JSON) so other programs can use them. How it works (recipe):

Look at the document (image + text + layout).
Find the parts the user asked for (the ‘entities’).
Copy the exact text and also record where it was found (page and box).
Place all of that into the right spots in a structured answer. Why it matters: Without structure, computers can’t easily search, check, or combine the data; it’s like dumping puzzle pieces in a bag without building the picture. 🍞 Anchor: A company receives scanned invoices and needs ‘invoice number,’ ‘vendor,’ and ‘total.’ Structured extraction fills those cells automatically.

🍞 Hook: You know how some worksheets always ask the same few things—like your name and date—no matter the subject?

🥬 The Concept (Closed IE and Narrow Ontologies): What it is: Closed information extraction asks for a fixed set of entities (like always ‘name,’ ‘date,’ ‘amount’). How it works: Models train on one document type with a short list of target fields; they tag and extract those exact fields. Why it matters: It works great when you always see the same form, but it struggles when the form or the target fields change. 🍞 Anchor: A receipt dataset might always ask for ‘store name’ and ‘total,’ which is perfect for receipts but not for slide decks.

🍞 Hook: Have you ever tried to answer a question about a picture and realized the question can be simple (“What color?”) or tricky (“Which item is most expensive?”)?

🥬 The Concept (VQA vs. Real-World Extraction): What it is: Visual Question Answering (VQA) asks a question about an image and expects an answer, often a short text span. How it works: The model reads the image and text, matches the question words to likely spots, and returns the answer. Why it matters: Many VQA tasks use simple, single answers and often share lots of words with the document, so string matching can sometimes cheat; real business tasks need multi-field, structured outputs. 🍞 Anchor: “What’s the invoice number?” is easy; “List all line items with names, quantities, and prices” is the real-world challenge.

🍞 Hook: Suppose your teacher sometimes gives you a blank table to fill (schema), sometimes writes a sentence, and sometimes just says, “Find all info about the authors.”

🥬 The Concept (Open, Closed, and On-Demand Queries): What it is: Three ways people ask for info.

Closed (plain text): “Give me X and Y.”
Closed (with schema): “Fill this JSON with X and Y.”
On-demand: “Get all info about Z” (you must decide which children fields matter). How it works: The model must read both the request style and the document, then adapt the output. Why it matters: Real users don’t always know exact field names. A good system must flex with any schema or vague request. 🍞 Anchor: “Extract signer ID and role” (closed), “[{"signer name":"","signer role":""}]” (schema), “All details about the signers” (on-demand).

🍞 Hook: Think of a backpack that can carry books (text) and a water bottle (images) at once.

🥬 The Concept (Vision-Language Models, VLMs): What it is: VLMs understand both pictures and words together. How it works: They encode images and text, connect them with attention, then generate or extract answers. Why it matters: Documents are visual (layout, tables, charts) and textual. Ignoring either side loses key clues. 🍞 Anchor: A VLM can read a chart’s legend (visual) and labels (text) to correctly extract the value of the red bar.

Before this paper: Most datasets focused on simple, fixed lists of things or single short answers, often from one document type. That was fine for narrow tasks but didn’t test flexible, multi-entity, structured extraction across many document styles. The problem: General-purpose VLMs need to handle various documents and changing schemas, but we lacked a fair, realistic test to measure this. Failed attempts: KEE datasets used small, fixed ontologies; many VQA sets asked single, easy questions with high word overlap—models could answer by matching strings, not true understanding; table-only tasks ignored full document context. The gap: No benchmark required models to: 1) adapt to user-provided schemas, 2) extract many related fields at once, 3) include exact locations, and 4) survive low word-overlap and missing-answer cases across diverse document types. Real stakes: In banks, hospitals, and schools, lots of workflows depend on correctly reading mixed documents. If the AI can’t adapt, humans must fix things by hand, slowing everything down and risking errors.

02Core Idea

🍞 Hook: You know how a universal remote can control lots of different TVs and speakers because it adapts to each device’s buttons? Imagine a ‘universal test’ that checks whether an AI can adapt to any document and any set of fields you ask for.

🥬 The Concept (The Aha!): What it is: ExStrucTiny is a benchmark that unifies closed, schema-based, and on-demand extraction on real document images and scores models on both what they extract and where they found it. How it works:

Provide diverse documents (forms, reports, slides, web pages) and three query styles (closed plain text, closed schema, on-demand).
Require answers in JSON with exact text, page, and bounding boxes for every extracted value.
Use a smart ‘schema mapper’ (a small text-only LLM) to align differently-shaped JSON outputs before scoring, so fair comparison doesn’t depend on naming or nesting.
Evaluate text accuracy, structure similarity, and grounding (page and box). Why it matters: Without this, we could wrongly judge models just because they used a different but correct JSON shape—or because the question was flexible and they didn’t adapt. This benchmark checks real skills businesses need. 🍞 Anchor: A user asks, “Extract all details about the signers.” The model must discover the children fields (like name, role, date), pull the exact strings, show where they came from, and organize them in a neat list of objects.

Multiple analogies:

Swiss Army Knife Test: Not just “Can you cut?” but “Can you cut, open, twist, and file?” ExStrucTiny tests many extraction skills at once.
Treasure Map with GPS: Don’t just bring the treasure (text). Show the GPS coordinates (page + box) to prove where you found it.
Build-Your-Own Shelf: Sometimes we hand you the shelf blueprint (schema); sometimes we just say “store author stuff” and you must decide which compartments (fields) to add.

Before vs After:

Before: Datasets often had single answers, fixed fields, and high word-overlap.
After: ExStrucTiny demands multi-entity, schema-flexible, low-overlap, and sometimes unanswerable queries, across varied document types, with location evidence.

Why it works (intuition):

Flexible queries + required JSON leaves (text, page, bbox) force models to both understand and ground answers.
The LLM schema-mapper removes grading unfairness from different-but-equivalent JSON shapes.
Low lexical overlap and unanswerable cases prevent ‘string matching’ shortcuts and check true comprehension.

Building blocks (each explained with Sandwich):

🍞 Hook: Like labeling each Lego piece you use in a build. 🥬 The Concept (Extraction Leaves): What it is: Each final value comes with its text, page, and bounding box. How: For every field, store {"text", "page", "bbox"}. Why: So we know exactly what you used and where it came from. 🍞 Anchor: ‘Total: $123.45,’ page 2, box [120, 540, 280, 575].
🍞 Hook: Different people organize binders differently. 🥬 The Concept (Schema Mapping LLM): What it is: A small LLM aligns your JSON shape to the gold shape. How: Flatten keys, match by values and meaning, then compute scores. Why: Prevents penalizing correct answers just for using different key names. 🍞 Anchor: ‘buyer.name’ can map to ‘customer.full_name’ if the value matches.
🍞 Hook: A good report card doesn’t just say ‘good’; it shows subject grades. 🥬 The Concept (Multi-part Scoring): What it is: Separate scores for text similarity, structure match, page correctness, and box overlap. How: Compute ANLS for text, tree-edit for structure, page accuracy, and IoU/proximity for boxes. Why: One number can hide weaknesses; multiple scores show where models need help. 🍞 Anchor: A model could get the right number but mark the wrong page—text score high, page score low.

03Methodology

At a high level: Document images → Query (closed, schema, or on-demand) → Model outputs JSON with extraction leaves → Schema mapper aligns predicted vs. gold → Metrics computed (text, structure, page, bbox).

Step 1: Task setup (queries and answers)

What happens: Each example has multi-page images and one query of three types: closed with plain text, closed with a schema, or on-demand (vague parent field). All answers must be JSON, and every final value must be an extraction leaf with {"text", "page", "bbox"}.
Why this step: Forces consistent, structured results across many styles of asking.
Example: Query: “[{"signer name":"","signer role":""}]”. Answer: a list of signer objects, each field holding a leaf with exact text, page index, and normalized bbox.

Step 2: Manual seed data

What happens: Experts annotate 102 high-quality examples across four sources: forms (FUNSD), financial reports (TAT-DQA), slide decks (SlideVQA), and web pages (VisualMRC). They design hard queries with multiple entities, low word-overlap, missing fields, cross-page values, and tricky layouts (like charts, checkboxes).
Why: Sets the gold standard for difficulty and realism; provides strong few-shot examples.
Example: “Extract cost center ID, department, and any events (type and date).” Some fields might be missing; answers include null leaves where data doesn’t exist.

Step 3: Synthetic expansion with a strong VLM

What happens: Using Gemini-2.5-Flash-Thinking, the team generates many more QAs, including extra-hard ones. They add chain-of-thought in the prompt for better structure-following and use higher temperature for diversity. Then they add targeted augmentations:
- Reformulations: rename entities to reduce word overlap.
- Unanswerables: request fields that don’t exist (fully or partially).
Why: Scales up variety and difficulty, mirroring real-life queries.
Example: Change “study name” to “report topic,” or add “project sponsor” when none exists.

Step 4: Human validation with editing

What happens: Experts validate 202 synthetic QAs, fixing queries, text values, pages, and boxes—and ensuring every leaf is correctly formatted. Average 25.5 edits per QA; only 2 rejected.
Why: Ensures correctness and standards compliance.
Example: If a box was slightly off, a validator tightens it; if a value didn’t match the document exactly, they correct the string.

Step 5: Final dataset composition

What happens: Combine 102 manual + 202 validated synthetic = 304 QAs over 110 documents. Keep a mix of difficulties: ~55% basic, ~25% reformulated, ~15% partially unanswerable, ~5% fully unanswerable.
Why: Balanced, realistic test-bed for generalist extraction.
Example: A slide deck query might ask for all authors’ details (names, titles, emails) across multiple pages.

Step 6: Fair evaluation via schema mapping

What happens: Models often produce valid but differently-structured JSON. To grade fairly, ExStrucTiny flattens gold and prediction trees and uses a small reasoning LLM (gpt-oss-20b) to map gold keys to predicted keys, with near-perfect mapping F1 (~0.976 in tests).
Why: Prevents ‘format fights’ from hiding true extraction quality.
Example: ‘0.person.name’ can match ‘authors.0.full_name’ if the strings match closely.

Step 7: Metrics

What happens: Compute:
- Text similarity (ANLS) per matched leaf.
- Page accuracy.
- Box overlap (IoU) and proximity.
- Structure similarity (tree-edit distance).
Why: Breaks the problem into ‘read correctly,’ ‘found the right place,’ and ‘organized it well.’
Example: If a model copies the right total but from the wrong page, text score is high but page score is low.

The secret sauce:

Requiring extraction leaves (text + page + bbox) prevents shortcutting and enables grounding checks.
The schema-mapper LLM makes evaluation robust to naming and layout differences.
The dataset design (multi-entity, low overlap, unanswerables, multiple doc types) pushes beyond string-matching into true understanding.

Concept mini-lessons (Sandwich style):

🍞 Hook: Like returning a library book with its exact shelf address. 🥬 The Concept (Answer Localization): What: Prove where each answer comes from (page and box). How: Include page index and normalized bbox for every value. Why: Trust and verification matter for real workflows. 🍞 Anchor: Total amount = “$210,910,” page 3, bbox [410, 612, 580, 640].
🍞 Hook: Sometimes you and a friend organize notes differently but mean the same thing. 🥬 The Concept (Tree-Edit Distance): What: A way to compare two JSON shapes. How: Count minimal edits to turn one structure into the other. Why: Rewards matching organization, not just matching words. 🍞 Anchor: Two nested lists of signers vs. a flat list with index tags can still be structurally close.

04Experiments & Results

The test: Evaluate many open- and closed-source VLMs of different sizes on ExStrucTiny with three few-shot examples (one per query type). Measure text extraction (ANLS), structure similarity, and grounding (page, bbox).

The competition: Open models (e.g., Qwen2.5-VL, Gemma-3, Pixtral, Mistral-Small, Kimi-VL) vs. closed models (Gemini-2.5 family). Inference used consistent settings and the schema-mapper for fair scoring.

Scoreboard with context:

Text extraction (ANLS): Closed models lead. The best closed model (Gemini-2.5-Pro) averages ~79.5% ANLS overall and reaches ~81.2% on the manual subset; the best open model (Qwen2.5-VL-72B-FP8) averages ~61.4% ANLS. That’s like an A- vs. a solid C.
Query type difficulty: Closed-with-schema and on-demand queries score lower than closed plain text across almost all models—because schema queries request many more fields, and on-demand queries force the model to infer the right children fields.
Size helps: Within model families, bigger models do better (e.g., Qwen2.5-VL-3B → 72B shows large gains).
Extraction length hurts open models: As the number of required values climbs (50+), open models’ scores drop sharply, while closed models stay steadier.
Manual vs. synthetic: Manual QAs are ~13.6% harder on average; no evidence of favoritism toward Gemini-2.5-Flash despite generating synthetic items, likely thanks to rigorous human validation.
Reformulations and unanswerables: On synthetic data, ANLS ≈ 62.4 for basic, ≈ 56.6 for unanswerables, and ≈ 45.7 for reformulated queries—showing sensitivity to wording changes and to correctly returning null for missing fields.
Visual vs. text-only: A text-only baseline (OCR text without images) scores ~10% ANLS lower than the image-based run, proving layout/visual cues matter.

Grounding results:

Page accuracy: Best ≈ 84.3% (closed).
Bounding-box IoU: Low across the board (best ≈ 14.4%), meaning models often get the text right but not the exact rectangle.
Box proximity: Best ≈ 74.8%, suggesting models get ‘nearby’ but not ‘perfectly overlapping.’ Interpretation: There’s a gap between extracting correct text and precisely pointing to it—crucial for audits and compliance.

Structure and validity:

Valid JSON and leaves: Larger models tend to produce more valid extraction leaves; some smaller ones output JSON that runs but with too few correctly formed leaves.
Schema mapping recall: Closed models recall more requested entities (Gemini-2.5-Pro/Flash ≈ 88%/87%). Big open models improve with scale.
Structure similarity: Several large open models approach closed models on tree similarity, showing that organizing outputs well is learnable with scale.

Context breakdown:

Hardest contexts: Charts and dense free text. Many models register their lowest scores there, aligning with known challenges in chart understanding and distraction from surrounding content.

Surprises:

Even when text is correct, grounding lags—so ‘answer found’ ≠ ‘evidence precisely located.’
Reformulating entity names (lower word overlap) hurts a lot, signaling over-reliance on keyword matching.
Unanswerable detection remains tough; models still try to ‘hallucinate’ answers instead of returning null leaves.

05Discussion & Limitations

Limitations:

ANLS is character-based and not ideal for numbers and dates (about 26% of values). A model can be ‘off by one digit’ but still useful—or vice versa—yet ANLS doesn’t capture that nuance.
English-only: The benchmark doesn’t yet test multilingual extraction.
Schema-mapper dependency: Using a text-only LLM to align schemas is accurate but slower than pure code and not 100% perfect.

Required resources:

A VLM with multi-page image handling and JSON generation.
GPU memory for inference at scale (the authors used four NVIDIA L40S GPUs).
The schema-mapper LLM (e.g., gpt-oss-20b) for evaluation.

When NOT to use:

If you only need a few fixed fields from one static form—traditional KEE datasets and specialist models may be simpler and faster.
If perfect pixel-level bounding boxes are mandatory today—current models’ IoU is low.
If your use case is multilingual right now—ExStrucTiny is currently English-only.

Open questions:

Better grounding: How do we tie text decoding and box prediction so ‘what’ and ‘where’ improve together?
Metric design: Can we add number/date-aware scoring and partial-credit rules that reflect business utility?
Robustness: How can we make models resilient to paraphrases (low word overlap) and confidently return null for missing fields?
Efficiency: Can we speed up schema alignment without losing fairness?
Coverage: What happens with more domains (legal, medical), more layouts (handwriting), and more languages?

06Conclusion & Future Work

Three-sentence summary: ExStrucTiny is a realistic benchmark that asks models to extract many related facts from varied document images and to return them in flexible JSON formats with exact locations. It fairly scores both ‘what you found’ and ‘where you found it,’ even when models choose different but equivalent JSON structures. Results show closed models currently lead, bigger models help, and everyone struggles with precise grounding, paraphrases, and unanswerables. Main achievement: A unified, schema-variable, grounding-aware benchmark plus a fair evaluation framework (with schema mapping) that mirrors real business extraction needs. Future directions: Add multilingual data, number/date-aware metrics, stronger grounding training, and broader document types (e.g., handwritten, stamps, signatures). Explore faster, programmatic schema alignment and richer few-shot guidance. Why remember this: ExStrucTiny moves the field from toy questions toward the messy, structured, and verifiable extractions that real workflows demand—pushing models to understand, adapt, and prove their answers.

Practical Applications

•Invoice processing: Extract vendor, invoice number, due date, and totals with page-and-box evidence for audits.
•Form intake: Pull patient or customer details (including checkboxes) into structured records with nulls for missing fields.
•Report mining: Gather all occurrences of metrics (e.g., revenue by quarter) from long financial reports and link them to their exact pages.
•Slide summarization for compliance: List authors, titles, and affiliations from multi-slide decks, with grounding for each value.
•Webpage capture: Extract product specifications from screenshots, preserving the field structure and locations.
•Quality control: Flag unanswerable fields explicitly (null leaves) to avoid hallucinations in automated pipelines.
•Workflow routing: Use schema-flexible extraction to populate different downstream APIs that expect different JSON shapes.
•Analytics preparation: Aggregate multi-entity extractions (e.g., all line items) for dashboards, keeping provenance for traceability.
•RPA integration: Drive robotic process automation steps using structured outputs that include where to click (approximate boxes).
•Dataset curation: Generate and validate synthetic but realistic QAs to extend internal testing for new document types.

Version: 1