GutenOCR: A Grounded Vision-Language Front-End for Documents

Hunter Heidenreich; Ben Elliott; Olivia Dinica; Yosheb Getachew

GutenOCR: A Grounded Vision-Language Front-End for Documents

Intermediate

Hunter Heidenreich, Ben Elliott, Olivia Dinica et al.1/20/2026

arXiv PDF

Key Summary

•GutenOCR turns a general vision-language model into a single, smart OCR front-end that can read, find, and point to text on a page using simple prompts.
•It returns both the words and the exact places (bounding boxes) where those words appear, so humans and software can verify results quickly.
•A single checkpoint supports four tasks: full-page reading, full-page detection, localized reading inside a box, and string-based search (conditional detection).
•On 10.5K business and science pages, the 7B model more than doubles its backbone’s overall grounded OCR score (0.40→0.82).
•On the Fox benchmark, GutenOCR greatly improves region- and line-level OCR but loses some page-level order and color-guided OCR ability.
•On OmniDocBench v1.5, GutenOCR boosts text-detection recall dramatically (≈0.02→0.55–0.62) but gets slightly worse at reading formulas.
•A layout-aware “text2d” output preserves spacing and line breaks so downstream tools can keep columns and gaps aligned.
•The training recipe uses only public data and a length curriculum to handle long pages, without changing tokenizers or adding adapters.
•The unified, prompt-based API makes it easy to plug into business workflows and human-in-the-loop review.
•Trade-offs remain: color-guided cues and formula-heavy layouts need extra training to match the model’s strengths in grounded reading and detection.

Why This Research Matters

Documents drive business, science, law, and everyday life, and much of that information lives in scans or images. GutenOCR makes OCR not just accurate but verifiable, linking each word to its exact spot so people and systems can trust the results. This grounding speeds up human review: missing text shows up as gaps in box coverage, and hallucinations appear as boxes over empty areas. It also powers precise automation—read only the box you care about, or find every line that matches a key phrase—without rewriting entire pipelines. Because the interface is prompt-based and unified, teams can plug it into existing workflows and RAG stacks more easily. In short, GutenOCR turns OCR into a reliable, controllable front door for downstream AI systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re reading a busy newspaper page with columns, photos, and captions. You don’t just copy letters—you also keep track of where each sentence is on the page so you don’t get lost.

🥬 The Concept (Vision-Language Model, VLM): A VLM is a model that understands pictures and words together in one brain. How it works:

It looks at the page image and turns it into visual features.
It reads or writes text tokens while paying attention to those visuals.
It connects what it sees (pixels) with what it says (words). Why it matters: Without a VLM, OCR either breaks complex layouts or treats text as if it floats without location—making it hard to trust or verify. 🍞 Anchor: When you ask for the title on a scanned report, a VLM can find the words and their spot near the top center.

Before: Classical OCR pipelines split the job into detection (find lines), recognition (read text), and layout analysis (order the text). They were good at pointing to boxes but could be brittle on complex pages, and adapting them to new tasks took a lot of engineering. New “OCR-free” vision–language systems learned to read directly from the pixels and did great on many benchmarks, but they often treated text location as a hidden secret. You got words out, but not always where they came from.

Problem: Real-world systems—like filing insurance claims or searching scientific PDFs—need more than just words. They need to know where those words came from (grounding), to read only in certain places (localized reading), and to search for a phrase and highlight all its spots (conditional detection). If the OCR can’t do these, downstream tools can’t trust or correct mistakes easily. Human review becomes slow because you must compare the whole page to a long transcript.

🍞 Hook: You know how your teacher asks you “Where did you find that answer in the book?” and you point to the exact sentence? That pointing is crucial.

🥬 The Concept (Grounded OCR): Grounded OCR reads the page and links each token or span to its exact 2D location. How it works:

It finds text regions on the image.
It reads the text of each region.
It saves both text and its bounding box coordinates. Why it matters: Without grounding, you can’t quickly check where text came from, so mistakes hide more easily. 🍞 Anchor: If the OCR says “Total: $125.00,” grounded OCR also shows the small box around that line in the bottom-right of the invoice.

Failed attempts: End-to-end page-to-Markdown models can produce beautiful transcripts, but they often expose unstable or limited token-to-box alignment. They may do well when the target output format exactly matches the evaluation, yet they struggle when a workflow needs fine control: change the reading order, re-read a specific area, or return only boxes for lines matching “Invoice #”. Classical pipelines had APIs for these parts, but struggled with diverse page designs.

Gap: The world needed a single, modern model that behaves like a classical OCR pipeline (clear boxes, controllable reading, reliable search) while keeping the flexibility of VLMs. In short, an API-like OCR front-end that downstream systems can plug into.

Real stakes:

Business: Miss a number on a tax form and the automated process fails or pays the wrong amount.
Science: If equations or figure captions are lost, research search tools miss key evidence.
Safety and auditing: Without locations (boxes), it’s hard to prove where an answer came from.
Human-in-the-loop: With grounding, reviewers can spot hallucinations (boxes over empty space) or gaps (no boxes over visible text) fast.

🍞 Hook: Think of a librarian who not only hands you a page’s text but also sticks a tiny flag where every sentence lives.

🥬 The Concept (Bounding Box): A bounding box is a rectangle, defined by its top-left and bottom-right pixel coordinates, that marks where text appears. How it works:

Measure x1,y1 (top-left) and x2,y2 (bottom-right).
Clip to page edges and ensure width/height are positive.
Attach the box to its text so you can jump back to the right spot. Why it matters: Without boxes, you can’t control or verify what was actually read on the page. 🍞 Anchor: The line “Subtotal” gets saved with a box like [310, 540, 690, 565], pinpointing its exact stripe on the scan.

This paper introduces GutenOCR, a grounded, prompt-driven, single-checkpoint VLM front-end trained on business documents, scientific articles, and synthetic data. It produces page transcripts, line/paragraph JSON with boxes, localized reading inside user boxes, and conditional detection (“where is x?”). It closes the gap by making the model act like an API that controls reading, detection, and grounding with one consistent interface—giving both machines and humans precise, verifiable outputs.

02Core Idea

Aha! Moment (one sentence): Treat OCR as a grounded, API-like front door built from a single vision-language model that you can steer with prompts to read, detect, and point to text anywhere on the page.

Three analogies:

Swiss Army Knife: One tool flips out different blades—read the full page, find lines, read just inside a box, or search for a phrase—without switching models.
Librarian with a Map: The model gives you the quote and marks its page coordinates, so you can check it instantly.
GPS for Documents: You don’t just know the destination text—you get precise directions to its spot on the page.

Before vs After:

Before: Choose between (a) modular but brittle pipelines with boxes or (b) strong end-to-end readers that hide where text came from and are hard to control.
After: One VLM that behaves like a classic OCR pipeline via prompts: it can return plain text, layout-preserving text, or structured JSON with boxes; it can read just inside a user box; and it can find all lines matching a query string.

Why it works (intuition, not equations):

The same VLM learns a handful of simple but powerful "primitives"—read, detect, ground—that cover many OCR tasks.
Prompt templates act like switches telling the model which primitive to use and what output schema to follow.
Training on a mix of real pages and grounding-focused synthetic data teaches stable line/paragraph geometry and robust local reading.
A length curriculum grows the model’s ability to handle long pages and long outputs in stages, like practicing short runs before a marathon.

Building blocks, each introduced with a Sandwich:

🍞 Hook: You know how saying “Please list ingredients” or “Just show steps” changes how a recipe is written?

🥬 The Concept (Prompt-Based Interface): A prompt-based interface is a way to tell the same model which task to do and what shape the answer should take. How it works:

You give the page image plus a short instruction (prompt).
You specify the output type: plain text, text2d, JSON lines/paragraphs, or just boxes.
The single model follows the schema and responds accordingly. Why it matters: Without prompts and schemas, you’d need many separate models or ad-hoc formats that are hard to integrate. 🍞 Anchor: “Detect all LINES and return their boxes as JSON.” makes the model reply with [[x1,y1,x2,y2], …].

🍞 Hook: Picture typing text in a fixed-width editor where spaces and blank lines control how things line up.

🥬 The Concept (text2d): text2d is a whitespace-preserving transcript that keeps 2D layout using only spaces and newlines. How it works:

Order lines in a natural top-down, left-right reading order.
Place them on a rough grid based on their boxes.
Use spaces to keep columns aligned and blank lines to show vertical gaps. Why it matters: Without layout-aware text, columns can collapse and tables scramble, confusing downstream tools. 🍞 Anchor: Two columns stay visually separated in text2d because long runs of spaces keep the right column to the right.

🍞 Hook: Imagine using a magnifying glass to read just one square on a busy poster.

🥬 The Concept (Localized Reading): Localized reading returns only the text inside a user-provided box. How it works:

Take the box [x1,y1,x2,y2] from the user.
Focus reading on that region only.
Output just that text, not the whole page. Why it matters: Without local control, you must re-read entire pages, slowing down extraction and reviews. 🍞 Anchor: “Read the text inside [120, 300, 480, 520]” returns only the paragraph in that rectangle.

🍞 Hook: When you press Ctrl+F for a phrase, you want every place it appears to light up.

🥬 The Concept (Conditional Detection): Conditional detection finds all lines whose text contains a given query. How it works:

Normalize the page text and the query.
Check each line for the query substring.
Return boxes for every matching line; return [] if none. Why it matters: Without this, building search-and-highlight or quick QA evidence tools is clumsy and slow. 🍞 Anchor: Querying “Invoice #” returns all line boxes that include “Invoice #”.

🍞 Hook: Think of outlining sentences with a rectangle to show your friend exactly where to look.

🥬 The Concept (Bounding Box): A bounding box is a rectangle [x1,y1,x2,y2] that marks where text lives on the page. How it works:

Coordinates are in pixels with origin at top-left.
Clip coordinates to the page and ensure width/height > 0.
Attach each box to its recognized text. Why it matters: Without clear boxes, you can’t verify, re-read, or attribute results. 🍞 Anchor: The heading “Results” is stored with its box so clicking it zooms right to the title area.

Together, these pieces let GutenOCR act like a thin, reliable front-end: you specify the task family and format in the prompt, and the model returns exactly the structured text and boxes you need for extraction, search, RAG, or human review.

03Methodology

High-level flow: Input (page image, plus optional query or box) → Prompt specifies task and output schema → Single VLM runs reading/detection/grounding → Output (plain text, text2d, JSON with boxes, or boxes-only).

Step-by-step recipe:

Define unified tasks and schemas.

What happens: The model supports four task families: (a) Full-page reading (text, text2d, lines, paragraphs), (b) Full-page detection (boxes only), (c) Localized reading (inside a given box), and (d) Conditional detection (find boxes for a query string).
Why it exists: A single, predictable interface removes glue code and lets downstream tools switch tasks by changing prompts.
Example: “Detect all LINES and return a JSON array of [x1,y1,x2,y2]” or “Read all text as text2d”.

Grounded outputs with clean geometry.

What happens: Output with boxes uses integer pixel coordinates [x1,y1,x2,y2], clipped to the page, no rotated boxes; text+box outputs use JSON arrays of {"text", "bbox"}.
Why it exists: Stable, simple geometry is easy to parse, score, and display for humans.
Example: {"text": "Subtotal", "bbox": [310,540,690,565]}.

Layout-sensitive text2d.

What happens: The model aggregates line-level boxes, orders them top-to-bottom then left-to-right, and renders them into a notional grid so spaces and blank lines keep alignment and vertical gaps.
Why it exists: It preserves 2D cues using just a string—handy for search, diffing, and lightweight UIs.
Example: A right-justified page number stays on the far right thanks to inserted spaces.

Data curation: real + synthetic.

What happens: Training mixes real business pages (OCR-IDL, TabMe++) and scientific articles (PubMed-OCR) with synthetic data targeting grounding (SynthDoG with line boxes; Grounded LaTeX for equations).
Why it exists: Real data covers noise and domain variety; synthetic data teaches precise line-level geometry and localized math.
Example: A synthetic page with many short lines provides dense supervision for line detection.

Prompt templates and phrasing variation.

What happens: Each (task, input, output) has multiple templates; the model sees variations like “this/that/attached page” and “return boxes as JSON”.
Why it exists: Improves robustness to small wording changes in production prompts.
Example: “Where is ‘Invoice #’ in this document?” vs “Find occurrences of ‘Invoice #’ and return boxes.”

Backbone and training setup.

What happens: Fine-tune Qwen2.5-VL-3B/7B with all modules trainable, long context (≈2k→16k tokens), and greedy decoding for stable outputs.
Why it exists: Keeps the model general and scalable without adapters/tokenizer changes and supports very long page transcripts.
Example: PubMed-OCR pages can exceed 8k tokens; stage 3b handles 8k–16k.

Length curriculum over stages.

🍞 Hook: Like learning to run farther each week—start with short jogs, then 5K, then 10K.

🥬 The Concept (Curriculum Training): Curriculum training increases difficulty over time—in this case, longer sequences and more complex layouts. How it works:

Stage 1 (<2k tokens): mixed synthetic + real, core reading/detection grounding.
Stage 2 (2k–8k): real pages only, emphasize structured JSON.
Stage 3a (2k–8k): add PubMed-OCR, refine paragraphs and long-form handling.
Stage 3b (8k–16k): specialize on very long scientific pages. Why it matters: Without a curriculum, the model may struggle with long transcripts and complex layouts, leading to unstable outputs. 🍞 Anchor: Just like practicing shorter piano pieces before a concerto, the model masters short pages before long multi-column papers.

Task routing at inference.

What happens: Routing is prompt-only: the same checkpoint runs all tasks; no external heads or routers.
Why it exists: Simplicity—one model to deploy and maintain, consistent behavior across tasks.
Example: Switching from full-page reading to conditional detection is a one-line prompt change.

Localized reading and conditional detection details.

🍞 Hook: Using a highlighter (a box) or a search term to jump straight to the part you care about.

🥬 The Concept (Localized Reading): Read text only inside a user-specified box. How it works:

Intersect the box with line/paragraph boxes.
Concatenate overlapping transcripts.
Output plain text for that region. Why it matters: Without it, reviewers must scan entire pages to check a small field. 🍞 Anchor: A user drags a box around “Shipping Address” to fetch only that block.

🍞 Hook: Pressing Ctrl+F, but for scanned images, and getting highlight boxes instead of just matches.

🥬 The Concept (Conditional Detection): Given a query, return boxes for every matching line. How it works:

Normalize text and query.
Check if the line contains the query.
Return all matching line boxes; [] if none. Why it matters: Enables fast evidence finding and clickable highlights in review tools. 🍞 Anchor: Query “Due Date” returns boxes for all lines that include that phrase across multi-page statements.

Secret sauce.

Unified API surface: Stable schemas (text, text2d, lines/paragraphs JSON, boxes) make the model act like an OCR front-end rather than a single-task parser.
Grounding-first bias: Training emphasizes finding and boxing lines reliably, so humans can quickly see misses/hallucinations.
Length curriculum + prompt variation: Boosts robustness for long, messy documents and real-world prompt phrasing.
One checkpoint: Easier deployment, consistent behavior, and transferable skills (e.g., region training helping line pointers on Fox).

04Experiments & Results

The test: Researchers measured three things—text accuracy, detection quality, and end-to-end fidelity—on held-out in-domain pages (business + science) and on public benchmarks Fox and OmniDocBench v1.5.

🍞 Hook: Think about grading spelling tests by letters and by words.

🥬 The Concept (Character Error Rate, CER): CER measures per-character mistakes in a transcript. How it works:

Normalize strings (e.g., Unicode, whitespace).
Compute minimum edits (insert/delete/replace) to match the ground truth.
Divide by total characters. Why it matters: Without CER, tiny letter mistakes hide; with it, we see precise reading accuracy. 🍞 Anchor: If “Total” is read as “Tctal,” that’s one character error.

🍞 Hook: Now grade whole words to penalize missing or wrong terms.

🥬 The Concept (Word Error Rate, WER): WER counts mistakes at the word level. How it works:

Tokenize into words.
Compute edit distance on words.
Divide by total words. Why it matters: Without WER, a jumbled word might look better than it is; WER captures bigger reading slips. 🍞 Anchor: Reading “Jan 2026” as “Jan 2025” is one word off.

🍞 Hook: Think of how much two rectangles overlap to see if they point to the same place.

🥬 The Concept (Intersection over Union, IoU): IoU measures how much two boxes overlap. How it works:

Compute intersection area of two boxes.
Compute union area.
IoU = intersection / union. Why it matters: Without IoU, we can’t score how well detection boxes match ground truth. 🍞 Anchor: A predicted line box that almost perfectly covers the true line gets IoU close to 1.0.

🍞 Hook: Report card time—precision and recall get blended into one score.

🥬 The Concept (F1@0.5): F1@0.5 is the F1 score when boxes are matched at IoU ≥ 0.5. How it works:

Match predictions to references if IoU ≥ 0.5.
Count precision and recall.
Compute harmonic mean (F1). Why it matters: Without F1@0.5, we can’t compare detectors fairly on both misses and false alarms. 🍞 Anchor: A detector that finds most lines with few extras scores a high F1.

🍞 Hook: A single scoreboard that combines reading and detection quality.

🥬 The Concept (Composite Score): The composite grounded OCR score averages (1− reading errors) with detection F1 across tasks. How it works:

Convert reading error ε (e.g., CER) into a score (1−ε).
Average with F1 scores across task families.
Result: a 0–1 score summarizing grounded OCR. Why it matters: Without it, we might miss trade-offs between reading accuracy and grounding quality. 🍞 Anchor: Moving from 0.40 to 0.82 is like boosting from a struggling C to a solid A- across balanced skills.

In-domain results (10.5K held-out pages): GutenOCR-7B more than doubles the composite score over its Qwen2.5-VL-7B base (≈0.40→0.82). Localized reading errors drop sharply (e.g., cutting CER by ~4–6× relative to backbone), and detection F1 jumps to around ~0.79–0.88 depending on task (full vs. conditional). The 3B model also shows large gains, approaching the 7B model while being lighter.

Fox benchmark (page, region, line, color tasks):

Region OCR: Best-in-class region CER at 3B (~0.053), beating backbones significantly; 7B is also strong (~0.067).
Line OCR: Large improvements over backbones (down to ~0.211–0.240 CER from ~0.701–0.817), even without explicit line-pointer training.
Page OCR: Very high Page F1 (reads the right content), but worse Page CER (order-sensitive) than some baselines—reflecting GutenOCR’s layout-driven linearization rather than Fox’s exact canonical order.
Color-guided OCR: Major weakness; CER ~0.94–0.963 vs much lower for color-aware baselines. The model often misreads color cues as words, a sign of catastrophic forgetting.

🍞 Hook: When you learn a new trick, sometimes you forget an old one.

🥬 The Concept (Catastrophic Forgetting): Catastrophic forgetting is when fine-tuning for a new skill makes a model lose an older skill. How it works:

Focused training overwrites useful earlier patterns.
The model optimizes for new objectives.
Old niche abilities fade unless protected. Why it matters: Without guarding older skills, specialized training can break helpful behaviors (like color-guided cues). 🍞 Anchor: After OCR-focused fine-tuning, the model forgets how to follow colored pointers and instead tries to read the color names.

OmniDocBench v1.5 (out-of-domain stress test):

Text recognition (cropped spans): Backbones slightly outperform GutenOCR on CER, especially on colorful backgrounds—likely domain mismatch.
Text detection recall: Big win for GutenOCR: recall jumps from ≈0.02 (backbones) to ≈0.55–0.62. This confirms transfer of line-level detection skill.
Formula recognition: Backbones do better. GutenOCR slightly degrades (more at 3B), showing that math-awareness wasn’t prioritized enough.

Surprises:

Region/line OCR got much stronger, and that skill transferred even to line-pointer tasks the model wasn’t directly trained for.
Page-level order on Fox looked worse by CER but content coverage (Page F1) was excellent—highlighting the difference between “what” and “in what order.”
Color-guided tasks regressed sharply, illuminating a clear fine-tuning trade-off.

Bottom line: GutenOCR reshapes the VLM into a controllable, grounded front-end—excellent at finding and reading the right places—while revealing where extra training (color cues, formulas) is needed.

05Discussion & Limitations

Limitations:

Color-guided OCR: The model largely lost the ability to follow color pointers; prompts mentioning colored boxes often got misread as text. This is a prime target for adding color-conditioned supervision.
Formula-heavy pages: Backbones outperformed fine-tuned models on LaTeX-like math. Improving symbol-level grounding and math-aware data would help.
Complex structures: Current outputs are line/paragraph-centric; explicit table structure, math layout trees, and cross-page links remain out of scope.
Reading order vs canonical formats: Layout-preserving text2d can clash with benchmarks expecting a particular page-to-Markdown order.

Required resources:

Training used 8× H100 GPUs with bf16 and ZeRO-3 memory sharding; long contexts up to 16k tokens require careful batch sizing.
At inference, a single 3B or 7B checkpoint serves all tasks; greedy decoding ensures stable structured outputs.

When not to use:

If you need strict, benchmark-specific page-to-Markdown ordering and color-guided cues out of the box, specialized models may fit better.
If equations are the main content (e.g., math textbooks), the un-fine-tuned backbone or a math-specialized model could be preferable.
If you need detailed table structure extraction (cells, spanning rules), pair GutenOCR with a downstream table parser.

Open questions:

How to preserve color-guided skills while specializing for grounded OCR (e.g., multi-task regularization or rehearsal)?
What’s the best way to add math-rich supervision so formulas improve without hurting general OCR?
Can we extend grounded outputs beyond lines/paragraphs to tables, figures, cross-references, and multi-page relations—toward a "document hologram"?
How do we jointly optimize accuracy, grounding, and latency/throughput for production constraints?

🍞 Hook: Think of reading a report while keeping sticky notes that point to exact evidence.

🥬 The Concept (Reading Order): Reading order is the sequence the model uses to present text from the page. How it works:

Decide a top-down, left-right order.
Keep columns and spacing so meaning isn’t scrambled.
Optionally match an external canonical order if required. Why it matters: Without consistent order, downstream tools can misinterpret columns or sections. 🍞 Anchor: Two-column news articles should not get merged left-right-left-right; each column should stay intact in order.

Overall, GutenOCR shows that an API-like, grounded OCR front-end can be both practical and high-performing, especially for detection and fine-grained reading, while candidly exposing the trade-offs to fix next.

06Conclusion & Future Work

Three-sentence summary: GutenOCR is a single vision-language checkpoint that acts like a grounded OCR front-end: it reads, detects, and points to text using a unified, prompt-based interface. It delivers big gains in localized reading and detection across in-domain pages and public benchmarks, while revealing trade-offs in color-guided OCR, page linearization vs canonical order, and formula-heavy content. The training recipe is open, uses public data, and turns general VLMs into controllable, verifiable OCR modules suitable for downstream automation and review.

Main achievement: Recasting OCR as a grounded, API-like front end—combining reading, detection, and grounding in one model with stable schemas (text, text2d, lines/paragraphs JSON, boxes) and strong region/line performance.

Future directions:

Add color-conditioned grounding and math-rich supervision to recover and surpass baseline color and formula skills.
Extend grounded interfaces to tables, figures, and cross-page references; measure evidential QA with provenance.
Optimize for efficiency (latency/throughput) while keeping long-context and grounding quality.

Why remember this: GutenOCR shows that the most valuable OCR is not just “what was read,” but “where it came from and how you can control the reading.” This grounded, prompt-driven approach is a practical foundation for reliable extraction, searchable archives, and evidence-first RAG—stepping stones toward richer “document holograms” that keep content and its proof tightly linked.

Practical Applications

•Invoice and receipt processing with clickable evidence boxes for totals, dates, and line items.
•Contract review where each extracted clause links back to its exact location for auditing.
•Scientific paper triage: read only figure captions or references by drawing a box or searching key phrases.
•Customer support: quickly highlight every occurrence of an account number across scanned forms.
•Compliance checks that verify fields (e.g., signatures) by localized reading inside target regions.
•RAG pipelines that fetch grounded snippets, preserving layout via text2d for better retrieval.
•Data labeling tools that pre-box lines and let annotators confirm or fix grounding easily.
•Batch QA for OCR: detect gaps where visible text lacks boxes to trigger re-scans or flags.
•Accessibility support: read selected regions on complex flyers or menus without reading the whole page.
•Financial audits: localized reading of specific table regions to confirm totals and subtotals with evidence.

Version: 1