OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong; Lei Chen; Xuanle Zhao; Wenkang Han; Liming Zheng; Jing Huang; Deyang Jiang; Yilin Cao; Lin Ma; Zhixiong Zeng

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Intermediate

Yufeng Zhong, Lei Chen, Xuanle Zhao et al.1/29/2026

arXiv PDF

Key Summary

•OCRVerse is a new AI model that can read both plain text in documents and the visual structures in charts, webpages, and science plots, all in one system.
•It learns in two stages: first by supervised fine-tuning (SFT) on a big mixed dataset to get general skills, then by reinforcement learning (RL) with custom rewards to get really good at each type of task.
•On the tough OmniDocBench v1.5 document test, OCRVerse scores 89.23 overall, beating many much larger general models like Gemini-2.5 Pro (88.03) and Qwen2.5-VL-72B (87.02).
•It handles formulas well (87.13 CDM) and does solidly on tables (85.77 TEDS), though the team notes tables and reading order still have room to improve.
•For vision-heavy tasks, it turns images into code: charts to Python, webpages to HTML, and icons to SVG, with results close to or better than some very large models despite being only 4B parameters.
•On ChartMimic it achieves strong low-level quality (72.2), on UniSVG it ranks near the top (76.3), and on Image2LaTeX-plot it leads with 63.1 EMS and 88.7% rendering success.
•The secret sauce is personalized rewards during RL, like format checks for tables and image-similarity checks for charts, so the model knows exactly what ‘good’ looks like in each domain.
•A carefully built dataset covers nine text-centric types (like books, slides, and exam papers) and six vision-centric types (like charts, webpages, circuits, and molecules).
•This holistic approach means one lightweight model can power document digitization, data visualization reuse, and web UI understanding across everyday and professional scenarios.

Why This Research Matters

In the real world, information isn’t just words—it’s tables, charts, webpages, and diagrams. OCRVerse turns static pictures of all these into living, editable text and code you can search, analyze, and reuse. That means faster digitization of school notes and business documents, easier rebuilding of charts for new data, and automatic HTML from webpage screenshots for rapid prototyping. Scientists and students can convert plots and formulas back into LaTeX, and chemists can turn molecular drawings into structured descriptions. Because the model is small yet strong, it’s practical to deploy, making advanced document and visual understanding widely accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how sometimes you need both words and pictures to really understand a page, like when a school worksheet has a paragraph, a table, and a graph? Old-school OCR (Optical Character Recognition) mostly learned to read just the letters, not the full story that the layout, charts, and diagrams tell. That worked fine for scanning simple text, but it broke down whenever information was locked inside visual structures like bar charts, math plots, webpages, electronics diagrams, or molecular drawings. Before OCRVerse, there were two big families of tools. Pipeline tools split the job into steps: first find regions like paragraphs or tables, then run specialized readers for each part. These were sturdy but fussy—if step one missed a box, step two couldn’t read it, and fixing or fine-tuning each piece was expensive. The newer end-to-end vision-language models (VLMs) could read text directly from images in one go and generalize better, but they sometimes got confused about layout order, repeated lines, or skipped tiny details—especially in dense tables or long math. People tried hybrids: detect the layout with classic OCR, then let a VLM explain the content. This helped a lot for complex documents, but most systems still focused on plain text. Meanwhile, the internet exploded with visual information—charts for science, webpages for apps, SVG icons for design, TikZ/LaTeX for math figures, and even chemistry diagrams. Reading just the letters wasn’t enough. You needed the structure, often best expressed as code: HTML for webpages, Python or plotting libraries for charts, LaTeX/TikZ for scientific graphics, and SVG for icons. The problem: the research community had lots of separate, single-skill tools. One could read tables; another could turn a chart into code; another could parse a web UI. But there wasn’t one small, unified brain that could do both text-centric and vision-centric OCR together. Combining them naively made models argue with themselves—each domain wants different outputs and formats, which can clash during training. What was missing was a way to teach one model to be a great generalist first, then a respectful specialist for each domain, without forgetting what it already knew. Also missing was the data: clean, broad, and balanced coverage of both everyday documents and professional visual content—plus a training recipe that tells the model exactly what ‘good’ means for text, for tables, for charts, for HTML, and so on. Why this matters in real life: imagine digitizing your school notes with proper reading order, recreating a chart you found online so you can tweak the colors and data, translating a screenshot of a webpage into editable HTML, or turning a chemistry diagram into a reusable description. Businesses and researchers need this too: faster document processing, searchable archives, editable scientific visuals, and smarter agents that can understand UIs. Holistic OCR turns static pictures into living, editable knowledge. OCRVerse aims to be that all-in-one reader, keeping the model small (only 4B parameters) but skilled across many tasks.

02Core Idea

Aha! Moment in one sentence: Teach one small model to first learn the basics across everything, then polish its skills per domain with custom rewards that match each domain’s idea of ‘right.’ 🍞 Hook: You know how a librarian reads words, but a designer reads layouts and shapes too? To truly understand a page, you need both. 🥬 The Concept (Holistic OCR): It’s one system that reads plain text and also interprets visual structures by generating code that recreates them. How it works:

Learn from mixed data that includes documents and visual composites.
Output either text (for paragraphs, formulas) or code (for tables, charts, webpages, SVG).
Use special rewards so each domain guides the model toward the right format and structure. Why it matters: Without holistic OCR, you need many separate tools, and you miss the meaning hidden in layouts and visuals. 🍞 Anchor: From a PDF page: it reads paragraphs as text, the table as HTML, and the chart as plotting code—so you can edit all three.

🍞 Hook: Imagine a translator who can understand both pictures and words at once. 🥬 The Concept (Vision-Language Model, VLM): A VLM is a model that connects what it sees (images) with what it can say or write (text/code). How it works:

A vision encoder turns images into features.
A language model turns features into text or code.
Training pairs images with the correct outputs. Why it matters: Without a VLM, the model can’t tie visual clues to the right words or code. 🍞 Anchor: Given a screenshot of a webpage, the VLM outputs HTML that matches the layout.

🍞 Hook: Think of school: you first learn the basics in class, then practice with feedback to get sharper. 🥬 The Concept (SFT-RL Multi-Domain Training): First, supervised fine-tuning (SFT) teaches general skills across domains; then reinforcement learning (RL) uses rewards tailored to each domain to perfect the skills. How it works:

SFT mixes all domains so the model learns formats and patterns.
RL gives domain-specific rewards (e.g., table structure, chart visual match).
The model adjusts to do better per domain without forgetting others. Why it matters: Without this two-step, the model either stays generic and sloppy or becomes great at one thing but forgets the rest. 🍞 Anchor: After SFT it can read most pages; after RL it nails tables, charts, and complex plots.

🍞 Hook: You know how a video game gives you points when you do the right thing? 🥬 The Concept (Reinforcement Learning, RL): RL trains by giving rewards for good outputs so the model learns strategies that score higher. How it works:

The model tries answers.
A reward function scores each answer.
The model updates to prefer higher-scoring answers. Why it matters: Without rewards, the model doesn’t know what success looks like in each domain. 🍞 Anchor: If the HTML has the right tags in the right order, the model gets a bigger reward and learns to do that more often.

🍞 Hook: Picture a teacher using different rubrics for essays, lab reports, and art projects. 🥬 The Concept (Custom Reward Strategies): Each domain uses its own scorecard—for text, formulas, tables, charts, webpages, and SVGs. How it works:

Text: reward by how close the letters match (edit distance).
Formulas: reward structure-aware similarity (e.g., CDM/BLEU on normalized LaTeX).
Tables and code: reward structure/format correctness or visual similarity when rendered. Why it matters: Without domain-specific rewards, one size fits none and training signals conflict. 🍞 Anchor: A chart-to-code output gets rewarded for looking like the original chart, not for having fancy but wrong code.

🍞 Hook: Imagine checking a drawing by comparing it side-by-side to the original. 🥬 The Concept (Visual Fidelity Rewards): For vision-centric tasks, the model’s code is rendered into an image and compared to the target using image features. How it works:

Render predicted code to an image.
Extract visual features (global and local).
Reward higher similarity to the ground truth. Why it matters: Without visual checks, code can be correct-looking text but render the wrong picture. 🍞 Anchor: If a bar chart’s bars are in the wrong order, the rendered image won’t match and the reward will drop.

🍞 Hook: Think of mixing ingredients from many cuisines to learn what flavors go together. 🥬 The Concept (Cross-Domain Knowledge Fusion): The model learns shared patterns from many data types so skills transfer across tasks. How it works:

Mix data from documents, charts, webpages, and more in SFT.
Learn common concepts (titles, legends, grids, boxes, reading order).
Reuse those concepts when facing new layouts. Why it matters: Without fusion, skills stay siloed and don’t help each other. 🍞 Anchor: Learning grid alignment in tables helps with chart axes and webpage columns.

🍞 Hook: A tidy library helps you find the right book fast. 🥬 The Concept (Data Engineering for OCR): It’s the careful collecting, cleaning, and labeling of diverse text and visual data for training. How it works:

Gather text-centric (books, slides, exam papers) and vision-centric (charts, webpages, SVG, circuits, molecules).
Clean and normalize formats; fix bad labels; generate structured annotations.
Use synthetic and self-annotated data to fill gaps. Why it matters: Without high-quality, balanced data, the model learns the wrong lessons or misses entire skills. 🍞 Anchor: Rendering webpage screenshots paired with correct HTML teaches the model exactly how visuals map to code.

03Methodology

At a high level: Image → Vision encoder + prompt → Language model → Output as text (for paragraphs/formulas) or code (for tables/HTML/SVG/plots). Stage 1: Supervised Fine-Tuning (SFT) What happens: OCRVerse (built on Qwen3-VL-4B) is trained on a big, mixed dataset that includes nine text-centric types (e.g., books, magazines, notes, slides, exam papers) and six vision-centric types (e.g., charts, webpages, SVG icons, geometry, circuits, molecules). The visual encoder and adapter are kept frozen to preserve strong image understanding, while the language model learns to produce the right outputs and formats. Why this step exists: Mixing domains teaches the model shared visual-language patterns and all required output styles in one brain. Without it, the model would specialize too early and fail to generalize. Example: Given a PDF page with a table and a formula, the model learns to output the body text verbatim, convert the table into HTML with correct rows/columns, and convert the formula into LaTeX. How SFT is fed with data:

Text-centric pipeline: collect open-source sets (e.g., street signs, documents, handwriting), real-world PDFs, and synthetic exam/math content; clean issues like missing content or wrong reading order; and generate annotations using advanced OCR/VLM tools. This includes converting tables to HTML and formulas to LaTeX.
Vision-centric pipeline: gather chart-to-code, webpage-to-HTML, image-to-SVG, LaTeX/TikZ diagrams, and chemistry data; clean corrupted or incomplete samples; and expand coverage via self-annotation (bootstrap a small model per domain to label more data). Why this data design matters: The model must see both words and structures. Without structured targets (like HTML, SVG, LaTeX), it cannot learn to generate faithful, editable code.

Stage 2: Reinforcement Learning (RL) with Personalized Rewards What happens: After SFT builds general knowledge, RL focuses on hard, format-heavy cases per domain using custom rewards. The model samples possible outputs; a reward function scores how good they are; policy optimization nudges the model toward higher-reward outputs. Why this step exists: Different domains define ‘correct’ differently. A great chart answer might be visually perfect but use different code; a great table answer must have precise row/column structure. Without domain-specific rewards, training signals can conflict and the model plateaus. Example (text-centric):

Plain text uses a reward based on edit distance (closer character match → higher reward).
Formulas use normalized LaTeX comparison with structure-aware metrics (e.g., BLEU/CDM) to avoid penalizing harmless formatting differences.
Tables use structure-based similarity (TEDS/variants) after normalizing headers and spans. Example (vision-centric):
Render the predicted code (e.g., chart Python, SVG, HTML) and compare the image to the ground truth using robust image features. Combine a global score (whole image) with local scores (patches) so small mistakes get noticed. Why render-and-compare: If the code ‘looks right’ when rendered, users can reuse it. Without visual fidelity checks, the model might produce code that seems reasonable but draws the wrong picture.

RL Data Curation What happens: For text-centric cases, choose challenging, high-uncertainty pages (e.g., dense tables, long formulas) so RL time is spent where it matters. For vision-centric cases, ensure diverse types and clean format constraints so rendered comparisons are reliable. Why this step exists: RL is expensive; aim it at the biggest pain points. Without careful selection, you waste steps on easy wins. Example: Pick tables with row/column spans and formulas with multi-line derivations; pick charts with legends, multiple series, and tricky colors.

Policy Optimization (intuitive view) What happens: For each input, the model proposes several outputs, each gets a reward. The model then shifts probability toward the better ones while keeping training stable with standard clipping tricks. Why this step exists: Sampling multiple attempts gives context—what is better or worse for this exact input—so the model learns faster. Without stability controls, RL can swing wildly and forget skills. Example: For a webpage screenshot, one HTML guess places the nav bar correctly and gets a higher reward; the model learns to favor that structure next time.

The Secret Sauce

Unified SFT builds a shared mental map of how visuals relate to text and code.
Personalized RL rewards tell the model exactly what ‘right’ means per domain (characters for text, structure for tables, visual match for charts/SVG/HTML).
Lightweight backbone (4B) plus high-quality data and rewards deliver big-model performance without big-model cost.

Concrete 3-sample walkthrough

Document page with paragraphs, a formula, and a table:

Input: page image + prompt “Please recognize all the text.”
Output: paragraphs as plain text, formula as normalized LaTeX, table as HTML with correct spans.
Why it works: SFT learned formats; RL rewards refined structure fidelity.

Chart screenshot:

Input: chart image + prompt “Please generate code to recreate the chart.”
Output: plotting code (e.g., Python/Matplotlib) that, when run, draws a visually matching chart.
Why it works: Visual fidelity rewards punish wrong bar order, colors, or legend mapping.

Webpage screenshot:

Input: screenshot + prompt “Generate the HTML layout.”
Output: valid HTML structure with correct nesting and sections.
Why it works: Format alignment and visual comparison encourage faithful layout reconstruction.

04Experiments & Results

The Test: The team measured both traditional document reading (text-centric) and visual-to-code understanding (vision-centric). For documents, they used OmniDocBench v1.5, which mixes languages and layouts and checks text, formulas, and tables rigorously. For vision-centric tasks, they used benchmarks that require generating code that recreates what you see: ChartMimic (charts), Design2Code (webpages), UniSVG (vector graphics), Image2LaTeX-plot (scientific plots), and ChemDraw (molecules). The Competition: OCRVerse (4B parameters) was compared to specialized OCR systems (e.g., Deepseek-OCR, dots.ocr, HunyuanOCR), large open-source VLMs (Qwen, InternVL series), and strong closed-source models (Gemini-2.5 Pro, GPT-5). Think of OCRVerse as a talented 9th grader competing against college students—it’s smaller but trained very smartly. The Scoreboard (text-centric):

Overall on OmniDocBench v1.5: 89.23. That’s like scoring an A when many big general models scored B+ to A- (Gemini-2.5 Pro: 88.03; Qwen2.5-VL-72B: 87.02).
Formulas (CDM): 87.13—stronger than several larger end-to-end models, showing that the formula-specific data and rewards paid off.
Text edit distance: 0.052; Reading order: 0.068—competitive but with noted room for layout-aware gains.
Tables (TEDS): 85.77—solid but behind the very best table-focused models, highlighting a future focus area. Why this matters: It shows a small model can rival or beat much larger general VLMs on serious document tasks when trained with the right data and rewards. The Scoreboard (vision-centric):
ChartMimic: OCRVerse achieves strong low-level quality (72.2) and high execution success, outperforming many similar or larger open models.
UniSVG-ISVGEN: 76.3 composite—near the top, suggesting excellent semantic fidelity in generated SVGs.
Design2Code: High-level 87.4, indicating reliable web layout reconstruction.
Image2LaTeX-plot: 88.7% rendering success and 63.1 EMS—clearly leading, like getting the highest grade in the class where most others struggle.
ChemDraw: 89.1% execution success and 54.7 Tanimoto similarity—approaching or surpassing strong baselines in similarity while being slightly lower in execution success than the very largest proprietary models. Context for the numbers: ‘Execution success’ means the generated code actually runs; ‘visual fidelity’ means the output image looks like the original; ‘structure scores’ (e.g., TEDS) measure whether the bones of the layout are correct. OCRVerse keeps these all high at once—a rare and valuable combo. Surprising Findings:
Parameter efficiency: With only 4B parameters, OCRVerse keeps up with or beats some 32B–72B models on several tasks. That’s like lifting the same weight with a smaller, better-trained muscle.
Plot understanding: The large gains on Image2LaTeX-plot suggest the visual-fidelity reward plus domain coverage really help with tricky scientific figures.
Balanced skills: Many models are great at either text or visuals; OCRVerse holds its own in both—evidence that holistic SFT plus custom RL resolves cross-domain conflicts. Takeaway: By telling the model exactly how to be right in each domain and giving it the right practice data, you can get big-model results with a small, practical system.

05Discussion & Limitations

Limitations:

Layout awareness: While competitive, reading order and fine-grained spatial logic still lag the most layout-aware systems. Explicit region-level conditioning could help.
Tables: Complex tables with multi-row/column spans and nested headers remain challenging. More diverse training data and stronger structure constraints could raise TEDS.
Code style vs. visual match: Different code can draw the same picture. Visual rewards help, but ensuring readable, standardized code across domains is still an open problem.
Rendering/toolchain dependence: Vision-centric evaluation depends on rendering environments and packages (e.g., LaTeX/TikZ). Mismatches can penalize good logic. Required Resources:
A modern GPU for training or fine-tuning; CPU is fine for small-scale inference, but fast batch rendering (for RL rewards) benefits from GPUs.
Rendering stacks for HTML, SVG, plots, and LaTeX to compute visual-fidelity rewards and to run evaluations robustly. When NOT to Use:
Ultra-high-precision archival OCR where regulatory-grade, 100% exact layout recreation is legally required—specialized pipelines with human verification may be safer.
Edge devices with tiny memory and no rendering capability; use a distilled or task-specific model instead.
Domains with exotic or proprietary formats the model hasn’t seen, unless you can supply examples and rewards for fine-tuning. Open Questions:
Can we add explicit layout tokens or region prompts to improve reading order without bloating the model?
What’s the best universal visual-reward that’s robust across resolutions, styles, and renderers?
How do we guarantee not just visual match but semantically clean, standardized code across domains?
Can we compress further (e.g., <2B parameters) while retaining holistic performance via better data and rewards?

06Conclusion & Future Work

Three-sentence summary: OCRVerse is a small but mighty model that reads both text and visual structures by outputting either clean text or the code that recreates charts, webpages, and scientific figures. It learns broadly with mixed supervised data, then sharpens per domain using custom rewards that define exactly what ‘correct’ means for text, tables, formulas, charts, HTML, and SVG. The result is competitive performance across both document OCR and vision-to-code tasks, rivaling much larger models while staying lightweight. Main achievement: A practical, end-to-end holistic OCR system—unifying character-level reading with code-level reconstruction—trained via a two-stage SFT→RL recipe that resolves cross-domain conflicts using personalized rewards. Future directions: Add stronger layout-aware cues to boost reading order and complex tables; expand table diversity and scientific figure coverage; standardize code style rewards; and explore even more efficient backbones and distillation. Broader agent integration (e.g., tools that read, edit, and test outputs) could close the loop for real-world workflows. Why remember this: OCRVerse shows that with the right data and the right rewards, one compact model can understand both the words and the drawings on a page—turning flat images into living, editable knowledge you can search, tweak, and reuse.

Practical Applications

•Digitize mixed documents by extracting text, converting tables to HTML, and formulas to LaTeX in one pass.
•Recreate online charts from screenshots into editable plotting code for data updates and style changes.
•Convert webpage screenshots into HTML scaffolds to speed up front-end prototyping.
•Turn icon images into SVG code for clean, scalable graphics in design systems.
•Parse geometry or scientific diagrams into LaTeX/TikZ for easy editing in research papers.
•Translate chemistry drawings into structured code/notations for database storage and search.
•Automate processing of exam papers, preserving reading order and complex math formatting.
•Build AI agents that can read a UI screenshot and generate code to interact with it.
•Create searchable archives of PDFs by extracting structured content rather than just plain text.
•Fine-tune OCRVerse on company-specific templates (reports, forms) using domain-tailored rewards.

Version: 1