Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Surapon Nonesung; Natapong Nitarach; Teetouch Jaknamon; Pittawat Taveekitworachai; Kunat Pipatanakul

Typhoon OCR: Open Vision-Language Model For Thai Document Extraction

Beginner

Surapon Nonesung, Natapong Nitarach, Teetouch Jaknamon et al.1/21/2026

arXiv PDF

Key Summary

•Typhoon OCR is an open, lightweight vision-language model that reads Thai and English documents and returns clean, structured text.
•It was built because most existing AI readers work well for big languages but struggle with Thai’s special script and complex page layouts.
•The team created a careful, step-by-step data pipeline that mixes real documents, synthetic pages, and quality checks to teach the model structure and text at the same time.
•Typhoon OCR can both transcribe text and rebuild layouts like tables, charts, figures, and equations into Markdown, HTML, and LaTeX.
•Two training styles were first used (Default Mode for simple pages and Structure Mode for complex pages) and later unified in V1.5 to make usage simpler.
•Despite being small, the models match or beat some larger proprietary systems on Thai financial reports and government forms.
•V1.5 (2B parameters) is faster, less dependent on PDF metadata, and more robust thanks to better data and training methods.
•Performance is slightly weaker on visually heavy infographics and very degraded scans, which remain open challenges.
•The code, models, and datasets are open, enabling reproducible research and practical deployment in Thai-centric workflows.

Why This Research Matters

Reliable, open Thai document extraction removes a major bottleneck in banks, schools, and government offices where paperwork is dense and layout-heavy. A compact 2B model means lower costs and faster responses, enabling on-device or private deployments that respect sensitive data. Structure-aware outputs like HTML tables and LaTeX equations let teams plug results directly into analytics, spreadsheets, and publishing tools. By reducing manual retyping and cleanup, organizations save time and reduce errors that can affect decisions and compliance. The open release empowers local developers and researchers to adapt and extend the system to new Thai domains. Finally, the approach offers a blueprint for bringing high-quality OCR and layout understanding to other low-resource languages.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re trying to read a busy bulletin board full of sticky notes, flyers, and doodles. If you only look at the letters but ignore where things are placed, you’ll miss which note goes with which title or table.

🥬 The Concept (Optical Character Recognition – OCR): OCR is a tool that turns what we see on a page (pixels) into text a computer can type and search. How it works: (1) Look at the image of the page, (2) find the characters, (3) turn those shapes into letters, (4) output the letters as text. Why it matters: Without OCR, computers can’t read scanned pages, so you can’t search, copy, or analyze them. 🍞 Anchor: Scanning a Thai restaurant menu and converting it into editable text is OCR in action.

🍞 Hook: You know how a comic book uses both pictures and speech bubbles to tell a story the words alone can’t?

🥬 The Concept (Vision–Language Model – VLM): A VLM understands images and text together. How it works: (1) See the image, (2) read any text in it, (3) match visual parts (like a table or chart) with the words, (4) produce a meaningful answer or a structured output. Why it matters: Without VLMs, an AI might read the words but miss what the layout, figures, or tables are trying to say. 🍞 Anchor: When you ask an AI to pull a table from a bank statement image and return it as an HTML table, that’s a VLM doing both seeing and reading.

The world before: AI document readers were built mainly around high-resource languages like English or Chinese. They worked decently on clean, simple pages but stumbled when documents were multi-column, table-heavy, or visually messy. For Thai, the problems were tougher: stacked diacritics, vowels that appear above or below consonants, and no spaces between words. That makes it hard to decide where a word starts and ends and to keep characters from colliding.

🍞 Hook: Think of Thai text like musical notes stacked on a staff—if notes shift up or down or sit very close, reading them cleanly gets tricky.

🥬 The Concept (Thai script challenges): Thai has complex letter placement and no explicit word boundaries. How it works: (1) Characters can stack, (2) vowels can float around consonants, (3) there are no spaces to split words, (4) layout is often dense in forms and financial reports. Why it matters: Without handling these, AI makes more mistakes—mixing characters, splitting words wrongly, and confusing columns. 🍞 Anchor: On a government form, the word boundaries aren’t marked, so a naive system might read two neighboring fields as one long word.

The problem: Open VLMs trained mostly on big-language datasets rarely see enough Thai pages to learn these patterns. So they misread characters, scramble the order of parts on a page, or miss the structure of tables and charts. Even strong proprietary models can slip on Thai-heavy, layout-dense documents.

Failed attempts: Teams tried plain OCR plus separate layout tools. But stitching many tools together often breaks on real-world pages: if OCR slightly misreads, the layout tool gets confused. Others tried generic VLMs, but with little Thai training, they didn’t generalize to financial tables, government forms, or handwritten notes.

🍞 Hook: You know how baking a cake from mismatched recipes can flop—too much sugar from one recipe, not enough eggs from another?

🥬 The Concept (Data curation and synthesis pipeline): It’s a careful recipe to build clean training data using multiple steps—extract text, reorganize it, auto-check it, then human-verify a sampled set. Why it works: Without curated data, the model learns from messy labels and copies the mistakes. 🍞 Anchor: If OCR says a table row is split in half, the pipeline catches the mismatch and fixes or removes it before training.

The gap: What was missing was a single open model tuned especially for Thai that learns not just to read characters but also to reconstruct structure—tables, headings, figures, and equations—and to be efficient enough for real deployments.

Real stakes: If your bank, school, or city office can’t reliably read Thai documents, work slows down, errors creep in, and people have to retype things manually. Accurate, open, and lightweight models lower costs, speed up processes, and keep data private by running on local machines.

02Core Idea

🍞 Hook: Imagine sorting a backpack full of mixed papers—receipts, homework, forms—then magically turning them into tidy digital documents that look right and read right.

🥬 The Concept (Typhoon OCR’s Aha!): Fine-tune an open vision–language backbone on Thai-focused, structure-aware data so the model learns to both read text and rebuild layout in one pass. How it works: (1) Gather real Thai documents plus synthetic ones with tables, charts, and math, (2) run a multi-stage pipeline to clean and structure labels, (3) fine-tune a VLM to output Markdown/HTML/LaTeX, (4) evaluate and iterate. Why it matters: Without structure-aware training, models can read characters but lose the page’s organization, making the results hard to use. 🍞 Anchor: A Thai financial report goes in; a clean HTML table with correct headers and rows comes out.

Three analogies:

Librarian analogy: The model is a librarian who not only reads the book but also knows where chapters, tables, and figures belong on each shelf.
Map analogy: The model doesn’t just list street names (words); it draws the map (layout) so you see how streets connect.
Recipe analogy: Instead of handing you raw ingredients (characters), it cooks the full dish (structured document) with the right plating (formatting).

Before vs after:

Before: OCR pipelines read text, then separate tools tried to guess layout; errors cascaded and Thai-specific quirks tripped the system.
After: A unified VLM, steeped in Thai documents and structure labels, emits text plus structure together, cutting error chains and boosting reliability.

🍞 Hook: You know how choosing the right game mode makes a video game more fun and fair?

🥬 The Concept (Default Mode vs. Structure Mode in V1): Two supervision styles taught the model how much layout to preserve. Default Mode emphasized readable Markdown for loose layouts; Structure Mode emphasized precise HTML tables and figure tags for complex pages. Why it matters: Without separate modes, simple pages became too complicated or complex pages lost important structure. 🍞 Anchor: A handwritten note becomes clean Markdown in Default Mode; a government form becomes HTML tables in Structure Mode.

🍞 Hook: Picking the right mode every time can feel like switching shoes for each sport.

🥬 The Concept (Unified Mode in V1.5): The team simplified training and inference by unifying modes so the model directly learns from visuals without relying on PDF anchors or manual mode switching. How it works: Better labeling models, broader data, and synthetic pages enable one robust output style. Why it matters: Fewer knobs for users; faster, more consistent results across document types. 🍞 Anchor: You drag in a Thai bank PDF or a scanned form, and the model returns the best-structured output automatically.

Why it works (intuition):

Teach the model the patterns it must recognize—Thai character stacking, no spaces between words, and regular structures like tables—using a lot of examples, including synthetic ones that cover rare or tricky cases.
Make outputs structure-aware so the model learns not just what letters say but where blocks belong.
Reduce dependency on external metadata and keep inputs consistent (e.g., image width) so the model doesn’t get distracted by noise.

Building blocks:

Thai & English document images as inputs.
A cleaned, structure-rich supervision dataset (Markdown, HTML tables, figure tags, LaTeX for math).
A vision–language backbone (Qwen family) fine-tuned end-to-end.
Resolution-aware preprocessing and long-context training.
An evaluation suite covering financial reports, government forms, books, infographics, handwriting.

🍞 Anchor: Think of teaching—first you show clear Thai text examples, then you show messy scans, then you add tables and charts, and finally you quiz the student on all of them. Typhoon OCR follows that learning journey.

03Methodology

At a high level: Input image or PDF → Preprocess (resolution, context) → VLM encodes visuals + text cues → Structured decoding (Markdown/HTML/LaTeX) → Output clean, layout-aware document.

🍞 Hook: Building a treehouse needs a plan, good wood, and careful steps; a document model needs a plan, good data, and careful training.

🥬 The Concept (Multi-Stage Dataset Construction): Create training labels in steps so they’re accurate and consistent. How it works:

Stage 1: Extract text with conventional OCR or parse PDF text layers for clean transcription.
Stage 2: Use open-source VLMs with prompts to reorganize text into layout-aware formats (Markdown/HTML/figure tags).
Stage 3: Agentic QC automatically checks for missing pieces, wrong order, or duplication and filters out bad cases.
Stage 4: Human reviewers spot-check samples and remove flawed ones. Why it matters: Without staged curation, noisy labels teach the model bad habits. 🍞 Anchor: If a table’s last row is accidentally dropped, the QC step flags it so it won’t mis-train the model.

Input handling:

Images are resized to a consistent width (V1: 1,800 px; V1.5: keep original if small, cap at 1,800 px if large) to stabilize training.
Long-context limits (around 16k–17k tokens) let the model handle long books or reports without chopping important parts.

🍞 Hook: Like assigning seats in a classroom so everyone can see the board.

🥬 The Concept (Anchor text length and long context): Set a generous token budget so the model can keep track of many page elements. How it works: allocate a large sequence length; manage attention so far-apart blocks can still connect. Why it matters: Without enough context, the model forgets early sections, mixing up headers and tables across pages. 🍞 Anchor: A 20-page Thai report keeps its table of contents aligned with later sections because the context window is big enough.

Two supervision modes in V1 (later unified in V1.5):

Default Mode: Simple, readable Markdown for loosely structured pages (receipts, notes).
Structure Mode: Rich HTML, figure tags, and LaTeX for complex layouts (financial reports, forms, academic pages).

🍞 Hook: Think of picking a pencil for sketching and a ruler for precise diagrams.

🥬 The Concept (Structured document supervision): Teach the model with outputs that reflect the page’s true structure. How it works: Annotate tables with headers and merges, mark figures, and preserve order and hierarchy. Why it matters: Without structure labels, the model guesses and breaks row/column relations. 🍞 Anchor: A balance sheet becomes an HTML table with correctly merged header cells for assets and liabilities.

V1.5 data upgrades:

Stronger labeling models (Qwen3-VL, Dots.OCR) replace reliance on PDF anchors.
Wider corpus: real Thai docs, DocLayNet subsets, Thai-translated VQA to retain general vision–language grounding, and synthetic documents for rare cases (math, charts, typographic variety).

🍞 Hook: When grocery stores lack rare spices, chefs sometimes make their own blends.

🥬 The Concept (Synthetic data generation): Programmatically create Thai pages with diverse fonts, charts, and equations. How it works: sample Thai words (PyThaiNLP), render text in many fonts/sizes, add charts (ChartCap), add culture-relevant visuals (SEA-VL Crawling), include equations (LaTeX OCR, OleehyO), and apply image noise (Augraphy). Why it matters: Without synthetic data, the model won’t see enough tricky cases to generalize. 🍞 Anchor: A generated page with a pie chart, a Thai caption, and an equation like ∫x dx is used to teach the model how to handle all three together.

Training recipe:

Backbone: Qwen2.5-VL (3B/7B) for V1; Qwen3-VL (2B) for V1.5.
Full-parameter supervised fine-tuning using open frameworks (olmOCR for V1; Axolotl for V1.5) with long-context support.
Quantization-aware training (V1.5) to make low-precision inference fast without big accuracy loss.

🍞 Hook: You know how practicing while wearing a light backpack makes the real race easier?

🥬 The Concept (Quantization-aware training): Train the model to expect lower-precision math at inference time. How it works: simulate quantized weights/activations during training so the model learns to be robust. Why it matters: Without this, shrinking the model later might hurt accuracy too much. 🍞 Anchor: A 2B model runs fast on an edge GPU while still producing reliable tables.

Inference flow example:

Input: A scanned Thai government form with typed fields and handwritten notes.
Preprocess: Resize to width 1,800 px; pack pages within context.
Encode: Visual encoder detects text blocks, lines, checkboxes, and figures; language model aligns text and structure.
Decode: Output HTML tables for fields, Markdown for paragraphs, <figure> tags for images, and LaTeX for math.
Result: A clean, structured document that preserves the original layout and content.

Secret sauce:

Pair Thai-specific training (script quirks, no spaces) with structure-aware outputs.
Balance real and synthetic data so the model sees clean and challenging cases.
Reduce reliance on external PDF metadata so deployment is simpler and faster.

04Experiments & Results

The test: Measure how well the model reads text and preserves structure across Thai document types. Three metrics make this meaningful:

🍞 Hook: When you grade a report, you check spelling, structure, and how many fixes are needed.

🥬 The Concept (BLEU, ROUGE-L, Levenshtein):

BLEU checks how many word pieces match the reference, like vocabulary accuracy.
ROUGE-L checks longest matching sequences, capturing structure/order.
Levenshtein counts character edits needed to fix errors; lower is better. Why it matters: Without these, we can’t fairly compare models on both text and layout. 🍞 Anchor: A BLEU of 0.90 on a financial report means the model’s text matches the gold version very closely; a Levenshtein of 0.07 means only small character tweaks are needed.

Competition: Typhoon OCR (3B/7B in V1; 2B in V1.5) was compared with GPT-4o, Gemini 2.5 Flash/Pro, and the prior Typhoon OCR version. The documents covered Thai financial reports, government forms, books, infographics, handwriting, and a mixed “Others” set.

Scoreboard highlights (V1, Structure Mode):

Financial reports: Typhoon OCR 7B (Image/PDF) ~ BLEU 0.91, ROUGE-L 0.94, Levenshtein 0.07–0.08; > GPT-4o and Gemini in this category.
Government forms: Typhoon OCR 3B ~ BLEU 0.92–0.93, ROUGE-L 0.96, Levenshtein 0.04–0.05; strong even vs. 7B and proprietary systems.
Thai books: All models lower; Typhoon around BLEU 0.63–0.64, ROUGE-L ~0.71–0.72, Levenshtein ~0.31–0.32. Books contain more figures and non-standard elements.

V1 takeaways:

Small gap between PDF-with-metadata and image-only for Typhoon models suggests strong visual-text alignment, not over-reliance on metadata.
Standardizing width at 1,800 px stabilized learning and improved results.

V1.5 results (2B) vs V1 (7B) and proprietary baselines (BLEU/ROUGE-L/Levenshtein):

Average performance rose from 0.558 to 0.644 BLEU and from 0.686 to 0.774 ROUGE-L, while Levenshtein dropped from 0.332 to 0.251. That’s like improving from a solid B to a strong A- overall.
Government forms and financial reports: V1.5 either matched or beat larger proprietary models and the older 7B Typhoon on lexical and structural metrics, with lower character error rates.
Infographics and handwriting: Proprietary models sometimes had lower character error, but V1.5 narrowed the gap significantly versus V1, showing clear progress.

Surprising findings:

The 3B V1 model occasionally matched or exceeded the 7B on forms, suggesting the data recipe mattered more than size.
V1.5 (2B), with better data and training, outperformed the older 7B on average—evidence that smart curation and unified supervision beat raw parameter count.
Books remained challenging due to heavy figures and ambiguous layouts. This points to future work in figure understanding and diagram reasoning.

Contextual meaning of numbers:

BLEU near 0.9 on financial/government pages is like transcribing a dense worksheet with only a few small slips.
ROUGE-L near 0.95 means the order and grouping of sections, rows, and headers are preserved, a key for re-usable outputs.
Levenshtein near 0.05–0.08 means few per-character edits, which drastically reduces manual cleanup.

Bottom line: Across structured Thai documents—the ones businesses and agencies rely on—Typhoon OCR is accurate, layout-faithful, and efficient, rivaling bigger closed models while remaining open and deployable.

05Discussion & Limitations

Limitations:

Degraded inputs (very low resolution, blur, glare, occlusion) still reduce accuracy, especially in handwriting and image-heavy infographics.
Figure-heavy and unconventional layouts (e.g., creative posters) are harder to parse consistently.
Multilingual breadth is limited mainly to Thai and English; other low-resource scripts need extra adaptation.
V1.5 simplifies modes, but rare edge cases may still need light post-processing.

Required resources:

A single GPU (even modest) can run the 2B model with low-precision inference; for training or large-batch processing, multi-GPU setups help.
Storage for the model, tokenizer, and data; and a pipeline to render outputs to Markdown/HTML/LaTeX as needed.

When not to use:

Extremely poor scans where text is unreadable to humans.
Documents where the exact pixel-perfect visual design (fonts, kerning) must be preserved (Typhoon optimizes for semantic structure, not DTP-level reproduction).
Highly specialized diagrams or scientific plots that require domain-specific reasoning beyond layout reconstruction.

Open questions:

How to further improve figure and chart understanding without inflating model size?
What’s the best balance of real vs synthetic data as we extend to new languages and domains?
Can lightweight post-OCR repair models fix residual Thai word segmentation errors in a plug-and-play way?
How to integrate higher-level reasoning (e.g., extracting relationships, validating totals) without sacrificing speed?

Overall assessment: Typhoon OCR shows that targeted data recipes and structure-aware supervision can unlock strong performance in Thai document understanding, all while staying small, open, and practical. The main growth area is robust handling of highly visual, irregular pages and very degraded scans—ripe targets for the next iteration.

06Conclusion & Future Work

Three-sentence summary: Typhoon OCR fine-tunes open vision–language backbones on Thai-focused, structure-aware data so the model can both read text and rebuild complex layouts. It matches or beats larger proprietary systems on key Thai document types while staying compact and deployable. V1.5 simplifies usage with a unified mode, reduces reliance on metadata, and lifts accuracy through better data and training.

Main achievement: Demonstrating that an open, small-footprint VLM—trained with a curated, Thai-centric, structure-rich corpus—can deliver high-quality transcription and layout reconstruction competitive with frontier closed models.

Future directions:

Stronger figure/chart reasoning and diagram understanding.
More robust handling of degraded images via targeted augmentations and noise modeling.
Expansion to other low-resource languages using the same data pipeline blueprint.
Adding light-weight reasoning for structured information extraction (e.g., totals, dates, entities).

Why remember this: It’s a proof that smart data and structure-aware supervision can beat brute-force size, especially in low-resource scripts like Thai. It opens the door to affordable, private, and accurate document automation across banks, schools, and government offices—without locking users into closed ecosystems.

Practical Applications

•Automate data entry from Thai government forms by exporting fields as structured HTML tables.
•Convert Thai financial reports into clean Markdown and HTML for quick analysis in dashboards.
•Digitize Thai books and educational materials with preserved headings, figures, and equations.
•Process receipts and invoices at scale to extract totals, dates, and vendor info for accounting.
•Archive scanned records with searchable text and layout, improving retrieval in knowledge bases.
•Prepare datasets for analytics by turning mixed documents into standardized, structured outputs.
•Enable on-prem or edge deployments for privacy-sensitive sectors like banking and healthcare.
•Support handwriting-heavy workflows by transcribing form fields and aligning them to table cells.
•Accelerate regulatory reporting by extracting tables and charts from official PDF releases.
•Pre-clean documents for downstream NLP tasks (e.g., entity recognition) using layout-aware text.

Version: 1