PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
Key Summary
- •This paper upgrades a small but mighty vision-language model called PaddleOCR-VL-1.5 to read and understand real-world, messy documents better than any model before it.
- •It adds a smarter layout analyzer (PP-DocLayoutV3) that can handle bent, skewed, and oddly lit pages by drawing precise shapes around content and deciding reading order in one go.
- •The model learns multiple skills at once (OCR, tables, formulas, charts, seals, and text spotting), so it sees documents more like a human would.
- •A new test set, Real5-OmniDocBench, checks performance under scanning, warping, screen photos, bad lighting, and skew; the model sets a new record with 92.05% accuracy.
- •On the standard OmniDocBench v1.5 benchmark, it reaches a top score of 94.5%, with strong gains on formulas and complex tables.
- •Text spotting is done end-to-end by producing both the words and their 4-corner locations in a single answer using special location tokens.
- •Seal recognition is added and performs much better than far larger models, showing excellent parameter efficiency at only 0.9B.
- •A distortion-aware data augmentation pipeline and uncertainty-aware sampling make the model robust to real-world photography problems.
- •Despite higher accuracy, the system also gets faster thanks to a pipelined, multi-threaded serving design and optimized batching.
- •This makes document understanding more reliable for everyday tools like search, RAG chatbots, and automated office workflows.
Why This Research Matters
In real life, most documents aren’t perfect: they’re photographed at angles, bent by hands, or dimly lit. When software misreads these, invoices get misbilled, reports lose key formulas, and chatbots answer from scrambled data. PaddleOCR-VL-1.5 proves that a small, specialized model can read messy documents accurately and fast, which boosts reliability for search, RAG, and business automation. Its new benchmark (Real5-OmniDocBench) gives the community a fair way to measure real-world robustness. By adding seal recognition and end-to-end text spotting, it handles office and field scenarios beyond standard OCR. This means better tools for schools, hospitals, labs, and companies where precision truly matters.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook): You know how reading a book is easy at your desk, but reading a wrinkled homework sheet in a dim hallway or a photo of a whiteboard can be tough? Computers struggle with that even more.
🥬 Filling (The Actual Concept) — Vision-Language Model (VLM):
- What it is: A VLM is a computer model that looks at pictures and reads/write words so it can understand documents like a person who both sees and reads.
- How it works: (1) See the image; (2) Turn picture parts into features; (3) Use a language brain to describe and structure what it sees; (4) Output text plus structure like tables or math.
- Why it matters: Without seeing and reading together, computers either miss the picture details or misunderstand the words.
🍞 Bottom Bread (Anchor): Imagine a robot reading a scanned worksheet—it recognizes text, finds the table, and writes it cleanly into a spreadsheet.
🍞 Top Bread (Hook): Imagine packing your school bag with math, science, and art assignments all at once, but still doing each well.
🥬 Filling — Multi-task Learning:
- What it is: Teaching one model several related skills at the same time.
- How it works: (1) Share a common vision-language backbone; (2) Train on many tasks (OCR, tables, formulas, charts, seals, text spotting); (3) Let skills help each other; (4) Fine-tune instructions for each task.
- Why it matters: If tasks are learned separately, the model forgets connections (like how table lines help text reading) and performs worse.
🍞 Bottom Bread (Anchor): Learning to read music helps with math patterns; here, learning tables helps with reading order and text quality.
The world before: Document AI had grown quickly. Earlier tools could read clean PDFs and simple scans well. Models like PaddleOCR-VL 1.0, DeepSeek-OCR, MonkeyOCR, and MinerU2.5 made big strides in accuracy and efficiency. But most were tuned for tidy pages: flat, bright, and straight.
The problem: Real life is messy. Phone photos skew pages, notebooks bend, screens add moiré stripes, and lights cast shadows. Normal detectors draw simple boxes and get confused when paragraphs curve or tilt, and reading order breaks across columns.
Failed attempts:
- One-size-fits-all rectangular boxes: Too much background or cut-off content when pages bend.
- Separate steps for reading order: Small mistakes pile up, causing cascades of errors.
- Clean-only training: Models panic when photos are dark, warped, or reflective.
The gap: We needed a layout engine that understands curvy, tilted shapes precisely and decides reading order at the same time, plus a training plan that practices “hard mode” (bad lighting, skew, screen photos). And we lacked a fair robustness test set for these physical distortions.
🍞 Top Bread (Hook): Think of tidying a desk: you place books (layout), decide which to read first (reading order), and then read the pages (recognition). Now do it while the desk is tilted.
🥬 Filling — Layout Analysis:
- What it is: Figuring out where parts of a page are (paragraphs, tables, figures) and in what order to read them.
- How it works: (1) Detect each element; (2) Outline its exact shape; (3) Decide the reading sequence; (4) Pass clean crops to a recognizer.
- Why it matters: Without correct layout, you mix columns, scramble tables, and break formulas.
🍞 Bottom Bread (Anchor): A newspaper has two columns; wrong layout reads right column before finishing the left.
Real stakes: If your company or school feeds messy documents into a chatbot, the answers can be wrong. Bills get misread, lab reports lose math, and tables go out of order. RAG systems depend on accurate, structured text; garbage in means garbage out.
What this paper brings: PaddleOCR-VL-1.5—a compact 0.9B-parameter model—raises clean accuracy and becomes far more robust in the wild. It adds a new, unified layout analyzer (PP-DocLayoutV3), expands tasks (including seals and text spotting), and introduces a brand-new benchmark, Real5-OmniDocBench, covering five tricky real-world distortions.
🍞 Top Bread (Hook): Imagine practicing soccer not just on perfect grass, but also on bumpy fields so game day doesn’t surprise you.
🥬 Filling — Distortion-Aware Data Augmentation:
- What it is: Training by adding realistic bends, skews, lighting changes, and moiré to images.
- How it works: (1) Simulate warping and tilt; (2) Add glare and dark spots; (3) Use screen-photo artifacts; (4) Mix in varied real samples.
- Why it matters: Without practicing on “hard fields,” the model slips in real use.
🍞 Bottom Bread (Anchor): A homework photo taken at night under a yellow lamp still gets read correctly.
Bottom line: Before, models excelled on tidy pages. After, PaddleOCR-VL-1.5 keeps the top accuracy on clean docs and sets records when pages are scanned, warped, skewed, or screen-photographed—all while staying small and fast.
02Core Idea
The “Aha!” moment in one sentence: Do precise layout shapes and reading order inside one vision model, then hand clean, corrected regions to a multi-skilled recognizer that was trained to handle the same messy distortions.
🍞 Top Bread (Hook): You know how cutting puzzle pieces more exactly makes the final picture easier to solve?
🥬 Filling — Instance Segmentation:
- What it is: Instead of just drawing a box, the model draws the exact outline (mask) of each page element.
- How it works: (1) Propose elements as queries; (2) Predict pixel-accurate masks and labels; (3) Fit shapes even on curved or tilted text; (4) Output clean regions with little background.
- Why it matters: Sloppy boxes include clutter; exact masks keep only the piece you need.
🍞 Bottom Bread (Anchor): A curved paragraph on a photo page gets outlined tightly, not trapped in a crooked rectangle.
Three analogies for the same idea:
- Glasses + highlighter: The layout model is the glasses that fix blur and the highlighter that traces exact shapes. The recognizer is the reader who then reads highlighted parts in order.
- Airport baggage: One team sorts luggage by exact shape and sends them in the right queue; another team scans barcodes and details. No reshuffling needed later.
- Orchestra: The layout conductor sets who plays and when (reading order) while the section leaders (recognizers) perform their parts (text, tables, formulas) perfectly.
Before vs. After:
- Before: Rectangles overlapped; reading order came later; errors stacked up; real photos broke the system.
- After: One shot layout finds exact shapes and order; the recognizer reads each element with tailored skills; distortions are expected, not surprising.
Why it works (intuition, no equations):
- Shared vision features: When detection, segmentation, and ordering learn together, the same visual clues (like column lines) support all three jobs, so they agree rather than fight.
- Anti-symmetric pairwise ordering: The model learns a consistent “who comes first” signal across all elements, and a simple voting sorts them globally without loops.
- Quadrilateral locations for spotting: Four corners fit rotated or slanted text better than plain boxes, so recognition improves.
- Special location tokens: By giving coordinates their own tokens, the language model learns spatial meaning without getting confused by random numbers.
🍞 Top Bread (Hook): Imagine reading a poster on a windy day; it flaps and warps, but you still figure out what to read first.
🥬 Filling — Reading Order Prediction (inside the layout model):
- What it is: The model decides which element to read first, second, third… using learned relationships between elements.
- How it works: (1) Compare each pair of elements; (2) Predict who comes before whom; (3) Collect everyone’s “votes”; (4) Sort by total votes.
- Why it matters: Without order, stories jumble, tables read wrong, and instructions become nonsense.
🍞 Bottom Bread (Anchor): A two-column report is read left-top to right-bottom, even if photographed at an angle.
Building blocks (the recipe pieces):
- PP-DocLayoutV3: A unified RT-DETR-based transformer with mask heads for exact shapes plus an integrated order head.
- PaddleOCR-VL-1.5-0.9B: A NaViT-style dynamic resolution visual encoder, an adaptive connector, and a compact ERNIE-4.5-0.3B language model trained for six tasks.
- Distortion-aware augmentation: Practice on warped, skewed, and low-light samples.
- Uncertainty-Aware Cluster Sampling (UACS): Focus extra training on the hardest visual clusters.
- Text spotting with 4-point quads + location tokens: Output words and their corners in one pass.
- GRPO reinforcement learning: Smooth out style mismatches and push the model on tricky cases.
🍞 Top Bread (Hook): Like using a magnifying glass only where needed so you don’t slow down the whole book.
🥬 Filling — Dynamic Resolution Encoder (NaViT-style):
- What it is: A vision encoder that adapts to different image sizes and crops efficiently.
- How it works: (1) Patch and pack images; (2) Keep important detail while saving compute; (3) Feed flexible shapes into the model.
- Why it matters: Fixed-size inputs either waste detail or waste time.
🍞 Bottom Bread (Anchor): A small seal is “zoomed” just enough to read the tiny curved letters without exploding memory.
🍞 Top Bread (Hook): Ever ask a friend to help explain a tricky diagram?
🥬 Filling — ERNIE-4.5-0.3B (Language Backbone):
- What it is: A compact language model that turns visual features into structured, readable outputs.
- How it works: (1) Receive fused visual tokens; (2) Follow task instructions; (3) Generate text, tables (Markdown/JSON), formulas, or location-tagged words.
- Why it matters: Without a strong language brain, outputs are messy, inconsistent, or wrong.
🍞 Bottom Bread (Anchor): The model writes a clean Markdown table and the correct LaTeX-like math while keeping reading order.
Finally, the benchmark:
🍞 Top Bread (Hook): A fair obstacle course tells you who’s truly best, not just who practiced on perfect tracks.
🥬 Filling — Real5-OmniDocBench:
- What it is: A benchmark that tests scanning, warping, screen photos, lighting issues, and skew, all matched 1-to-1 with clean ground truth.
- How it works: (1) Start from OmniDocBench v1.5; (2) Create five real-world variants; (3) Keep labels identical; (4) Score text, formulas, tables, and order.
- Why it matters: Without a robust test, claims of “works in the wild” are just guesses.
🍞 Bottom Bread (Anchor): Two photos of the same page—one clean, one skewed—are graded against the same answer key.
03Methodology
At a high level: Input (PDF page or photo) → Stage A: PP-DocLayoutV3 (find exact shapes + reading order) → Stage B: PaddleOCR-VL-1.5-0.9B (recognize text/tables/formulas/charts/seals or do spotting) → Post-process to Markdown/JSON.
Stage A: PP-DocLayoutV3 (Unified Layout Analysis)
- What happens: A transformer built on RT-DETR predicts, for each element, its class (paragraph, table, figure, title, formula, seal…), a pixel-accurate mask, a tight box if needed, and its place in the reading order.
- Why it exists: If we don’t nail the shape and order in one go, later steps stumble: boxes include clutter, columns get mixed, and tables split.
- Example: A warped textbook page: the paragraph and formula regions are outlined precisely even on a curve; the model decides the formula comes after the paragraph.
Secret sauce pieces:
- Instance segmentation head: draws exact outlines so slanted or curved content is isolated.
- Integrated reading order: learns pairwise precedence and uses a simple voting sort to produce a clean sequence.
- End-to-end training: detection, masks, and order share the same features, aligning geometry and logic from the start.
Stage B: PaddleOCR-VL-1.5-0.9B (Element-level Recognition + Spotting)
- What happens: The vision encoder (NaViT-style dynamic resolution) turns each cropped region into visual tokens; an adaptive connector feeds them into a compact ERNIE-4.5-0.3B language model that follows task instructions to produce structured outputs.
- Why it exists: Specialized reading per element (OCR, tables, formulas, charts, seals) delivers higher fidelity than one generic read.
- Example: A complex financial table with merged cells is converted into Markdown with correct headers and cell contents, then merged across pages if needed.
Text Spotting (end-to-end):
- What happens: For natural scenes or mixed layouts, the model directly outputs words plus 4-corner coordinates in reading order in a single generation.
- Why it exists: Many texts are rotated or curved; 4-point quads fit reality better than plain boxes; one pass keeps speed and consistency.
- Example: "DREAM <LOC_253> <LOC_286> <LOC_346> <LOC_298> <LOC_345> <LOC_339> <LOC_252> <LOC_330>" encodes the word DREAM and the TL/TR/BR/BL corners.
Post-processing:
- What happens: Outputs are stitched into Markdown/JSON; tables can merge across pages; heading hierarchies are refined.
- Why it exists: Users need ready-to-use, structured content for RAG and analytics.
- Example: A report becomes clean Markdown with correct section levels and joined cross-page tables.
Training Recipe
- Layout Analysis (PP-DocLayoutV3):
- Data: 38k carefully labeled pages across 25 component types (paragraphs, titles, tables, figures, footnotes, seals, vertical text, etc.), each with exact boundaries and absolute reading order.
- Distortion-aware augmentation: simulate warping, skew, screen artifacts, and lighting changes so the model practices “hard mode.”
- Optimization: AdamW, small weight decay, stable learning rate, 150 epochs, end-to-end so queries learn both geometry and topology simultaneously.
🍞 Top Bread (Hook): Practicing on hills makes running flat easy.
🥬 Filling — Distortion-Aware Data Augmentation:
- What it is: Training with images bent, skewed, and oddly lit.
- How it works: Add realistic camera and lighting artifacts to clean pages.
- Why it matters: Without this, field performance collapses.
🍞 Bottom Bread (Anchor): A screen photo with glare still yields crisp text.
- Element Recognition + Spotting (PaddleOCR-VL-1.5-0.9B):
- Pre-training: 46M image-text pairs (up from 29M), broader languages and messy scenarios; higher spotting resolution to better localize tiny text; seal and spotting priors introduced early.
- Instruction post-training: Retain OCR/table/formula/chart; add seals and text spotting; use special <LOC_0>…<LOC_1000> tokens for normalized coordinates so the LM learns spatial meaning.
- Reinforcement learning (GRPO): Grouped rollouts and relative advantages help stabilize style and focus on tricky, high-value samples.
🍞 Top Bread (Hook): Like telling a friend to say answers in a certain format so the teacher can grade quickly.
🥬 Filling — Location Tokens + 4-Point Quads:
- What it is: Special tokens that mean “this is a coordinate” for each of four corners.
- How it works: Insert eight location tokens (x,y for TL, TR, BR, BL) right after the word.
- Why it matters: Prevents number confusion and captures rotation/tilt precisely.
🍞 Bottom Bread (Anchor): The word “Total” in a rotated invoice cell returns with exact slanted box corners.
🍞 Top Bread (Hook): When classmates grade each other’s work, you learn faster from tough questions.
🥬 Filling — GRPO (Group Relative Policy Optimization):
- What it is: A reinforcement learning method that compares outputs within a group to improve policies.
- How it works: Run parallel generations; score them; update based on relative gains; sample harder cases more often.
- Why it matters: Without it, the model may overfit easy styles and falter on weird layouts.
🍞 Bottom Bread (Anchor): Handwritten notes with messy arrows still get read and ordered correctly.
Data selection: Uncertainty-Aware Cluster Sampling (UACS)
- What happens: Use visual features (e.g., CLIP) to cluster images; estimate uncertainty by running the model multiple times; sample more from hard clusters with a weighted plan.
- Why it exists: Training time is precious; spend more on examples the model finds confusing.
- Example: The system upsamples warped seals and dense borderless tables.
🍞 Top Bread (Hook): You practice piano pieces that you mess up most, not the ones you already ace.
🥬 Filling — UACS:
- What it is: A way to pick training data that is both diverse and challenging.
- How it works: Cluster by visual style; measure uncertainty; allocate more samples to hard clusters.
- Why it matters: Without targeted sampling, the model wastes time on easy repeats.
🍞 Bottom Bread (Anchor): More practice time goes to faint photos of forms with tiny text.
Serving and Speed
- Pipeline: Three threads—PDF-to-image, layout analysis, VLM inference—with queue buffers; dynamic minibatches group blocks across pages.
- Why it matters: Keeps GPUs busy, shortens wait time, and scales to large document sets.
- Example: On an A100, FastDeploy reaches about 1.43 pages/s and over 2000 tokens/s end-to-end on OmniDocBench v1.5.
Secret sauce summary:
- Unified layout (masks + order) reduces cascading errors.
- Quad-based spotting with location tokens captures rotation.
- Distortion-aware training and UACS focus on real trouble spots.
- GRPO stabilizes style and boosts hard-case performance.
- Efficient serving squeezes more throughput from the same hardware.
04Experiments & Results
The tests: The team measured how well the model reads text, formulas, and tables and keeps the right reading order, both on clean benchmarks (OmniDocBench v1.5) and on a new real-world benchmark (Real5-OmniDocBench) covering scanning, warping, screen photos, bad lighting, and skew.
The competition: They compared against pipeline tools (Marker, PP-StructureV3), huge general VLMs (Qwen, Gemini, GPT), and specialized document parsers (MinerU2.5, MonkeyOCR, DeepSeek-OCR, dots.ocr). Some competitors had over 200 billion parameters; PaddleOCR-VL-1.5 has just 0.9B.
The scoreboard with context:
- OmniDocBench v1.5 overall: 94.5%. Think of it like scoring an A+ while most solid classmates get A− to B+.
- Text Edit Distance: 0.035 (lower is better), meaning very few character errors.
- Formula (CDM): 94.21%, a big leap—like catching almost all math symbols in the right places.
- Tables (TEDS / TEDS-S): 92.76% / 95.79%, showing strong structure recovery even in complex layouts.
- Reading Order Edit: 0.042 (lower is better), indicating near-correct sequencing.
Real5-OmniDocBench overall: 92.05%—a new record. This is the “in-the-wild” stress test.
- Scanning: 93.43% — steady on typical office scanner quality.
- Warping: 91.25% — robust on bent or curved pages.
- Screen Photography: 91.76% — resists moiré and reflections.
- Illumination: 92.16% — reads well under uneven or dim light.
- Skew: 91.66% — especially strong where others struggle; a huge jump over the prior version.
New capabilities:
- Text Spotting: Across 9 dimensions (Ancient, Blur, Common, handwriting in CN/EN, printed text in CN/EN, table, Japanese), it leads overall (0.8621). That’s like topping every event at a track meet, not just one.
- Seal Recognition: NED 0.138 vs. 0.382 for a 235B model. Picture a tiny player outscoring a giant star by a wide margin.
Speed and efficiency:
- End-to-end on A100 with FastDeploy: ~1.43 pages/s and ~2017 tokens/s, beating the older version by ~17–19%.
- Works well across different backends (vLLM, SGLang) and GPUs, showing good portability and tuning headroom.
Surprising findings:
- The compact 0.9B model often beats huge general models in document-specific tasks. Specialization plus the right training beats brute force size here.
- Skew robustness jumped dramatically—integrating reading order into the vision layout module seems to pay off the most on tilted pages.
- Special location tokens for coordinates reduced fragmentation and improved spotting accuracy more than treating coordinates like normal numbers.
Takeaway: It’s not just top grades on clean tests; it’s also the class champion on the obstacle course that mimics real life.
05Discussion & Limitations
Limitations:
- Extremely poor scans (very low DPI or heavy blur) can still fail; masks and OCR both need a minimum of detail.
- The Real5 benchmark covers five key distortions; other issues like heavy occlusion, stains, or torn pages are less explored.
- Handwritten math with dense, tiny symbols remains tough, especially when curved and faint.
- Ultra-long tables spanning many pages with inconsistent styles can still cause merging hiccups.
Required resources:
- A single modern GPU (e.g., A100, H20, L20, or even 4090-class) for best throughput; CPU-only is possible but much slower.
- For peak performance, deploy with FastDeploy or well-tuned vLLM/SGLang; set batch and token limits to balance speed and memory.
- Storage for diverse training/eval data if you plan to fine-tune on your domain.
When NOT to use:
- If you only have perfect, tiny documents and latency is ultra-critical on a phone CPU, a very lightweight OCR-only pipeline may be enough.
- If documents are mostly artistic posters with extreme graphic effects and little text, a scene-text model specialized for art may outperform.
- If you need deep reasoning about content (not just parsing), pair this with a strong downstream LLM.
Open questions:
- Can we further unify spotting and parsing so that layout and recognition happen truly in one pass for all tasks without losing accuracy?
- How to handle severe occlusions (stickers, stamps over text) and physical damage (rips, folds) more reliably?
- Can coordinate tokens be made continuous (learned embeddings) without losing stability or format simplicity?
- What are the best ways to align reading order across languages with different typography (vertical scripts, right-to-left) under distortion?
- How far can parameter-efficient tuning (LoRA, adapters) push domain adaptation without retraining the full model?
06Conclusion & Future Work
Three-sentence summary: PaddleOCR-VL-1.5 is a compact, multi-task vision-language model that combines a new unified layout analyzer (with exact masks and built-in reading order) and a robust recognizer to parse text, tables, formulas, charts, seals, and spotted text in challenging conditions. It sets state-of-the-art accuracy on OmniDocBench v1.5 (94.5%) and a new real-world benchmark, Real5-OmniDocBench (92.05%), while running efficiently on common GPU backends. The design choices—distortion-aware training, 4-point spotting with location tokens, and reinforcement learning—deliver robustness that rivals or beats much larger general models.
Main achievement: Showing that a carefully engineered 0.9B model can outperform giant models on real-world document parsing by unifying precise layout (masks + order) with multi-skill recognition and practical training/serving tricks.
Future directions:
- Even tighter unification of spotting and parsing; improved handling of occlusions and damaged pages.
- Smarter coordinate representations and language-specific reading-order models.
- Parameter-efficient domain adaptation for enterprise-specific forms and handwriting.
- Deeper integration with RAG systems for end-to-end, trustworthy question answering from messy sources.
Why remember this: It proves that “small and smart” can beat “big and general” when the problem is well understood—especially in the wild, where pages skew, bend, glare, and still must be read correctly.
Practical Applications
- •Automated invoice and receipt processing from mobile photos taken on-site.
- •Reliable parsing of scanned academic papers with complex formulas and tables for research search engines.
- •Digitizing government forms and contracts, including reading official seals accurately.
- •Building robust RAG chatbots that answer from photographed documents without misordering content.
- •Processing field reports and maintenance logs captured in poor lighting or at awkward angles.
- •Archiving historical documents and traditional scripts with improved spotting and layout understanding.
- •Extracting tables across multiple pages in financial reports into clean, analysis-ready spreadsheets.
- •Parsing ID cards, certificates, and permits from phone photos while preserving reading order and structure.
- •Transcribing classroom whiteboard or slide photos into structured notes with correct headings and tables.
- •Mobile document scanners that output clean Markdown/JSON even under glare and skew.