DeepSeek-OCR 2: Visual Causal Flow

Haoran Wei; Yaofeng Sun; Yukun Li

DeepSeek-OCR 2: Visual Causal Flow

Intermediate

Haoran Wei, Yaofeng Sun, Yukun Li1/28/2026

arXiv PDF

Key Summary

•DeepSeek-OCR 2 teaches a computer to “read” pictures of documents in a smarter order, more like how people read.
•It swaps the usual vision encoder for a small language model so the system can think causally about what to look at next.
•Special learnable tokens (causal flow tokens) decide a new reading order by looking at all image pieces and earlier decisions.
•A custom attention mask lets image pieces see each other both ways, while the learnable tokens look forward step by step like a storyteller.
•Only the reordered tokens are sent to the decoder LLM, creating two stages of causal reasoning that handle images with fewer tokens.
•On OmniDocBench v1.5, DeepSeek-OCR 2 reaches 91.09% and cuts reading-order mistakes (edit distance) from 0.085 to 0.057.
•It keeps the token budget small (256–1120) yet beats or matches bigger systems on reading order, making it faster and cheaper.
•The method is a path toward one unified encoder that could work for images, text, and audio by swapping query tokens.
•It works great on many document types but still struggles with very text-heavy newspapers due to token limits and data scarcity.

Why This Research Matters

Documents power daily life—bills, schoolwork, medical forms, and research all rely on the right content in the right order. DeepSeek-OCR 2 reads pages more like people do, so captions match the right figures and steps stay in sequence, reducing confusion. Because it uses fewer tokens, it’s faster and cheaper, which helps services scale to millions of pages. Its LM-style encoder is a step toward one shared brain that can also handle audio and text by swapping the learnable queries. Cleaner reading order also makes downstream apps—like search, summarization, and question answering—much more reliable. In short, it turns messy layouts into understandable stories that computers and people can trust.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you don’t always read a page from the very top-left to the very bottom-right? If there’s a big title, your eyes jump there first. If there’s a picture with a caption, you peek at the picture, then the caption, then maybe the paragraph it belongs to. Your eyes follow meaning, not just straight lines.

🍞 Hook: Imagine a scavenger hunt on a worksheet. You don’t look at each square in order; you jump to the boxes that look like clues. 🥬 The Concept: Visual Tokens are tiny chunks of an image that a computer uses to describe what it sees, like puzzle pieces of the picture.

How it works:
1. Split the image into small patches.
2. Turn each patch into a number vector (a token).
3. Use these tokens to reason about the whole picture.
Why it matters: Without tokens, the computer only has raw pixels and can’t easily compare parts or focus on important pieces. 🍞 Anchor: A homework page becomes 256–1120 small tokens, each summarizing a tiny area like a letter, a line, or a cell in a table.

🍞 Hook: You know how you underline the most helpful sentence in a paragraph? Your brain is paying extra attention to it. 🥬 The Concept: Attention Mechanism helps the computer focus on the most useful tokens when making decisions.

How it works:
1. Look at all tokens.
2. Score how helpful each token is for the current step.
3. Mix information, giving higher weight to high-scoring tokens.
Why it matters: Without attention, the computer treats every token the same, like underlining every word in a book. 🍞 Anchor: To answer “What’s the table title?”, attention lifts the title tokens higher than the page number in the corner.

🍞 Hook: Sometimes you skim a paragraph, then reread it backward to catch details you missed. 🥬 The Concept: Bidirectional Attention lets tokens see each other both ways, gathering global context across the image.

How it works:
1. Each image token can look at all other image tokens.
2. It blends details from everywhere (left, right, up, down) at once.
Why it matters: Without two-way visibility, the model might miss connections between, say, a figure and its caption. 🍞 Anchor: A chart token can directly see the caption token that explains its y-axis, even if they’re far apart on the page.

🍞 Hook: Think of wearing special glasses that only let you see what’s allowed during a game. 🥬 The Concept: Custom Attention Mask is a rulebook that tells which tokens are allowed to look at which other tokens.

How it works:
1. Build a 2-part mask: one part allows all-to-all for image tokens (bidirectional), one part allows only look-back for special query tokens (causal).
2. Apply the mask to control information flow in each transformer layer.
Why it matters: Without the mask, the model either can’t reorder tokens causally or loses global image context. 🍞 Anchor: Image pieces freely talk to each other, but the “reader” tokens only peek at earlier steps, just like reading left-to-right.

🍞 Hook: Sometimes reading a map with a good guidebook makes everything click. 🥬 The Concept: Language Model as Vision Encoder means using a small language model to process image tokens so it can reason about sequence and meaning.

How it works:
1. Feed image tokens as a prefix to a compact LLM.
2. Add learnable query tokens that act like a smart reader.
3. Let the LLM-style layers reorder information with the custom mask.
Why it matters: Without an LLM-style encoder, the system sticks to a rigid grid order and struggles with the true reading logic of complex pages. 🍞 Anchor: The model uses Qwen2-0.5B-sized layers to think about images like a sentence that needs the right word order.

The world before: Most vision-language models flattened images into a single long line of patches and processed them like text—top-left to bottom-right—with positional encodings that assume this order is meaningful. That makes simple pages okay but tangles up multi-column layouts, sidebars, math formulas, and tables. When the page’s true reading order zigzags, the fixed order makes the model guess.

🍞 Hook: In class, students improve by asking better questions over time. 🥬 The Concept: Learnable Queries are special tokens that the model can adjust during training to pull out the most useful information.

How it works:
1. Create query tokens that attend to the image tokens.
2. Update them as the model learns, so they become skilled “askers.”
3. Use their outputs as a smart summary of what matters.
Why it matters: Without learnable queries, the model can’t adapt what it asks for different layouts or tasks. 🍞 Anchor: Some queries become great at finding headers; others specialize in table rows or figure captions.

Failed attempts: Object detectors (like DETR) and projectors (like BLIP-2’s Q-former) used parallel queries to compress or find objects, but those queries talk to each other in both directions and don’t enforce a one-direction reading flow. Cross-attention encoders separated from LLM decoders sometimes fail to converge because image pieces can’t interact richly enough.

🍞 Hook: When you solve a mystery, each clue you find changes what you look for next. 🥬 The Concept: Causal Reasoning means picking the next step based on what you’ve already discovered.

How it works:
1. Start with what you know.
2. Choose the next place to look using logic from previous steps.
3. Repeat to build a meaningful sequence.
Why it matters: Without causality, the model can’t create a sensible reading path through a messy page. 🍞 Anchor: After spotting a section title, the next fixations go to its paragraph, not a random footnote.

The gap: We needed a way to let the encoder itself learn a flexible, meaningful order of image tokens before handing them to the language model, so the language model isn’t forced to untangle a jumbled, grid-fixed sequence.

🍞 Hook: Think of a conductor pointing to which instrument should play next. 🥬 The Concept: Causal Flow Tokens are learnable queries that produce a new, meaningful reading order of the image tokens.

How it works:
1. Give them causal attention so each step can only look back.
2. Let them attend to all image tokens plus earlier flow tokens.
3. Output a reordered sequence that captures the page’s logic.
Why it matters: Without these tokens, the model can’t turn 2D layouts into a smart 1D story for the decoder to read. 🍞 Anchor: They pick “Title → Author → Abstract → Figure 1 → Caption,” not just “left-to-right.”

Real stakes: Misreading order can scramble facts: prices matching the wrong items, steps in the wrong sequence on a lab sheet, or mixing formulas with unrelated text. That’s why people care—better reading order means fewer mistakes when processing bills, forms, homework, research papers, and reports.

02Core Idea

The “Aha!” moment in one sentence: If we let a small language-model-style encoder causally reorder image tokens before decoding, the whole system can read images like a person—following meaning, not just the grid.

🍞 Hook: Imagine turning a messy desk into a neat to-do list. 🥬 The Concept: DeepEncoder V2 is a vision encoder that uses a compact LLM with a custom attention mask and learnable causal flow tokens to reorder visual tokens by meaning.

How it works:
1. Compress the image into visual tokens (global and local crops).
2. Feed those tokens as a prefix into an LLM-style stack with a special mask.
3. Add the same number of causal flow tokens as visual tokens.
4. Let flow tokens look at all image tokens and earlier flow tokens.
5. Keep only the flow tokens’ outputs and pass them to the decoder LLM.
Why it matters: Without this reordering stage, the decoder starts from a scrambled line of patches, wasting capacity fixing ordering instead of understanding content. 🍞 Anchor: A two-column article with sidebars becomes a sensible reading list the decoder can easily narrate.

Three analogies for the same idea:

Tour guide: The visual tokens are all the sites in a city; the causal flow tokens plan the best route so you don’t zigzag randomly.
Recipe card: Ingredients (visual tokens) are scattered; the flow tokens arrange them into steps so cooking (decoding) is smooth.
School notes: You have facts everywhere; the flow tokens turn them into an outline the teacher can read aloud.

Before vs. After:

Before: Fixed raster order; decoder struggles with layout logic; needs more tokens and time to make sense of complex pages.
After: Causally reordered sequence; decoder focuses on meaning; equal or fewer tokens achieve better accuracy and lower reading-order errors.

Why it works (intuition without equations):

The image tokens keep bidirectional attention so they form a strong global picture—like everyone in a class discussing freely.
The flow tokens have causal attention, so they build a one-way story—like a student presenting slides in order.
Because flow tokens can see all image tokens but only earlier flow tokens, each decision depends on what’s already been decided, not the future—this enforces a stable, human-like reading path.
Using a compact LLM here aligns image processing with the decoder’s own causal logic, creating a two-cascade system of 1D reasoners that approximates the 2D reading process.

Building blocks, introduced with the sandwich pattern:

🍞 Hook: Think of cutting a big poster into sticker tiles. 🥬 The Concept: Visual Tokens are the tiles we get from compressing global and local image crops.

How it works: A light tokenizer turns a 1024×1024 view into 256 tokens and each 768×768 local crop into 144 tokens; up to 6 locals give at most 1120 tokens total.
Why it matters: This keeps compute small while preserving important detail. 🍞 Anchor: A full page plus a few close-ups turn into a compact set of tokens.

🍞 Hook: When the whole class chats, ideas fly both ways. 🥬 The Concept: Bidirectional Attention among visual tokens lets far-apart pieces (like figure and caption) inform each other.

Why it matters: It preserves CLIP-like global understanding without forcing a fixed order. 🍞 Anchor: The model knows the table body belongs to its header, even across a page.

🍞 Hook: A game rule sheet tells who can talk to whom and when. 🥬 The Concept: Custom Attention Mask stitches two worlds: free-talk for image tokens and one-way talk for flow tokens.

Why it matters: It’s the guardrail that makes reordering causal while keeping global vision intact. 🍞 Anchor: The left half of the mask is open chat; the right half is a one-way line.

🍞 Hook: A small storyteller makes sense of scattered facts. 🥬 The Concept: Language Model as Vision Encoder uses a compact LLM (Qwen2-0.5B scale) to perform sequence-aware reasoning over images.

Why it matters: It naturally supports causal flow and inherits efficient LLM tricks. 🍞 Anchor: It’s like turning picture pieces into a sensible sentence the big decoder can expand.

🍞 Hook: Students who ask sharper questions learn faster. 🥬 The Concept: Learnable Queries (flow tokens) adapt during training to pull the right information at the right time.

Why it matters: Static questions miss important patterns in diverse layouts. 🍞 Anchor: Some queries get great at math regions; others at dense paragraphs.

🍞 Hook: Solving a maze means each turn depends on your last one. 🥬 The Concept: Causal Reasoning ensures each new step follows from what’s already chosen, building a stable reading order.

Why it matters: It avoids peeking into the future and keeps the sequence logical. 🍞 Anchor: After picking a header, the next step picks its paragraph, not a random footer.

Put together, DeepEncoder V2 is the conductor: bidirectional vision for rich context, causal flow for order, and a compact LLM to align both with the decoder’s language-style reasoning.

03Methodology

At a high level: Image → Tokenizer (compress to visual tokens) → LLM-style encoder with custom mask (add causal flow tokens; reorder) → Keep only flow tokens → MoE decoder LLM → Output text/structure.

Step 1: Vision Tokenizer (compact, 16× compression)

What happens: The image is seen in one global view (1024×1024 → 256 tokens) plus 0–6 local crops (each 768×768 → 144 tokens). Total tokens range from 256 to 1120.
Why this exists: It squeezes the image into a small, information-rich set of tokens so later global attention is affordable.
Example: A research paper page with a figure gets 256 global tokens plus 2 locals near the figure and table (2×144), totaling 544 tokens.

Step 2: Prefix the visual tokens into an LLM-style encoder

What happens: All visual tokens are concatenated and fed as a prefix into a compact decoder-only transformer initialized from Qwen2-0.5B-scale weights.
Why this exists: Keeping image tokens inside the same stack as the queries lets them interact richly layer by layer. Attempts to isolate them in a separate encoder-decoder with cross-attention didn’t converge well because image tokens couldn’t talk enough.
Example: The 544 tokens enter the stack and immediately gain global, bidirectional visibility among themselves.

Step 3: Add causal flow tokens with equal count

What happens: Append the same number of learnable causal flow tokens as visual tokens (m visual tokens, n = m flow tokens). These queries can attend to all visual tokens and to earlier flow tokens, never the future ones.
Why this exists: Equal capacity gives space for re-fixations—if the page needs to revisit a region, some flow tokens can do that without crowding others out.
Example: With 544 visual tokens, we add 544 flow tokens, creating space for both first-pass picks (titles) and second-pass details (captions).

Step 4: Apply the custom attention mask (the secret traffic rules)

What happens: Build a block mask where the visual-token block is fully bidirectional, and the flow-token block is strictly lower-triangular (causal). Cross-block allows flow tokens to see all visual tokens at every step.
Why this exists: It preserves CLIP-like global modeling for the image while enforcing a one-way, human-like reading path in the flow.
Example: Flow token #37 can see all image tokens and flow tokens #1–#36, but not #38+.

Step 5: Two-stage cascade causal reasoning

What happens: Inside the encoder, flow tokens progressively impose a semantic order on visual content. After the last layer, we discard the visual-token outputs and keep only the final flow-token states (the reordered sequence). Then the decoder LLM autoregressively reads these in order to produce text, structure, and answers.
Why this exists: This splits the job: the encoder does ordering (what to read next), the decoder does reasoning (what to say next). Two 1D causal stages cooperate to approximate 2D page logic.
Example: The encoder outputs a sequence like [Title → Authors → Abstract → Fig1 → Caption → Section 1 → Table 1 → Header → Body…]; the decoder then writes a clean, correctly ordered transcript or JSON structure.

Step 6: Multi-crop strategy and token budgeting

What happens: Use up to 6 local crops to focus on dense areas, capping the total at 1120 tokens to match practical budgets (e.g., Gemini-3 Pro).
Why this exists: Keeps the method fast and usable in production while allowing detail zoom-ins when needed.
Example: A long table gets two extra crops so the model captures headers and body clearly without exploding the token count.

Step 7: Training pipeline (three stages)

Stage 1 (Encoder pretraining): Train the tokenizer + LLM-style encoder jointly with a lightweight decoder using next-token prediction over image-text pairs at 768 and 1024 resolutions. Outcome: the encoder learns feature extraction, compression, and early reordering.
Stage 2 (Query enhancement): Integrate with the DeepSeek MoE decoder; freeze the tokenizer, train encoder + decoder together using multi-crop, strengthening the flow tokens’ ability to order and compress.
Stage 3 (Continue-training LLM): Freeze the entire encoder; train only the decoder for speed and to better consume the reordered outputs.
Why these stages exist: Separate concerns—first learn good tokens and ordering, then align with the decoder, then scale training throughput.
Example: After stage 2, the model gets much better at picking captions right after figures; after stage 3, it writes cleaner outputs with fewer repetitions.

The secret sauce

Equal cardinality of flow and visual tokens: enough slots for re-fixations and multi-pass ordering.
Custom mask that marries bidirectional vision with causal flow: global context plus narrative order.
LM-as-encoder: aligns image processing with language-style causality and inherits LLM efficiencies (MoE, efficient attention).
Keep-only-flow to the decoder: saves compute and forces the encoder to distill the right order and content.

Concrete data walk-through

Input: A magazine page with a main title, two columns, a photo, and a caption.
Tokenizer: 256 global + 2 locals near the photo and the end of column two = 544 visual tokens.
Encoder: Flow tokens learn to output [Main Title → Byline → Left Column Para 1–3 → Photo → Caption → Right Column Para 1–3 → Footer].
Decoder: Generates text in that order and a structured JSON (sections, figures, captions) with low reading-order edit distance.

04Experiments & Results

The test and why it matters

Benchmark: OmniDocBench v1.5 with 1,355 real document pages (magazines, papers, reports; English and Chinese).
What’s measured: Overall accuracy and edit distances for text, formulas, tables, and—crucially—reading order (how well the model follows the page’s true sequence).
Why: Reading order is the heartbeat of understanding documents. Getting the right content in the wrong order still confuses people and downstream apps.

The competition

Pipelines and end-to-end models, including strong baselines: DeepSeek-OCR (previous gen), Gemini-3 Pro (similar token budget), InternVL, Qwen-VL families, MinerU, OCRVerse, and more.

Scoreboard with context

DeepSeek-OCR 2: 91.09% overall with a max of 1120 visual tokens. • This is like getting an A when many peers are at B+ to A−, but using fewer study notes.
Improvement over DeepSeek-OCR: +3.73% overall under similar data sources and lower token cap. • That’s moving from a solid 87.36% to a standout 91.09% while shrinking the visual token budget ceiling by 36 tokens (1156 → 1120).
Reading order edit distance: 0.085 → 0.057 (lower is better). • Think of this as cutting order mistakes by about a third—fewer mix-ups of what comes first.
Category-level edit distances (vs. Gemini-3 Pro, similar 1120 budget): DeepSeek-OCR 2 reaches 0.100 overall ED vs. Gemini’s 0.115, showing stronger parsing with similar resources.

Surprising and notable findings

Small but mighty: With an LLM-style encoder of about 0.5B parameters, the system improves order and content without inflating compute like massive multimodal stacks.
Order first, reason second: Cascading two causal reasoners (encoder then decoder) appears more effective than relying on the decoder alone to untangle grid-ordered tokens.
Practical readiness: In production, where there’s no ground truth, repetition rate dropped from 6.25% to 4.17% on images and from 3.69% to 2.88% on PDFs—fewer loops and cleaner outputs.

Where it shines and where it stumbles (from 9 document types)

Big wins: Magazines, academic papers, reports, colorful textbooks—reading order improves across the board, often by large margins.
Weak spot: Newspapers remain tough (very dense text). Causes:
1. Token budget pressure—too many small items for a 1120 cap.
2. Limited training data—about 250k samples is not enough for that style.
Easy fix ideas: Add more local crops on very dense pages or increase training data for newspapers.

Takeaway numbers you can remember

256–1120 visual tokens per page.
3.73% overall boost vs. previous DeepSeek-OCR.
Reading order ED down to 0.057.
Production repetition down ~2 percentage points on images and ~0.8 on PDFs.

What these mean in plain terms

Better order = fewer misunderstandings: Captions go with the right figures, steps stay in sequence, and table headers match their rows.
Better compression = faster and cheaper: Doing more with fewer tokens means quicker, less expensive processing at scale.
Strong promise for unified multimodality: If this works for images, similar LLM-style encoders with modality-specific queries could compress and reorder audio and text, too.

05Discussion & Limitations

Limitations

Token budget sensitivity: Extremely dense pages (e.g., newspapers) can exceed the comfortable 1120-token cap, leading to missed details or higher text edit distance.
Data imbalance: Underrepresentation of certain document types (like newspapers) limits specialization and hurts performance there.
Single-pass flow length: Using flow tokens equal to visual tokens is good, but deeper multi-hop re-examinations might need even more flow capacity.
Domain specificity: Trained heavily on OCR/document tasks; general visual reasoning (e.g., open-world scenes) remains to be fully validated.

Required resources

Training: Multi-node GPU clusters (e.g., 160 GPUs) for staged training; staged freezing helps manage cost.
Inference: Modest compared to huge multimodal models; fits real-world budgets thanks to 16× compression and 256–1120 token caps.

When not to use

Ultra-dense microprint scans where even many local crops cannot capture all details within budget.
Non-document images that require spatially precise pixel outputs (e.g., medical segmentation) rather than reading order and text extraction.
Tasks needing strict 2D geometric outputs without a textual or sequential endpoint.

Open questions

How long should the causal flow be for true multi-hop reordering? Would 2× or 3× flow length over visual tokens boost revisiting power?
Can the same LM-as-encoder handle audio and video streams with modality-specific query embeddings without new architectures?
What’s the best curriculum for training order—do we teach titles and captions first, or all at once?
How does this approach scale with even tighter token budgets or more aggressive compression ratios?
Can we learn when to add more local crops on-the-fly based on early flow-token signals?

06Conclusion & Future Work

Three-sentence summary

DeepSeek-OCR 2 introduces DeepEncoder V2, an LLM-style vision encoder that causally reorders image tokens before decoding so pages are read by meaning, not just by grid position.
A custom attention mask lets visual tokens see globally while learnable causal flow tokens build a one-way reading path; only the reordered flow outputs go to the decoder.
This two-cascade causal setup improves accuracy (91.09% on OmniDocBench v1.5) and slashes reading-order errors, all within a compact token budget (256–1120).

Main achievement

Proving that an LM-as-vision-encoder with causal flow tokens can outperform fixed-order pipelines on document reading, offering a practical route to genuine 2D reasoning through two 1D causal stages.

Future directions

Increase flow capacity for multi-hop revisits; adaptive cropping for dense pages; broaden training to newspaper-heavy and low-resource layouts; extend the same encoder design to audio and video with modality-specific queries.

Why remember this

It’s a simple, powerful idea: first reorder by meaning, then reason about content. By aligning vision with language’s causal nature, DeepSeek-OCR 2 reads like we do—one sensible step at a time—making document AI more accurate, efficient, and closer to truly understanding pages.

Practical Applications

•Automated PDF-to-structured-JSON conversion with correct reading order for analytics.
•Enterprise invoice and receipt processing with fewer mismatched line items and prices.
•Academic paper ingestion where figures, captions, and references are correctly linked.
•K-12 worksheet digitization that preserves step-by-step instructions and answer keys.
•Compliance document review (policies, contracts) with faithful section ordering.
•Accessible reading tools that narrate complex pages in a sensible order for screen readers.
•Batch pretraining data generation for LLMs with lower repetition and cleaner layouts.
•Form understanding in banking and healthcare with robust table header-to-row matching.
•Summarization pipelines that avoid mixing captions with unrelated paragraphs.
•Knowledge-base building from reports where section hierarchies are preserved.

Version: 1