DeepSeek-OCR 2: Visual Causal Flow
Key Summary
- ā¢DeepSeek-OCR 2 teaches a computer to āreadā pictures of documents in a smarter order, more like how people read.
- ā¢It swaps the usual vision encoder for a small language model so the system can think causally about what to look at next.
- ā¢Special learnable tokens (causal flow tokens) decide a new reading order by looking at all image pieces and earlier decisions.
- ā¢A custom attention mask lets image pieces see each other both ways, while the learnable tokens look forward step by step like a storyteller.
- ā¢Only the reordered tokens are sent to the decoder LLM, creating two stages of causal reasoning that handle images with fewer tokens.
- ā¢On OmniDocBench v1.5, DeepSeek-OCR 2 reaches 91.09% and cuts reading-order mistakes (edit distance) from 0.085 to 0.057.
- ā¢It keeps the token budget small (256ā1120) yet beats or matches bigger systems on reading order, making it faster and cheaper.
- ā¢The method is a path toward one unified encoder that could work for images, text, and audio by swapping query tokens.
- ā¢It works great on many document types but still struggles with very text-heavy newspapers due to token limits and data scarcity.
Why This Research Matters
Documents power daily lifeābills, schoolwork, medical forms, and research all rely on the right content in the right order. DeepSeek-OCR 2 reads pages more like people do, so captions match the right figures and steps stay in sequence, reducing confusion. Because it uses fewer tokens, itās faster and cheaper, which helps services scale to millions of pages. Its LM-style encoder is a step toward one shared brain that can also handle audio and text by swapping the learnable queries. Cleaner reading order also makes downstream appsālike search, summarization, and question answeringāmuch more reliable. In short, it turns messy layouts into understandable stories that computers and people can trust.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how you donāt always read a page from the very top-left to the very bottom-right? If thereās a big title, your eyes jump there first. If thereās a picture with a caption, you peek at the picture, then the caption, then maybe the paragraph it belongs to. Your eyes follow meaning, not just straight lines.
š Hook: Imagine a scavenger hunt on a worksheet. You donāt look at each square in order; you jump to the boxes that look like clues. š„¬ The Concept: Visual Tokens are tiny chunks of an image that a computer uses to describe what it sees, like puzzle pieces of the picture.
- How it works:
- Split the image into small patches.
- Turn each patch into a number vector (a token).
- Use these tokens to reason about the whole picture.
- Why it matters: Without tokens, the computer only has raw pixels and canāt easily compare parts or focus on important pieces. š Anchor: A homework page becomes 256ā1120 small tokens, each summarizing a tiny area like a letter, a line, or a cell in a table.
š Hook: You know how you underline the most helpful sentence in a paragraph? Your brain is paying extra attention to it. š„¬ The Concept: Attention Mechanism helps the computer focus on the most useful tokens when making decisions.
- How it works:
- Look at all tokens.
- Score how helpful each token is for the current step.
- Mix information, giving higher weight to high-scoring tokens.
- Why it matters: Without attention, the computer treats every token the same, like underlining every word in a book. š Anchor: To answer āWhatās the table title?ā, attention lifts the title tokens higher than the page number in the corner.
š Hook: Sometimes you skim a paragraph, then reread it backward to catch details you missed. š„¬ The Concept: Bidirectional Attention lets tokens see each other both ways, gathering global context across the image.
- How it works:
- Each image token can look at all other image tokens.
- It blends details from everywhere (left, right, up, down) at once.
- Why it matters: Without two-way visibility, the model might miss connections between, say, a figure and its caption. š Anchor: A chart token can directly see the caption token that explains its y-axis, even if theyāre far apart on the page.
š Hook: Think of wearing special glasses that only let you see whatās allowed during a game. š„¬ The Concept: Custom Attention Mask is a rulebook that tells which tokens are allowed to look at which other tokens.
- How it works:
- Build a 2-part mask: one part allows all-to-all for image tokens (bidirectional), one part allows only look-back for special query tokens (causal).
- Apply the mask to control information flow in each transformer layer.
- Why it matters: Without the mask, the model either canāt reorder tokens causally or loses global image context. š Anchor: Image pieces freely talk to each other, but the āreaderā tokens only peek at earlier steps, just like reading left-to-right.
š Hook: Sometimes reading a map with a good guidebook makes everything click. š„¬ The Concept: Language Model as Vision Encoder means using a small language model to process image tokens so it can reason about sequence and meaning.
- How it works:
- Feed image tokens as a prefix to a compact LLM.
- Add learnable query tokens that act like a smart reader.
- Let the LLM-style layers reorder information with the custom mask.
- Why it matters: Without an LLM-style encoder, the system sticks to a rigid grid order and struggles with the true reading logic of complex pages. š Anchor: The model uses Qwen2-0.5B-sized layers to think about images like a sentence that needs the right word order.
The world before: Most vision-language models flattened images into a single long line of patches and processed them like textātop-left to bottom-rightāwith positional encodings that assume this order is meaningful. That makes simple pages okay but tangles up multi-column layouts, sidebars, math formulas, and tables. When the pageās true reading order zigzags, the fixed order makes the model guess.
š Hook: In class, students improve by asking better questions over time. š„¬ The Concept: Learnable Queries are special tokens that the model can adjust during training to pull out the most useful information.
- How it works:
- Create query tokens that attend to the image tokens.
- Update them as the model learns, so they become skilled āaskers.ā
- Use their outputs as a smart summary of what matters.
- Why it matters: Without learnable queries, the model canāt adapt what it asks for different layouts or tasks. š Anchor: Some queries become great at finding headers; others specialize in table rows or figure captions.
Failed attempts: Object detectors (like DETR) and projectors (like BLIP-2ās Q-former) used parallel queries to compress or find objects, but those queries talk to each other in both directions and donāt enforce a one-direction reading flow. Cross-attention encoders separated from LLM decoders sometimes fail to converge because image pieces canāt interact richly enough.
š Hook: When you solve a mystery, each clue you find changes what you look for next. š„¬ The Concept: Causal Reasoning means picking the next step based on what youāve already discovered.
- How it works:
- Start with what you know.
- Choose the next place to look using logic from previous steps.
- Repeat to build a meaningful sequence.
- Why it matters: Without causality, the model canāt create a sensible reading path through a messy page. š Anchor: After spotting a section title, the next fixations go to its paragraph, not a random footnote.
The gap: We needed a way to let the encoder itself learn a flexible, meaningful order of image tokens before handing them to the language model, so the language model isnāt forced to untangle a jumbled, grid-fixed sequence.
š Hook: Think of a conductor pointing to which instrument should play next. š„¬ The Concept: Causal Flow Tokens are learnable queries that produce a new, meaningful reading order of the image tokens.
- How it works:
- Give them causal attention so each step can only look back.
- Let them attend to all image tokens plus earlier flow tokens.
- Output a reordered sequence that captures the pageās logic.
- Why it matters: Without these tokens, the model canāt turn 2D layouts into a smart 1D story for the decoder to read. š Anchor: They pick āTitle ā Author ā Abstract ā Figure 1 ā Caption,ā not just āleft-to-right.ā
Real stakes: Misreading order can scramble facts: prices matching the wrong items, steps in the wrong sequence on a lab sheet, or mixing formulas with unrelated text. Thatās why people careābetter reading order means fewer mistakes when processing bills, forms, homework, research papers, and reports.
02Core Idea
The āAha!ā moment in one sentence: If we let a small language-model-style encoder causally reorder image tokens before decoding, the whole system can read images like a personāfollowing meaning, not just the grid.
š Hook: Imagine turning a messy desk into a neat to-do list. š„¬ The Concept: DeepEncoder V2 is a vision encoder that uses a compact LLM with a custom attention mask and learnable causal flow tokens to reorder visual tokens by meaning.
- How it works:
- Compress the image into visual tokens (global and local crops).
- Feed those tokens as a prefix into an LLM-style stack with a special mask.
- Add the same number of causal flow tokens as visual tokens.
- Let flow tokens look at all image tokens and earlier flow tokens.
- Keep only the flow tokensā outputs and pass them to the decoder LLM.
- Why it matters: Without this reordering stage, the decoder starts from a scrambled line of patches, wasting capacity fixing ordering instead of understanding content. š Anchor: A two-column article with sidebars becomes a sensible reading list the decoder can easily narrate.
Three analogies for the same idea:
- Tour guide: The visual tokens are all the sites in a city; the causal flow tokens plan the best route so you donāt zigzag randomly.
- Recipe card: Ingredients (visual tokens) are scattered; the flow tokens arrange them into steps so cooking (decoding) is smooth.
- School notes: You have facts everywhere; the flow tokens turn them into an outline the teacher can read aloud.
Before vs. After:
- Before: Fixed raster order; decoder struggles with layout logic; needs more tokens and time to make sense of complex pages.
- After: Causally reordered sequence; decoder focuses on meaning; equal or fewer tokens achieve better accuracy and lower reading-order errors.
Why it works (intuition without equations):
- The image tokens keep bidirectional attention so they form a strong global pictureālike everyone in a class discussing freely.
- The flow tokens have causal attention, so they build a one-way storyālike a student presenting slides in order.
- Because flow tokens can see all image tokens but only earlier flow tokens, each decision depends on whatās already been decided, not the futureāthis enforces a stable, human-like reading path.
- Using a compact LLM here aligns image processing with the decoderās own causal logic, creating a two-cascade system of 1D reasoners that approximates the 2D reading process.
Building blocks, introduced with the sandwich pattern:
š Hook: Think of cutting a big poster into sticker tiles. š„¬ The Concept: Visual Tokens are the tiles we get from compressing global and local image crops.
- How it works: A light tokenizer turns a 1024Ć1024 view into 256 tokens and each 768Ć768 local crop into 144 tokens; up to 6 locals give at most 1120 tokens total.
- Why it matters: This keeps compute small while preserving important detail. š Anchor: A full page plus a few close-ups turn into a compact set of tokens.
š Hook: When the whole class chats, ideas fly both ways. š„¬ The Concept: Bidirectional Attention among visual tokens lets far-apart pieces (like figure and caption) inform each other.
- Why it matters: It preserves CLIP-like global understanding without forcing a fixed order. š Anchor: The model knows the table body belongs to its header, even across a page.
š Hook: A game rule sheet tells who can talk to whom and when. š„¬ The Concept: Custom Attention Mask stitches two worlds: free-talk for image tokens and one-way talk for flow tokens.
- Why it matters: Itās the guardrail that makes reordering causal while keeping global vision intact. š Anchor: The left half of the mask is open chat; the right half is a one-way line.
š Hook: A small storyteller makes sense of scattered facts. š„¬ The Concept: Language Model as Vision Encoder uses a compact LLM (Qwen2-0.5B scale) to perform sequence-aware reasoning over images.
- Why it matters: It naturally supports causal flow and inherits efficient LLM tricks. š Anchor: Itās like turning picture pieces into a sensible sentence the big decoder can expand.
š Hook: Students who ask sharper questions learn faster. š„¬ The Concept: Learnable Queries (flow tokens) adapt during training to pull the right information at the right time.
- Why it matters: Static questions miss important patterns in diverse layouts. š Anchor: Some queries get great at math regions; others at dense paragraphs.
š Hook: Solving a maze means each turn depends on your last one. š„¬ The Concept: Causal Reasoning ensures each new step follows from whatās already chosen, building a stable reading order.
- Why it matters: It avoids peeking into the future and keeps the sequence logical. š Anchor: After picking a header, the next step picks its paragraph, not a random footer.
Put together, DeepEncoder V2 is the conductor: bidirectional vision for rich context, causal flow for order, and a compact LLM to align both with the decoderās language-style reasoning.
03Methodology
At a high level: Image ā Tokenizer (compress to visual tokens) ā LLM-style encoder with custom mask (add causal flow tokens; reorder) ā Keep only flow tokens ā MoE decoder LLM ā Output text/structure.
Step 1: Vision Tokenizer (compact, 16Ć compression)
- What happens: The image is seen in one global view (1024Ć1024 ā 256 tokens) plus 0ā6 local crops (each 768Ć768 ā 144 tokens). Total tokens range from 256 to 1120.
- Why this exists: It squeezes the image into a small, information-rich set of tokens so later global attention is affordable.
- Example: A research paper page with a figure gets 256 global tokens plus 2 locals near the figure and table (2Ć144), totaling 544 tokens.
Step 2: Prefix the visual tokens into an LLM-style encoder
- What happens: All visual tokens are concatenated and fed as a prefix into a compact decoder-only transformer initialized from Qwen2-0.5B-scale weights.
- Why this exists: Keeping image tokens inside the same stack as the queries lets them interact richly layer by layer. Attempts to isolate them in a separate encoder-decoder with cross-attention didnāt converge well because image tokens couldnāt talk enough.
- Example: The 544 tokens enter the stack and immediately gain global, bidirectional visibility among themselves.
Step 3: Add causal flow tokens with equal count
- What happens: Append the same number of learnable causal flow tokens as visual tokens (m visual tokens, n = m flow tokens). These queries can attend to all visual tokens and to earlier flow tokens, never the future ones.
- Why this exists: Equal capacity gives space for re-fixationsāif the page needs to revisit a region, some flow tokens can do that without crowding others out.
- Example: With 544 visual tokens, we add 544 flow tokens, creating space for both first-pass picks (titles) and second-pass details (captions).
Step 4: Apply the custom attention mask (the secret traffic rules)
- What happens: Build a block mask where the visual-token block is fully bidirectional, and the flow-token block is strictly lower-triangular (causal). Cross-block allows flow tokens to see all visual tokens at every step.
- Why this exists: It preserves CLIP-like global modeling for the image while enforcing a one-way, human-like reading path in the flow.
- Example: Flow token #37 can see all image tokens and flow tokens #1ā#36, but not #38+.
Step 5: Two-stage cascade causal reasoning
- What happens: Inside the encoder, flow tokens progressively impose a semantic order on visual content. After the last layer, we discard the visual-token outputs and keep only the final flow-token states (the reordered sequence). Then the decoder LLM autoregressively reads these in order to produce text, structure, and answers.
- Why this exists: This splits the job: the encoder does ordering (what to read next), the decoder does reasoning (what to say next). Two 1D causal stages cooperate to approximate 2D page logic.
- Example: The encoder outputs a sequence like [Title ā Authors ā Abstract ā Fig1 ā Caption ā Section 1 ā Table 1 ā Header ā Bodyā¦]; the decoder then writes a clean, correctly ordered transcript or JSON structure.
Step 6: Multi-crop strategy and token budgeting
- What happens: Use up to 6 local crops to focus on dense areas, capping the total at 1120 tokens to match practical budgets (e.g., Gemini-3 Pro).
- Why this exists: Keeps the method fast and usable in production while allowing detail zoom-ins when needed.
- Example: A long table gets two extra crops so the model captures headers and body clearly without exploding the token count.
Step 7: Training pipeline (three stages)
- Stage 1 (Encoder pretraining): Train the tokenizer + LLM-style encoder jointly with a lightweight decoder using next-token prediction over image-text pairs at 768 and 1024 resolutions. Outcome: the encoder learns feature extraction, compression, and early reordering.
- Stage 2 (Query enhancement): Integrate with the DeepSeek MoE decoder; freeze the tokenizer, train encoder + decoder together using multi-crop, strengthening the flow tokensā ability to order and compress.
- Stage 3 (Continue-training LLM): Freeze the entire encoder; train only the decoder for speed and to better consume the reordered outputs.
- Why these stages exist: Separate concernsāfirst learn good tokens and ordering, then align with the decoder, then scale training throughput.
- Example: After stage 2, the model gets much better at picking captions right after figures; after stage 3, it writes cleaner outputs with fewer repetitions.
The secret sauce
- Equal cardinality of flow and visual tokens: enough slots for re-fixations and multi-pass ordering.
- Custom mask that marries bidirectional vision with causal flow: global context plus narrative order.
- LM-as-encoder: aligns image processing with language-style causality and inherits LLM efficiencies (MoE, efficient attention).
- Keep-only-flow to the decoder: saves compute and forces the encoder to distill the right order and content.
Concrete data walk-through
- Input: A magazine page with a main title, two columns, a photo, and a caption.
- Tokenizer: 256 global + 2 locals near the photo and the end of column two = 544 visual tokens.
- Encoder: Flow tokens learn to output [Main Title ā Byline ā Left Column Para 1ā3 ā Photo ā Caption ā Right Column Para 1ā3 ā Footer].
- Decoder: Generates text in that order and a structured JSON (sections, figures, captions) with low reading-order edit distance.
04Experiments & Results
The test and why it matters
- Benchmark: OmniDocBench v1.5 with 1,355 real document pages (magazines, papers, reports; English and Chinese).
- Whatās measured: Overall accuracy and edit distances for text, formulas, tables, andācruciallyāreading order (how well the model follows the pageās true sequence).
- Why: Reading order is the heartbeat of understanding documents. Getting the right content in the wrong order still confuses people and downstream apps.
The competition
- Pipelines and end-to-end models, including strong baselines: DeepSeek-OCR (previous gen), Gemini-3 Pro (similar token budget), InternVL, Qwen-VL families, MinerU, OCRVerse, and more.
Scoreboard with context
- DeepSeek-OCR 2: 91.09% overall with a max of 1120 visual tokens. ⢠This is like getting an A when many peers are at B+ to Aā, but using fewer study notes.
- Improvement over DeepSeek-OCR: +3.73% overall under similar data sources and lower token cap. ⢠Thatās moving from a solid 87.36% to a standout 91.09% while shrinking the visual token budget ceiling by 36 tokens (1156 ā 1120).
- Reading order edit distance: 0.085 ā 0.057 (lower is better). ⢠Think of this as cutting order mistakes by about a thirdāfewer mix-ups of what comes first.
- Category-level edit distances (vs. Gemini-3 Pro, similar 1120 budget): DeepSeek-OCR 2 reaches 0.100 overall ED vs. Geminiās 0.115, showing stronger parsing with similar resources.
Surprising and notable findings
- Small but mighty: With an LLM-style encoder of about 0.5B parameters, the system improves order and content without inflating compute like massive multimodal stacks.
- Order first, reason second: Cascading two causal reasoners (encoder then decoder) appears more effective than relying on the decoder alone to untangle grid-ordered tokens.
- Practical readiness: In production, where thereās no ground truth, repetition rate dropped from 6.25% to 4.17% on images and from 3.69% to 2.88% on PDFsāfewer loops and cleaner outputs.
Where it shines and where it stumbles (from 9 document types)
- Big wins: Magazines, academic papers, reports, colorful textbooksāreading order improves across the board, often by large margins.
- Weak spot: Newspapers remain tough (very dense text). Causes:
- Token budget pressureātoo many small items for a 1120 cap.
- Limited training dataāabout 250k samples is not enough for that style.
- Easy fix ideas: Add more local crops on very dense pages or increase training data for newspapers.
Takeaway numbers you can remember
- 256ā1120 visual tokens per page.
- 3.73% overall boost vs. previous DeepSeek-OCR.
- Reading order ED down to 0.057.
- Production repetition down ~2 percentage points on images and ~0.8 on PDFs.
What these mean in plain terms
- Better order = fewer misunderstandings: Captions go with the right figures, steps stay in sequence, and table headers match their rows.
- Better compression = faster and cheaper: Doing more with fewer tokens means quicker, less expensive processing at scale.
- Strong promise for unified multimodality: If this works for images, similar LLM-style encoders with modality-specific queries could compress and reorder audio and text, too.
05Discussion & Limitations
Limitations
- Token budget sensitivity: Extremely dense pages (e.g., newspapers) can exceed the comfortable 1120-token cap, leading to missed details or higher text edit distance.
- Data imbalance: Underrepresentation of certain document types (like newspapers) limits specialization and hurts performance there.
- Single-pass flow length: Using flow tokens equal to visual tokens is good, but deeper multi-hop re-examinations might need even more flow capacity.
- Domain specificity: Trained heavily on OCR/document tasks; general visual reasoning (e.g., open-world scenes) remains to be fully validated.
Required resources
- Training: Multi-node GPU clusters (e.g., 160 GPUs) for staged training; staged freezing helps manage cost.
- Inference: Modest compared to huge multimodal models; fits real-world budgets thanks to 16Ć compression and 256ā1120 token caps.
When not to use
- Ultra-dense microprint scans where even many local crops cannot capture all details within budget.
- Non-document images that require spatially precise pixel outputs (e.g., medical segmentation) rather than reading order and text extraction.
- Tasks needing strict 2D geometric outputs without a textual or sequential endpoint.
Open questions
- How long should the causal flow be for true multi-hop reordering? Would 2Ć or 3Ć flow length over visual tokens boost revisiting power?
- Can the same LM-as-encoder handle audio and video streams with modality-specific query embeddings without new architectures?
- Whatās the best curriculum for training orderādo we teach titles and captions first, or all at once?
- How does this approach scale with even tighter token budgets or more aggressive compression ratios?
- Can we learn when to add more local crops on-the-fly based on early flow-token signals?
06Conclusion & Future Work
Three-sentence summary
- DeepSeek-OCR 2 introduces DeepEncoder V2, an LLM-style vision encoder that causally reorders image tokens before decoding so pages are read by meaning, not just by grid position.
- A custom attention mask lets visual tokens see globally while learnable causal flow tokens build a one-way reading path; only the reordered flow outputs go to the decoder.
- This two-cascade causal setup improves accuracy (91.09% on OmniDocBench v1.5) and slashes reading-order errors, all within a compact token budget (256ā1120).
Main achievement
- Proving that an LM-as-vision-encoder with causal flow tokens can outperform fixed-order pipelines on document reading, offering a practical route to genuine 2D reasoning through two 1D causal stages.
Future directions
- Increase flow capacity for multi-hop revisits; adaptive cropping for dense pages; broaden training to newspaper-heavy and low-resource layouts; extend the same encoder design to audio and video with modality-specific queries.
Why remember this
- Itās a simple, powerful idea: first reorder by meaning, then reason about content. By aligning vision with languageās causal nature, DeepSeek-OCR 2 reads like we doāone sensible step at a timeāmaking document AI more accurate, efficient, and closer to truly understanding pages.
Practical Applications
- ā¢Automated PDF-to-structured-JSON conversion with correct reading order for analytics.
- ā¢Enterprise invoice and receipt processing with fewer mismatched line items and prices.
- ā¢Academic paper ingestion where figures, captions, and references are correctly linked.
- ā¢K-12 worksheet digitization that preserves step-by-step instructions and answer keys.
- ā¢Compliance document review (policies, contracts) with faithful section ordering.
- ā¢Accessible reading tools that narrate complex pages in a sensible order for screen readers.
- ā¢Batch pretraining data generation for LLMs with lower repetition and cleaner layouts.
- ā¢Form understanding in banking and healthcare with robust table header-to-row matching.
- ā¢Summarization pipelines that avoid mixing captions with unrelated paragraphs.
- ā¢Knowledge-base building from reports where section hierarchies are preserved.