CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Yuling Shi; Chaoxiang Xie; Zhensu Sun; Yeheng Chen; Chenxu Zhang; Longfei Yun; Chengcheng Wan; Hongyu Zhang; David Lo; Xiaodong Gu

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Intermediate

Yuling Shi, Chaoxiang Xie, Zhensu Sun et al.2/2/2026

arXiv PDF

Key Summary

•The paper tests a simple but bold idea: show code to AI as pictures instead of plain text, then shrink those pictures to save tokens and time.
•Modern multimodal models (that can see and read) often understand these code images as well as, or better than, raw text across four tasks: completion, summarization, clone detection, and Q&A.
•Because images can be smoothly downscaled, the same code can cost 2×–8× fewer tokens while staying readable to strong models.
•Gemini-3 models stayed accurate even at 8× compression and sometimes beat their own text baselines, showing remarkable robustness.
•Visual cues like syntax highlighting and bold fonts help at low-to-moderate compression (1×–4×) but matter less when images get very small (8×).
•Clone detection was especially strong with pictures, hinting that seeing structure at a glance helps compare meaning, not just matching tokens.
•A careful “code OCR” test showed how errors grow with more compression: first characters, then lines, then whole blocks—but high-level understanding can still survive small character mistakes.
•They package the approach into a practical tool, CODEOCR, that renders code into images with chosen styles and compression levels, making it easy to deploy.
•Overall, representing code as images points to a new, token-efficient path for AI code understanding without slowing down inference.

Why This Research Matters

Developers often need models to understand entire files or projects, but text tokens make that slow and expensive. This work shows a simple switch—render code as images—that can slash token costs by 2×–8× while keeping or even improving accuracy. It means faster IDE assistants that can handle long files and big repos without hitting context limits. Visual cues like syntax highlighting help models focus on the right parts, especially at moderate compression. Strong models already handle this well, and a practical tool (CODEOCR) makes it easy to start using today. Over time, specialized training could further boost accuracy at extreme compression, bringing cheaper and better code intelligence to more people.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you read a comic, you can see the whole scene at once, not just one word at a time? That big-picture view helps your brain understand more with less effort.

🥬 Filling (The Actual Concept):

What it is: This paper explores showing code to AI as pictures instead of plain text, so the AI can use its vision to understand code and we can shrink those pictures to save tokens and cost.
How it works:
1. Render the source code into clean, readable images (like screenshots from an editor).
2. Feed those images, plus a text instruction, to a multimodal model that can handle both pictures and words.
3. Adjust image resolution to control “visual tokens,” trading tiny bits of clarity for big savings in token costs.
4. Test this across key coding tasks to see what breaks and what stays strong.
Why it matters: If models can see code well enough as images, we can fit much more code into the same budget and run faster without losing understanding.

🍞 Bottom Bread (Anchor): Imagine sending a whole file as one or two pictures instead of thousands of text tokens—your AI assistant can still answer questions about it, and your bill is smaller.

The World Before:

Big language models love text, so we fed them code as long lines of tokens. That works, but it’s costly: longer code means more tokens, more memory, more time, and sometimes hitting hard context limits.
Text compression for code (like pruning or rewriting) often drops important pieces. If you throw away the wrong token or rename a variable poorly, the code’s meaning can blur.

🍞 Top Bread (Hook): Imagine a giant LEGO instruction booklet. If you must flip one page at a time, it’s slow. If you could view a whole spread, you’d spot patterns faster.

🥬 Filling:

What it is: Multimodal language models (MLLMs) are AIs that can read both text and images.
How it works: They break images into small patches (like LEGO tiles), turn those into embeddings (numbers with meaning), then mix them with text using attention.
Why it matters: They can “see” indentation, braces, colors, and layout—the visual structure of code—without needing every token spelled out.

🍞 Bottom Bread (Anchor): When an MLLM looks at a syntax-highlighted Python function, it notices the blocks and keywords at a glance, like you noticing headings and bullet points in a handout.

The Problem:

As projects grow, token counts explode. That means higher cost and slower responses. Developers want their tools to read whole files, not just little snippets.

Failed Attempts:

Text-only tricks like pruning, summarizing, or rewriting help a bit but can delete context or change semantics. You risk losing the very clue the model needs.

The Gap:

Could we switch the medium from text to images and rely on compression by resolution instead of chopping tokens? Strangely, no one had done a large, careful study on code-as-images for understanding tasks.

Real Stakes:

Faster IDE assistants that handle big files and sprawling repos without timing out.
Lower API bills because images can be fairly compressed.
More stable performance for tasks that need the “shape” of code (like clone detection or high-level summaries), even if some characters blur.

🍞 Top Bread (Hook): Think of zooming a photo. Even when it’s smaller, you still recognize your friend’s face.

🥬 Filling:

What it is: Image compression by downscaling preserves the overall look while using fewer tokens.
How it works: Start with a crisp code image. Downscale to hit a target “visual token” budget. The model gets fewer—but denser—patches.
Why it matters: Unlike text, where every token you cut might remove a meaning nugget, images shrink smoothly and keep structure.

🍞 Bottom Bread (Anchor): A code file that would cost 110 text tokens can be an image costing 27 visual tokens and still be readable to a vision-capable model.

02Core Idea

🍞 Top Bread (Hook): Imagine swapping a 20-page script for a neat storyboard of the same movie. You can still follow the plot, and it’s easier to carry.

🥬 Filling (The Actual Concept):

What it is: The key idea is to represent code as images and let multimodal models understand it visually, then save tons of tokens by scaling the image resolution.
How it works:
1. Render code into images that capture layout, indentation, and colors.
2. Feed images plus a text instruction into an MLLM that “sees and reads.”
3. Downscale the images (1×, 2×, 4×, 8×) to use fewer visual tokens while keeping the big-picture structure.
4. Evaluate on code completion, summarization, clone detection, and code Q&A.
Why it matters: If models keep their accuracy while images get smaller, we get cheaper, faster, longer-context code understanding.

🍞 Bottom Bread (Anchor): With 8× compression, Gemini-3-Pro got 79.5% accuracy on code Q&A—beating its 74.8% raw text baseline—while using only 12.5% as many tokens for the code.

Multiple Analogies:

City map vs. street list: A city map shows neighborhoods and roads at a glance (code structure). A street list (text tokens) is precise but not holistic.
Class photo vs. attendance roll: The photo gives group layout instantly (indentation, blocks). The roll gives names line by line.
Recipe card vs. narrated instructions: The formatted card shows sections (ingredients, steps) so your brain groups them better than a long spoken paragraph.

Before vs. After:

Before: Code = long token strings; compression = risky token pruning; context windows = tight.
After: Code = images; compression = smooth downscaling; context windows feel bigger because visual tokens carry more structure per token.

Why It Works (intuition):

MLLMs’ vision encoders excel at spotting patterns like color-coded keywords, balanced brackets, and indentation cliffs.
Spatial structure helps separate “what belongs together,” boosting tasks that rely on shape and flow, like clone detection and summaries.
Language priors fill in minor gaps when characters blur, so small OCR mistakes don’t always harm understanding.

Building Blocks (each with a mini sandwich):

🍞 Hook: You know how teachers color-code notes to guide your eyes? 🥬 Syntax Highlighting: It’s a way to color parts of code (keywords, strings) so models see roles quickly. Steps: (1) tokenize code, (2) color by type, (3) render. Why: Without it, everything is gray and harder to parse at a glance. 🍞 Anchor: “def” in blue, strings in green—easy to spot.
🍞 Hook: Picture cutting a poster into tiles. 🥬 Visual Patches/Tokens: The image is split into small patches the model embeds as tokens. Steps: (1) divide image, (2) encode each patch, (3) feed sequence. Why: Without patches, the model can’t digest the image. 🍞 Anchor: A 2240×2240 image with 14×14 patches becomes a tidy grid of tokens.
🍞 Hook: Think of merging four small photos into one collage square. 🥬 Pooling/Alignment: A V-L adapter pools neighboring patches to compress while preserving meaning density. Why: Without alignment, you spend too many tokens on redundant details. 🍞 Anchor: 2×2 pooling turns four patch vectors into one.
🍞 Hook: Imagine choosing small, medium, or large pizza sizes. 🥬 Compression Ratio: Set 1×, 2×, 4×, 8× to trade clarity for token savings. Why: No ratio means no budget control. 🍞 Anchor: 4× uses 25% of tokens yet often keeps performance.

03Methodology

At a high level: Input (code + instruction) → Render code to image → Downscale to target compression → MLLM encodes image + text → Unified attention → Output.

Step-by-step recipe with sandwiches for the key pieces:

Code Rendering

🍞 Hook: Imagine printing your homework neatly so it’s easy to read.
🥬 What: Turn source code into clean, editor-like images (plain, bold, highlighted) at a high base resolution (e.g., 2240×2240). How: a) Use a monospace font so columns align. b) Optionally color-code tokens (syntax highlighting) or thicken strokes (bold). c) Split into multiple pages if the file is long, keeping line order. Why: Without rendering, there’s no visual input to compress—only raw text.
🍞 Anchor: A 180-line Python module becomes two crisp pages that look like VS Code screenshots.

Resolution Compression

🍞 Hook: Like zooming out on a photo to fit it into a collage.
🥬 What: Downscale images to meet token budgets (1×, 2×, 4×, 8×), where higher × means fewer visual tokens. How: a) Start with a high-res image. b) Bilinear downsample to target resolution so the visual token count matches the budget. c) Keep the instruction as text; only the code is an image. Why: Without controlled downscaling, you can’t trade cost for clarity or compare to text baselines fairly.
🍞 Anchor: A code snippet that cost 110 text tokens can become a 27-token image at ~8×.

Multimodal Encoding

🍞 Hook: Think of cutting a poster into tiles and labeling each tile.
🥬 What: The vision encoder splits the image into fixed-size patches (e.g., 14×14 pixels) and turns each into a vector (“visual token”). How: a) Patchify the image; b) run a Vision Transformer to get embeddings; c) pool/align with a V-L adapter to compress and match the text space; d) tokenize and embed the instruction text. Why: Without this, the model can’t mix what it sees with what it reads.
🍞 Anchor: “def” in blue, braces, and indentation become numerical patterns the model can attend to.

Fusion and Modeling

🍞 Hook: Like stacking picture tiles and caption words in one row so a reader can look across both.
🥬 What: Concatenate visual embeddings with text embeddings and pass them through the model’s self-attention layers. How: a) [Visual tokens; Text tokens] → Transformer blocks. b) The model learns which patches/words matter for the task. Why: Without fusion, the model either sees text alone or image alone—not both together.
🍞 Anchor: Prompt: “Summarize this module.” The model attends to function headings, docstrings, and key identifiers in the image while following the instruction text.

Task Setups and Metrics

🍞 Hook: Think of four class quizzes testing different skills.
🥬 What: Four tasks, each stressing a different level of understanding. How: a) Code Summarization (CompScore): Read long modules as images; write a summary. b) Code Completion (ES/EM): Provide relevant context as images + a text prefix; fill in the missing code. c) Clone Detection (ACC/F1): See two code images; answer if they do the same job. d) Code Q&A (ACC): Read code images; pick the correct multiple-choice answer. Why: Without diverse tasks, we wouldn’t know where images shine or struggle.
🍞 Anchor: A clone pair might look different line-by-line but “feel” alike structurally, and the visual signal helps spot that.

Baselines and Controls

🍞 Hook: Every good science fair project needs a fair comparison.
🥬 What: Compare Image vs. Text and include a No-Context lower bound. How: a) NoCtx: remove code context; only keep instructions and options (where applicable). b) Text: feed raw code tokens (standard approach). c) Image: feed code as pictures at matched token budgets. Why: Without baselines, we can’t tell if pictures really help.
🍞 Anchor: If Image beats Text and both beat NoCtx, the pictures are carrying real signal.

Models and Repetitions

🍞 Hook: Testing with just one student doesn’t tell you about the whole class.
🥬 What: Seven multimodal LLMs, from open-weight to proprietary (e.g., Qwen-3-VL, GLM-4.6v, GPT-5-mini/5.1, Gemini-2.5/3). How: a) Same prompts and inputs per task; b) 5 runs each; c) average and standard deviation; d) Wilcoxon tests for significance. Why: Without multiple models and trials, results might be flukes.
🍞 Anchor: Gemini-3 models repeatedly handle compression gracefully; GLM/Qwen vary more.

Secret Sauce (why this method is clever):

Smooth compression: Images let you dial compression up or down continuously, unlike text pruning which is yes/no per token.
Structure at a glance: Visual layout carries meaning (indentation, braces, alignment) that helps with high-level tasks.
Visual cues: Syntax highlighting and bold act like road signs that survive moderate shrinking.

Concrete data example:

A Python file with 6,000+ tokens as text becomes a few images; at 4× compression, models like Gemini-3 still summarize well (CompScore gains), and in Q&A, Gemini-3-Pro reaches 79.5% at 8×—higher than text baseline—using far fewer tokens.

04Experiments & Results

The Test: What and Why

They measured how well models perform on four code tasks when the code is given as images vs. as text, across compression levels (1×, 2×, 4×, 8×) and rendering styles (plain, bold, highlight). This checks both feasibility and efficiency.

The Competition: Who and Against What

Seven multimodal models competed: Qwen-3-VL, GLM-4.6v, GPT-5-mini, GPT-5.1, Gemini-2.5-Pro, Gemini-3-Flash, Gemini-3-Pro.
Baselines: NoCtx (lower bound) and Text (standard).

Scoreboard with Context:

Overall feasibility: • Many models matched or beat Text using Image at the same token budget. For clone detection, GPT-5-mini’s F1 jumped about 42% with images over text; GPT-5.1 also rose strongly. • Gemini-3-Flash/Pro often improved on completion ES and Q&A ACC with images.
Compression resilience: • Gemini-3-Pro hit 79.5% ACC in Q&A at 8× compression, beating its own text baseline (74.8%). That’s like scoring higher on the test while studying a smaller, blurrier handout. • Summarization and clone detection stayed solid up to 4×–8× for strong models. Completion and Q&A were more sensitive for weaker models beyond 2×–4×.
Visual aids (highlighting, bold): • Helped most at 1×–4× (often +1–3% or more); less helpful at 8× because colors/thickness blur. • In some edge cases at 8×, bold slightly hurt by smudging character shapes.
Cross-language generalization: • The main trends repeated in Java. Gemini family beat text baselines broadly; clone detection improvements held across models. Qwen-3-VL showed big F1 gains under compression in Java, suggesting compression nudges attention toward semantics.

Surprising Findings:

“Less can be more”: moderate compression sometimes worked like gentle denoising, blurring distracting surface differences and making semantic comparison (clone detection) easier.
Tiny OCR mistakes didn’t always hurt meaning: downstream tasks relying on structure or summaries stayed strong even when character-perfect transcription failed.
Strong models degraded “gracefully” at high compression; weaker ones hit a cliff near 8×.

Deep Dive: Code Reconstruction (OCR-style)

They asked models to transcribe code images exactly at 1×–8×. Errors grew in a clear ladder:
1. Token errors first (e.g., 1 vs l, 0 vs O, missed commas),
2. Then line errors at moderate compression,
3. Then block errors at high compression (hallucinated chunks).
Gemini-3 kept block errors low even at 8×, matching its strong task scores under heavy compression.

Bottom line scores in plain words:

If the class average is a B-, Gemini-3-Pro getting an A at 8× while carrying a thinner notebook is impressive.
GPT-5-mini’s clone detection leap with images says “seeing structure” matters when judging similarity.
Visual cues are like highlighters that help—until the page gets too tiny to see the colors.

05Discussion & Limitations

Limitations (specific and honest):

Not all models benefit equally. Open-weight models showed mixed results and sometimes degraded with images, especially at high compression.
At extreme compression (8×), visual cues can blur; bold can even hurt legibility in some setups.
Only Python and Java were tested deeply; while trends generalized to Java, more languages (C++, JS, Go, TypeScript) deserve full study.
OCR-style exact transcription still struggles at high compression for some models—even if downstream semantics survive, tasks that truly need character-perfect precision (e.g., strict patch generation) may suffer.
Rendering choices (font, theme, line spacing) were reasonable defaults, not exhaustively tuned; other IDE styles might help or hurt.

Required Resources:

Access to multimodal models (API or local) and a rendering pipeline (e.g., Pygments + Pillow).
GPU resources if running large open-weight models locally; otherwise, reliable API access.
Enough token budget to test multiple compression ratios and styles fairly; careful prompt engineering and benchmarks.

When NOT to Use:

Tasks demanding exact, character-perfect output (e.g., cryptic bug reproduction or byte-exact diffs) under very high compression.
Low-vision or accessibility contexts where colored cues may not be appropriate without parallel text.
Models known to have weak visual encoders; for them, text may still be safer at normal budgets.

Open Questions:

Can we design code-specific visual pretraining to boost OCR fidelity at 8×–16× compression?
What are the best adaptive rendering rules (when to use highlighting, bold, line spacing) for each task and compression range?
Can hybrid inputs (some text tokens + compressed images) beat either alone across all tasks?
How does this interact with retrieval at repository scale—should we retrieve patches of images, or dynamically mix text and images?
Can we learn an automatic “sweet spot” that picks the lowest-cost compression with no accuracy loss per model and per task in real time?

06Conclusion & Future Work

Three-sentence summary:

This paper shows that turning code into images and feeding them to multimodal models can match or beat text-based performance while using far fewer tokens through simple downscaling.
Strong models, especially Gemini-3, stay accurate even at 8× compression and sometimes surpass text baselines, and visual cues like syntax highlighting help most at 1×–4×.
A code OCR study explains why: small character errors appear first, but structure and meaning often survive, powering summarization and clone detection despite compression.

Main Achievement:

A clear, data-backed case that image-based code representation is a practical, token-efficient alternative to raw text for multiple code understanding tasks—and a working tool (CODEOCR) to make it easy.

Future Directions:

Train code-specialized multimodal models for higher OCR fidelity at extreme compression.
Develop adaptive rendering that auto-selects syntax highlighting, bold, spacing, and resolution per task and budget.
Explore hybrid text+image pipelines and repository-scale retrieval with visual snippets.

Why Remember This:

It flips the usual perspective: instead of squeezing text harder, switch the medium to pictures that compress smoothly while keeping structure.
For long code and tight budgets, this can mean cheaper, faster, and often better results—pointing to a “vision-first” era for AI code tools.

Practical Applications

•Set your IDE assistant to send code as images at 2×–4× compression for cheaper, faster summaries without quality loss.
•Use syntax highlighting in code images for completion tasks to gain a few extra percentage points in accuracy at moderate compression.
•Adopt 8× compression for code Q&A with strong models like Gemini-3-Pro to cut token usage massively while keeping accuracy high.
•Switch clone detection pipelines to code images to emphasize semantic similarity over surface token matches.
•Combine retrieval (RAG) with image inputs: render retrieved snippets as images and keep prompts in text for best of both worlds.
•Build CI bots that summarize long diffs as images to stay within token budgets on large pull requests.
•For byte-perfect needs (e.g., patches), keep images at 1×–2× or mix in key text spans to avoid OCR-sensitive errors.
•Use CODEOCR’s dynamic compression to auto-fit large files into a fixed token budget before sending to an API.
•Experiment with bold rendering at 1×–2× and avoid it at 8× to prevent character smudging.
•Benchmark your own model family: find the compression “sweet spot” per task where cost drops but accuracy stays flat.

Version: 1