VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Hongbo Zhao; Meng Wang; Fei Zhu; Wenzhuo Liu; Bolin Ni; Fanhu Zeng; Gaofeng Meng; Zhaoxiang Zhang

VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

Intermediate

Hongbo Zhao, Meng Wang, Fei Zhu et al.12/17/2025

arXiv PDF

Key Summary

•Long texts are expensive for AI to read because each extra token costs a lot of compute and memory.
•A clever workaround called vision‑text compression (VTC) turns long text into compact images that feed into vision‑language models (VLMs), shrinking input tokens by 3–20×.
•This paper builds VTCBench, the first big test to see whether VLMs can still understand long contexts after text is compressed into images.
•VTCBench checks three skills: finding facts (Retrieval), connecting clues without exact word matches (Reasoning), and remembering conversations (Memory).
•Most VLMs can read text in images (good OCR) and do simple find-the-fact tasks, but they struggle badly with deeper reasoning and long-term memory in VTC form.
•Performance drops as the context gets longer, fonts get smaller (higher compression), or the needed clue sits in the “middle” of the document (lost‑in‑the‑middle).
•A few models, like Gemini‑3‑Pro, show that strong performance with VTC is possible, nearly matching text‑only baselines on the wild, varied benchmark.
•VTCBench‑Wild adds many real‑world render styles (fonts, sizes, spacing) and reveals big robustness gaps across models.
•The study shows VTC is promising but today’s VLMs need new training and architectures to reason over dense visual text, not just read it.
•These benchmarks give designers a roadmap to build faster, more scalable models that still think clearly over long contexts.

Why This Research Matters

Long documents are everywhere—school handouts, medical records, legal files, company wikis—and reading them is costly for AI. Vision‑text compression can cut costs by 3–20×, but only if models still think clearly after text is turned into images. This work shows where that clarity breaks—reasoning, memory, and the middle of long pages—and which design choices (like font size and token budgets) matter most. It also proves strong performance is achievable, as seen with Gemini‑3‑Pro on the wild benchmark. With these insights, builders can make faster, cheaper assistants that remain reliable on real‑world, long contexts. That means better summarization, search, and decision support in everyday tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how carrying a giant stack of books is tiring, but putting them into a compact backpack makes it easier? Computers feel the same way about long texts.

🥬 Filling (The Actual Concept): What it is: Long-context understanding is an AI’s ability to read, keep track of, and use information that stretches over very long inputs—like whole books or long chats. How it works: 1) The model reads the whole input, 2) hangs onto key facts, 3) links ideas that may be far apart, and 4) answers questions that might depend on many scattered clues. Why it matters: Without it, the model forgets earlier details, misses links between far-apart facts, and answers break down when inputs get long.

🍞 Bottom Bread (Anchor): Imagine asking, “Which two classmates who met in chapter 2 worked together again in chapter 14?” Without long-context understanding, the model loses the trail.

The World Before: Large language models (LLMs) became great at many tasks—writing, coding, answering questions—but they stumble when inputs get very long. That’s because the regular Transformer attention grows expensive fast: every extra token adds new attention links to compute and store. Many clever ideas tried to fix this—efficient attention, extrapolated position encodings, prompt compression, external memory—but often quality drops once sequences get really long.

🍞 Top Bread (Hook): Imagine turning a super-long shopping list into a snapshot photo so you can bring it easily.

🥬 Filling (The Actual Concept): What it is: Vision‑Text Compression (VTC) turns long text into dense images so a vision‑language model (VLM) can read fewer tokens. How it works: 1) Render the text as one or more compact pages (images), 2) a vision encoder turns each page into visual tokens, 3) a language model reasons over those tokens. Why it matters: It shrinks input from thousands of tokens to far fewer visual tokens (3×–20× compression), saving compute and memory.

🍞 Bottom Bread (Anchor): Instead of feeding 20,000 words as tokens, you supply 10 images; the VLM “reads” them like pages.

The Problem: We knew VTC can do OCR-like tasks well (it reads text in images), but we didn’t know if VLMs still understand long contexts after this compression. Do they just “see” words, or can they also connect distant clues, reason when the question doesn’t share exact words with the answer, and remember conversation histories?

Failed Attempts: Prior tests often relied on literal matches—if the question says “red apple,” the answer is easy to find when the page also says “red apple.” That’s good for OCR but not for real understanding. When benchmarks don’t break literal matching, models can look smarter than they are.

🍞 Top Bread (Hook): Think of searching a messy room: it’s one thing to spot a red toy (literal match), and another to realize the toy you need is the one that fits the puzzle (reasoned match).

🥬 Filling (The Actual Concept): What it is: Associative reasoning checks if models can link clues that don’t share the exact same words. How it works: 1) Find related ideas, 2) connect them through world knowledge or context, 3) choose the correct fact even if wording differs. Why it matters: Without it, models freeze when the question doesn’t copy-paste the answer’s words.

🍞 Bottom Bread (Anchor): If the text says “Katie is vegan” and the question asks “Who can’t eat fish-based meals?”, the model must infer “Katie.”

The Gap: No one had a rigorous, VTC‑focused benchmark that tests three pillars of long-context understanding: retrieval, associative reasoning, and long-term memory—especially under different render styles and compression levels. That meant we didn’t know whether VTC preserves deep understanding or mainly helps with surface reading.

Real Stakes: Why care? Because long-context is everywhere: reading whole PDFs, summarizing multi-day chat threads, scanning legal or medical records, or analyzing large codebases. If we can shrink inputs with VTC while keeping brains (reasoning and memory) intact, we can make AI assistants faster, cheaper, and more capable in day-to-day life. If not, we risk building speedy readers that can’t actually connect the dots.

02Core Idea

🍞 Top Bread (Hook): Imagine stuffing a long novel into a few crisp photo pages, then asking a friend to answer tricky questions from just those photos.

🥬 Filling (The Actual Concept): What it is: The paper’s key insight is that we must directly measure whether vision‑language models truly understand long contexts after texts are compressed into images—beyond just reading the words. How it works: Build VTCBench, a suite that (1) compresses text-to-images in controlled ways, (2) tests retrieval, reasoning without word overlap, and long-term memory, and (3) varies rendering and compression to see what breaks. Why it matters: Without such a test, we can overestimate VTC’s success; with it, we discover big gaps and a roadmap to fix them.

🍞 Bottom Bread (Anchor): It’s like giving students photo-copies of a long book and grading not just if they can find a sentence, but if they can solve the mystery, too.

Multiple Analogies:

Library Map Analogy: Before, models read a whole shelf book-by-book (lots of tokens). VTC turns the shelf into a dense map (images). VTCBench checks if students can still find every book, connect plots, and remember who did what after only seeing the map.
Suitcase Analogy: You compress a week’s clothes into packing cubes (images). VTCBench tests not just if you packed socks (retrieval), but whether you dressed for the weather (reasoning) and remembered what you wore each day (memory).
Puzzle Analogy: Text becomes a high-density jigsaw. VTCBench tests if the model can spot pieces (retrieval), see the hidden picture (reasoning), and recall where pieces were used (memory), even when the pieces look similar.

Before vs After:

Before: We assumed strong OCR meant strong understanding; many benchmarks encouraged matching exact words.
After: We learn most VLMs can read but struggle to reason and remember across long, compressed pages—especially with small fonts, long lengths, and “middle” placements. Yet one model (Gemini‑3‑Pro) shows near‑text‑baseline performance, proving VTC can work.

Why It Works (Intuition): VTCBench isolates key pressure points. By controlling compression (e.g., font size) and fixing or varying rendering, it shows whether failures come from perception (can’t see text), attention (lost in the middle), or cognition (can’t connect clues). Using tasks that minimize word overlap forces real understanding rather than word matching.

Building Blocks (each introduced with a sandwich):

🍞 Hook: You know how a magnifying glass helps you read tiny print? 🥬 Concept: Vision‑Language Models (VLMs) are AIs that look at images and talk about them. How: 1) A vision encoder turns pixels into visual tokens, 2) a language model reasons over those tokens to produce answers. Why: Without this two-step, images stay as raw pixels and the model can’t discuss their contents. 🍞 Anchor: Show a photo of a menu; the VLM reads and answers “What’s the cheapest dessert?”

🍞 Hook: Think of counting beads instead of paragraphs. 🥬 Concept: VTC Ratio is how many original text tokens you replace with each visual token after rendering to images. How: 1) Render text with operator R (font, size, spacing), 2) the vision encoder outputs visual tokens, 3) ratio r_VTC = text tokens / visual tokens. Why: Without tracking this, you can’t compare compression fairly across models. 🍞 Anchor: If 20,000 text tokens become 2,000 visual tokens, r_VTC=10.

🍞 Hook: Like choosing a print size for your school poster. 🥬 Concept: Rendering Operator R is the styling rulebook (font, size, colors, line height, image size) that turns text into images. How: Pick settings, print to pages, feed to the VLM. Why: Change R and you change readability and compression at once. 🍞 Anchor: 12‑pt Helvetica on 896×896 images vs 16‑pt changes how much fits and how easy it is to read.

🍞 Hook: Hide a needle in a haystack. 🥬 Concept: VTC‑Retrieval checks if the model can find exact facts inside long, compressed pages. How: Insert key‑value “needles” into long distractor text and ask for them. Why: Without retrieval, the model can’t even start reasoning. 🍞 Anchor: “What’s the magic number for long‑context?” when the page states “One magic number is: 2026.”

🍞 Hook: Solve a riddle, not just copy a sentence. 🥬 Concept: VTC‑Reasoning tests associative reasoning with minimal word overlap between question and evidence. How: Use clues that require commonsense links. Why: Without this, a model can only parrot matches and can’t think. 🍞 Anchor: From “Katie is vegan,” answer “Who can’t eat fish-based meals?”

🍞 Hook: Remember who said what in a long group chat. 🥬 Concept: VTC‑Memory checks whether the model can recall and integrate facts over long dialogues. How: Single‑hop, multi‑hop, temporal, and open‑domain questions. Why: Without memory, assistants forget user details and timelines. 🍞 Anchor: “What did Caroline research?” after a long threaded chat.

03Methodology

At a high level: Long text → Render to images (R) → Vision encoder makes visual tokens → Language model reasons → Answers to Retrieval, Reasoning, Memory questions.

Step‑by‑step (with sandwiches for new ideas as they appear):

Build the Inputs

What happens: The authors take long contexts (essays, book snippets, multi‑turn dialogues) and convert them into image pages using a rendering operator R. Two setups are used: (a) Predefined VTC ratio: adjust font size per model so everyone sees the same compression level; (b) Predefined rendering: fix the style (e.g., 12‑pt Helvetica, 896×896) so images are identical, even if models produce different token counts.
Why this step exists: It controls the experiment so we can compare apples to apples—either same compression or same look.
Example: The same 8k‑token essay is rendered to a handful of 896×896 pages. In ratio mode, a small font might give r_VTC≈2; in fixed‑style mode, r_VTC differs by model.

🍞 Hook: Like measuring how much you can fit on a poster depending on font size. 🥬 Concept: Font size strongly controls readability and compression. How: Smaller font fits more text per page (higher r_VTC) but is harder to read; larger font lowers compression but boosts OCR accuracy. Why: Without tuning font size, results can be unfair or misleading. 🍞 Anchor: Table 2 shows that increasing font size often lifts accuracy, especially for harder tasks.

Define the Tasks

VTC‑Retrieval: Four NIAH flavors—Single, Multi‑Keys (find the right one among many), Multi‑Values (collect all values for a key), Multi‑Queries (answer many keys at once).
VTC‑Reasoning: Questions require linking clues with minimal lexical overlap, pushing true understanding.
VTC‑Memory: Long chats adapted from LoCoMo; four subtypes—Single‑hop, Multi‑hop, Temporal, Open‑domain.
Why this step exists: Retrieval tests seeing; Reasoning tests thinking; Memory tests remembering and integrating. All three are needed for real long‑context use.
Example: The benchmark might hide “One magic number is 2026” on page 3 and ask for it, or ask “Who can’t eat fish-based meals?” after saying “Katie is vegan.”

🍞 Hook: Like grading a quiz with different sections: find, think, remember. 🥬 Concept: ContainsAll vs ContainsAny are ways to score aggregation. How: ContainsAll checks if all required items are returned; ContainsAny accepts at least one. Why: Without strict ContainsAll, a model that finds only half the facts looks better than it should. 🍞 Anchor: If two values are correct, “A and B,” returning only “A” fails ContainsAll.

Control the Settings

Predefined VTC Ratio: The benchmark picks a target r_VTC (e.g., 2× compression), then adjusts R (mostly font size) per model so everyone sees equivalent density. Images are standardized to 896×896 to align with common patch sizes.
Predefined Rendering: Fix R (e.g., 12‑pt Helvetica, 96‑dpi, tight spacing) so every model sees identical images; the actual r_VTC then varies with the model’s vision encoder.
Why this step exists: Separates what’s due to compression (density) from what’s due to vision architecture (tokenization style).
Example: In fixed‑render mode, a model that heavily pools features may produce very few tokens and lose detail; another that tiles images might produce more tokens and maintain detail.

🍞 Hook: Think of two races—one where all bikes have the same gear ratio, and one where all ride the same track but choose their own gearing. 🥬 Concept: Visual tokens are the pieces the VLM actually reasons over after seeing the image. How: The vision encoder patches, pools, or tiles the image into a set of token embeddings. Why: Without enough, small, legible tokens, text details blur and reasoning collapses. 🍞 Anchor: A thumbnail‑based model might waste 256 tokens on a tiny, unreadable overview for dense text.

VTCBench‑Wild for Realism

What happens: The authors build a pool of 99 render styles (fonts, sizes 10–20 px, line heights 1.0–1.5) and sample randomly, plus a large question pool spanning lengths and needle positions. They then sample a final set (800 retrieval, 800 reasoning, 600 memory) to test robustness.
Why this step exists: Real life has varied PDFs and chat apps; a single neat style can hide brittleness.
Example: A model might ace 12‑pt Helvetica but falter on 10‑pt Times New Roman with tighter lines.

🍞 Hook: Like checking if a student can read any handwriting, not just the teacher’s favorite font. 🥬 Concept: Robustness to rendering means the model keeps working across styles. How: Test many fonts, sizes, spacings, and colors. Why: Without robustness, a model fails on documents that look slightly different. 🍞 Anchor: Glyph improves a lot when the background color changes back to white.

Measure and Analyze

What happens: For retrieval and reasoning, the benchmark checks whether the exact correct items are in the answer (ContainsAll). For memory, an LLM judge grades short answers. Needle positions sweep from start to end to expose positional biases.
Why this step exists: It ties scores to clear skills and reveals phenomena like “lost in the middle.”
Example: Heatmaps show accuracy vs. needle depth and context length; a U‑shape means edges are easier than the middle.

The Secret Sauce: VTCBench precisely disentangles reading vs. thinking vs. remembering under compression, controls both compression and style, and adds a wild setting to expose hidden brittleness. It turns vague “does VTC work?” into concrete, fixable engineering questions: font/readability thresholds, token budgets, layout‑aware attention, and training for associative reasoning.

04Experiments & Results

The Test (What and Why):

Retrieval: Can models accurately find planted facts in long, compressed pages? This checks basic visual reading and precise grounding.
Reasoning: Can models answer questions whose wording doesn’t literally match the evidence? This probes associative reasoning, not just word matching.
Memory: Can models answer questions about long dialogues—single‑hop, multi‑hop, temporal, and open‑domain—after text is compressed into images? This tests integration over time.

The Competition (Who):

Open‑source VLMs: Qwen2.5‑VL (7B/72B), Qwen3‑VL (8B/235B‑A22B), InternVL3.5 (8B/38B), GLM‑4.1V‑9B‑Thinking, Kimi‑VL‑A3B‑Instruct, Gemma3‑27B, Glyph, DeepSeek‑OCR.
Proprietary VLMs: Gemini‑2.5‑Pro, GPT‑5; plus text‑only LLM baselines Qwen3‑8B and Gemini‑3‑Pro (LLM).

The Scoreboard with Context:

Retrieval: Many VLMs do well on shorter contexts, reflecting good OCR and literal matching. But as contexts grow (up to 32k tokens) performance usually drops. Text‑only Qwen3‑8B stays above ~95% at long lengths, highlighting a VTC gap for most VLMs.
Reasoning: Here’s the big collapse. Even strong retrievers stumble when questions don’t share exact words with answers. Qwen3‑8B (LLM) still does relatively well (e.g., ~94% at 1k), but most VLMs fall far lower. Higher compression (smaller fonts) hurts more.
Memory: Multi‑hop, temporal, and open‑domain questions are tough. Some models like Qwen2.5‑VL‑72B perform competitively on single‑hop, but overall, memory under VTC is much weaker than text‑only baselines.

Clear Standouts and Caveats:

Gemini‑3‑Pro (VLM) on VTCBench‑Wild: Overall 87.64%, nearly matching its text‑only version and outperforming Qwen3‑8B (text LLM baseline) on the wild set. This proves VTC can work very well with the right architecture/training.
Glyph: Balanced and relatively strong among open‑source VLMs, but sensitive to background and style; switching to a white background can boost memory a lot.
InternVL3.5 and likely GPT‑5 use thumbnail tokens; for dense text, thumbnails are illegible and waste a chunk of token budget, likely dragging performance.

Surprising Findings:

Lost‑in‑the‑Middle (Spatial Version): Accuracy vs. needle position forms a U‑shape—models do best at the start and end of pages and worst in the center, especially as length increases. This echoes text‑only LLM behavior but now in 2D layout.
Font Size Dominates: Increasing font size (reducing compression) often yields big gains across tasks. Style changes (font family, colors) matter far less than legibility.
Refusal Behavior: Some models (e.g., Qwen3‑VL series) often refuse to answer associative questions when words don’t match literally, sinking reasoning scores.
Aggregation Gaps: In multi‑value retrieval, models often find one of several needed items (ContainsAny) but miss others (ContainsAll), showing retrieval, not synthesis, is the main bottleneck over long contexts.

Making the Numbers Meaningful:

Think of ContainsAll near 90% as an A grade; many VLMs drop toward C or worse as context grows, while the text LLM baseline stays around A.
On VTCBench‑Wild, Gemini‑3‑Pro’s 87.64% is like finishing a marathon near the front, while many open‑source VLMs finish in the middle of the pack.

Takeaway: Most VLMs can “read” compressed text, but deep understanding (reasoning, memory) suffers under high density, long lengths, and challenging layouts—unless the architecture is specially prepared, as Gemini‑3‑Pro suggests.

05Discussion & Limitations

Limitations:

Language Scope: The benchmark is English‑only and mainly general‑domain prose and conversations; we don’t yet know how well VTC behaves for scripts like Chinese/Arabic or specialized formats like code and legal briefs.
API Black Boxes: Closed‑source model APIs may resize or pre‑process images in hidden ways, muddying comparisons of rendering settings.
Snapshot of a Moving Target: As new VLM architectures appear, results may shift; ongoing updates are needed.

Required Resources:

Rendering pipeline to generate images at various styles and sizes.
GPUs capable of VLM inference over many images (e.g., 896×896 pages per sample).
Evaluation infrastructure for ContainsAll metrics and an LLM judge for memory scoring.

When NOT to Use:

Ultra‑tiny fonts or extreme compression where legibility crashes; models will likely fail both retrieval and reasoning.
Tasks requiring subtle multi‑hop reasoning or temporal tracking over very long contexts unless you have evidence your model is robust under VTC.
Architectures that hard‑allocate many tokens to thumbnails/global overviews if your inputs are dense text rather than natural images.

Open Questions:

Can we train vision encoders specifically for dense text pages, avoiding wasted thumbnail tokens and boosting center‑of‑page attention?
What pretraining or alignment best teaches associative reasoning over visually encoded text, not just OCR?
Can layout‑aware attention or recurrence reduce the lost‑in‑the‑middle effect for 2D text grids?
Where is the sweet spot between compression and comprehension across languages, scripts, and document types?
How can prompts or small adapters lower refusal rates on non‑literal reasoning without harming safety?

06Conclusion & Future Work

3‑Sentence Summary: This paper introduces VTCBench and VTCBench‑Wild—the first thorough benchmarks to test whether vision‑language models truly understand long contexts after compressing text into images. Results show most models read well but struggle to reason and remember under high compression, long lengths, and varied styles—though Gemini‑3‑Pro proves VTC can work at near text‑only levels. The study maps out where and why failures happen (font size, layout, thumbnail tokens, lost‑in‑the‑middle), giving clear targets for better designs.

Main Achievement: Turning “VTC seems fine because OCR is good” into a rigorous, multi‑task, multi‑style evaluation that reveals the real gap between surface reading and deep understanding—and demonstrating that top performance is possible with the right architecture.

Future Directions: Train vision encoders and attention specifically for dense, uniform text; develop layout‑aware long‑range mechanisms; reduce refusal on associative queries; optimize token budgets (ditch wasted thumbnails); and extend to more languages and document types (code, legal, scientific).

Why Remember This: VTC promises big speed and memory wins for long documents, but only if models keep their thinking caps on. VTCBench shows exactly where today’s systems falter and where a few already shine, setting the path to fast, frugal, and truly thoughtful long‑context AI.

Practical Applications

•Build VLM readers for long PDFs that auto‑select legible font rendering to balance compression with accuracy.
•Tune font size dynamically based on model and task (retrieval vs reasoning) to hit a safe readability threshold.
•Adopt layout‑aware prompting (“important facts appear near the middle”) to counter lost‑in‑the‑middle effects.
•Avoid thumbnail‑heavy encoders for dense documents; reallocate tokens to high‑resolution tiles for text.
•Use ContainsAll‑style evaluations in production to ensure multi‑fact answers aren’t partially correct.
•Train with associative‑reasoning curricula (non‑literal question/answer pairs) to reduce refusals and guesswork.
•Standardize document rendering presets (e.g., 12‑pt Helvetica, 96‑dpi) for consistent performance across teams.
•Add render‑style augmentation (fonts, sizes, spacing) during training to improve robustness to real‑world docs.
•Cache per‑page visual tokens for repeated queries over the same document to cut latency and costs.
•Introduce center‑focused attention or sliding‑window rereads for middle‑page facts in very long contexts.

Version: 1