InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Hongyuan Tao; Bencheng Liao; Shaoyu Chen; Haoran Yin; Qian Zhang; Wenyu Liu; Xinggang Wang

InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

Intermediate

Hongyuan Tao, Bencheng Liao, Shaoyu Chen et al.12/9/2025

arXiv PDF

Key Summary

•InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
•It keeps speed and memory use nearly constant no matter how long the input gets, so it can handle unlimited images or video streams.
•On a single RTX 4090, InfiniteVL holds a steady real-time 24 FPS in streaming video while a comparable Transformer slows down and eventually runs out of memory.
•Per-token latency is over 3.6× faster than similar Transformer VLMs at around 50K tokens, and the speed gap grows to about 8× at 300–350K tokens.
•It performs competitively with leading Transformer VLMs on general benchmarks and does especially well on text-rich tasks like charts, documents, and OCR (e.g., 82% ChartQA, 78.5% TextVQA, 91.7% DocVQA).
•A three-stage training plan—distillation pretraining, instruction tuning, and long-sequence fine-tuning—lets InfiniteVL learn fast with much less data.
•The model is self-contained (no external memory banks) yet remembers earlier context through a compact, learned memory.
•Even with unlimited inputs, it runs within about 9 GB of VRAM, making it friendly for edge devices.
•A small number of window-attention layers greatly boosts fine-detail tasks, while most layers use linear memory for long-range context.
•Overall, InfiniteVL shows you can have long-term memory, strong short-task skills, and real-time speed—all in one VLM.

Why This Research Matters

InfiniteVL makes long, continuous understanding practical on everyday hardware. It can read entire documents, watch long videos, and remember early details without slowing down or running out of memory. That enables real-time assistance for drivers, robots, and AR devices that need to react now while recalling what happened minutes ago. It also helps students, journalists, and doctors process complex charts, tables, and reports quickly. Because it’s self-contained, deployment is simpler—no external memory servers to manage. In short, it brings always-on, detail-aware, long-memory AI closer to phones, cars, and small edge devices.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how telling a really long story is hard if you don’t take notes? After a while, you forget who did what and when.

🥬 Filling (The Actual Concept): What it is: Classic Transformer-based vision-language models (VLMs) pay attention to every piece of the story at once. This is powerful but gets very slow and memory-hungry as the story gets longer. How it works (step by step):

The model turns images and words into tokens (little pieces). 2) It makes every token look at every other token (full attention). 3) To answer a question, it mixes all these connections. Why it matters: Without a smarter plan, the time and memory grow quickly as inputs get longer, making real-time video understanding or huge documents too slow or impossible.

🍞 Bottom Bread (Anchor): If you stream a long video to a Transformer VLM, it starts fast but slows down, needs more and more memory, and can crash when it runs out—like trying to carry every class note in your backpack forever.

🍞 Top Bread (Hook): Imagine trying to follow a parade by looking only through a small window in a fence—you see details up close, but miss the floats that already passed.

🥬 The Concept: Sliding Window Attention (SWA) What it is: SWA makes the model focus on a recent local window of tokens instead of the whole history, keeping compute small and fast. How it works:

Keep a moving window (e.g., last 8K tokens). 2) Attend only inside that window. 3) Slide the window forward as new tokens arrive. Why it matters: It’s speedy and good at details right now, but it forgets what fell outside the window.

🍞 Anchor: In a live soccer stream, SWA lets the model clearly read the scoreboard this minute, but it forgets a goal from 10 minutes ago if it’s outside the window.

🍞 Top Bread (Hook): Picture a librarian who doesn’t keep every book open at once. Instead, they keep a tidy summary card that grows smarter as more books arrive.

🥬 The Concept: Linear Attention What it is: A faster attention method that keeps a fixed-size summary (state) instead of storing every old token. How it works:

Convert keys/queries into features. 2) Add each new token’s info into a compact state. 3) Read from that state to answer questions. Why it matters: It avoids a growing cache, so speed and memory stay steady even for very long inputs, but fine details can get squished in the summary.

🍞 Anchor: When scanning a 200-page comic, linear attention keeps a running summary so you remember the plot, but you might miss the tiny text on page 47.

🍞 Top Bread (Hook): Think of a smart notebook that decides what to keep, what to fade, and how to rearrange notes so they don’t all blur together.

🥬 The Concept: Gated DeltaNet What it is: A linear-attention-style module that updates its memory with gates and a clever rotation so it stores more useful, less tangled summaries. How it works:

Keep a fixed-size memory matrix. 2) Use gates to decide how much of the old memory to keep. 3) Apply a rotation-like step so new info doesn’t collapse the memory. 4) Write the new info compactly. 5) Read from memory with the current query. Why it matters: Without gates and rotation, long histories mix together and details collide; Gated DeltaNet reduces collisions and keeps long-term memory clearer.

🍞 Anchor: It’s like tidying your binder with dividers and sticky notes—older pages aren’t lost, and you can still find last week’s homework fast.

🍞 Top Bread (Hook): Training a marathon runner is different from training a sprinter—you warm up, practice form, then add long runs.

🥬 The Concept: Long-Sequence Fine-Tuning (SFT) What it is: A training phase where the model practices very long inputs so its memory skills actually turn on and stabilize. How it works:

Start with regular instruction tasks. 2) Gradually lengthen the sequences (images, videos, documents). 3) Teach the model to stay coherent over tens of thousands of tokens. Why it matters: Without this practice, the model can have the right architecture but still stumble on long contexts.

🍞 Anchor: It’s like practicing to perform a 2-hour play: you rehearse longer and longer scenes until you can keep the whole story straight.

🍞 Top Bread (Hook): What if we could have both the telescope for local detail and a map for the whole journey?

🥬 The Concept: InfiniteVL What it is: A hybrid VLM that interleaves Sliding Window Attention (for sharp local detail) with Gated DeltaNet (for compact, long-term memory) so it runs fast forever. How it works:

Use a vision encoder to turn pictures/video frames into tokens. 2) Mix layers: a few SWA layers for crisp local reading, and mostly Gated DeltaNet layers for long-range memory. 3) Train in three stages: distill from a strong teacher, instruction-tune, then long-sequence SFT. 4) Stream inputs with constant memory. Why it matters: You get real-time speed, strong detail reading, and long-term recall—all together—without giant memory growth.

🍞 Anchor: On a phone, InfiniteVL can summarize an hour-long video, remember early scenes, read tiny subtitles, and still stay smooth at 24 FPS.

02Core Idea

🍞 Top Bread (Hook): You know how wearing both glasses and binoculars helps—you wear glasses to see the whole room clearly and binoculars to zoom in on a bird in a tree.

🥬 The Aha! Moment (One sentence): Put a long-memory, linear-time brain (Gated DeltaNet) and a local-detail, windowed eye (SWA) into the same model, then train it in stages so it’s fast, sharp, and remembers for a very long time.

Multiple Analogies:

Map + Magnifying Glass: Gated DeltaNet is your city map (big picture over time), SWA is your magnifying glass (street-level signs). You need both to navigate well.
Notes + Highlighter: Gated DeltaNet is your neat summary notes; SWA is the highlighter to capture tiny facts in the moment.
Pantry + Spice Rack: Gated DeltaNet is the pantry storing essentials compactly; SWA is the spice rack giving detailed flavor where needed.

Before vs. After:

Before: Window-only models ran fast but forgot everything outside the window. Linear-only models remembered long context but often missed tiny visual details like OCR.
After: InfiniteVL keeps long-term memory through a compact state and preserves local fine-grain detail with a few SWA layers—so it reads charts, tables, documents, and still scales to unlimited inputs.

Why It Works (Intuition, no equations):

Long context needs a memory that doesn’t grow with length. Linear attention provides a fixed-size state. Gated DeltaNet improves that state so it doesn’t collapse into mush over time.
Local detail needs focused attention. A small number of SWA layers give the model a ‘close-up’ view where precision matters (like reading text or interpreting a small figure in a chart).
Interleaving them lets local details flow into a global memory and vice versa, so the model can recall old facts while still seeing new details clearly.

Building Blocks (each with a mini-sandwich):

🍞 Hook: Imagine moving from checking eyes close-up to checking long-distance vision. 🥬 What it is: Hybrid Block (1 SWA layer + 3 Gated DeltaNet layers). How it works: 1) SWA captures fresh local details. 2) Gated DeltaNet updates long-term memory. 3) Repeat across 9 blocks so details and memory reinforce. Why it matters: Without the SWA stops, tiny details fade; without Gated DeltaNet, long stories get lost. 🍞 Anchor: Reading a comic series for months: SWA reads the tiny speech bubbles today; Gated DeltaNet remembers plots from last season.
🍞 Hook: Learning from a top student in class. 🥬 What it is: Distillation Pretraining. How it works: 1) Start from a strong Transformer teacher’s weights. 2) Replace attention with Gated DeltaNet. 3) Make the student match the teacher layer-by-layer and end-to-end. Why it matters: It quickly transfers general knowledge so the student doesn’t start from zero. 🍞 Anchor: It’s like copying the best notes before adding your own improvements.
🍞 Hook: Practicing how to answer questions nicely. 🥬 What it is: Instruction Tuning. How it works: 1) Use curated Q&A/dialogue data. 2) Train to follow instructions, use formats, and reason. Why it matters: Without it, the model may know things but won’t respond helpfully. 🍞 Anchor: You learn to show your work, not just say the answer.
🍞 Hook: Rehearsing a full-length play. 🥬 What it is: Long-Sequence SFT. How it works: 1) Use longer videos/documents. 2) Extend context up to 32,768 tokens (and beyond in streaming). 3) Stabilize long-memory behavior. Why it matters: The memory muscles get strong only by practicing long contexts. 🍞 Anchor: Training from 5-minute scenes to full 2-hour performances.

Key Takeaway: By weaving memory-friendly linear layers (Gated DeltaNet) with detail-friendly windows (SWA), and training in three smart stages, InfiniteVL achieves real-time, long-range, and fine-detail multimodal understanding.

03Methodology

High-Level Recipe: Input (images/video + text) → Vision encoder + tokenizer → Hybrid Blocks [SWA → Gated DeltaNet ×3] × 9 → Language head → Output tokens.

Step 1: Turn images and words into tokens

What happens: A high-resolution vision encoder (from Qwen2.5-VL) converts each image or frame into visual tokens; a text tokenizer converts words into text tokens; a small MLP projects visual tokens into the same space as text.
Why this step: Without a shared space, the model can’t mix vision and language well.
Example: A 10-second clip at 1 FPS gives 10 frames. If each frame becomes ~256 tokens, that’s ~2,560 visual tokens plus a few dozen text tokens for the question.

Step 2: Process tokens with Hybrid Blocks (1 SWA + 3 Gated DeltaNet)

What happens: • The SWA layer (with RoPE positions, GQA: 16 query heads, 2 key-value heads) attends within a local window (e.g., 8K tokens) to sharpen fine details. • Then three Gated DeltaNet layers (16 heads each, with a 1D conv of window size 4 and an output gate) read/write a fixed-size memory state—no growing KV cache—capturing long-range dependencies. • Residual connections and layer norms stabilize training; MLPs (hidden size ~11008, SiLU) add nonlinearity and capacity.
Why this step: SWA alone forgets older context; linear memory alone can blur tiny details. Interleaving them preserves both.
Example: When reading a scanned contract, SWA helps read a small clause number; Gated DeltaNet helps recall an earlier definition from page 1 while you’re now on page 8.

Step 3: Output generation

What happens: The decoder-only LLM head produces answer tokens one by one, conditioned on the hybrid representation.
Why this step: This is how the model turns understanding into useful text.
Example: For “What does the legend say in the bottom-right chart?”, the model outputs a sentence quoting the legend text.

Training: Three Stages (the “coach plan”)

Stage I: Distillation Pretraining • What: Initialize from a strong Transformer VLM (Qwen2.5-VL). Replace its attention with Gated DeltaNet while inheriting other weights. • How: First align each layer’s outputs (MSE) using the teacher’s previous layer as input for both; then do end-to-end KL matching of token distributions. Max image 512×512; max sequence 8,192. • Why: It transfers broad knowledge fast and stabilizes the linear layers. • Example data: ~1M multimodal QA/Caption pairs for layer-wise and ~1M for end-to-end.
Stage II: Instruction Supervised Fine-Tuning (SFT) • What: Tune on diverse, high-quality instruction data (general VQA, charts/tables, OCR/docs, math/reasoning, code, science/education, text-only) to improve helpfulness and reasoning. • How: Cross-entropy training with curated prompts; raise image resolution to 1344×1344; keep max length at 8,192 for efficiency. • Why: Distillation teaches “what,” instruction tuning teaches “how to answer well.” • Example data: ~8M multimodal SFT samples.
Stage III: Long-Sequence SFT • What: Teach the model to stay coherent over long contexts and streaming. • How: Extend max context to 32,768 tokens; mix ~200K long video QA/caption pairs (sampled from LLaVA-Video-178K at 10 FPS, up to 224 frames/frame ≤256 tokens) with ~800K general SFT samples; use LoRA for efficient updates. • Why: Without targeted long-sequence practice, the memory may exist but not perform steadily over time. • Example: Ask questions about frames seen 10 minutes earlier; the model should recall them without slowing down.

Secret Sauce (why it’s clever):

The architecture is self-contained: no external memory bank, yet it remembers long context via a compact, gated, rotated state (Gated DeltaNet).
Only a small fraction of SWA layers are needed to recover fine detail—boosting OCR, document, and chart tasks—while keeping most layers linear for speed and constant memory.
The staged training warms up knowledge (distill), behavior (instruction), and endurance (long-sequence) so the model is both smart and steady.

Efficiency and Streaming:

Because linear layers don’t need a growing KV cache, latency per token stays almost constant—even at 300K+ tokens.
The fixed execution path enables CUDA Graph capture for streaming, shaving off overhead and stabilizing 24 FPS prefill.
Example measurement: On an RTX 4090, InfiniteVL keeps ≈24 FPS in streaming with ~274 tokens per frame, while a Transformer baseline drops from ~10 FPS to ~1 FPS and hits OOM near 300 frames.

Putting It All Together:

Input → Encode → [SWA (local) → Gated DeltaNet ×3 (memory)] ×9 → Output. Train with Distill → SFT → Long-SFT. Result: A model that reads tiny details now and remembers what happened a long time ago—without slowing down.

04Experiments & Results

The Test (what and why):

General Understanding: MME, MMStar, MMBench, SeedBench(image), ScienceQA, RealWorldQA, AI2D—check broad multimodal skills.
Text-Rich & Reasoning: ChartQA, TextVQA, DocVQA, OCRBench, MMMU, MathVista—stress fine detail reading, structure, and reasoning.
Long-Context & Streaming: Video-MME and LongVideoBench—see if performance stays steady as frames and tokens grow; measure FPS and latency.

The Competition:

Similar-scale Transformer VLMs (e.g., Qwen2.5-VL-3B, InternVL2.5-4B, Phi-3.5-Vision-4B, SmolVLM2-2B, PaliGemma2-3B, TinyLLaVA-3B).
Linear/Hybrid VLMs (e.g., Cobra-3B, MaTVLM-3B).

The Scoreboard (with context):

Text-rich wins: InfiniteVL hits 82.0% on ChartQA, 78.5% on TextVQA, and 91.7% on DocVQA—scores in the ballpark of strong Transformers at similar size, and clearly ahead of prior linear/hybrid baselines. That’s like getting an A when earlier linear models got C’s on reading charts and documents.
Overall performance: InfiniteVL’s averages sit in the mid-70s across general multimodal suites, comparable to leading Transformer models of similar size and training scale, while using a much more efficient architecture.
Long-context resilience: On Video-MME and LongVideoBench, InfiniteVL’s performance stays stable (or improves slightly) as the number of frames increases beyond 32, while a window-only Transformer degrades once its window is exceeded. Like a runner who keeps pace after 10 miles, InfiniteVL’s memory doesn’t fade.

Speed and Memory (the headline wins):

Per-token latency: >3.6× faster than a similar Transformer at ~50K tokens; advantage grows to ~8× by 300–350K tokens when the Transformer hits out-of-memory.
Streaming FPS: InfiniteVL sustains ≈24 FPS prefill with ~274 tokens/frame; the Transformer baseline slides from ~10 FPS to ~1 FPS and crashes around 300 frames.
Memory footprint: About 9 GB of VRAM remains steady even with unlimited inputs—no ever-growing cache.

Surprising Findings:

A little SWA goes a long way: Even a small ratio of SWA layers produces big jumps on OCR/doc/chart tasks; gains keep coming with more SWA but with diminishing returns. The 1:3 SWA-to-Gated-DeltaNet ratio balances detail and memory best.
Training stages matter: Skipping distillation or long-sequence SFT hurts. Stage I+II yields the strongest short/general results; adding Stage III trades a tiny dip in short tasks for much better long-context generalization.
Not all linear modules are equal: Vanilla Linear Attention struggled to converge; Mamba/GLA converged but lagged on text-rich tasks; Gated DeltaNet significantly improved stability and accuracy on DocVQA, TextVQA, and OCRBench—showing that smarter memory updates really help fine-grained vision-language understanding.

Bottom Line:

InfiniteVL matches or approaches Transformer performance at similar sizes while crushing long-input efficiency constraints—constant memory, stable latency, and real-time streaming.

05Discussion & Limitations

Limitations:

Memory is compact, not infinite: Even smart compression can blur extremely fine details over very long durations. For ultra-precise pixel-level reasoning that spans hours, some information may still fade.
Slight trade-off after long-sequence SFT: Adding Stage III can nudge down short-task scores a bit while boosting long-context strength; finding the perfect data mix remains an art.
Teacher dependence: Distillation quality depends on the teacher’s strengths and biases; errors can be inherited.
Fixed hybrid ratio: A static SWA/Gated-DeltaNet layout may be suboptimal for some domains; dynamic routing could help.

Required Resources:

Training: Multi-GPU setup (authors used NVIDIA H20s, BF16), ~10M open-source samples, three-stage pipeline, LoRA for Stage III.
Inference: A single RTX 4090 can hold ≈9 GB for unlimited streams; lower-VRAM edge devices benefit most from the constant-memory design.

When NOT to Use:

Tiny, short problems that fit comfortably in a small attention window where a simple windowed Transformer is already fast enough.
Tasks demanding ultra-precise global pixel alignment (e.g., some dense segmentation pipelines) without any tolerance for compression.
Scenarios where the teacher model for distillation is weak or misaligned.

Open Questions:

Can the hybrid ratio be adapted on the fly based on input complexity (dynamic SWA placement)?
How can we further reduce long-term collisions in the compact memory while keeping it small and fast?
Can the model learn to ‘bookmark’ rare but crucial details (e.g., a legal clause) for guaranteed later retrieval?
How far can streaming stability go—multi-hour, multi-day logs—before we need hierarchical memory?
What interpretability tools best reveal what the memory actually stores over time?

06Conclusion & Future Work

Three-Sentence Summary:

InfiniteVL interleaves Sliding Window Attention (for local detail) with Gated DeltaNet (for long-term, linear-time memory) and trains in three stages to be both sharp and steady.
It achieves competitive accuracy with similar-sized Transformer VLMs but keeps latency and memory nearly constant, enabling real-time 24 FPS streaming and ultra-long context understanding on a single GPU.
The result is a self-contained, deployment-friendly VLM that handles unlimited inputs without external memory.

Main Achievement:

Showing that a carefully designed hybrid—mostly linear memory with a touch of windowed attention—plus a staged training strategy can deliver Transformer-level capability while breaking the long-input speed and memory bottleneck.

Future Directions:

Stronger, more interpretable long-term memory updates; dynamic hybrid ratios; smarter data mixes that keep short-task scores high while pushing long-context further; and broader tests on edge devices.

Why Remember This:

InfiniteVL proves you don’t have to pick between long memory, fine detail, and real-time speed. With the right mix, you can have all three—opening the door to always-on, low-cost multimodal AI that remembers what it sees.

Practical Applications

•Real-time dashcam copilots that remember earlier traffic events and read road signs or license plates on the fly.
•AR assistants that describe scenes continuously while recalling past instructions and reading small on-screen text.
•On-device document assistants that scan long PDFs, extract tables, and answer questions without cloud memory.
•Customer-support bots that watch long tutorial videos and provide step-by-step help with reliable recall.
•Robotics perception that keeps track of objects over minutes of navigation while recognizing tiny labels or warnings.
•Meeting and lecture companions that align slides, charts, and transcripts over long sessions and summarize later.
•News and research summarizers that connect facts across long articles and embedded figures or tables.
•Healthcare intake tools that read multi-page forms, maintain patient context, and flag key details in scans.
•Security monitoring that watches continuous video feeds, remembering earlier anomalies and reading small overlays.
•Education tools that help students analyze long lab videos and complex diagrams with accurate, context-aware explanations.

Version: 1