InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Key Summary
- ā¢InfiniteVL is a vision-language model that mixes two ideas: local focus with Sliding Window Attention and long-term memory with a linear module called Gated DeltaNet.
- ā¢It keeps speed and memory use nearly constant no matter how long the input gets, so it can handle unlimited images or video streams.
- ā¢On a single RTX 4090, InfiniteVL holds a steady real-time 24 FPS in streaming video while a comparable Transformer slows down and eventually runs out of memory.
- ā¢Per-token latency is over 3.6Ć faster than similar Transformer VLMs at around 50K tokens, and the speed gap grows to about 8Ć at 300ā350K tokens.
- ā¢It performs competitively with leading Transformer VLMs on general benchmarks and does especially well on text-rich tasks like charts, documents, and OCR (e.g., 82% ChartQA, 78.5% TextVQA, 91.7% DocVQA).
- ā¢A three-stage training planādistillation pretraining, instruction tuning, and long-sequence fine-tuningālets InfiniteVL learn fast with much less data.
- ā¢The model is self-contained (no external memory banks) yet remembers earlier context through a compact, learned memory.
- ā¢Even with unlimited inputs, it runs within about 9 GB of VRAM, making it friendly for edge devices.
- ā¢A small number of window-attention layers greatly boosts fine-detail tasks, while most layers use linear memory for long-range context.
- ā¢Overall, InfiniteVL shows you can have long-term memory, strong short-task skills, and real-time speedāall in one VLM.
Why This Research Matters
InfiniteVL makes long, continuous understanding practical on everyday hardware. It can read entire documents, watch long videos, and remember early details without slowing down or running out of memory. That enables real-time assistance for drivers, robots, and AR devices that need to react now while recalling what happened minutes ago. It also helps students, journalists, and doctors process complex charts, tables, and reports quickly. Because itās self-contained, deployment is simplerāno external memory servers to manage. In short, it brings always-on, detail-aware, long-memory AI closer to phones, cars, and small edge devices.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Top Bread (Hook): You know how telling a really long story is hard if you donāt take notes? After a while, you forget who did what and when.
š„¬ Filling (The Actual Concept): What it is: Classic Transformer-based vision-language models (VLMs) pay attention to every piece of the story at once. This is powerful but gets very slow and memory-hungry as the story gets longer. How it works (step by step):
- The model turns images and words into tokens (little pieces). 2) It makes every token look at every other token (full attention). 3) To answer a question, it mixes all these connections. Why it matters: Without a smarter plan, the time and memory grow quickly as inputs get longer, making real-time video understanding or huge documents too slow or impossible.
š Bottom Bread (Anchor): If you stream a long video to a Transformer VLM, it starts fast but slows down, needs more and more memory, and can crash when it runs outālike trying to carry every class note in your backpack forever.
š Top Bread (Hook): Imagine trying to follow a parade by looking only through a small window in a fenceāyou see details up close, but miss the floats that already passed.
š„¬ The Concept: Sliding Window Attention (SWA) What it is: SWA makes the model focus on a recent local window of tokens instead of the whole history, keeping compute small and fast. How it works:
- Keep a moving window (e.g., last 8K tokens). 2) Attend only inside that window. 3) Slide the window forward as new tokens arrive. Why it matters: Itās speedy and good at details right now, but it forgets what fell outside the window.
š Anchor: In a live soccer stream, SWA lets the model clearly read the scoreboard this minute, but it forgets a goal from 10 minutes ago if itās outside the window.
š Top Bread (Hook): Picture a librarian who doesnāt keep every book open at once. Instead, they keep a tidy summary card that grows smarter as more books arrive.
š„¬ The Concept: Linear Attention What it is: A faster attention method that keeps a fixed-size summary (state) instead of storing every old token. How it works:
- Convert keys/queries into features. 2) Add each new tokenās info into a compact state. 3) Read from that state to answer questions. Why it matters: It avoids a growing cache, so speed and memory stay steady even for very long inputs, but fine details can get squished in the summary.
š Anchor: When scanning a 200-page comic, linear attention keeps a running summary so you remember the plot, but you might miss the tiny text on page 47.
š Top Bread (Hook): Think of a smart notebook that decides what to keep, what to fade, and how to rearrange notes so they donāt all blur together.
š„¬ The Concept: Gated DeltaNet What it is: A linear-attention-style module that updates its memory with gates and a clever rotation so it stores more useful, less tangled summaries. How it works:
- Keep a fixed-size memory matrix. 2) Use gates to decide how much of the old memory to keep. 3) Apply a rotation-like step so new info doesnāt collapse the memory. 4) Write the new info compactly. 5) Read from memory with the current query. Why it matters: Without gates and rotation, long histories mix together and details collide; Gated DeltaNet reduces collisions and keeps long-term memory clearer.
š Anchor: Itās like tidying your binder with dividers and sticky notesāolder pages arenāt lost, and you can still find last weekās homework fast.
š Top Bread (Hook): Training a marathon runner is different from training a sprinterāyou warm up, practice form, then add long runs.
š„¬ The Concept: Long-Sequence Fine-Tuning (SFT) What it is: A training phase where the model practices very long inputs so its memory skills actually turn on and stabilize. How it works:
- Start with regular instruction tasks. 2) Gradually lengthen the sequences (images, videos, documents). 3) Teach the model to stay coherent over tens of thousands of tokens. Why it matters: Without this practice, the model can have the right architecture but still stumble on long contexts.
š Anchor: Itās like practicing to perform a 2-hour play: you rehearse longer and longer scenes until you can keep the whole story straight.
š Top Bread (Hook): What if we could have both the telescope for local detail and a map for the whole journey?
š„¬ The Concept: InfiniteVL What it is: A hybrid VLM that interleaves Sliding Window Attention (for sharp local detail) with Gated DeltaNet (for compact, long-term memory) so it runs fast forever. How it works:
- Use a vision encoder to turn pictures/video frames into tokens. 2) Mix layers: a few SWA layers for crisp local reading, and mostly Gated DeltaNet layers for long-range memory. 3) Train in three stages: distill from a strong teacher, instruction-tune, then long-sequence SFT. 4) Stream inputs with constant memory. Why it matters: You get real-time speed, strong detail reading, and long-term recallāall togetherāwithout giant memory growth.
š Anchor: On a phone, InfiniteVL can summarize an hour-long video, remember early scenes, read tiny subtitles, and still stay smooth at 24 FPS.
02Core Idea
š Top Bread (Hook): You know how wearing both glasses and binoculars helpsāyou wear glasses to see the whole room clearly and binoculars to zoom in on a bird in a tree.
š„¬ The Aha! Moment (One sentence): Put a long-memory, linear-time brain (Gated DeltaNet) and a local-detail, windowed eye (SWA) into the same model, then train it in stages so itās fast, sharp, and remembers for a very long time.
Multiple Analogies:
- Map + Magnifying Glass: Gated DeltaNet is your city map (big picture over time), SWA is your magnifying glass (street-level signs). You need both to navigate well.
- Notes + Highlighter: Gated DeltaNet is your neat summary notes; SWA is the highlighter to capture tiny facts in the moment.
- Pantry + Spice Rack: Gated DeltaNet is the pantry storing essentials compactly; SWA is the spice rack giving detailed flavor where needed.
Before vs. After:
- Before: Window-only models ran fast but forgot everything outside the window. Linear-only models remembered long context but often missed tiny visual details like OCR.
- After: InfiniteVL keeps long-term memory through a compact state and preserves local fine-grain detail with a few SWA layersāso it reads charts, tables, documents, and still scales to unlimited inputs.
Why It Works (Intuition, no equations):
- Long context needs a memory that doesnāt grow with length. Linear attention provides a fixed-size state. Gated DeltaNet improves that state so it doesnāt collapse into mush over time.
- Local detail needs focused attention. A small number of SWA layers give the model a āclose-upā view where precision matters (like reading text or interpreting a small figure in a chart).
- Interleaving them lets local details flow into a global memory and vice versa, so the model can recall old facts while still seeing new details clearly.
Building Blocks (each with a mini-sandwich):
- š Hook: Imagine moving from checking eyes close-up to checking long-distance vision. š„¬ What it is: Hybrid Block (1 SWA layer + 3 Gated DeltaNet layers). How it works: 1) SWA captures fresh local details. 2) Gated DeltaNet updates long-term memory. 3) Repeat across 9 blocks so details and memory reinforce. Why it matters: Without the SWA stops, tiny details fade; without Gated DeltaNet, long stories get lost. š Anchor: Reading a comic series for months: SWA reads the tiny speech bubbles today; Gated DeltaNet remembers plots from last season.
- š Hook: Learning from a top student in class. š„¬ What it is: Distillation Pretraining. How it works: 1) Start from a strong Transformer teacherās weights. 2) Replace attention with Gated DeltaNet. 3) Make the student match the teacher layer-by-layer and end-to-end. Why it matters: It quickly transfers general knowledge so the student doesnāt start from zero. š Anchor: Itās like copying the best notes before adding your own improvements.
- š Hook: Practicing how to answer questions nicely. š„¬ What it is: Instruction Tuning. How it works: 1) Use curated Q&A/dialogue data. 2) Train to follow instructions, use formats, and reason. Why it matters: Without it, the model may know things but wonāt respond helpfully. š Anchor: You learn to show your work, not just say the answer.
- š Hook: Rehearsing a full-length play. š„¬ What it is: Long-Sequence SFT. How it works: 1) Use longer videos/documents. 2) Extend context up to 32,768 tokens (and beyond in streaming). 3) Stabilize long-memory behavior. Why it matters: The memory muscles get strong only by practicing long contexts. š Anchor: Training from 5-minute scenes to full 2-hour performances.
Key Takeaway: By weaving memory-friendly linear layers (Gated DeltaNet) with detail-friendly windows (SWA), and training in three smart stages, InfiniteVL achieves real-time, long-range, and fine-detail multimodal understanding.
03Methodology
High-Level Recipe: Input (images/video + text) ā Vision encoder + tokenizer ā Hybrid Blocks [SWA ā Gated DeltaNet Ć3] Ć 9 ā Language head ā Output tokens.
Step 1: Turn images and words into tokens
- What happens: A high-resolution vision encoder (from Qwen2.5-VL) converts each image or frame into visual tokens; a text tokenizer converts words into text tokens; a small MLP projects visual tokens into the same space as text.
- Why this step: Without a shared space, the model canāt mix vision and language well.
- Example: A 10-second clip at 1 FPS gives 10 frames. If each frame becomes ~256 tokens, thatās ~2,560 visual tokens plus a few dozen text tokens for the question.
Step 2: Process tokens with Hybrid Blocks (1 SWA + 3 Gated DeltaNet)
- What happens: ⢠The SWA layer (with RoPE positions, GQA: 16 query heads, 2 key-value heads) attends within a local window (e.g., 8K tokens) to sharpen fine details. ⢠Then three Gated DeltaNet layers (16 heads each, with a 1D conv of window size 4 and an output gate) read/write a fixed-size memory stateāno growing KV cacheācapturing long-range dependencies. ⢠Residual connections and layer norms stabilize training; MLPs (hidden size ~11008, SiLU) add nonlinearity and capacity.
- Why this step: SWA alone forgets older context; linear memory alone can blur tiny details. Interleaving them preserves both.
- Example: When reading a scanned contract, SWA helps read a small clause number; Gated DeltaNet helps recall an earlier definition from page 1 while youāre now on page 8.
Step 3: Output generation
- What happens: The decoder-only LLM head produces answer tokens one by one, conditioned on the hybrid representation.
- Why this step: This is how the model turns understanding into useful text.
- Example: For āWhat does the legend say in the bottom-right chart?ā, the model outputs a sentence quoting the legend text.
Training: Three Stages (the ācoach planā)
-
Stage I: Distillation Pretraining ⢠What: Initialize from a strong Transformer VLM (Qwen2.5-VL). Replace its attention with Gated DeltaNet while inheriting other weights. ⢠How: First align each layerās outputs (MSE) using the teacherās previous layer as input for both; then do end-to-end KL matching of token distributions. Max image 512Ć512; max sequence 8,192. ⢠Why: It transfers broad knowledge fast and stabilizes the linear layers. ⢠Example data: ~1M multimodal QA/Caption pairs for layer-wise and ~1M for end-to-end.
-
Stage II: Instruction Supervised Fine-Tuning (SFT) ⢠What: Tune on diverse, high-quality instruction data (general VQA, charts/tables, OCR/docs, math/reasoning, code, science/education, text-only) to improve helpfulness and reasoning. ⢠How: Cross-entropy training with curated prompts; raise image resolution to 1344Ć1344; keep max length at 8,192 for efficiency. ⢠Why: Distillation teaches āwhat,ā instruction tuning teaches āhow to answer well.ā ⢠Example data: ~8M multimodal SFT samples.
-
Stage III: Long-Sequence SFT ⢠What: Teach the model to stay coherent over long contexts and streaming. ⢠How: Extend max context to 32,768 tokens; mix ~200K long video QA/caption pairs (sampled from LLaVA-Video-178K at 10 FPS, up to 224 frames/frame ā¤256 tokens) with ~800K general SFT samples; use LoRA for efficient updates. ⢠Why: Without targeted long-sequence practice, the memory may exist but not perform steadily over time. ⢠Example: Ask questions about frames seen 10 minutes earlier; the model should recall them without slowing down.
Secret Sauce (why itās clever):
- The architecture is self-contained: no external memory bank, yet it remembers long context via a compact, gated, rotated state (Gated DeltaNet).
- Only a small fraction of SWA layers are needed to recover fine detailāboosting OCR, document, and chart tasksāwhile keeping most layers linear for speed and constant memory.
- The staged training warms up knowledge (distill), behavior (instruction), and endurance (long-sequence) so the model is both smart and steady.
Efficiency and Streaming:
- Because linear layers donāt need a growing KV cache, latency per token stays almost constantāeven at 300K+ tokens.
- The fixed execution path enables CUDA Graph capture for streaming, shaving off overhead and stabilizing 24 FPS prefill.
- Example measurement: On an RTX 4090, InfiniteVL keeps ā24 FPS in streaming with ~274 tokens per frame, while a Transformer baseline drops from ~10 FPS to ~1 FPS and hits OOM near 300 frames.
Putting It All Together:
- Input ā Encode ā [SWA (local) ā Gated DeltaNet Ć3 (memory)] Ć9 ā Output. Train with Distill ā SFT ā Long-SFT. Result: A model that reads tiny details now and remembers what happened a long time agoāwithout slowing down.
04Experiments & Results
The Test (what and why):
- General Understanding: MME, MMStar, MMBench, SeedBench(image), ScienceQA, RealWorldQA, AI2Dācheck broad multimodal skills.
- Text-Rich & Reasoning: ChartQA, TextVQA, DocVQA, OCRBench, MMMU, MathVistaāstress fine detail reading, structure, and reasoning.
- Long-Context & Streaming: Video-MME and LongVideoBenchāsee if performance stays steady as frames and tokens grow; measure FPS and latency.
The Competition:
- Similar-scale Transformer VLMs (e.g., Qwen2.5-VL-3B, InternVL2.5-4B, Phi-3.5-Vision-4B, SmolVLM2-2B, PaliGemma2-3B, TinyLLaVA-3B).
- Linear/Hybrid VLMs (e.g., Cobra-3B, MaTVLM-3B).
The Scoreboard (with context):
- Text-rich wins: InfiniteVL hits 82.0% on ChartQA, 78.5% on TextVQA, and 91.7% on DocVQAāscores in the ballpark of strong Transformers at similar size, and clearly ahead of prior linear/hybrid baselines. Thatās like getting an A when earlier linear models got Cās on reading charts and documents.
- Overall performance: InfiniteVLās averages sit in the mid-70s across general multimodal suites, comparable to leading Transformer models of similar size and training scale, while using a much more efficient architecture.
- Long-context resilience: On Video-MME and LongVideoBench, InfiniteVLās performance stays stable (or improves slightly) as the number of frames increases beyond 32, while a window-only Transformer degrades once its window is exceeded. Like a runner who keeps pace after 10 miles, InfiniteVLās memory doesnāt fade.
Speed and Memory (the headline wins):
- Per-token latency: >3.6Ć faster than a similar Transformer at ~50K tokens; advantage grows to ~8Ć by 300ā350K tokens when the Transformer hits out-of-memory.
- Streaming FPS: InfiniteVL sustains ā24 FPS prefill with ~274 tokens/frame; the Transformer baseline slides from ~10 FPS to ~1 FPS and crashes around 300 frames.
- Memory footprint: About 9 GB of VRAM remains steady even with unlimited inputsāno ever-growing cache.
Surprising Findings:
- A little SWA goes a long way: Even a small ratio of SWA layers produces big jumps on OCR/doc/chart tasks; gains keep coming with more SWA but with diminishing returns. The 1:3 SWA-to-Gated-DeltaNet ratio balances detail and memory best.
- Training stages matter: Skipping distillation or long-sequence SFT hurts. Stage I+II yields the strongest short/general results; adding Stage III trades a tiny dip in short tasks for much better long-context generalization.
- Not all linear modules are equal: Vanilla Linear Attention struggled to converge; Mamba/GLA converged but lagged on text-rich tasks; Gated DeltaNet significantly improved stability and accuracy on DocVQA, TextVQA, and OCRBenchāshowing that smarter memory updates really help fine-grained vision-language understanding.
Bottom Line:
- InfiniteVL matches or approaches Transformer performance at similar sizes while crushing long-input efficiency constraintsāconstant memory, stable latency, and real-time streaming.
05Discussion & Limitations
Limitations:
- Memory is compact, not infinite: Even smart compression can blur extremely fine details over very long durations. For ultra-precise pixel-level reasoning that spans hours, some information may still fade.
- Slight trade-off after long-sequence SFT: Adding Stage III can nudge down short-task scores a bit while boosting long-context strength; finding the perfect data mix remains an art.
- Teacher dependence: Distillation quality depends on the teacherās strengths and biases; errors can be inherited.
- Fixed hybrid ratio: A static SWA/Gated-DeltaNet layout may be suboptimal for some domains; dynamic routing could help.
Required Resources:
- Training: Multi-GPU setup (authors used NVIDIA H20s, BF16), ~10M open-source samples, three-stage pipeline, LoRA for Stage III.
- Inference: A single RTX 4090 can hold ā9 GB for unlimited streams; lower-VRAM edge devices benefit most from the constant-memory design.
When NOT to Use:
- Tiny, short problems that fit comfortably in a small attention window where a simple windowed Transformer is already fast enough.
- Tasks demanding ultra-precise global pixel alignment (e.g., some dense segmentation pipelines) without any tolerance for compression.
- Scenarios where the teacher model for distillation is weak or misaligned.
Open Questions:
- Can the hybrid ratio be adapted on the fly based on input complexity (dynamic SWA placement)?
- How can we further reduce long-term collisions in the compact memory while keeping it small and fast?
- Can the model learn to ābookmarkā rare but crucial details (e.g., a legal clause) for guaranteed later retrieval?
- How far can streaming stability goāmulti-hour, multi-day logsābefore we need hierarchical memory?
- What interpretability tools best reveal what the memory actually stores over time?
06Conclusion & Future Work
Three-Sentence Summary:
- InfiniteVL interleaves Sliding Window Attention (for local detail) with Gated DeltaNet (for long-term, linear-time memory) and trains in three stages to be both sharp and steady.
- It achieves competitive accuracy with similar-sized Transformer VLMs but keeps latency and memory nearly constant, enabling real-time 24 FPS streaming and ultra-long context understanding on a single GPU.
- The result is a self-contained, deployment-friendly VLM that handles unlimited inputs without external memory.
Main Achievement:
- Showing that a carefully designed hybridāmostly linear memory with a touch of windowed attentionāplus a staged training strategy can deliver Transformer-level capability while breaking the long-input speed and memory bottleneck.
Future Directions:
- Stronger, more interpretable long-term memory updates; dynamic hybrid ratios; smarter data mixes that keep short-task scores high while pushing long-context further; and broader tests on edge devices.
Why Remember This:
- InfiniteVL proves you donāt have to pick between long memory, fine detail, and real-time speed. With the right mix, you can have all threeāopening the door to always-on, low-cost multimodal AI that remembers what it sees.
Practical Applications
- ā¢Real-time dashcam copilots that remember earlier traffic events and read road signs or license plates on the fly.
- ā¢AR assistants that describe scenes continuously while recalling past instructions and reading small on-screen text.
- ā¢On-device document assistants that scan long PDFs, extract tables, and answer questions without cloud memory.
- ā¢Customer-support bots that watch long tutorial videos and provide step-by-step help with reliable recall.
- ā¢Robotics perception that keeps track of objects over minutes of navigation while recognizing tiny labels or warnings.
- ā¢Meeting and lecture companions that align slides, charts, and transcripts over long sessions and summarize later.
- ā¢News and research summarizers that connect facts across long articles and embedded figures or tables.
- ā¢Healthcare intake tools that read multi-page forms, maintain patient context, and flag key details in scans.
- ā¢Security monitoring that watches continuous video feeds, remembering earlier anomalies and reading small overlays.
- ā¢Education tools that help students analyze long lab videos and complex diagrams with accurate, context-aware explanations.