CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Moritz Böhle; Amélie Royer; Juliette Marrie; Edouard Grave; Patrick Pérez

CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

Intermediate

Moritz Böhle, Amélie Royer, Juliette Marrie et al.12/22/2025

arXiv PDF

Key Summary

•CASA is a new way to mix images and text inside a language model that keeps speed and memory low while keeping accuracy high.
•Instead of stuffing image tokens into the text like a giant sandwich (token insertion), CASA lets text look at the image and at nearby text in small windows.
•This small-window trick creates a natural on/off switch (implicit gating) so the model knows when to trust the image and when to trust the text.
•CASA beats classic cross-attention models on tough tasks like reading charts and documents, where tiny details matter.
•CASA comes close to the accuracy of full token insertion but uses much less memory, especially for long conversations and streaming videos.
•It can be added to a plain text model to make it see images, or used to convert an existing vision-language model to be more efficient.
•For live video captioning, CASA keeps latency low and memory almost flat over time, while token insertion slows down and runs out of memory.
•Ablations show the self-attention part of CASA is crucial; removing it causes big performance drops.
•Updating image tokens through heavy layers brings tiny gains but big costs, so CASA avoids it by design.

Why This Research Matters

CASA lets vision-language models stay fast and memory-friendly without giving up much accuracy, which is essential for devices we actually use every day. It means assistants can reliably read bills, forms, and charts without needing a giant server. It also keeps up with live video, producing captions or explanations in real time without freezing or crashing. For teams with limited GPUs, CASA makes training and serving multimodal models more practical. In education, accessibility, and on-the-go translation, CASA’s balance of detail and efficiency can reach more people and more places. Overall, it moves multimodal AI toward being useful, affordable, and responsive in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to read a picture book while someone keeps sliding more and more big pictures between the words. Soon, your notebook is stuffed, and it’s hard to turn the pages.

🥬 The Concept (Self-Attention):

What it is: Self-attention is how a language model lets each word look at other words to figure out what’s important.
How it works:
1. Take a word.
2. Compare it to all earlier words.
3. Give higher scores to the most helpful words.
4. Mix the information, guided by those scores.
Why it matters: Without self-attention, the model forgets which words relate, like mixing up who did what in a sentence. 🍞 Anchor: In “The cat sat on the mat,” self-attention helps connect “cat” and “sat,” not “cat” and some random word.

🍞 Hook: You know how a friend can point to a picture while you read the caption, so you use both signals at once?

🥬 The Concept (Cross-Attention):

What it is: Cross-attention lets text look at another source (like an image) to bring in extra clues.
How it works:
1. Make image tokens (tiny pieces of the image) using a vision encoder.
2. Let text tokens ask questions to the image tokens.
3. Gather the most relevant visual bits.
4. Blend them into the text’s thinking.
Why it matters: Without cross-attention, text can’t use the image’s details, like reading a chart without seeing the chart. 🍞 Anchor: When asked, “What color is the sign?”, cross-attention helps the word “color” focus on the sign’s pixels.

🍞 Hook: Picture a backpack where you shove every photo into your notebook between the words. Heavy!

🥬 The Concept (Token Insertion):

What it is: Token insertion puts image tokens directly into the text stream so everything attends to everything.
How it works:
1. Convert an image to many tokens.
2. Insert those tokens among the text tokens.
3. Run normal self-attention over the giant mixed sequence.
4. Let the model use both sources freely.
Why it matters: It’s accurate but expensive; big images and long videos mean thousands of extra tokens, which explode memory and compute. 🍞 Anchor: Reading a PDF page at high resolution can add thousands of tokens; the model slows down and quickly fills its memory.

🍞 Hook: Think of a whiteboard that remembers what you already wrote, so you don’t have to rewrite it every time you add a new sentence.

🥬 The Concept (KV Cache):

What it is: The KV cache stores past attention info so generation is faster.
How it works:
1. When you process tokens, save their keys and values.
2. Reuse them for the next steps instead of recomputing.
3. Grow the cache as the sequence grows.
4. Speed up future attention lookups.
Why it matters: If you insert lots of image tokens, the cache balloons, slowing everything and eating memory. 🍞 Anchor: In streaming video, every new frame adds tokens; with insertion, your cache becomes a heavy backpack you must carry forever.

🍞 Hook: When your phone storage is full, you zip files, but zipping images too much makes them blurry.

🥬 The Concept (Token Compression):

What it is: Compression reduces the number of image tokens before using them.
How it works:
1. Group or pool image patches.
2. Keep only summary tokens (like queries).
3. Drop the rest.
4. Use fewer tokens to save memory.
Why it matters: Over-compressing deletes fine details (like tiny text on a receipt), hurting accuracy. 🍞 Anchor: On chart-reading tasks, heavy compression lowers scores because small bars and labels vanish.

The World Before: Vision-language models got great by inserting visual tokens into the text stream. It was simple and strong because self-attention handled both. But with high-res images, long chats, or videos, token counts exploded, making training and inference slow and memory-hungry.

The Problem: People revisited cross-attention to be efficient—only text queries the image, and image tokens don’t pass through all heavy layers. It runs fast and keeps the KV cache small. But a performance gap appeared on fine-grained tasks like charts, documents, and small text reading.

Failed Attempts: Researchers tried adding gates (manual on/off switches), extra visual tokens, and special modules to update visual features. These helped a bit but didn’t close the gap, especially for high-detail reading.

The Gap: Cross-attention lacked local text-to-text interaction during visual fusion. Text was influenced by the image without also checking in with nearby text to keep context steady.

Real Stakes: In daily life, this means slow or inaccurate assistants that can’t reliably read your bills, forms, menus, and charts—or keep up with live video. We need something both accurate and efficient so phones, laptops, and edge devices can handle real-world multimodal tasks smoothly.

02Core Idea

🍞 Hook: Imagine you’re taking notes while looking at a picture. You don’t just stare at the photo—you also re-read the last few words you wrote to make sure the next word fits.

🥬 The Concept (CASA, Cross-Attention via Self-Attention):

What it is: CASA is a fusion layer where a text token looks at the image tokens and also at nearby text tokens in a small, causal window.
How it works:
1. Split the conversation into windows: each new image starts a new window, followed by its related text.
2. For a current text token, let it attend to (a) the image tokens and (b) the text tokens back to the image.
3. The attention softmax balances how much to take from image vs. text (implicit gating).
4. Keep image tokens out of heavy feed-forward layers and out of the KV cache.
Why it matters: Without the local text-to-text check-in, cross-attention can overwrite or destabilize text meaning. CASA keeps the text grounded while pulling in just the right visual bits. 🍞 Anchor: When answering “What’s the y-axis label of the chart?”, CASA lets the word “label” glance at the chart tokens and at the nearby words that describe what you’re asking, so it doesn’t drift.

The “Aha!” Moment in one sentence: Let text attend locally to text while attending to the image, so cross-attention behaves like a tiny, well-mannered self-attention window that naturally gates visual influence.

Three analogies:

Air traffic control: CASA is like a tower that watches planes (image) and the runway logs (recent text) at the same time, deciding safely who lands now.
Cooking with a recipe: You taste the sauce (image) but also re-read the last recipe line (text) before adding spice, so you don’t ruin the dish.
Classroom whisper: You peek at the board (image) but also check your last sentence (text) before writing the next, keeping your notes consistent.

Before vs After:

Before (cross-attention): Text looked only at the image in the fusion step, risking overwriting local context and losing details.
After (CASA): Text looks at image + nearby text, creating a stabilizing loop that preserves meaning and pulls in fine-grained visual clues.

🍞 Hook: Think of a dimmer switch that smoothly blends lamp light with sunlight in your room.

🥬 The Concept (Implicit Gating):

What it is: CASA’s attention softmax itself decides how much to trust image vs. text—no extra, manual gate needed.
How it works:
1. Compute attention scores to image tokens and to nearby text tokens.
2. Softmax turns scores into weights.
3. The model learns to give high weight to “itself” and useful text, and—when helpful—to the right image patches.
4. Visual info helps without flooding the text stream.
Why it matters: Without this, models may over- or under-use the image, especially with tricky details (like tiny OCR text). 🍞 Anchor: Ablations show that blocking a token’s attention to itself makes scores crash—proof that this natural gating is key to CASA’s power.

🍞 Hook: Imagine reading a comic: each panel (image) is followed by a few speech bubbles (text) that belong together.

🥬 The Concept (Local Attention Windows):

What it is: A local window starts at the image and includes the related follow-up text.
How it works:
1. New image arrives → open a new window.
2. Text tokens in that window attend to the image and to earlier text in the same window.
3. The next image starts a fresh window; past windows stay cached efficiently.
4. The rest of the language model still does global text self-attention, so long-range story flow is preserved.
Why it matters: Without local windows, compute scales badly and the model can get confused across unrelated images. 🍞 Anchor: In streaming video, each frame forms a tiny window with its caption bits, keeping latency low and memory stable.

Building Blocks (simple view):

Visual encoder: turns images/frames into tokens (but these tokens don’t go through heavy feed-forward layers).
CASA layer placements: three variants—CASA⊕ (parallel add), CASA→ (before self-attention), CASA∨ (replacing some self-attention blocks to save even more compute).
Blockwise attention: efficient training/inference by handling windows as blocks so cost grows slowly with more images.
Causal masking: text can’t peek into the future; each token sees only what it should.
Modularity: drop CASA into an existing text LLM or adapt a token-insertion VLM with minimal extra parameters.

Why it works (intuition): CASA keeps the text’s “identity” alive by letting a token attend to itself and its recent context while looking at the image. The attention softmax becomes an automatic mixer, preventing the image from shouting over the text unless the task truly needs it (like reading a small axis label).

03Methodology

High-level recipe: Input (image + text) → Vision encoder (image → tokens) → CASA layers (text attends to image + local text) → Standard LLM layers (text-only self-attention + FFNs) → Output tokens (answer/caption).

Step 1: Turn images into tokens

What happens: A vision encoder converts each image or video frame into a grid of tokens.
Why: Models work best on token streams.
Example: A 896×896 image may become 1024 image tokens.

Step 2: Form local windows

What happens: Each time an image appears, we start a new window that includes that image’s tokens and the following related text tokens.
Why: Windows keep attention local and efficient, and prevent unrelated images from interfering.
Example: Chat: “Here’s an image of a receipt” [image] “What is the total?” → window contains that image + “What is the total?” and the model’s reply.

🍞 Hook: Imagine looking at a photo and then glancing back at the last line of your notes before writing the next word.

🥬 The Concept (CASA Layer, the fusion step):

What it is: A CASA layer lets the current text token attend to both the image tokens and the earlier text tokens in its window.
How it works:
1. Take the current text token as a query.
2. Keys/values are the image tokens plus the text tokens from after the image up to the current position (causal).
3. Compute attention scores and softmax weights.
4. Mix information using those weights and add as a residual update to the text stream (CASA⊕/→), or replace some self-attention blocks (CASA∨).
Why it matters: If we remove this local text-to-text interaction, cross-attention becomes brittle; the image can overly dominate or confuse the text. 🍞 Anchor: In “What’s the y-axis label?”, the token for “label” gets info from the chart patches and nearby words like “y-axis,” not from a random earlier part of the chat.

Step 3: Keep image tokens out of heavy layers

What happens: Only text tokens go through the model’s feed-forward networks (FFNs) and get saved in the KV cache. Image tokens act as keys/values in CASA attention but don’t pass through FFNs and aren’t stored in the cache.
Why: This slashes memory and compute, especially for long videos or many images.
Example: In a 2-minute video at 2 fps, token insertion would balloon the cache; CASA’s cache stays mostly about the text.

🍞 Hook: Think of a kitchen where you can bring a bowl to the counter to mix, but you don’t carry all your pantry shelves back and forth every time.

🥬 The Concept (Blockwise Attention):

What it is: A fast way to do attention over windows as blocks during training/inference.
How it works:
1. Split sequences into natural windows around image insertions.
2. Run attention per window efficiently.
3. Use text-only queries to avoid quadratic blow-ups.
4. Maintain causal masks so no one peeks ahead.
Why it matters: Without blockwise attention, costs would grow too fast as images multiply. 🍞 Anchor: When packing many QA pairs per batch, each pair forms its own block, preventing wasteful padding and unnecessary cross-sample attention.

Step 4: Choose a CASA placement

CASA⊕ (parallel): Compute self-attention and CASA in parallel and add the results.
- Why: Minimal disruption to a frozen model; great for adapting existing VLMs.
- Example: Converting a token-insertion model like Qwen2.5-VL to efficient CASA with small accuracy drop.
CASA→ (before SA): Run CASA first, then self-attention.
- Why: Strong accuracy when training end-to-end from a text LLM.
- Example: Upgrading a 2B text model to a VLM with CASA layers.
CASA∨ (replace some SA): Swap a subset of self-attention layers with CASA.
- Why: Even more efficient inference; place them sparsely to keep accuracy strong.
- Example: Replace every 4th self-attention layer for speed-memory gains with minor trade-offs.

Step 5: Train efficiently

What happens: Use multimodal sequence packing (many short QA pairs packed together), windows for CASA, and an attention kernel like FlashAttention2.
Why: Saves memory, avoids padding waste, and matches the streaming behavior we want at inference.
Example: Pack up to 2048 text tokens and tens of thousands of image tokens per GPU step during training, yet keep memory under control due to CASA’s design.

What breaks without each step:

No windows: Attention touches unrelated images; compute explodes.
No text-to-text in CASA: Visual fusion becomes unstable; accuracy drops on fine detail tasks.
Image tokens through FFNs: Memory skyrockets; training/inference slow greatly.
No blockwise attention: Training becomes impractical on long multimodal sequences.

The secret sauce:

CASA’s local self-attend during fusion creates implicit gating: the model naturally favors “self + helpful neighbors,” blending in just the right visual patches. This balances precision (fine details) with efficiency (fast, low memory), especially crucial for streaming video where frames never stop.

04Experiments & Results

The tests: The authors measured accuracy on diverse benchmarks and also tracked compute/memory in both training and real-time inference. They grouped tasks into (1) High-res document/chart reading (tiny text, lines, labels), (2) OCR in natural images, and (3) General visual QA. They also tested streaming video understanding and live video captioning where latency and memory growth really matter.

The competition: CASA was compared to token insertion models (the heavy but strong baseline) and to modern cross-attention VLMs (efficient but usually weaker on fine detail). They trained CASA both from a text-only LLM (Helium1-2B) and by adapting a strong token-insertion VLM (Qwen2.5-VL-3B) using only added CASA layers.

Scoreboard with context:

Versus cross-attention: CASA consistently and clearly wins on fine-grained tasks like ChartQA, DocVQA, and InfoVQA—think moving from a B- to a solid A when tiny details matter. On general QA, CASA matches or beats these models too.
Versus token insertion: CASA narrows the gap a lot. On average, there’s still about a 7-point drop versus full insertion in the most demanding settings, but the compute and memory savings are huge, especially on long, image-heavy dialogs and videos.
Adapting a big VLM: Swapping Qwen2.5-VL-3B to CASA⊕ recovers most of its accuracy with far better efficiency, outperforming many larger cross-attention baselines while training fewer parameters.
Video understanding: CASA-based models land close to the token-insertion base and above larger cross-attention rivals on several video QA benchmarks.
Live video captioning: Despite being smaller (3B vs. 7B baselines), a CASA-adapted model achieves competitive win rates using an LLM-as-judge setup, while keeping latency low and memory nearly flat as frames stream in.

Surprising findings and ablations:

Self-attention is critical inside CASA. If you prevent a token from attending to itself in CASA at inference, performance drops sharply across all benchmarks—evidence for the implicit gating effect.
Updating image tokens via feed-forward layers brings slight gains but costs a lot of memory and time; CASA avoids this by design.
CASA∨ (replacing a few self-attention layers) gives extra speed and memory savings; spreading them uniformly works better than clumping them.
Compressing image tokens (e.g., Q-Former) helps short-term efficiency but hurts fine detail tasks fast, and still struggles with long streaming where caches blow up.

Efficiency data made simple:

Memory: Not pushing image tokens through FFNs and not caching them makes CASA’s memory use much lower than token insertion, often by large factors in long runs.
Speed: CASA maintains high generation speed across many frames; insertion slows down as it keeps piling tokens into the KV cache. In streaming plots, insertion lines hit out-of-memory while CASA keeps going.

Bottom line: On detailed reading benchmarks, CASA lifts cross-attention to near token-insertion quality, keeps general QA strong, and shines in streaming tasks by staying fast and memory-thrifty.

05Discussion & Limitations

Limitations:

Still a small gap vs. full token insertion on the most detail-hungry tasks (e.g., fine infographic or diagram questions). If absolute top accuracy on tiny text is your only goal and you can afford the cost, insertion may still win.
CASA relies on a good vision encoder; weak visual features limit performance.
CASA’s windows are local in the fusion step; if a task truly needs cross-image global fusion at once, special handling may be needed.
Current efficient training uses blockwise attention libraries; tooling constraints (like mask alignment rules) can affect setup.

Required resources:

A capable GPU setup for training (the paper used H100s); inference is much lighter than insertion-based methods, especially for long sequences.
A decent multimodal dataset mix (documents, charts, OCR, general QA; videos if streaming is desired).

When NOT to use:

Ultra short prompts with a single small image where cost doesn’t matter and you want absolute best accuracy—token insertion may be simpler.
Extreme micro-detail reading at the limit of resolution where every last bit matters and resources are abundant.
Pipelines that fundamentally depend on pushing image tokens through FFNs at every layer for custom reasons.

Open questions:

Can dynamic windowing (adapting window sizes on-the-fly) further improve both accuracy and speed?
Could lightweight, smarter implicit gates enhance CASA beyond the softmax’s natural balance without adding heavy parameters?
How best to combine CASA with gentle compression for ultra-limited devices without losing chart/OCR detail?
What are the theoretical properties of CASA’s attention distributions across layers and heads, and can these guide better designs?
Can CASA’s idea generalize to audio-text, sensor-text, or tri-modal settings while keeping the same efficiency benefits?

06Conclusion & Future Work

Three-sentence summary: CASA fuses images and text by letting each text token look at both the image and nearby text in a small, causal window. This creates a natural, implicit gate that preserves meaning and adds visual detail without blowing up memory or latency. As a result, CASA closes most of the gap to full token insertion while keeping the efficiency of cross-attention, especially in long, streaming scenarios.

Main achievement: Showing that adding local text-to-text attention inside cross-attention is the key missing piece to make efficient fusion competitive on fine-grained tasks.

Future directions: Explore adaptive window sizes, smarter implicit gating, and hybrid strategies with light compression; extend CASA to more modalities; refine training kernels and masking to further boost speed. Investigate theoretical insights into attention patterns to auto-tune CASA placement and windowing.

Why remember this: CASA turns a known efficient idea (cross-attention) into a high-accuracy tool simply by restoring a bit of self-attention at the right place and time. It keeps models nimble on long chats and videos, bringing practical multimodal AI closer to real-time, on-device, and everyday use without sacrificing much accuracy.

Practical Applications

•Real-time live video captioning on laptops or edge devices with stable latency and memory.
•Smart document readers that handle high-resolution receipts, invoices, and forms efficiently.
•Chart and infographic assistants that answer detailed questions without heavy server costs.
•On-device accessibility tools (e.g., reading signs or menus) with fast, reliable OCR-based help.
•Customer support bots that understand product photos and long troubleshooting chats without slowing down.
•AR glasses that describe scenes live, staying within battery and memory limits.
•Classroom tools that explain diagrams step-by-step without lag.
•Telemedicine helpers that read medical charts and annotate images during calls.
•Industrial monitoring systems that caption and summarize long surveillance or process videos in real time.
•News and sports highlight generators that keep up with streaming feeds to produce timely summaries.

Version: 1