OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding; Yiyan Ji; Jungang Li; Xuyang Liu; Xinlong Chen; Junfei Wu; Bozhou Li; Bohan Zeng; Yang Shi; Yushuo Guan; Yuanxing Zhang; Jiaheng Liu; Qiang Liu; Pengfei Wan; Liang Wang

OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Intermediate

Yue Ding, Yiyan Ji, Jungang Li et al.2/4/2026

arXiv PDF

Key Summary

•OmniSIFT is a new way to shrink (compress) audio and video tokens so omni-modal language models can think faster without forgetting important details.
•It works asymmetrically: first clean up the video (find the truly useful visual bits), then use that cleaned video to decide which audio pieces to keep.
•The video step (STVP) keeps spatially unique regions and moments that actually change over time, throwing away repetitive patches.
•The audio step (VGAS) uses the pruned video as a guide so it only keeps the sounds that match what’s on screen.
•On Qwen2.5-Omni-7B with just 35% tokens kept, OmniSIFT scored 50.0 on WorldSense, beating the full-token model’s 49.7, and reached 73.2 on DailyOmni (full tokens: 72.2).
•With only 25% of the original tokens, OmniSIFT still outperforms all compression baselines and sometimes even beats full-token performance.
•It adds only 4.85M parameters, trains end-to-end using a straight-through estimator, and stays compatible with efficient attention implementations.
•Latency and memory use drop a lot: over 40% faster total inference time and 4.6 GB lower peak memory on WorldSense for the 7B model versus full tokens.
•Performance remains stable even when audio is compressed very hard, unlike symmetric methods whose accuracy falls off quickly.
•The key idea mirrors how people watch videos: our eyes anchor the scene first, then our ears focus on sounds that match what we see.

Why This Research Matters

OmniSIFT makes powerful audio–video AI models faster and lighter without losing smarts, which means they can run on cheaper hardware and handle longer videos. It reduces delays for real-time uses like meetings, classes, and live events by removing tokens that add little value. By keeping the right cross-modal pairs (what you see with what you hear), it can even beat the performance of using all tokens. This approach can lower the cost of deploying multimodal assistants at scale. It also opens the door to more accessible tools for education, accessibility (e.g., better captions), and safety monitoring. The idea of asymmetric compression can inspire similar advances in other multimodal systems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a movie night can get confusing if you try to watch three screens and listen to two podcasts at the same time? Your brain gets overloaded because there’s just too much happening.

🥬 Filling (The Actual Concept): What it is: Omni-modal Large Language Models (Omni-LLMs) are AI systems that can watch videos, listen to audio, and read text all together to answer questions or describe what’s happening. How it works:

A vision encoder turns video frames into little pieces called tokens.
An audio encoder turns sound into tokens too.
These tokens are mixed with text tokens and sent to a big language model that reasons across them. Why it matters: Without a smart way to handle all those tokens, the AI slows down and may need huge memory, making it hard to use on long clips or in real time.

🍞 Bottom Bread (Anchor): Imagine a 20-second clip of a kid playing piano while a dog barks. An Omni-LLM must connect the visuals (kid, piano keys) and audio (music notes, barking) to answer, “Who is making the music?”

🍞 Top Bread (Hook): Imagine a giant jigsaw puzzle with way too many extra pieces. If you try to use all of them, building the picture takes forever.

🥬 Filling (The Actual Concept): What it is: A token is like a tiny puzzle piece of information. Token compression is packing those pieces smarter so the model only keeps the most useful ones. How it works:

Score tokens by how informative they are.
Keep the top-scoring ones.
Discard the rest to save time and memory. Why it matters: Without compression, long videos and high-resolution audio create sequences of tens of thousands of tokens that are too slow and expensive to process.

🍞 Bottom Bread (Anchor): Instead of feeding 20,000 tokens from a 20-second clip, compression might keep only 5,000 critical tokens and still answer questions correctly.

🍞 Top Bread (Hook): Think of cleaning your room. If you just shove stuff under the bed without looking, you’ll probably hide the homework you actually need.

🥬 Filling (The Actual Concept): What it is: Earlier methods tried two strategies for token compression in audio-video models. Modality-decoupled compression pruned audio and video separately. Modality-symmetric compression treated audio and video as equally informative and compressed them the same way. How it works:

Decoupled: apply vision-only rules to video and audio-only rules to audio, independently.
Symmetric: compute joint saliency and compress both streams in a uniform fashion. Why it matters: These miss cross-modal dependencies—what’s important in audio often depends on what’s on screen—and can cut tokens that become crucial when paired together.

🍞 Bottom Bread (Anchor): If a speaker is on stage, the video shows a person talking; the matching audio (their voice) matters a lot. Decoupled or symmetric methods might drop either the face or the voice, making the model confused.

🍞 Top Bread (Hook): You know how your eyes usually guide your ears when you watch someone speak? You focus on their lips and then listen harder to their voice.

🥬 Filling (The Actual Concept): What it is: Modality-asymmetric compression means treating video and audio differently and letting video guide which audio parts to keep. How it works:

First, prune video to remove repeated patches and still keep the big visual clues.
Then, use those visual anchors to pick the matching, meaningful audio. Why it matters: Without this asymmetry, the model may keep loud but irrelevant sounds or drop quiet but important ones that line up with the scene.

🍞 Bottom Bread (Anchor): If a scoreboard flips from 27–26 to 28–26, video anchors help the model keep the right audio moment (the referee’s call) to explain why the score changed.

The world before: Omni-LLMs were great at understanding mixed media, but sequences exploded in length because videos have lots of frames and audio is sampled densely. For a short clip, token counts often reached 20,000 or more. This made models slower, memory-hungry, and hard to deploy for long videos or real-time tasks.

The problem: How can we ruthlessly cut tokens without cutting understanding? People tried directly borrowing image/video token pruning for both audio and video (decoupled) or compressing both modalities the same way (symmetric). But these approaches ignored that audio saliency depends on visual context. A clap sound matters only when someone claps on screen. Symmetric methods also struggled with compatibility and efficiency, sometimes requiring expensive extra layers before compressing.

Failed attempts:

Decoupled pruning ignored cross-modal ties; it could drop tokens that were only meaningful when matched across audio and video.
Symmetric pruning treated audio and video equally; it often turned into picking time positions, not preserving modality-specific cues, and could be slow or incompatible with fast attention kernels.

The gap: We needed a light, trainable, end-to-end method that first stabilizes the video signal (remove spatial and temporal redundancy) and then uses these visual anchors to guide which audio tokens to keep.

Real stakes: Faster, cheaper, and smarter compression means:

Watching and explaining longer videos on everyday GPUs or even edge devices.
Lower latency for live captioning, meetings, and streaming events.
Better accuracy by tossing noisy, redundant tokens and focusing on key moments.

Enter OmniSIFT, which turns human-like perception (eyes guide ears) into a practical, efficient compression pipeline.

02Core Idea

🍞 Top Bread (Hook): Imagine packing for a trip: you fold and stack your clothes first, then you tuck in socks and toiletries in the gaps your clothes create. You don’t treat everything the same.

🥬 Filling (The Actual Concept): What it is: OmniSIFT is a modality-asymmetric token compression method that first prunes video in space and time, then uses those visual anchors to select the most relevant audio. How it works:

Spatio-Temporal Video Pruning (STVP) finds unique regions in the first frame (spatial) and changes between consecutive frames (temporal).
Vision-Guided Audio Selector (VGAS) uses the pruned video tokens to highlight audio tokens that match the scene.
A straight-through estimator lets the whole system learn end-to-end even though selecting Top-K tokens is discrete. Why it matters: Without visually guided selection, the model may drop small but important sounds (like a whistle) or keep irrelevant background noise, hurting understanding.

🍞 Bottom Bread (Anchor): On a sports clip, STVP keeps the scoreboard and player movement patches; VGAS then keeps the whistle and commentary matching those visuals, letting the model explain a score change correctly.

The “Aha!” moment in one sentence: Let the video clean itself first, then use those clean visual anchors to pick the audio—because in real life, our eyes lead our ears.

Three analogies:

Museum tour: Curators first hang the key paintings (video anchors), then docents choose the stories (audio) that match those paintings, not random background chatter.
Cooking: You prepare the main dish (video) before seasoning (audio). The spices only make sense once the main flavors are set.
School play: The stage setup (video) comes first; then microphones (audio) are placed where the actors stand, not randomly in the audience.

Before vs After:

Before: Compress audio and video separately or treat them the same; drop cross-modal clues; performance dips under tight budgets.
After: Prune video into anchors; guide audio with those anchors; keep the right multimodal cues; accuracy can match or beat full tokens even at 25–35% retention.

Why it works (intuition, no equations):

Videos have a lot of static, repeated patches and slow changes; removing spatially bland and temporally similar patches preserves scene structure and motion cues.
Sounds are ambiguous without visuals (is that a clap, a thud, or a door slam?), so audio saliency should be scored with visual context.
By aligning audio selection to visual anchors, the method preserves the cross-modal pairs that most help reasoning and throws out unaligned noise.
End-to-end training with a straight-through estimator teaches the selector what the LLM truly needs to answer questions correctly.

Building Blocks (with sandwich explanations):

🍞 Top Bread (Hook): You know how photographers frame a shot to capture the most important parts of a scene?

🥬 Filling: What it is: Spatio-Temporal Video Pruning (STVP) keeps visually unique spots in the first frame and new/changed spots in the next frame. How it works:

Spatial saliency: compare each patch to the frame’s overall look; keep the ones that stand out.
Temporal saliency: compare matching patches across consecutive frames; keep the ones that changed.
Take the top patches per frame to form compact visual anchors. Why it matters: Without this, you keep piles of nearly identical patches and waste compute on non-changing areas.

🍞 Bottom Bread (Anchor): In a lecture video, STVP keeps the speaker’s face and the slide text (spatial), plus the moment when the slide changes (temporal).

🍞 Top Bread (Hook): Imagine choosing which sounds to pay attention to at a party by first spotting who’s actually talking to you.

🥬 Filling: What it is: Vision-Guided Audio Selector (VGAS) chooses audio tokens based on how well they align with the pruned video anchors. How it works:

Cross-attention: each audio token looks at the visual anchors to get context.
A small MLP turns that context into a saliency score per audio token.
Keep Top-K audio tokens that match the visuals best. Why it matters: Without visual guidance, you risk keeping loud but irrelevant noises and dropping quiet but important speech.

🍞 Bottom Bread (Anchor): If the video shows a person raising a phone on stage while speaking, VGAS keeps the speech tokens and the subtle applause that matches that moment.

🍞 Top Bread (Hook): Picking the best items isn’t smooth—you either keep them or you don’t—so how does a model learn which ones to keep during training?

🥬 Filling: What it is: A Straight-Through Estimator (STE) is a training trick that lets gradients flow through a hard selection step as if it were soft. How it works:

Forward pass: pick Top-K tokens (hard, 0/1 mask).
Backward pass: pretend the mask changes smoothly with the scores, so gradients pass through.
Update the selector to keep tokens that improved answers. Why it matters: Without STE, the model can’t learn end-to-end which tokens actually help the LLM’s final predictions.

🍞 Bottom Bread (Anchor): It’s like grading a talent show where you can only say yes/no to acts, but during practice you give them sliding-scale feedback so they know how to improve.

03Methodology

At a high level: Video + Audio → STVP (prune video spatially and temporally) → VGAS (use pruned video to score and pick audio) → Compressed sequence → LLM answers.

Inputs and setup:

The video is split into small time chunks; each chunk has two consecutive frames turned into visual tokens.
The audio in the same time span is turned into audio tokens.
Tokens from both streams are aligned per chunk so they talk about the same moment in time.

Step A: Spatio-Temporal Video Pruning (STVP)

What happens: For each chunk, STVP creates visual anchors by keeping only (a) spatially distinctive patches in the first frame and (b) temporally changed patches in the second frame.
Why this step exists: Video is full of redundancy—static backgrounds and near-duplicate patches across frames. Pruning saves compute while keeping the scene layout and motion clues. Without it, the next step (picking audio) won’t have clean anchors.
Example with data: Suppose a chunk has 2 frames, each with 196 tokens (patches), for 392 visual tokens. If the visual retention ratio keeps half, STVP selects the top 98 tokens from frame 1 (spatially distinctive) and top 98 from frame 2 (temporal changes), leaving 196 compact visual anchors.

How STVP scores tokens:

Spatial saliency (frame 1): Compare each patch to the average look of the whole frame; patches that differ more are kept.
Temporal saliency (frame 2): Compare each patch to its counterpart in frame 1; patches that changed more are kept.
Top-K per frame: Keep the highest-scoring patches to match the desired retention.

Step B: Vision-Guided Audio Selector (VGAS)

What happens: Audio tokens use cross-attention to look at the pruned visual tokens and form context-aware audio representations. A tiny MLP turns each audio token’s context into a score; Top-K audio tokens are kept.
Why this step exists: Audio saliency is ambiguous without the scene. By aligning to visual anchors, we keep speech, whistles, alarms, or music that matter for what’s on screen and drop irrelevant noise. Without this, you might prune away the very sounds needed to explain visible events.
Example with data: If a chunk has 320 audio tokens and we keep 40%, VGAS selects 128 audio tokens whose scores are highest after seeing the visual anchors.

Training trick: Straight-Through Estimator (STE)

What happens: During the forward pass, the selector makes a hard choice (keep vs drop). During the backward pass, it pretends that choice was smooth so gradients can update the scoring network.
Why this step exists: Top-K is not differentiable; without STE, the selector can’t learn from end-to-end supervision.
Example: If the model answered a question wrong, gradients tell the selector, “you should’ve kept those speech tokens aligned with the face,” and next time it scores them higher.

Putting it all together across chunks:

Each chunk: STVP → VGAS → keep top visual and audio tokens.
Concatenate compressed chunks with the text prompt.
Feed the shorter sequence into the LLM for reasoning and generation.

Concrete walkthrough on a 20-second clip:

Before: 20,000 tokens total (video + audio).
After: Keep 25% (about 5,000 tokens). STVP keeps key visual anchors like faces, screens, moving objects; VGAS keeps synchronized speech and event sounds.
Result: The LLM answers, “The man raises his phone to show the audience while speaking into a microphone,” both faster and with fewer hallucinations.

Design choices that make it clever (the secret sauce):

Asymmetry matches human perception: eyes first, ears second. This avoids keeping noisy sounds that don’t match visuals.
Two-part video pruning: split the job into spatial distinctiveness (what stands out) and temporal change (what’s new), so anchors capture both layout and motion.
Lightweight VGAS: a single cross-attention layer with a small hidden size (about 4.85M extra params total) provides strong guidance without heavy overhead.
Operator-friendly: Unlike attention-score–dependent saliency that can clash with fast attention kernels, OmniSIFT keeps compatibility in practice.
End-to-end training: The STE bridges the gap between hard selection and gradient learning, so the system learns which tokens the LLM truly needs.

🍞 Top Bread (Hook): Think of it like editing a vlog. You first trim boring shots and near-duplicates, then you align which bits of narration and background sound to keep with the shots you kept.

🥬 Filling: What it is: Modality-Asymmetric Compression is the overall recipe: prune video first (STVP), then select audio guided by video (VGAS). How it works:

Chunk the input.
Compute spatial and temporal saliency; keep top visual tokens as anchors.
Cross-attend audio to anchors; score and keep top audio tokens.
Feed the compressed sequence to the LLM. Why it matters: Without this recipe, compression either ignores cross-modal ties or wastes tokens on redundancy, lowering accuracy.

🍞 Bottom Bread (Anchor): In a class lecture, the compressed stream keeps the teacher’s face, the changing slide bullets, and the spoken explanation aligned to each bullet—just what the model needs to answer, “What did the teacher say after showing the third point?”

04Experiments & Results

The test: The authors evaluated OmniSIFT on five audio–video benchmarks that need smart cross-modal reasoning: DailyOmni, WorldSense, Video-MME (with audio), OmniVideoBench, and the video-SALMONN-2 captioning test set. They measured accuracy on QA tasks, plus efficiency metrics like inference latency and GPU memory use. They also checked stability across different compression ratios (how hard you squeeze tokens).

The competition: OmniSIFT competed against three baselines:

OmniZip (symmetric, audio-guided)
DyCoke (adapted to prune video and audio independently)
Random pruning (drops tokens uniformly)

The scoreboard with context:

WorldSense (Qwen2.5-Omni-7B, 35% retained): OmniSIFT scores 50.0, which beats full tokens at 49.7 and clearly tops OmniZip (48.9), DyCoke (47.3 on efficiency table; 52.7 in main table appears for different setting—here, the key comparison at 35% shows OmniSIFT best among compressors and matching/exceeding full tokens on several tasks). Think of 50.0 versus 49.7 as winning by a photo finish, but after running only a third of the race distance.
DailyOmni (7B, 35% retained): OmniSIFT hits 73.2; full tokens get 72.2; OmniZip 67.7. That’s like getting an A when the full-token model got an A− and competitors got B’s.
At 25% retained, OmniSIFT again tops all compression baselines on DailyOmni (72.5 vs OmniZip 66.2) and WorldSense (49.9 vs OmniZip 48.7 in ablation table; 51.2 in main table across configurations), showing it stays strong even under very tight budgets.

Surprising findings:

Less can be more: By removing redundant tokens that add noise, OmniSIFT sometimes outperforms the full-token model.
Robustness to hard audio compression: When the audio compression ratio increases (keep fewer audio tokens), OmniZip’s accuracy drops notably (e.g., from ~48.9% to ~44.0%), but OmniSIFT stays above ~49.3%, showing the power of vision-guided audio selection.

Efficiency gains:

On WorldSense with Qwen2.5-Omni-7B at 35% retained:
- Peak GPU memory drops by more than 4.6 GB vs full tokens.
- Total inference time drops by over 40%.
- Latency remains on par with training-free baselines despite adding a small, learned module (about 4.85M params).
FLOPs comparison shows large savings: full tokens ~555.74T vs OmniSIFT ~250.83T at 25% retention—a cut of over 50%.

Ablations (what makes it tick):

Remove spatial or temporal parts of STVP? Accuracy drops—both parts matter.
Replace vision-guided audio with audio-only selection? Accuracy drops by around 3–4 points on DailyOmni and WorldSense. Visual guidance is key.
Make the VGAS selector deeper (3 layers)? No gain—slightly worse accuracy and a bit more memory. The lightweight design is enough.

Case study (why asymmetry wins): In a badminton clip, OmniZip prunes critical scoreboard patches and some matching audio, leading to the wrong explanation of a score change. OmniSIFT keeps the scoreboard visuals and the referee’s call in audio, producing the correct answer.

Bottom line: Across multiple datasets, models (7B and 3B), and retention settings (35% and 25%), OmniSIFT consistently leads among compression methods and occasionally beats full tokens while being faster and lighter.

05Discussion & Limitations

Limitations:

Visual anchors are the backbone of the method. If the visual encoder misses small but crucial cues (e.g., a tiny indicator light), VGAS may under-score the matching audio and drop it.
Very dark, occluded, or off-camera events (important sound with no visual counterpart) can challenge the vision-guided approach.
The Top-K selection is discrete; while STE enables learning, it is an approximation and may not perfectly reflect the true gradient of hard choices.

Required resources:

A base Omni-LLM (e.g., Qwen2.5-Omni-7B/3B), plus training on synchronized audio–video SFT data (e.g., ~107K AVoCaDO pairs used here).
Modest extra parameters (~4.85M) and compute to train the lightweight VGAS and fine-tune the decoder.

When not to use:

Pure audio tasks (podcasts without any video) where audio context alone is sufficient—vision guidance won’t help.
Scenarios where the video is extremely low quality or unrelated to the audio (e.g., a static thumbnail over a radio program); the visual anchors won’t be informative.

Open questions:

Can the method adaptively decide how much to compress per chunk based on moment-to-moment difficulty?
How to handle off-screen audio events that matter (e.g., a siren you hear but don’t see)?
Can we extend asymmetry further (e.g., text guiding vision or audio in certain tasks)?
How robust is the approach to different encoders and tokenizers, or to domain shifts like egocentric footage and surveillance video?
Could a tiny, learned motion detector improve temporal saliency beyond simple frame-to-frame differences without adding heavy cost?

06Conclusion & Future Work

Three-sentence summary: OmniSIFT compresses audio–video tokens asymmetrically by pruning video first and then using those visual anchors to select the most relevant audio. This simple, trainable, and lightweight design preserves key cross-modal cues, often matching or beating full-token accuracy while cutting memory and latency dramatically. It remains robust even at tight retention ratios like 25%, outperforming prior compression methods.

Main achievement: Turning the human-inspired rule—eyes lead ears—into a practical, end-to-end token compression framework (STVP + VGAS + STE) that boosts both efficiency and accuracy for omni-modal LLMs.

Future directions: Make compression more adaptive over time, better handle off-screen audio events, explore multi-way asymmetry (e.g., text or audio guiding vision in special cases), and test across broader domains and encoder choices. Add tiny motion-aware cues or confidence estimates to improve selection when visuals are weak.

Why remember this: OmniSIFT shows that smarter, not just smaller, is the path to efficient multimodal AI—by keeping the right cross-modal pairs, models can think faster and sometimes even think better than when given everything.

Practical Applications

•Live captioning and summarization of lectures and meetings with lower latency.
•Real-time sports highlights that correctly explain score changes and key events.
•Customer support agents that understand tutorial videos with voice-over on standard GPUs.
•Mobile or edge deployment of multimodal assistants for AR glasses and IoT cameras.
•Faster video QA for content moderation and compliance checks.
•Efficient video analytics in surveillance with better alignment between events seen and sounds heard.
•Smart video editing assistants that identify and keep the most relevant audio–visual moments.
•Interactive education tools that answer questions about science demos recorded on video.
•Accessibility features that generate richer audio descriptions aligned with on-screen actions.
•Cost-effective large-scale processing of video archives for search and retrieval.

Version: 1