SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Tongcheng Fang; Hanling Zhang; Ruiqi Xie; Zhuo Han; Xin Tao; Tianchen Zhao; Pengfei Wan; Wenbo Ding; Wanli Ouyang; Xuefei Ning; Yu Wang

SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer

Intermediate

Tongcheng Fang, Hanling Zhang, Ruiqi Xie et al.1/23/2026

arXiv PDF

Key Summary

•Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
•Sparse attention skips most pairs to go fast, but it can miss important long-distance connections in videos.
•LoRA fine-tuning helps a bit after making attention sparse, but at ultra-high sparsity it can’t fully fix lost information.
•SALAD adds a tiny, fast linear-attention branch in parallel with the sparse branch to mix information globally at low cost.
•A smart input-dependent gate decides how much the linear branch should help at each layer, so it complements rather than overwhelms sparse attention.
•With only 2,000 videos and 1,600 steps, SALAD hits 90% sparsity and 1.72× speedup while keeping quality close to full attention.
•The method shares most weights and adds only ~5% parameters, using zero-initialization so it starts as a pure sparse model.
•Experiments on VBench show SALAD beats prior sparse/linear baselines in the quality–speed trade-off, especially for consistency and image quality.
•Adding 3D Rotary Position Embeddings helps the linear branch understand space and time in videos.
•You can even drop about 20% of linear branches after training for extra speed, with little or no quality loss.

Why This Research Matters

Fast, high-quality video generation lowers costs and speeds up creative workflows for filmmakers, game designers, educators, and advertisers. With SALAD, laptops or modest servers can produce stable, coherent clips closer to real time, enabling more rapid iteration. This helps small studios and indie creators compete without massive GPU farms. It also improves user experiences in apps that generate or edit videos on the fly, like social media tools or learning platforms. By keeping long-range coherence at high sparsity, SALAD brings advanced video AI closer to everyday use. Over time, similar ideas could make long, high-resolution video generation practical for live previews and interactive storytelling.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to watch a whole movie through a keyhole. You’d only see tiny bits at a time, and it would take forever to understand the whole story.

🥬 The Concept: Transformers use “attention” to compare every part of the input with every other part. For videos, that means comparing a huge number of tokens with each other, which grows quadratically.

What it is: Full attention checks all pairs of tokens, so work grows like tokens × tokens.
How it works: 1) Turn video frames into tokens. 2) For each token, score it against every other token. 3) Use those scores to blend information. 4) Repeat across layers.
Why it matters: With long videos, this becomes too slow and memory-hungry, blocking fast, high-resolution generation.

🍞 Anchor: If a video has 30,000 tokens, full attention is like trying to high-five every classmate and also every possible pair of classmates—way too many high-fives.

🍞 Hook: You know how skimming a book’s bolded words gives you the gist faster? That’s like “sparse attention.”

🥬 The Concept: Sparse attention only looks at a small, chosen set of token pairs instead of all of them.

What it is: A speed trick that limits which tokens can talk to which.
How it works: 1) Pick a pattern (like a sliding window) or top-k important pairs. 2) Only compute attention for those. 3) Skip the rest.
Why it matters: It speeds things up a lot, but can miss long-distance connections—like forgetting that a dog in frame 1 is the same dog in frame 50.

🍞 Anchor: A sliding window is like only talking to your desk neighbors in class. Fast, but you may miss the announcements from the front.

🍞 Hook: Imagine you wear a headset that lets you hear the teacher’s voice clearly no matter where they stand in the room—cheaply.

🥬 The Concept: Linear attention is a faster way to mix information across all tokens using a special trick that makes the work grow linearly with tokens.

What it is: A global-mixing attention whose cost grows like tokens, not tokens squared.
How it works: 1) Map tokens with a simple function (like ReLU). 2) Pre-summarize all keys and values once. 3) Mix each query with the summary.
Why it matters: It can spread information cheaply, but by itself may be too weak to model super complex, long videos well.

🍞 Anchor: It’s like making class notes into a handout once, then each student quickly uses the handout to study, instead of everyone re-copying the entire chalkboard.

🍞 Hook: Think of a dimmer switch that brightens or darkens a lamp based on the time of day.

🥬 The Concept: An input-dependent scalar gate decides how much to use the linear branch’s help at each layer, based on the current features.

What it is: A tiny learned controller that scales the linear branch’s output.
How it works: 1) Look at the layer’s input. 2) Pass it through a small linear layer and a sigmoid. 3) Average across tokens to get one gate value. 4) Multiply the linear-branch output by this gate before adding it to the sparse output.
Why it matters: Without it, the linear branch can be too loud or too quiet. The gate keeps balance so the linear branch complements sparse attention.

🍞 Anchor: Like a traffic light that adapts to rush hour, letting more cars (linear info) through only when roads (layers) need it.

🍞 Hook: Imagine building a video from TV snow. Each step makes it a little clearer.

🥬 The Concept: Video Diffusion Transformers start with noisy video and gradually denoise it into a coherent clip using attention.

What it is: A model that adds and removes noise to learn how to generate realistic videos from text prompts.
How it works: 1) Add noise to real videos during training. 2) Learn to predict and remove that noise. 3) At test time, start from noise and repeatedly denoise using attention blocks.
Why it matters: It’s currently one of the best ways to make high-quality videos from text.

🍞 Anchor: Like sculpting from a rough block: chip away the noise step by step until the picture appears.

The world before: Full attention gave great quality but was too slow for long, high-res videos. Sparse attention made it faster but lost long-range connections, causing weird artifacts (like duplicate subjects or flicker). People tried “training-free” sparse masks (fast but limited sparsity) and “training-based” sparse methods (can be very sparse but need tons of data and compute). LoRA fine-tuning helped some after compressing attention, but at ultra-high sparsity and small budgets, it couldn’t fully repair long-distance context.

The problem: How can we keep the speed of very sparse attention (like 90% sparse!) without losing the long-range glue that holds a video together?

Failed attempts: 1) Just sparse—fast, but misses global context. 2) Just linear—global, but too weak alone for complex video patterns. 3) Sparse + LoRA—better, but still can’t fully recover with tiny training budgets. 4) Sparse + linear with a fixed projector—still unbalanced; linear can be too strong or too weak.

The gap: We need a way to cheaply add back the missing global info, and to control it precisely per layer and per input.

Real stakes: Faster, high-quality video generation helps creators preview and iterate quickly, reduces GPU costs for studios, and makes advanced video tools more accessible to students, indie artists, and small companies.

02Core Idea

🍞 Hook: You know how a good basketball team has both sprinters and passers? The sprinters get you down the court fast, and the passers see the whole floor.

🥬 The Concept: SALAD runs sparse attention (the sprinter) in parallel with a tiny linear-attention branch (the passer), and a smart gate decides how much passing help to use.

What it is (one sentence): A parallel sparse–linear attention with an input-dependent scalar gate that restores long-range info at ultra-high sparsity.
How it works: 1) Compute sparse attention for speed. 2) Compute linear attention for cheap global mixing. 3) Use a learned gate to scale the linear output. 4) Add them together and continue.
Why it matters: Without the linear helper, sparse attention drops long links; without a gate, linear can swamp or vanish—both hurt quality.

🍞 Anchor: It’s like pairing a fast runner with a map reader, then using a volume knob to make sure the directions are just loud enough to help.

The “Aha!” moment in one sentence: Make ultra-sparse attention work by adding a tiny global-mixing branch and teach a small gate to blend them just right.

Three analogies:

Orchestra: Sparse attention is the rhythm section (steady, efficient). Linear attention is the conductor’s cues (global timing). The gate is the mixer adjusting volumes so music stays balanced.
Cooking: Sparse attention is your quick stovetop cook; linear attention is your slow marinade that spreads flavor evenly. The gate is the timer that decides how much marinade to add back for perfect taste.
City traffic: Local roads (sparse) move cars nearby; the highway (linear) connects distant spots quickly. The ramp meter (gate) controls how many cars enter the highway to keep flow smooth.

Before vs. After:

Before: Ultra-sparse attention often produced subject duplication, flicker, and weak text alignment unless you spent huge training budgets.
After: With SALAD, you reach 90% sparsity, about 1.72× end-to-end speedup, and keep quality near full attention using only 2k videos and 1.6k steps.

Why it works (intuition, no equations):

Sparse attention excels at local detail and efficiency but loses some distant links. Linear attention cheaply spreads a global summary across tokens. Their errors are complementary. The gate learns how much global summary each layer needs from the current features. Zero-initializing the linear branch’s projector starts from the known-good sparse model and lets learning add only helpful global hints. 3D RoPE gives the linear branch a sense of where tokens are in space and time so its global signals are meaningful.

Building blocks (mini concepts):

🍞 Hook: Like grouping classmates by desk rows and also passing a class-wide announcement sheet. 🥬 Sparse attention: local neighbors for speed.
🥬 Linear attention: global summary for reach.
🥬 Gate: small learned scalar to balance.
🥬 Shared Q/K/V: both branches reuse the same projections to save parameters.
🥬 Projection after linear: balances scale before fusing.
🥬 3D RoPE: space–time positional sense for the linear branch.
🍞 Anchor: The result is a clean blend: the local chat stays clear, and the announcements are just loud enough to help everyone stay in sync.

Concept sandwiches (new terms not yet defined above):

🍞 Hook: Imagine labeling each ingredient so you know where and when it was added to a stew. 🥬 3D Rotary Position Embedding (3D RoPE): a way to mark tokens with their 3D position (height, width, frame) so attention understands space and time.
- How it works: 1) Rotate Q/K features by angles tied to positions. 2) This encodes relative positions. 3) Both sparse and linear branches can leverage it.
- Why it matters: Without position sense, global mixing gets blurry; with it, long-range links stay meaningful.
  🍞 Anchor: Like a map grid telling you “row, column, page,” so you don’t confuse two similar-looking streets in different neighborhoods.
🍞 Hook: Think of turning down the volume of a new instrument until you’re sure it blends nicely. 🥬 Zero-initialization of the linear projector: start the linear branch at zero output, then learn only what helps.
- How it works: Initialize the projector to zeros; training gradually grows its contribution via the gate.
- Why it matters: Prevents the linear branch from blasting noise at the start.
  🍞 Anchor: Like adding salt bit by bit until the soup tastes just right.

03Methodology

At a high level: Input video tokens → Sparse attention (fast local) in parallel with Linear attention (cheap global) → Gate scales linear output → Fuse → Output to the next layer → Repeat across layers and denoising steps.

Step-by-step, like a recipe:

Prepare tokens with positions

What happens: Convert video latents into a token sequence; apply 3D RoPE to Q and K so both branches understand space and time.
Why it exists: Without positions, attention can’t tell if two similar patches come from different frames or spots.
Example: Token 12 might be “dog ear at (x=40,y=60) in frame 5.” 3D RoPE tags it so attention remembers this context.

Compute shared Q, K, V

What happens: Use the same Wq, Wk, Wv to get Q/K/V that feed both branches.
Why it exists: Saves parameters (~+5% total overhead instead of duplicating projections) and keeps representations aligned.
Example: Q(12), K(12), V(12) are computed once and sent to both sparse and linear branches.

Sparse attention branch (local, efficient)

What happens: Apply a mask to limit who each query can see. We use either Spatial-Temporal Sliding Window Attention (ST-SWA) or dynamic Top-K.
Why it exists: Reduces compute from quadratic to much less by skipping most pairs.
Example: In ST-SWA, token 12 only attends to nearby tokens in the same frame and aligned spots in neighboring frames. In Top-K, it keeps only the k most similar blocks.

Linear attention branch (global, cheap)

What happens: Using a ReLU-based linear attention, pre-summarize keys/values once and mix into each query. Add a small projection layer afterward (zero-initialized).
Why it exists: Provides long-range mixing at O(N) cost; the projector balances scales before fusion.
Example: A “dog” token gets a gentle global hint from all frames to avoid duplicating the dog or forgetting its color.

Input-dependent scalar gate (the balance knob)

What happens: Pass the branch’s input through a tiny linear layer + sigmoid, average across tokens to get a single scalar G in [0,1], then scale the linear branch’s output by G.
Why it exists: Keeps the linear branch helpful but not overwhelming; adjusts per input and per layer.
Example: If the scene already has strong local consistency, G might be 0.2; if long-range links are needed, G might rise to 0.4.

Fuse and project

What happens: Add sparse_output + G × proj(linear_output), then apply the usual output projection Wo.
Why it exists: This creates a balanced blend for the next layer, keeping the residual structure familiar to the pretrained model.
Example: The fused vector keeps crisp edges (from sparse) and stable identity/motion across frames (from linear).

Train efficiently

What happens: Fine-tune on just 2,000 open-source videos for 1,600 steps (batch size 8). LoRA adapts shared Wq/Wk/Wv/Wo; the linear projector and gate are fully trainable; the projector starts at zero.
Why it exists: Achieves high sparsity without massive datasets or compute.
Example: After fine-tuning, the model reaches ~90% sparsity and ~1.72× speedup at 480p×77 frames while keeping quality near full attention.

What breaks without each piece:

No 3D RoPE: The linear branch’s global mix blurs space–time, hurting consistency.
No projector: Linear and sparse outputs can be on mismatched scales, making fusion unstable.
No gate: Linear might be too strong (collapse) or too weak (no benefit).
No shared Q/K/V: Parameter count and memory rise, reducing practicality.

Concrete mini example with numbers:

Suppose we have 10,000 tokens.
Sparse branch (ST-SWA) lets each token attend to 256 neighbors → big savings over 10,000².
Linear branch pre-summarizes all tokens once, then cheaply mixes a global signal into each token.
Gate computes G=0.32 for this layer.
Final: output = sparse + 0.32 × proj(linear), then Wo to next layer.

Secret sauce (why it’s clever):

The gate learns the right mix per input and layer. Zero-init projector avoids early interference. Shared projections keep the extra parameters tiny. And 3D RoPE gives the linear branch the space–time sense it needs to be genuinely helpful rather than just “spread everything everywhere.”

04Experiments & Results

The test: Measure video quality and consistency while counting real end-to-end speed on a single GPU. Use VBench metrics—Subject Consistency (does the subject stay the same?), Background Consistency (does the scene stay stable?), Image Quality (how sharp/clean?), Text Consistency (does it match the prompt?)—and VisionReward. Also report sparsity and speedup.

The competition: Compare against training-free sparse methods (ST-SWA, SVG2, PARO), training-based methods (SLA), and LoRA-tuned sparse baselines (including Top-K+LoRA). Use the same Wan 2.1-1.3B base, 480p, 77 frames. For tuning methods, keep the compute budget small (2k videos, 1.6k steps, bs=8) so the comparison is fair under tight resources.

The scoreboard (contextualized):

Full attention is the quality reference but slow.
Training-free sparse methods at ~45–63% sparsity give speedups like 1.2–1.5× but lose quality—like getting a B- when you wanted an A.
SLA needs much larger data and batch sizes to shine; under the small budget here, it underperforms.
LoRA helps sparse models, but at 77–90% sparsity still leaves artifacts like subject duplication and temporal glitches.
SALAD (with ST-SWA): About 90% sparsity and ~1.72× speedup, while reaching or surpassing dense baseline on several VBench scores (e.g., strong subject/background consistency and image quality). That’s like running as fast as the sprinter while keeping the orchestra in tune.
SALAD (with Top-K): Also improves over Top-K+LoRA, showing the gate+linear combo works with dynamic sparsity too.

Surprising findings:

A small scalar gate matters a lot: dialing down the linear branch slightly at inference improved several metrics, proving balance is critical.
Zero-initializing the linear projector trains better and ends up higher quality than random init, because it starts as a known-good sparse model.
The linear branch tends to have lower rank than the sparse branch, which fits its “helper” role—global but gentle.
You can drop about 20% of linear branches (post-training) based on low gate importance and keep quality roughly the same, gaining ~5% extra speed.

Qualitative observations:

Without the linear helper, sparse+LoRA can produce two dogs that slowly merge into one or lose small objects (like a boat). With SALAD, details return and motion stays coherent across frames.
Attention maps show the linear branch spreads weights across long distances, adding the global glue sparse attention misses.

Bottom line: On a tight budget, SALAD hits high sparsity and real speedup while matching or beating dense attention on consistency and quality—something prior sparse-only or LoRA-only fixes struggled to achieve.

05Discussion & Limitations

Limitations:

Linear attention alone isn’t strong enough to model entire long sequences; SALAD relies on sparse attention to do the heavy lifting. If the gate is mis-tuned, linear can under- or over-contribute, hurting quality.
While training is light (2k videos, ~20.6 GPU hours), you still need a compatible kernel stack for sparse and linear attention to get end-to-end speedups.
The method is validated at 480p×77 frames; scaling to much longer, higher-res videos may need re-tuning of windows, Top-K, or gate behavior.
The gate is a single scalar per layer per sample (after averaging tokens). Finer-grained gates (per head or per channel) might help but add complexity.
Domain shifts (e.g., very stylized animation) could require modest extra fine-tuning to keep the balance right.

Required resources:

A pretrained video diffusion transformer (e.g., Wan 2.1-1.3B) and 2k short videos for tuning.
GPU support for sparse kernels (ST-SWA/Top-K) and a linear-attention implementation.
Standard LoRA tooling for Wq/Wk/Wv/Wo plus training the linear projector and gate.

When NOT to use:

If latency isn’t a concern and you can afford full attention, SALAD’s complexity may not be needed.
For extremely tiny models or super short clips, overheads of extra branches/gating might not pay off.
If your stack lacks efficient sparse kernels, theoretical sparsity won’t translate to real speed.

Open questions:

Could per-head or per-channel gating further improve balance?
How does SALAD scale to minute-long, 4K video while keeping speedups?
Can we learn when to drop linear branches per layer automatically during inference for maximum efficiency?
Are there better kernel choices than ReLU-based linear attention for video?
Could a small cross-attention to a learned global memory replace or complement linear attention?

06Conclusion & Future Work

3-sentence summary: SALAD pairs ultra-sparse attention with a tiny linear-attention helper and an input-dependent gate that blends them just right. With shared projections, zero-initialized fusion, and 3D RoPE, it restores lost long-range information at low cost. Using only 2k videos and 1.6k steps, it reaches 90% sparsity, ~1.72× speedup, and quality on par with full attention.

Main achievement: Showing that a carefully gated, parallel linear branch can make ultra-sparse attention practical for video diffusion—preserving global coherence without large training budgets.

Future directions: Explore finer-grained gates (per head/channel), smarter branch-dropping policies at inference, alternative linear kernels, and scaling to much longer, higher-resolution videos. Investigate integrating a lightweight global memory or mixture-of-experts routing to further enhance global reasoning.

Why remember this: SALAD turns the usual speed–quality trade-off into a win–win for video generation—keeping the sprinting speed of sparse attention while adding just enough global guidance to keep stories consistent across time.

Practical Applications

•Speed up text-to-video generation for creative previews and storyboarding without sacrificing consistency.
•Accelerate iterative ad/video design cycles where many prompt variations must be tested quickly.
•Enable smoother, more coherent motion in AI video editing tools (e.g., inpainting or style changes across frames).
•Reduce cloud GPU costs for video platforms by serving sparse+linear attention models with real end-to-end speedups.
•Improve on-device or edge video generation by cutting memory and compute demands.
•Provide faster educational content creation (animated explainers) with stable characters and scenes.
•Boost research prototyping for new prompts/datasets by enabling rapid, high-sparsity fine-tuning on small video sets.
•Enhance generative game assets (cutscenes, trailers) with consistent subjects over time.
•Support batch generation pipelines where throughput and per-video latency both matter.
•Facilitate A/B testing of video styles with coherent subjects and backgrounds under tight compute budgets.

Version: 1