SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Yasaman Haghighi; Alexandre Alahi

SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching

Intermediate

Yasaman Haghighi, Alexandre Alahi2/27/2026

arXiv

Key Summary

•SenCache speeds up video diffusion models by reusing past answers only when the model is predicted to change very little.
•Instead of guesswork, it uses a principled 'sensitivity' score that measures how much the model reacts to small changes in its inputs.
•This score looks at two things at once: how much the noisy picture (the latent) moved and how far the time step jumped.
•If the predicted change stays under a chosen tolerance (epsilon), SenCache safely skips a costly model call and uses a cached result.
•A tiny calibration set (as few as 8 videos) is enough to estimate the model’s sensitivity for later use.
•Across Wan 2.1, CogVideoX, and LTX-Video, SenCache delivers equal or better visual quality than prior caching methods at similar speed.
•Early denoising steps are fragile, so SenCache uses a stricter tolerance there to protect quality.
•The method needs no retraining, no model edits, and works with different architectures and samplers.
•A simple cap on consecutive cache reuses (n) prevents drift and keeps results stable.
•This sensitivity-first view explains why past heuristics sometimes worked and sometimes failed, and turns caching into an adaptive, per-sample strategy.

Why This Research Matters

Video diffusion models are powerful but slow, making creative tools feel laggy and costly. SenCache speeds them up without retraining by skipping work only when it’s safe, so artists and developers get faster turnarounds. The method is adjustable: users can pick more speed or more quality depending on the task. It also needs just a tiny setup (a small calibration set), so it’s practical for many teams. Because it’s model- and sampler-agnostic, it can travel across systems and likely to other domains like audio and motion. Lower compute means lower energy, which is good for budgets and the planet. In short, it unlocks more responsive, more sustainable generative experiences.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you’re doing homework, you sometimes reuse work you did before—like a solved math step—if you’re sure it still applies? But if the problem changes a lot, you redo the step so you don’t make a mistake.

🥬 Filling (The Actual Concept)

What it is: This paper is about making video-generating AIs (diffusion models) faster by reusing earlier results only when it’s safe.
How it works (step by step):
1. Notice that diffusion models take many tiny steps to turn noise into a video. Each step is expensive.
2. If two steps are very similar, we can reuse the old answer (cache) instead of paying for a new one.
3. Past methods guessed when to reuse, using rules of thumb (heuristics).
4. This paper replaces guesswork with a sensitivity test: it asks, “Will the answer change a lot if the inputs change a little?”
5. If the test says “No, the change is tiny,” we safely reuse the cached answer and skip the heavy computation.
Why it matters: Without smart reuse, diffusion models are slow and costly, which makes high-quality video generation impractical for many users.

🍞 Bottom Bread (Anchor) Imagine editing a 5‑second clip: if the next frame looks almost the same as the last, why recalculate everything? Just reuse the last result and move on.

🍞 Top Bread (Hook) Imagine a sculptor gently sanding a statue. Most strokes are tiny. If two strokes are in the same spot with the same pressure, the result won’t change much.

🥬 Filling (Diffusion Models)

What it is: Diffusion models are AIs that start with pure noise and carefully remove it step by step to create images or videos.
How it works:
1. Start with random noise.
2. At each time step, a denoiser predicts how to remove a bit of noise.
3. Repeat many times until a clear picture or video appears.
Why it matters: The many steps make diffusion models very high quality—but also slow, especially for long videos.

🍞 Bottom Bread (Anchor) Think of un-crumpling a paper ball slowly, one small tug at a time, until it’s flat and readable.

🍞 Top Bread (Hook) You know how your web browser loads pages faster when it has saved copies of images from before? That’s caching.

🥬 Filling (Caching and Caching Mechanism)

What it is: Caching saves previous results so you don’t have to recompute them.
How it works:
1. Store a model’s output from an earlier step.
2. When you reach a similar step later, reuse that stored output.
3. Only recompute if things have changed a lot.
Why it matters: Caching cuts down on repeated work, making the whole process much faster.

🍞 Bottom Bread (Anchor) Like keeping your favorite snack handy on your desk so you don’t walk to the kitchen every time.

🍞 Top Bread (Hook) Imagine racing a friend to finish chores. If you do fewer steps or make each step quicker, you finish sooner.

🥬 Filling (Inference Efficiency)

What it is: Inference efficiency means how quickly and cheaply a trained model can produce results.
How it works:
1. Reduce the number of big steps the model takes.
2. Or make each step cost less.
3. Or both, without hurting quality too much.
Why it matters: Faster generation saves time, money, and energy.

🍞 Bottom Bread (Anchor) If it takes minutes for each short video, that’s too slow for creative work or interactive tools.

🍞 Top Bread (Hook) You know how sometimes we use shortcuts like “If it looks cloudy, bring an umbrella,” even though it’s not perfect science?

🥬 Filling (Heuristics and Heuristic Caching Methods)

What it is: Heuristics are quick rules of thumb based on experience, not guaranteed by theory.
How it works:
1. Watch a simple signal (like a change in embedding or output size).
2. If the signal is small, skip the step; if big, recompute.
3. Tune a bunch of thresholds to make it work okay.
Why it matters: Heuristics can be helpful but may fail when the situation changes or when two different causes of change aren’t both tracked.

🍞 Bottom Bread (Anchor) Past caching methods (TeaCache, MagCache) worked in some cases but stumbled when the model was sensitive in ways the rules didn’t track.

02Core Idea

🍞 Top Bread (Hook) Imagine you’re walking on a path and deciding when to check your map. If the path isn’t turning much and the ground feels the same, you skip checking and keep walking. But if the path bends or the terrain changes, you check the map again.

🥬 Filling (Aha! Moment)

What it is: The key insight is to reuse a denoiser’s past output only when the model is locally insensitive—when tiny input changes won’t change the output much.
How it works:
1. Measure how sensitive the denoiser is to two things: the noisy picture (latent) and the time step.
2. Predict how much the output would change between two steps using these sensitivities.
3. If the predicted change is below a tolerance (epsilon), reuse the cached output; otherwise recompute.
Why it matters: This replaces guesswork with a clear, adjustable rule that adapts per sample.

🍞 Bottom Bread (Anchor) If your GPS says the road ahead is straight for a while, you don’t need to keep checking it every few seconds—you just go.

🍞 Top Bread (Hook) Think of a sleeping baby: if the room gets only a tiny bit brighter or a tiny bit noisier, the baby stays asleep; big changes wake them up.

🥬 Filling (Network Sensitivity)

What it is: Network sensitivity measures how much a model’s answer changes when its inputs change a little.
How it works:
1. Look at two directions of change: in the picture itself (latent) and in time.
2. Estimate how responsive the model is to each direction.
3. Add their contributions to predict the total change.
Why it matters: If we ignore either direction, we’ll sometimes reuse when we shouldn’t, causing artifacts.

🍞 Bottom Bread (Anchor) If the light changes a lot or the sound changes a lot, the baby wakes. We must watch both, not just one.

🍞 Top Bread (Hook) Suppose you’re adjusting the volume knob a tiny bit and want to guess how much louder it gets without turning it all the way—just from that tiny nudge.

🥬 Filling (First-Order Approximation)

What it is: A simple way to predict small output changes from small input changes using local slopes (no heavy math needed here).
How it works:
1. Nudge the input slightly and see how the output moves.
2. Do the same for time.
3. Use these small moves to predict the change over the next real step.
Why it matters: This quick estimate is cheap and good enough to decide when to reuse.

🍞 Bottom Bread (Anchor) Like checking how sensitive your faucet is: a tiny twist tells you how much more water will come out, so you can predict a bigger twist.

🍞 Top Bread (Hook) Imagine setting a guardrail: “As long as the wobble is smaller than my limit, keep going. If it’s bigger, slow down and recheck.”

🥬 Filling (Tolerance Epsilon)

What it is: Epsilon is a knob that sets how strict we are about reusing cached results.
How it works:
1. Choose a small epsilon for high quality (fewer reuses, more checks).
2. Choose a larger epsilon for speed (more reuses, risk a bit of quality).
3. Use a very small epsilon for early, fragile steps and relax later steps.
Why it matters: This single knob lets users trade speed for quality in a controlled way.

🍞 Bottom Bread (Anchor) Like setting how crispy you want your toast: lighter means safer but slower; darker is faster but riskier.

🍞 Top Bread (Hook) Think of packing snacks for a hike. You can reuse snacks for a few stops, but not forever—they go stale.

🥬 Filling (Cache Lifetime n)

What it is: A small number n limits how many steps in a row we reuse before forcing a refresh.
How it works:
1. Count how many times in a row you reused.
2. If you hit n, recompute once to reset drift.
3. Then start counting again.
Why it matters: Prevents small errors from slowly piling up.

🍞 Bottom Bread (Anchor) You won’t keep eating yesterday’s sandwich for a week—you make a fresh one after a few meals.

🍞 Top Bread (Hook) Imagine choosing between two weather tips: one watches the thermometer (temperature), the other watches the clock (time of day). You’d do best by checking both.

🥬 Filling (Why It Works vs. Before/After)

Before: Heuristics watched one clue (like time embedding change or residual size) and missed the other.
After: SenCache watches both the latent change and the time change, predicting output drift better.
Why it works: Locally, models behave like smooth functions; the combined sensitivity gives a faithful picture of near-future change.

🍞 Bottom Bread (Anchor) If sunset is near (time) and a cold front arrives (state), the temperature will drop faster than from either clue alone.

03Methodology

🍞 Top Bread (Hook) Imagine rollerblading downhill. You check your speedometer only when the slope or your pace might change a lot; otherwise you keep rolling and save time.

🥬 Filling (High-Level Recipe)

What it is: A step-by-step plan to decide when to reuse a cached denoiser output.
How it works (overview): Input (noisy latent, time, prompt) → Compute denoiser once → Track how much the latent and time change → Predict output change using sensitivity → If small, reuse; if big, recompute → Output frames.
Why it matters: Each reuse skips a heavy forward pass, cutting latency without retraining.

🍞 Bottom Bread (Anchor) Like reusing a good map reading for nearby turns, only checking again when the road bends.

— Step A: Define the pieces — 🍞 Hook: You know how a recipe needs ingredients and tools before cooking begins? 🥬 Concept: Latent $x_t$ and Timestep t

What it is: $x_t$ is the current noisy picture; t is the current step in the denoising schedule.
How it works:
1. Start at high t with lots of noise.
2. Step to lower t while $x_t$ becomes cleaner.
3. At each step, the denoiser predicts how to clean.
Why it matters: Reuse decisions depend on how much $x_t$ and t change between steps. 🍞 Anchor: If t only drops a tiny bit and $x_t$ barely changes, it’s a good reuse candidate.

— Step B: Estimate model sensitivity — 🍞 Hook: Before a hike, you check how your shoes grip on small test steps. 🥬 Concept: Sensitivity Score (combining latent and time)

What it is: A single score predicting how much the denoiser’s output will change over the next step.
How it works:
1. Precompute two numbers per time step: sensitivity to latent changes and sensitivity to time changes.
2. During sampling, measure how much $x_t$ moved and how much t moved since last compute.
3. Multiply each movement by its sensitivity and add them.
Why it matters: Watching both parts avoids surprises when either latent or time effects are big. 🍞 Anchor: If predicted change is 0.03 and epsilon is 0.05, we reuse; if it’s 0.07, we recompute.

— Step C: Calibrate once, reuse many — 🍞 Hook: You taste a new soup with a few sips to learn its flavor before seasoning; you don’t need the whole pot. 🥬 Concept: Calibration

What it is: A tiny, one-time procedure to estimate the model’s typical sensitivities.
How it works:
1. Pick a small, diverse set of videos (e.g., 8).
2. For each relevant time, nudge inputs a little to see how outputs react (finite differences).
3. Save these sensitivity values in a lookup table.
Why it matters: Lets SenCache make fast decisions later without extra forward passes each time. 🍞 Anchor: With just 8 videos, the sensitivity curves looked almost the same as with thousands.

— Step D: Make per-step decisions — 🍞 Hook: Think of a traffic light that changes based on real traffic, not a fixed timer. 🥬 Concept: Adaptive Cache Policy

What it is: A rule that decides reuse vs. recompute at every step for every sample.
How it works:
1. Track cumulative changes since the last real compute (latent drift, time gap).
2. Compute the sensitivity score using the precomputed sensitivities and these changes.
3. If $score ≤ epsilon$ and you haven’t hit the reuse cap n, reuse; else recompute and reset.
Why it matters: Samples differ. Easy ones get more reuses; tricky ones get more recomputes. 🍞 Anchor: A calm scene of a blue sky reuses often; a sudden explosion scene refreshes more.

— Step E: Guardrails for stability — 🍞 Hook: Even smooth roads have speed limits. 🥬 Concept: Tolerance Epsilon and Reuse Cap n

What it is: Two knobs to balance speed and quality.
How it works:
1. Use smaller epsilon early (fragile steps), larger later (more forgiving).
2. Set n to avoid too many reuses in a row; beyond n, force a refresh.
3. Tune them to match your quality and latency goals.
Why it matters: Keeps quality from drifting and prevents rare but large errors. 🍞 Anchor: In tests, NFE dropped as epsilon increased; beyond $n≈4$ , quality started to dip without more speed gains.

— Step F: Putting it all together (example) — 🍞 Hook: Let’s do a mini, pretend run. 🥬 Concept: End-to-end Reuse Decision

What it is: A snapshot of one caching decision.
How it works:
1. We just computed at step t=0.50; we cached the output.
2. The sampler moves to t=0.49; $x_t$ shifts slightly.
3. Sensitivity-to-latent=0.4, sensitivity-to-time=0.8 (from the table). Predicted change=0. $4×late$ n $t_m$ ove + 0. $8×ti$ m $e_m$ ove.
4. If predicted change=0.03 and epsilon=0.05, reuse; count=1. If next step predicts 0.06, or count hits n, recompute and reset.
Why it matters: This tiny math guides when to skip, saving big compute over many steps. 🍞 Anchor: Like deciding to skip checking a map for a few straight blocks, but looking again when streets curve.

The Secret Sauce

Combining both latent and time sensitivities into one score is the clever bit: it matches how the model actually changes.
Decisions are per-sample and per-step, not fixed in advance, so the cache schedule adapts to easy vs. hard moments.
A tiny, once-only calibration gives you the sensitivity map without retraining or model edits.

04Experiments & Results

🍞 Top Bread (Hook) When you try a new running shoe, you care about speed and comfort. Here, speed is how few heavy steps the model takes; comfort is how good the videos look.

🥬 Filling (The Test and Competitors)

What it is: The authors tested SenCache on three popular video diffusion models and compared it against two strong caching baselines (TeaCache and MagCache).
How it works:
1. Measure speed by the number of function evaluations (NFE) and the cache ratio (how often we reuse).
2. Measure visual quality with LPIPS (lower is better), PSNR (higher is better), and SSIM (higher is better).
3. Try both conservative (slow) and aggressive (fast) settings to probe the speed–quality trade.
Why it matters: Real users want fast videos that still look great.

🍞 Bottom Bread (Anchor) It’s like comparing cars on both lap time and comfort: a rocket-fast car that shakes your teeth out isn’t a winner.

🍞 Top Bread (Hook) Think of three race tracks with different terrain: smooth (Wan), twisty (LTX), and bumpy (CogVideoX). The same driving style won’t work best on all three.

🥬 Filling (Scoreboard with Context)

Wan 2.1 (81 frames, $832×480$ , 50 steps): • Slow regime: All methods look similar in quality, but TeaCache used more NFEs (slower). MagCache and SenCache matched NFEs; quality stayed high ( $PSNR ≈ 30$ .7, $SSIM ≈ 0$ .939, $LPIPS ≈ 0$ .039). • Fast regime: At matched NFEs, SenCache produced better quality than MagCache (e.g., lower $LPIPS ≈ 0$ .054 vs 0.060, higher $SSIM ≈ 0$ .922 vs 0.914), like getting an A when the other gets a B+.
CogVideoX (49 frames, $720×480$ , 50 steps): • At similar speed, SenCache beat MagCache on all three metrics ( $LPIPS ≈ 0$ .190 vs 0.195; $PSNR ≈ 22$ .09 vs 21.85; $SSIM ≈ 0$ .779 vs 0.733), a noticeable bump in clarity and structure.
LTX-Video (161 frames, $768×512$ , 50 steps): • SenCache again edged out MagCache ( $LPIPS ≈ 0$ .163 vs 0.180; $PSNR ≈ 23$ .67 vs 23.37; $SSIM ≈ 0$ .829 vs 0.822), a solid win in detail preservation.

🍞 Bottom Bread (Anchor) Imagine two photo apps that both export fast; SenCache’s photos look sharper and more faithful at the same speed.

🍞 Top Bread (Hook) Sometimes a tiny test tells you a lot—like poking a cake with a toothpick to see if it’s done.

🥬 Filling (Surprising/Useful Findings)

Small calibration sets work: Sensitivity curves from just 8 videos closely matched those from thousands, saving setup time.
Early steps are delicate: Using a very strict epsilon (e.g., 1% error) early protects final quality.
Cap on reuses: Increasing the max reuse count n improves speed up to around n=4; after that, quality drops without extra speed gains.
Model differences: CogVideoX and LTX needed larger epsilons to hit low NFEs, suggesting they change more per step (higher effective sensitivity) than Wan; bigger epsilon means more approximate reuse and a steeper quality trade.

🍞 Bottom Bread (Anchor) Like different breads needing different oven times, each model has its own best settings for speed vs. taste.

🍞 Top Bread (Hook) You don’t just want theory—you want a stopwatch result.

🥬 Filling (Wall-Clock and Compute)

On a GH200 GPU for Wan 2.1, SenCache cut end-to-end time from ~182 s to ~107 s (≈41% speedup), slightly better than MagCache (~110 s, ≈39%).
Total FLOPs dropped by about 58% for both caching methods, turning minutes into under two minutes while keeping visuals crisp.

🍞 Bottom Bread (Anchor) That’s like trimming a 3-minute wait to under 2 minutes without making the video look worse—in many cases, it looks better.

05Discussion & Limitations

🍞 Top Bread (Hook) Even a great compass has limits: it works best when the terrain isn’t too wild.

🥬 Filling (Honest Assessment)

Limitations:
1. First-order estimates are local: in strongly nonlinear regions, they can underpredict changes and cause artifacts.
2. Sensitivity tables are per model; unusual content outside the tiny calibration set might shift behavior.
3. A fixed epsilon might be too strict in some steps and too loose in others; dynamic schedules could do better.
4. Very long reuse chains can drift; hence the need for an n cap.
Required Resources: • A small calibration run (as few as 8 videos) to estimate sensitivities; storage for a compact lookup. • No retraining, no architecture edits; works with standard samplers.
When NOT to Use: • Ultra-aggressive speed with zero tolerance for quality loss in highly nonlinear scenes. • Extremely low-step solvers (e.g., after heavy distillation) where there’s little to cache. • If memory is too tight to keep cached outputs.
Open Questions:
1. Can we learn better (possibly higher-order) sensitivity predictors cheaply?
2. What’s an optimal dynamic epsilon(t) schedule across steps?
3. How best to combine SenCache’s local rule with global step-planning methods?
4. How well does this extend to audio, motion, or multimodal diffusion with different dynamics?

🍞 Bottom Bread (Anchor) Like switching from a fixed bedtime to a smart bedtime that adapts to how tired you are, a dynamic epsilon could boost both rest (quality) and schedule (speed).

06Conclusion & Future Work

🍞 Top Bread (Hook) If a path barely turns, you don’t need to check the map every few steps—save time and keep walking.

🥬 Filling (Takeaway)

3-Sentence Summary: SenCache accelerates diffusion-model inference by reusing denoiser outputs only when a sensitivity test predicts tiny changes. It measures how responsive the model is to shifts in both the noisy image (latent) and the time step, then uses a tolerance (epsilon) and a reuse cap (n) to keep quality high. This turns caching from guesswork into a principled, adaptive, per-sample policy that needs no retraining.
Main Achievement: A simple, theory-backed sensitivity score that unifies latent and time effects into a single, actionable cache rule.
Future Directions: Smarter sensitivity estimators, dynamic epsilon schedules, and hybrids that mix SenCache’s local decisions with global step-planning.
Why Remember This: It reframes “when to skip” as “how much will the output move,” offering a clean, controllable speed–quality trade that travels well across models and likely across modalities.

🍞 Bottom Bread (Anchor) It’s the difference between randomly skipping pages in a book and skipping only the lines you’re sure repeat the same idea—faster reading without losing the story.

Practical Applications

•Speed up text-to-video preview rendering for creators to iterate scripts and storyboards quickly.
•Accelerate video ads and social media clip generation at scale while keeping brand quality.
•Enable near-real-time visual feedback in video editing tools that rely on diffusion-based effects.
•Reduce inference costs for content platforms that auto-generate short clips and trailers.
•Improve latency in interactive applications like motion prototyping or AR content generation.
•Batch-generate educational or training videos faster under fixed compute budgets.
•Deploy lighter, greener video generation services by cutting GPU hours and power usage.
•Prototype sensitivity-aware skipping for audio diffusion (e.g., voiceovers) to speed synthesis.
•Integrate adaptive caching into multimodal pipelines (text+image+video) for end-to-end gains.

Version: 1