HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Haowei Zhang; Shudong Yang; Jinlan Fu; See-Kiong Ng; Xipeng Qiu

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Intermediate

Haowei Zhang, Shudong Yang, Jinlan Fu et al.1/21/2026

arXiv PDF

Key Summary

•HERMES is a training-free way to make video-language models understand live, streaming video quickly and accurately.
•It treats the model’s KV cache like a layered memory: shallow layers remember the newest frames, middle layers connect recent details, and deep layers keep long-term summaries.
•By smartly keeping only the most useful video tokens in each layer, HERMES cuts video tokens by up to 68% while staying accurate.
•It answers in real time without any extra retrieval step when you ask a question, giving up to 10× faster time to first token (TTFT) than prior state of the art.
•HERMES adds three plug-and-play pieces: hierarchical KV cache management, cross-layer memory smoothing, and position re-indexing.
•Across StreamingBench and OVO-Bench, HERMES improves accuracy over base models and prior training-free methods, with gains up to 11.4% on streaming tasks.
•GPU memory stays steady even as the video gets longer because the cache has a fixed budget per layer.
•It works across different popular models (like LLaVA-OneVision and Qwen2.5-VL) and sizes (from 0.5B to 32B).
•Lazy re-indexing is best for streaming (low cost, stable), while eager re-indexing is better for offline long-range tasks.
•HERMES shows that we can manage streaming video memory inside the model, without retraining and without heavy external systems.

Why This Research Matters

Real-time video understanding powers everyday tools like live sports explainers, classroom demos, safety monitors, and wearable assistants. HERMES shows that we can get fast, stable answers without retraining models or relying on slow, last-second retrieval. It keeps GPU memory usage steady even as videos grow long, which makes deployment cheaper and more reliable. By matching what each layer naturally remembers, it preserves both fresh details and long-term summaries. That means better accuracy with fewer tokens, which is greener and more scalable. As a training-free, plug-and-play method, it can be adopted widely across different open-source models. This brings practical, low-latency streaming AI closer to everyday reality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how watching a long live video can feel like drinking from a firehose—you see new things every second, but you also need to remember earlier moments to understand what’s going on now? That same challenge hits AI models that try to follow a video stream in real time. Before this paper, Multimodal Large Language Models (MLLMs) had gotten quite good at understanding pre-recorded (offline) videos: they could load a clip, examine it carefully, and then answer questions. But livestreams are different: the video never stops, future frames are unknown, questions can arrive anytime, and answers must be quick.

The problem looked like this: models had to keep enough of the past to reason about the present, while not running out of fast GPU memory and still replying instantly when asked. Many earlier attempts stored video memories outside the model (as captions or raw patches) and then did retrieval when a user asked a question. That helped memory pressure but slowed everything down at the worst moment—right when the user wanted an answer. Other works tried compressing video tokens for offline clips, but those methods didn’t fit streaming because they often assumed you could see the whole video ahead of time or run extra computations later.

Failed attempts taught two lessons: (1) extra retrieval at question time hurts latency and breaks the smooth, end-to-end flow; and (2) one-size-fits-all compression treats every layer and every token the same, ignoring that different decoder layers specialize in different kinds of information.

The missing piece was a better view of what the model’s own memory (the KV cache) is actually storing during a stream. If we could understand how layers use this cache, we might keep the right tokens at the right layers and toss the rest—without retraining and without pausing for retrieval.

This matters in daily life. Imagine a teacher live-commenting on a science demo, a sports app explaining a replay seconds after it happens, a wearable assistant helping someone find their keys in a long home video, or a safety system monitoring a factory floor. All need fast, steady answers in the moment, with limited hardware. The stakes are practical: save memory, keep latency tiny, and don’t sacrifice accuracy.

🍞 Hook: Imagine your notebook during a classroom experiment. You scribble fresh observations on the top, summarize useful facts in the middle, and keep a neat final summary at the bottom for later. 🥬 The Concept (KV Cache): The KV cache is the model’s short-term memory that stores key and value vectors for past tokens so it can quickly look back without recomputing everything.

How it works:
1. As each video frame is turned into tokens, the model saves their keys and values in a cache.
2. When new tokens arrive (or a question appears), attention looks back into this cache to find relevant info fast.
3. A fixed budget decides how many past tokens each layer can keep.
Why it matters: Without a cache, the model would need to reprocess the whole past every time, making real-time answers too slow and memory-hungry. 🍞 Anchor: Like keeping sticky notes of recent steps in a lab procedure so you can quickly check what you just did without rereading the entire textbook.

🍞 Hook: Think of how you remember a field trip: vivid recent moments, a working list of important details, and a long-term summary of the whole day. 🥬 The Concept (Recency Bias): Recency bias means the model tends to focus more on the newest tokens in early layers.

How it works:
1. Shallow layers pay strong attention to the most recent frames.
2. Their attention fades quickly for older frames.
3. Deeper layers gradually care less about recency and more about stable summaries.
Why it matters: If you treat all layers the same, you either keep too much recent fluff everywhere or throw away valuable long-term anchors. 🍞 Anchor: When following instructions, you glance most at the latest step, not the first one from ten minutes ago.

02Core Idea

The aha! HERMES sees the KV cache as a layered (hierarchical) memory: shallow layers act like sensory memory for the latest frames, middle layers act like working memory to connect recent pieces, and deep layers act like long-term memory holding frame-level anchors—so we keep different tokens at different layers, on purpose.

🍞 Hook: You know how a school has lockers for daily items, binders for current units, and an archive for year-long records? 🥬 The Concept (Hierarchical Memory Framework): A hierarchical memory framework organizes what to keep at each layer so the model remembers the right stuff at the right depth.

How it works:
1. Shallow layers keep mostly very recent tokens (sensory memory).
2. Middle layers blend recent info with important earlier tokens (working memory).
3. Deep layers keep sparse, rhythmically spaced anchor tokens that summarize each frame (long-term memory).
Why it matters: Without this structure, memory gets clogged or misses key anchors, causing either slow responses or confused answers. 🍞 Anchor: Like keeping today’s homework in your backpack, current notes in a binder, and final summaries in a portfolio.

🍞 Hook: When you listen to a story, your brain highlights words like names and actions while downplaying fillers like “um.” 🥬 The Concept (Attention Mechanism): Attention lets the model score tokens by importance so it can focus on what matters for the task.

How it works:
1. Look at all tokens in the cache.
2. Score each token by relevance to the current focus.
3. Weigh important tokens more when predicting the answer.
4. Different layers develop different attention patterns (recency vs. anchors).
Why it matters: Without attention, the model would treat “the” and “goal scored” equally, missing crucial events. 🍞 Anchor: To answer “Who kicked the ball into the goal?”, attention zooms in on the player and the scoring moment, not on the crowd.

Three analogies for the same idea:

Library analogy: New returns go on a sorting cart (shallow). Active course shelves hold key chapters (middle). The archives keep one copy of each book (deep).
Kitchen analogy: Freshly chopped veggies on the cutting board (shallow), ongoing pot on the stove (middle), recipe card summary on the fridge (deep).
Timeline analogy: Current minute-by-minute notes (shallow), a running timeline with key timestamps (middle), a chapter-per-scene outline (deep).

Before vs. After:

Before: Uniform token eviction across layers, extra retrieval at question time, and unstable latency/memory.
After: Layer-aware keeping/evicting of tokens, no retrieval at question time, stable low memory, and real-time answers.

Why it works (intuition): Decoder layers specialize. Shallow layers are great at perceiving what just happened, deep layers condense each frame into a compact “anchor,” and middle layers bridge the two. By aligning compression with these natural roles, the model preserves exactly what each layer needs, wasting fewer tokens and avoiding rework at question time.

Building blocks inside HERMES:

Importance scoring per layer: recency-based in shallow layers, attention-based in deep layers, and a blend in the middle.
Cross-layer memory smoothing so layers don’t end up with mismatched memories.
Position re-indexing to keep positional encodings healthy during long streams, without recomputing everything.
A fixed per-layer cache budget so GPU memory stays flat no matter how long the video gets.

03Methodology

At a high level: Video stream → encode frames into tokens → hierarchical KV cache management (keep the right tokens per layer) → cross-layer memory smoothing (align memories) → position re-indexing (keep positions sane) → answer in real time.

Step A: Hierarchical KV Cache Management (what to keep/evict per layer)

What happens: As frames arrive (e.g., 0.5 fps, 196 tokens per frame in LLaVA-OV), tokens flow into each decoder layer’s KV cache. Once the cache hits its budget (e.g., 4K or 6K tokens per layer), HERMES scores each token’s importance and keeps only the Top-K per layer.
Why this step exists: If we don’t pick smartly, we either blow memory or drop the wrong tokens, hurting accuracy or latency.
How (layer-specific scoring): • Shallow layers (sensory memory): Use a recency score that decays with time (recent tokens score high). Think of Ebbinghaus’s forgetting curve—older tokens fade. • Deep layers (long-term memory): Use attention magnitude (with a generic guidance prompt as a stand-in for unpredictable questions) to find frame-level anchors—sparse peaks often every 196 tokens (one per frame). • Middle layers (working memory): Blend recency and attention with a layer-dependent weight that shifts gradually from recency to attention the deeper you go.
Example with numbers: Suppose we have a 4K token budget per layer, chunk size 16 frames, and we’ve streamed 256 frames (≈ 50K tokens total at 196 tokens/frame). Shallow layers will keep mostly the last few chunks; deep layers will keep the strongest anchors (about one per frame), and middle layers will keep a mix.

🍞 Hook: Imagine classmates copying answers from different pages if their notebooks don’t match—chaos! 🥬 The Concept (Cross-Layer Memory Smoothing): Cross-layer memory smoothing gently passes importance from deeper layers to shallower ones so layers don’t disagree about which positions matter.

How it works:
1. Compute raw importance per layer.
2. Blend each layer’s scores with the next deeper layer via a smoothing factor.
3. Select Top-K tokens after smoothing to keep alignment.
Why it matters: Without smoothing, the same position might be kept deep but dropped shallow, breaking helpful cross-layer interactions. 🍞 Anchor: Like using a class-wide study guide so everyone highlights the same key facts, not random ones.

Secret sauce inside Step A: Summary tokens in deep layers. When deep layers evict many old tokens, HERMES averages their values and phase-aligns their keys (using a RoPE delta) into a single summary token, preserving a compact memory of far history.

🍞 Hook: When you clean your room, you don’t throw away your old certificates—you put them into a single folder. 🥬 The Concept (Summary Tokens): A summary token compresses many evicted deep-layer tokens into one phase-aligned placeholder.

How it works:
1. Phase-align keys to a target position (so RoPE phases line up).
2. Average aligned keys and values.
3. Insert the single summary token back into the cache.
Why it matters: Without summaries, you lose long-term clues when you run out of space. 🍞 Anchor: Like turning a pile of worksheets into one neat summary page you can keep.

Step B: Position Re-Indexing (keep positional encodings healthy)

What happens: As we prune and keep tokens over a long stream, positions can get huge and gappy. Re-indexing remaps kept tokens to a compact contiguous range while correcting rotary positions so we can still reuse cached keys.
Why this step exists: Positional indices growing too large can hurt generation quality; gaps can misalign RoPE phases; recomputing keys would be expensive.
Two strategies: • Lazy re-indexing (best for streaming): Only re-index when you approach limits; keeps overhead low and preserves recent position stability. • Eager re-indexing (best for offline): Re-index at each compression for perfectly compact positions; costs more compute but stabilizes very long-range semantics.
Example: With a fixed text prefix and a moving video region, we left-compact video positions and apply a RoPE delta to adjust keys without recomputing them.

🍞 Hook: Think of street addresses: if houses get demolished randomly, address numbers become messy and confusing. 🥬 The Concept (Position Re-Indexing): Re-assign token positions to a tidy, gap-free range and correct rotary phases so attention still works.

How it works:
1. Keep system text positions fixed.
2. Left-compact the kept video token positions.
3. Apply a rotary phase delta to cached keys to match new positions.
Why it matters: Without re-indexing, positions drift, attention misfires, and answers degrade over time. 🍞 Anchor: Like renumbering houses on a block after renovations so mail gets delivered correctly again.

Putting it together:

Input: A video stream split into chunks (e.g., 16 frames per chunk).
Pipeline: Encode frames → insert tokens into each layer’s KV cache → if over budget, score and keep tokens per layer (plus deep summaries) → smooth across layers → re-index positions when needed → answer instantly when asked.
Secret sauce: All of this happens training-free and without any extra retrieval or offloading at question time, so TTFT stays tiny and memory stays flat.

04Experiments & Results

The test: Researchers measured accuracy on streaming and offline video QA benchmarks, plus speed and memory. Key speed metrics were Time To First Token (TTFT)—how fast the first answer token appears after a question—and Time Per Output Token (TPOT). They also tracked peak GPU memory usage as videos got longer.

🍞 Hook: Imagine a buzzer-beater quiz: speed and correctness both count. 🥬 The Concept (TTFT): TTFT is how long it takes to start answering after you ask a question.

How it works:
1. Time starts when the user asks.
2. The model reuses its compact cache—no extra retrieval.
3. The first token appears in under ~30 ms in tests.
Why it matters: If TTFT is slow, the system feels laggy even if later tokens are fast. 🍞 Anchor: Like pressing a light switch and seeing the bulb turn on immediately instead of waiting.

Benchmarks and competitors:

Streaming: StreamingBench (real-time), OVO-Bench (real-time and backward tracing), RVS-Ego and RVS-Movie (open-ended streaming QA).
Offline: MVBench (short), Egoschema and VideoMME (long).
Baselines: Base models (e.g., LLaVA-OV-7B, Qwen2.5-VL-7B/32B) and training-free methods like ReKV, LiveVLM, StreamMem.

Scoreboard with context:

StreamingBench + OVO-Bench: • LLaVA-OV-7B base averaged about 53.35%. With HERMES (4K tokens, 0.5 fps), average rose to 58.27%—like moving from a B- to a solid B+/A- among peers. • Qwen2.5-VL-7B base averaged ~52.28%. With HERMES (4K tokens, 1 fps), average jumped to 59.21%—a sizable boost over base and prior training-free approaches. • Qwen2.5-VL-32B also improved with HERMES (up to ~64.82% avg), showing scalability to larger models.
RVS-Ego and RVS-Movie (open-ended): • HERMES improved accuracy by up to 11.4% over uniformly sampled frames, surpassing other training-free methods.
Offline long-video (VideoMME, Egoschema): • With limited tokens, HERMES matched or beat base models. For example, on VideoMME with LLaVA-OV-7B, HERMES reached 58.85% (4K tokens), beating the base 57.67% despite fewer tokens.

Efficiency highlights:

TTFT stayed under ~30 ms across different numbers of input frames and chunk sizes; GPU memory remained flat because the per-layer budget was fixed.
Compared to prior SOTA, HERMES achieved up to 10× faster TTFT (e.g., vs. StreamingTOM) and reduced peak memory compared to LiveVLM.

Surprising findings:

Even at 4K tokens per layer, performance stabilized for streaming tasks—meaning much of the original video tokens were redundant.
Backward tracing (answering about earlier scenes) improved thanks to deep-layer anchor tokens and summary tokens.
Lazy re-indexing worked best for streaming (saving compute while keeping positions stable), while eager re-indexing helped offline long-range reasoning.

Bottom line: HERMES delivered both speed and accuracy. It kept memory steady, answered fast without retrieval, and held up across different models and video lengths.

05Discussion & Limitations

Limitations:

HERMES relies on the observed layer specializations (recency in shallow, anchors in deep). If a future model organizes information very differently, the same heuristics may need adjustment.
The guidance prompt used to estimate deep attention (for unpredictable queries) might not perfectly match a user’s eventual question in niche cases.
Summary tokens are lossy compressions; extremely fine details far in the past may be blurred.
Hyperparameters (cache budgets, smoothing strengths, layer partitions) may need tuning across models and domains.
Audio or subtitle-heavy tasks were not the main focus here; integrating multi-modality summaries may require extensions.

Required resources:

A single modern GPU (e.g., A800, H200) can run HERMES with FP16 and fixed cache budgets; no extra servers or databases are needed at question time.
A compatible MLLM backbone (e.g., LLaVA-OV, Qwen2.5-VL) and its vision encoder are required.

When NOT to use:

If your application demands proactive planning with external knowledge bases at query time, or very heavy audio/text fusion not summarized visually.
If you can afford long pre-processing and want the absolute best offline-only accuracy, eager re-indexing plus task-specific training may outperform training-free setups.
If the stream contains ultra-rapid micro-events that must all be preserved, an extremely small cache budget could miss some.

Open questions:

Can we learn the layer partitions and smoothing weights automatically, per model and per domain?
How can we better generate pseudo-queries for deep attention in a truly query-agnostic way?
Can summary tokens be made adaptive (e.g., multiple summaries per storyline or per object)?
How does this approach extend to richer multimodal anchors (audio, subtitles) and to higher frame rates/resolutions?
Could external memory be added as a slow, optional tier without hurting TTFT when not used?

06Conclusion & Future Work

In three sentences: HERMES treats the model’s KV cache as a layered memory—recent details in shallow layers, a working blend in the middle, and long-term frame anchors in deep layers—to keep the right tokens at the right depths. With cross-layer smoothing, summary tokens, and smart position re-indexing, it delivers real-time answers without retrieval, flat GPU memory, and better accuracy—even with up to 68% fewer video tokens. It works across different open-source models and sizes, achieving up to 10× faster TTFT than prior training-free methods.

Main achievement: A training-free, plug-and-play, hierarchical memory management strategy that turns the KV cache itself into an efficient streaming memory, providing both stability and speed.

Future directions: Automate layer partitioning and smoothing, enrich deep anchors with audio/subtitle signals, learn adaptive budgets, and fuse optional slow external memory as a fallback tier. Also, explore robustness to rapid camera motion and higher frame rates.

Why remember this: HERMES shows that the model’s own cache can serve as an effective, hierarchical memory for streaming video—no retraining, no last-second retrieval—unlocking practical, low-latency video assistants that scale to long, unpredictable streams.

Practical Applications

•Live sports commentary: instantly identify scorers, fouls, and key plays during a match.
•Classroom demonstrations: answer students’ questions about ongoing science experiments in real time.
•Security monitoring: quickly detect and recall earlier incidents without pausing the stream.
•Wearable assistants: help users find objects or recall steps during daily activities from egocentric video.
•Customer support for livestream shopping: describe products and compare items as the host presents them.
•Robotics teleoperation: summarize recent actions while remembering critical waypoints and hazards.
•Traffic cams: report accidents and track events as they unfold, with quick backward tracing.
•Live event captioning: provide low-latency descriptions of who is speaking and what’s happening on stage.
•Healthcare observation: summarize key patient activities during long monitoring sessions while staying responsive.
•Esports analysis: explain strategies and key moments in real time, with instant recall of earlier plays.

Version: 1