Papers4

#KV cache compression

FASA: Frequency-aware Sparse Attention

FASA is a training-free method that makes large language models faster and lighter on memory by keeping only the most useful past tokens during decoding.

#FASA#Frequency-aware sparse attention#KV cache compression

Not triaged yet

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate

Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

Not triaged yet

Efficient Autoregressive Video Diffusion with Dummy Head

Intermediate

Hang Guo, Zhaoyang Jia et al.Jan 28arXiv

This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.

#autoregressive video diffusion#multi-head self-attention#KV cache compression

Not triaged yet

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Intermediate

Jang-Hyun Kim, Dongyoon Han et al.Jan 25arXiv

Fast KVzip is a new way to shrink an LLM’s memory (the KV cache) while keeping answers just as accurate.

#KV cache compression#gated KV eviction#sink attention

Not triaged yet