๐ŸŽ“How I Study AIHISA
๐Ÿ“–Read
๐Ÿ“„Papers๐Ÿ“ฐBlogs๐ŸŽฌCourses
๐Ÿ’กLearn
๐Ÿ›ค๏ธPaths๐Ÿ“šTopics๐Ÿ’กConcepts๐ŸŽดShorts
๐ŸŽฏPractice
๐ŸงฉProblems๐ŸŽฏPrompts๐Ÿง Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers4

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#KV cache compression

FASA: Frequency-aware Sparse Attention

Intermediate
Yifei Wang, Yueqi Wang et al.Feb 3arXiv

FASA is a training-free method that makes large language models faster and lighter on memory by keeping only the most useful past tokens during decoding.

#FASA#Frequency-aware sparse attention#KV cache compression

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Intermediate
Dvir Samuel, Issar Tzachor et al.Feb 2arXiv

The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.

#autoregressive video diffusion#KV cache compression#sparse attention

Efficient Autoregressive Video Diffusion with Dummy Head

Intermediate
Hang Guo, Zhaoyang Jia et al.Jan 28arXiv

This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.

#autoregressive video diffusion#multi-head self-attention#KV cache compression

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction

Intermediate
Jang-Hyun Kim, Dongyoon Han et al.Jan 25arXiv

Fast KVzip is a new way to shrink an LLMโ€™s memory (the KV cache) while keeping answers just as accurate.

#KV cache compression#gated KV eviction#sink attention