FASA is a training-free method that makes large language models faster and lighter on memory by keeping only the most useful past tokens during decoding.
The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.
This paper finds that about 1 out of every 4 attention heads in autoregressive video diffusion models mostly looks only at the current frame and almost ignores the past, wasting memory and time.
Fast KVzip is a new way to shrink an LLMโs memory (the KV cache) while keeping answers just as accurate.