Long texts make language models slow because they must keep and re-check a huge memory called the KV cache for every new word they write.
The paper makes long video generation much faster and lighter on memory by cutting out repeated work in attention.
Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.