Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
VideoSSM is a new way to make long, stable, and lively videos by giving the model two kinds of memory: a short-term window and a long-term state-space memory.