Video generators are slow because attention looks at everything, which takes a lot of time.
Transformers slow down on very long inputs because standard attention looks at every token pair, which is expensive.