Videos are made of very long lists of tokens, and regular attention looks at every pair of tokens, which is slow and expensive.
VINO is a single AI model that can make and edit both images and videos by listening to text and looking at reference pictures and clips at the same time.