The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.