Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).
The paper asks a simple question: what must a vision modelβs internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
The paper fixes a big problem in long video generation: models either forget what happened or slowly drift off-topic over time.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.