The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
Similarity-based image–text models like CLIP can be fooled by “half-truths,” where adding one plausible but wrong detail makes a caption look more similar to an image instead of less similar.
WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.
Robots often act like goldfish with short memories; HiF-VLA fixes this by letting them use motion to remember the past and predict the future.
This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.