Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.
This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patchβs embedding in an image sequence, just like language models predict the next word.