Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patchβs embedding in an image sequence, just like language models predict the next word.
This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.