Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.