The paper turns video avatars from passive puppets into active doers that can plan, act, check their own work, and fix mistakes over many steps.
Robots that follow spoken instructions used to be slow and jerky because one big model tried to think and move at the same time.