The paper argues that to build an AI that truly understands and simulates the real world, it must be consistent in three ways at once: across different senses (modal), across 3D space (spatial), and across time (temporal).
This paper shows how a video generator can improve its own videos during sampling, without extra training or outside checkers.