This paper shows a simple way to turn any strong autoregressive (step-by-step) model into a diffusion vision-language model (parallel, block-by-block) without changing the architecture.
ProPhy is a new two-step method that helps video AIs follow real-world physics, not just make pretty pictures.