The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.