SurgWorld teaches surgical robots using videos plus text, then guesses the missing robot moves so we can train good policies without collecting tons of real robot-action data.
Yume1.5 is a model that turns text or a single image into a living, explorable video world you can move through with keyboard keys.
HiStream makes 1080p video generation much faster by removing repeated work across space, time, and steps.
SemanticGen is a new way to make videos that starts by planning in a small, high-level 'idea space' (semantic space) and then adds the tiny visual details later.
The paper teaches a video generator to move things realistically by borrowing motion knowledge from a strong video tracker.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.