Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.
TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.