PixelGen is a new image generator that works directly with pixels and uses what-looks-good-to-people guidance (perceptual loss) to improve quality.
PromptRL teaches a language model to rewrite prompts while a flow-based image model learns to draw, and both are trained together using the same rewards.
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
Diffusion models make pictures from noise but often miss what people actually want in the prompt or what looks good to humans.
This paper turns a video model into a step-by-step visual thinker that makes one final, high-quality picture from a text prompt.
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.
StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.
Sparse-LaViDa makes diffusion-style AI models much faster by skipping unhelpful masked tokens during generation while keeping quality the same.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
MetaCanvas lets a multimodal language model (MLLM) sketch a plan inside the generator’s hidden canvas so diffusion models can follow it patch by patch.
EMMA is a single AI model that can understand images, write about them, create new images from text, and edit images—all in one unified system.