The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
This paper introduces XDLM, a single model that blends two popular diffusion styles (masked and uniform) so it both understands and generates text and images well.
StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.
RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.