The paper shows that using information from many layers of a language model (not just one) helps text-to-image diffusion transformers follow prompts much better.
This paper introduces XDLM, a single model that blends two popular diffusion styles (masked and uniform) so it both understands and generates text and images well.
This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.
DiffThinker turns hard picture-based puzzles into an image-to-image drawing task instead of a long texting task.
StageVAR makes image-generating AI much faster by recognizing that early steps set the meaning and structure, while later steps just polish details.
RecTok is a new visual tokenizer that teaches the whole training path of a diffusion model (the forward flow) to be smart about image meaning, not just the starting latent features.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.
This paper shows a new way to teach an autoencoder to shape its hidden space (the 'latent space') to look like any distribution we want, not just a simple bell curve.