DREAM is one model that both understands images (like CLIP) and makes images from text (like top text-to-image models).
The Sphere Encoder is a new way to make images fast by teaching an autoencoder to place all images evenly on a big imaginary sphere and then decode random spots on that sphere back into pictures.
FrankenMotion is a new AI that makes human motion by controlling each body part over time, like a careful puppeteer.
This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.