Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.
This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).
This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.
SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.