This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
FrankenMotion is a new AI that makes human motion by controlling each body part over time, like a careful puppeteer.
This paper shows how to make long, camera-controlled videos much faster by generating only a few smart keyframes with diffusion, then filling in the rest using a 3D scene and rendering.
This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.
This paper shows a simple way to make image-generating AIs (diffusion Transformers) produce clearer, more accurate pictures by letting the model guide itself from the inside.
Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.
This paper introduces Log-linear Sparse Attention (LLSA), a new way for Diffusion Transformers to focus only on the most useful information using a smart, layered search.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.