This paper fixes a hidden flaw in a popular image tokenizer (FSQ) with a simple one-line change to its activation function.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper shows how to make long, camera-controlled videos much faster by generating only a few smart keyframes with diffusion, then filling in the rest using a 3D scene and rendering.
This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.
Flow Matching is like teaching arrows to push points from a simple cloud (source) to real pictures (target); most people start from a Gaussian cloud because it points equally in all directions.
Alchemist is a smart data picker for training text-to-image models that learns which pictures and captions actually help the model improve.
Latent diffusion models are great at making images but learn the meaning of scenes slowly because their training goal mostly teaches them to clean up noise, not to understand objects and layouts.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
Normalizing Flows are models that learn how to turn real images into simple noise and then back again.
This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).
This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.
SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.