Unified Latents (UL) is a way to learn the hidden code (latents) for images and videos by training three parts together: an encoder, a diffusion prior, and a diffusion decoder.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.