This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.
VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).