Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).