Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.
Big text-to-image models make amazing pictures but are slow because they take hundreds of tiny steps to turn noise into an image.
TwinFlow is a new way to make big image models draw great pictures in just one step instead of 40–100 steps.
Before this work, big vision-language models (VLMs) were great at understanding pictures and words together but not at making new pictures.