UniT teaches one multimodal model to think in steps with pictures and words, so it can check its own work and fix mistakes as it goes.
This paper introduces Self-E, a text-to-image model that learns from scratch and can generate good pictures in any number of steps, from just a few to many.