The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
Big idea: Make image-making AIs stop, think, check, and fix their own work so they get better at both creating pictures and understanding them.
This paper teaches AI to pay attention better by training its focus, not just its words.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.