The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
Before this work, most text-to-image models used VAEs (small, squished image codes) and struggled with slow training and overfitting on high-quality fine-tuning sets.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).