OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper shows that we can turn big, smart vision features into a small, easy-to-use code for image generation with just one attention layer.