OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
Humans keep a big-picture memory (a βmindscapeβ) when reading long things; this paper teaches AI to do the same.