OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.
This paper shows that great image understanding features alone are not enough for making great images; you also need strong pixel-level detail.
VQRAE is a new kind of image tokenizer that lets one model both understand images (continuous features) and generate/reconstruct them (discrete tokens).