OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
IntermediateLetian Zhang, Sucheng Ren et al.Jan 21arXiv
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
#Unified Visual Encoder#VAE#Vision Transformer