Youtu-VL is a new kind of vision-language model that learns to predict both words and tiny image pieces, not just words.
This paper introduces HUVR, a single vision model that can both recognize whatβs in an image and reconstruct or generate images from tiny codes.