This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.
The paper turns the 'holes' (missing spots) in depth camera images into helpful training hints instead of treating them as garbage.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.
LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.
Ministral 3 is a new family of small-but-mighty AI language models (3B, 8B, 14B) that learn from a larger model using a step-by-step tutoring method called Cascade Distillation.
InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.