Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).
VidEoMT shows that a single, well‑trained Vision Transformer (ViT) can segment and track objects in videos without extra tracking gadgets.
The Sphere Encoder is a new way to make images fast by teaching an autoencoder to place all images evenly on a big imaginary sphere and then decode random spots on that sphere back into pictures.
This paper shows how to make a whole picture in one go, directly in pixels, without using a hidden “latent” space or many tiny steps.
The paper turns the 'holes' (missing spots) in depth camera images into helpful training hints instead of treating them as garbage.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
This paper introduces HUVR, a single vision model that can both recognize what’s in an image and reconstruct or generate images from tiny codes.
LightOnOCR-2-1B is a single, compact AI model that reads PDF pages and scans and turns them into clean, well-ordered text without using fragile multi-step OCR pipelines.
Ministral 3 is a new family of small-but-mighty AI language models (3B, 8B, 14B) that learn from a larger model using a step-by-step tutoring method called Cascade Distillation.
InfiniDepth is a new way to predict depth that treats every image location as a smooth, continuous place you can ask for depth, not just the fixed pixels of a grid.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.