This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
This paper teaches a computer to find the same object when seen from two very different cameras, like a body camera (first-person) and a room camera (third-person).
DINO-SAE is a new autoencoder that keeps both the meaning of an image (semantics) and tiny textures (fine details) at the same time.
C-RADIOv4 is a single vision model that learns from several expert models at once and keeps their best skills while staying fast.
AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.