DINO-SAE is a new autoencoder that keeps both the meaning of an image (semantics) and tiny textures (fine details) at the same time.
C-RADIOv4 is a single vision model that learns from several expert models at once and keeps their best skills while staying fast.
AnyDepth is a new, simple way for a computer to tell how far things are in a picture using just one image (monocular depth).
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
This paper shows you can train a big text-to-image diffusion model directly on the features of a vision foundation model (like DINOv3) without using a VAE.