The paper asks a simple question: what must a vision model’s internal pictures (embeddings) look like if it can recognize new mixes of things it already knows?
Similarity-based image–text models like CLIP can be fooled by “half-truths,” where adding one plausible but wrong detail makes a caption look more similar to an image instead of less similar.
This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
People often pick CLIP-like models for image labeling, but this paper shows that large multimodal models (LMMs) can be just as good—or even better—when you give them a few examples in the prompt (in-context learning).
This paper speeds up image and video generators called diffusion transformers by changing how big their puzzle pieces (patches) are at each step.
OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.