OpenVision 3 is a single vision encoder that learns one set of image tokens that work well for both understanding images (like answering questions) and generating images (like making new pictures).
Co2S is a new way to train segmentation models with very few labels by letting two different students (CLIP and DINOv3) learn together and correct each other.
The paper builds YearGuessr, a giant, worldwide photo-and-text dataset of 55,546 buildings with their construction years (1001–2024), GPS, and popularity (page views).
StoryMem is a new way to make minute‑long, multi‑shot videos that keep the same characters, places, and style across many clips.
The paper tackles a paradox: visual tokenizers that get great pixel reconstructions often make worse images when used for generation.
The paper shows that many AI image generators are trained to prefer one popular idea of beauty, even when a user clearly asks for something messy, dark, blurry, or emotionally heavy.
Most image-similarity tools only notice how things look (color, shape, class) and miss deeper, human-like connections.
SpaceControl lets you steer a powerful 3D generator with simple shapes you draw, without retraining the model.