Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.
DINO-SAE is a new autoencoder that keeps both the meaning of an image (semantics) and tiny textures (fine details) at the same time.
UniX is a new medical AI that both understands chest X-rays (writes accurate reports) and generates chest X-ray images (high visual quality) without making the two jobs fight each other.
This paper shows that the best VAEs for image generation are the ones whose latents neatly separate object attributes, a property called semantic disentanglement.
This paper introduces NExT-Vid, a way to teach a video model by asking it to guess the next frame of a video while parts of the past are hidden.
This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).