DINO-SAE is a new autoencoder that keeps both the meaning of an image (semantics) and tiny textures (fine details) at the same time.
This paper asks whether generation training benefits more from an encoderβs big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).