Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).
This paper asks whether generation training benefits more from an encoderβs big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).