Papers2

#patch tokens

Locality-Attending Vision Transformer

Sina Hajimiri, Farzad Beizaee et al.Mar 5arXiv

Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).

#Vision Transformer#self-attention#segmentation

Not triaged yet

What matters for Representation Alignment: Global Information or Spatial Structure?

Intermediate

Jaskirat Singh, Xingjian Leng et al.Dec 11arXiv

This paper asks whether generation training benefits more from an encoder’s big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).

#representation alignment#REPA#iREPA

Not triaged yet