Next-Embedding Prediction Makes Strong Vision Learners
BeginnerSihan Xu, Ziqiao Ma et al.Dec 18arXiv
This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patchβs embedding in an image sequence, just like language models predict the next word.
#self-supervised learning#vision transformer#autoregression