Papers2

#stop-gradient

FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment

Robots learn better when they predict short, meaningful summaries of future images instead of drawing every pixel of the future scene.

#world modeling#vision-language-action (VLA)#diffusion policy

Next-Embedding Prediction Makes Strong Vision Learners

Beginner

Sihan Xu, Ziqiao Ma et al.Dec 18arXiv

This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.

#self-supervised learning#vision transformer#autoregression