Papers4

#causal masking

Causal World Modeling for Robot Control

Robots used to copy actions from videos without truly understanding how the world changes, so they often messed up long, multi-step jobs.

#robot world model#autoregressive diffusion#causal masking

Not triaged yet

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

Not triaged yet

Next-Embedding Prediction Makes Strong Vision Learners

Beginner

Sihan Xu, Ziqiao Ma et al.Dec 18arXiv

This paper introduces NEPA, a very simple way to teach vision models by having them predict the next patch’s embedding in an image sequence, just like language models predict the next word.

#self-supervised learning#vision transformer#autoregression

Not triaged yet

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Intermediate

Yuwei Guo, Ceyuan Yang et al.Dec 17arXiv

This paper fixes a common problem in video-making AIs where tiny mistakes snowball over time and ruin long videos.

#autoregressive video diffusion#exposure bias#teacher forcing

Not triaged yet