WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.
Robots learn better when they get small hints at every step instead of only a final thumbs-up or thumbs-down.
This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.
FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.