LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.
IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.
This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'
Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.
TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.
FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.
Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.
RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.
LEO-RobotAgent is a simple but powerful framework that lets a language model think, plan, and operate many kinds of robots using natural language.