Papers11

#robot manipulation

A Pragmatic VLA Foundation Model

LingBot-VLA is a robot brain that listens to language, looks at the world, and decides smooth actions to get tasks done.

#Vision‑Language‑Action#foundation model#Flow Matching

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Beginner

Jongwoo Park, Kanchana Ranasinghe et al.Jan 22arXiv

IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.

#Vision-Language-Action#affinity map#training-free guidance

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Intermediate

Jianshu Zhang, Chengxuan Qian et al.Jan 21arXiv

This paper asks a new question for vision-language models: not just 'What do you see?' but 'How far along is the task right now?'

#progress reasoning#vision-language models#episodic retrieval

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate

Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Intermediate

Bin Yu, Shijie Lian et al.Jan 20arXiv

TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.

#Vision-Language-Action#catastrophic forgetting#Asymmetric Mixture-of-Transformers

Future Optical Flow Prediction Improves Robot Control & Video Generation

Intermediate

Kanchana Ranasinghe, Honglu Zhou et al.Jan 15arXiv

FOFPred is a new AI that reads one or two images plus a short instruction like “move the bottle left to right,” and then predicts how every pixel will move in the next moments.

#optical flow#future optical flow prediction#vision-language model

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Intermediate

Xiaopeng Lin, Shijie Lian et al.Dec 18arXiv

Robots learn best from what they would actually see, which is a first-person (egocentric) view, but most AI models are trained on third-person videos and get confused.

#egocentric vision#first-person video#vision-language model

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

Intermediate

Enshen Zhou, Cheng Chi et al.Dec 15arXiv

RoboTracer is a vision-language model that turns tricky, word-only instructions into safe, step-by-step 3D paths (spatial traces) robots can follow.

#RoboTracer#spatial trace#3D spatial referring

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Intermediate

Yicheng Feng, Wanpeng Zhang et al.Dec 15arXiv

Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.

#Vision-Language-Action#3D spatial grounding#visual-physical alignment

FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

Intermediate

Yulu Gan, Ligeng Zhu et al.Dec 11arXiv

FoundationMotion is a fully automatic pipeline that turns raw videos into detailed motion data, captions, and quizzes about how things move.

#motion understanding#spatio-temporal reasoning#video question answering

LEO-RobotAgent: A General-purpose Robotic Agent for Language-driven Embodied Operator

Intermediate

Lihuang Chen, Xiangyu Luo et al.Dec 11arXiv

LEO-RobotAgent is a simple but powerful framework that lets a language model think, plan, and operate many kinds of robots using natural language.

#LEO-RobotAgent#language-driven robotics#LLM agent