🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers15

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Vision-Language-Action

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Intermediate
I. Apanasevich, M. Artemyev et al.Jan 31arXiv

Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.

#Vision-Language-Action#Unified Action Space#Multi-embodiment Pretraining

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Intermediate
Haozhe Xie, Beichen Wen et al.Jan 29arXiv

DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.

#Dynamic object manipulation#Vision-Language-Action#Continuous inference

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Beginner
Jongwoo Park, Kanchana Ranasinghe et al.Jan 22arXiv

IVRA is a simple, training-free add-on that helps robot brains keep the 2D shape of pictures while following language instructions.

#Vision-Language-Action#affinity map#training-free guidance

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate
Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Intermediate
Bin Yu, Shijie Lian et al.Jan 20arXiv

TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.

#Vision-Language-Action#catastrophic forgetting#Asymmetric Mixture-of-Transformers

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Intermediate
Linqing Zhong, Yi Liu et al.Jan 16arXiv

Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.

#Vision-Language-Action#Action Chain-of-Thought#Explicit Action Reasoner

Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning

Intermediate
Chi-Pin Huang, Yunze Man et al.Jan 14arXiv

Fast-ThinkAct teaches a robot to plan with a few tiny hidden "thought tokens" instead of long paragraphs, making it much faster while staying smart.

#Vision-Language-Action#latent reasoning#verbalizable planning

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory

Intermediate
Shaoan Wang, Yuanfei Luo et al.Jan 13arXiv

VLingNav is a robot navigation system that sees, reads instructions, and acts, while deciding when to think hard and when to just move.

#Vision-Language-Action#embodied navigation#adaptive chain-of-thought

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Intermediate
Tianshuai Hu, Xiaolu Liu et al.Dec 18arXiv

Traditional self-driving used separate boxes for seeing, thinking, and acting, but tiny mistakes in early boxes could snowball into big problems later.

#Vision-Language-Action#End-to-End Autonomous Driving#Dual-System VLA

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Intermediate
Yicheng Feng, Wanpeng Zhang et al.Dec 15arXiv

Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.

#Vision-Language-Action#3D spatial grounding#visual-physical alignment

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Intermediate
Zhe Liu, Runhui Huang et al.Dec 14arXiv

DrivePI is a single, small (0.5B) multimodal language model that sees with cameras and LiDAR, talks in natural language, and plans driving actions all at once.

#DrivePI#Vision-Language-Action#3D occupancy

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Intermediate
Chao Xu, Suyu Zhang et al.Dec 12arXiv

Vision-Language-Action (VLA) models are robots’ “see–think–do” brains that connect cameras (vision), words (language), and motors (action).

#Vision-Language-Action#Embodied AI#Multimodal Alignment
12