🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
📝Daily Log🎯Prompts🧠Review
SearchSettings
How I Study AI - Learn AI Papers & Lectures the Easy Way

Papers21

AllBeginnerIntermediateAdvanced
All SourcesarXiv
#Vision-Language-Action

Chain of World: World Model Thinking in Latent Motion

Intermediate
Fuxiang Yang, Donglin Di et al.Mar 3arXiv

Robots learn better when they think about how things move over time, not by redrawing every pixel of a video.

#Vision-Language-Action#World Model#Latent Motion

World Guidance: World Modeling in Condition Space for Action Generation

Intermediate
Yue Su, Sijin Chen et al.Feb 25arXiv

WoG (World Guidance) teaches a robot to imagine just the right bits of the near future and use those bits to pick better actions.

#Vision-Language-Action#world modeling#condition space

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

Intermediate
Jingxuan Zhang, Yunta Hsieh et al.Feb 23arXiv

Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.

#Vision-Language-Action#Post-Training Quantization#Diffusion Transformer

VLANeXt: Recipes for Building Strong VLA Models

Intermediate
Xiao-Ming Wu, Bin Fan et al.Feb 20arXiv

This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.

#Vision-Language-Action#robot manipulation#flow matching

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Intermediate
GigaBrain Team, Boyuan Wang et al.Feb 12arXiv

GigaBrain-0.5M* is a robot brain that sees, reads, and acts, and it gets smarter by imagining the future before moving.

#Vision-Language-Action#World Model#Reinforcement Learning

RISE: Self-Improving Robot Policy with Compositional World Model

Intermediate
Jiazhi Yang, Kunyang Lin et al.Feb 11arXiv

RISE lets a robot learn safely and cheaply by practicing in its imagination instead of always in the real world.

#Reinforcement Learning#World Models#Compositional World Model

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

Intermediate
Yalcin Tur, Jalal Naghiyev et al.Feb 8arXiv

Robots often use the same amount of thinking for easy and hard moves, which wastes time on easy steps and isn’t enough for tricky ones.

#Recurrent depth#Latent iterative reasoning#Vision-Language-Action

Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

Intermediate
I. Apanasevich, M. Artemyev et al.Jan 31arXiv

Green-VLA is a step-by-step training recipe that teaches one model to see, understand language, and move many kinds of robots safely and efficiently.

#Vision-Language-Action#Unified Action Space#Multi-embodiment Pretraining

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Intermediate
Haozhe Xie, Beichen Wen et al.Jan 29arXiv

DynamicVLA is a small and fast robot brain that sees, reads, and acts while things are moving.

#Dynamic object manipulation#Vision-Language-Action#Continuous inference

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

Intermediate
Shijie Lian, Bin Yu et al.Jan 21arXiv

Robots often learn a bad habit called the vision shortcut: they guess the task just by looking, and ignore the words you tell them.

#Vision-Language-Action#Bayesian decomposition#Latent Action Queries

TwinBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

Intermediate
Bin Yu, Shijie Lian et al.Jan 20arXiv

TwinBrainVLA is a robot brain with two halves: a frozen generalist that keeps world knowledge safe and a trainable specialist that learns to move precisely.

#Vision-Language-Action#catastrophic forgetting#Asymmetric Mixture-of-Transformers

ACoT-VLA: Action Chain-of-Thought for Vision-Language-Action Models

Intermediate
Linqing Zhong, Yi Liu et al.Jan 16arXiv

Robots usually think in words and pictures, but their hands need exact motions, so there is a gap between understanding and doing.

#Vision-Language-Action#Action Chain-of-Thought#Explicit Action Reasoner
12