This paper studies Vision–Language–Action (VLA) robots under one fair setup to find which design choices truly matter.
Robots often use the same amount of thinking for easy and hard moves, which wastes time on easy steps and isn’t enough for tricky ones.
Robots usually learn by copying many demonstrations, which is expensive and makes them brittle when things change.
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.