Vision-Language-Action (VLA) robots are powerful but too big and slow for many real-world devices.
Robots often learn good hand motions during training but get confused when the scene or the instructions change at test time, even a little bit.