Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
IntermediateYicheng Feng, Wanpeng Zhang et al.Dec 15arXiv
Robots often see the world as flat pictures but must move in a 3D world, which makes accurate actions hard.
#Vision-Language-Action#3D spatial grounding#visual-physical alignment