Robots learn faster and more flexibly when they can use human touch data, but humans and robots feel touch with very different sensors.
This paper asks whether generation training benefits more from an encoderβs big-picture meaning (global semantics) or from how features are arranged across space (spatial structure).