This paper shows a simple, repeatable way to teach general Vision-Language Models (VLMs) to understand e-commerce items much better without forgetting their general skills.
Robots like cars and drones see the world with many different sensors (cameras, LiDAR, radar, and even event cameras), and this paper shows a clear roadmap for teaching them to understand space by learning from all of these together.