This paper teaches a humanoid robot to find and pick up many different objects in new places using plain-language requests like 'grab the orange mug.'
VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.