Vision Transformers (ViTs) are great at recognizing what is in a whole image but often blur the tiny details needed to label each pixel (segmentation).
This paper builds A4-Agent, a smart three-part helper that figures out where to touch or use an object just from a picture and a written instruction, without any extra training.