This paper teaches an AI to segment any object you name (open-vocabulary) much better by adding a few example pictures with pixel labels and smart retrieval.
This paper teaches a humanoid robot to find and pick up many different objects in new places using plain-language requests like 'grab the orange mug.'
Multimodal Large Language Models (MLLMs) often hallucinate on videos by trusting words and common sense more than what the frames really show.