Text-to-image models draw pretty pictures, but often put things in the wrong places or miss how objects interact.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows todayβs big vision-language AIs are not as good at it as we thought.