This paper teaches AI to pay attention better by training its focus, not just its words.
ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.
VideoCoF is a new way to edit videos that first figures out WHERE to edit and then does the edit, like thinking before acting.
VG-Refiner is a new way for AI to find the right object in a picture when given a description, even if helper tools make mistakes.
The paper shows how a vision-language model (VLM) can train itself to be a fair judge of answers about images without using any human preference labels.