AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.
GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.
This paper teaches AI to pay attention better by training its focus, not just its words.
ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.
WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.
LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.
The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.
Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.
This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.