Papers15

#visual grounding

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Intermediate

Rui Yang, Qianhui Wu et al.Feb 25arXiv

GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.

#GUI agent#visual grounding#long-horizon navigation

Not triaged yet

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

Intermediate

Haiyang Xu, Xi Zhang et al.Feb 15arXiv

This paper builds GUI-Owl-1.5, an AI that can use phones, computers, and web browsers like a careful human helper.

#GUI agent#visual grounding#reinforcement learning

Not triaged yet

Reinforced Attention Learning

Intermediate

Bangzheng Li, Jianmo Ni et al.Feb 4arXiv

This paper teaches AI to pay attention better by training its focus, not just its words.

#Reinforced Attention Learning#attention policy#multimodal LLM

Not triaged yet

ObjEmbed: Towards Universal Multimodal Object Embeddings

Intermediate

Shenghao Fu, Yukun Su et al.Feb 2arXiv

ObjEmbed teaches an AI to understand not just whole pictures, but each object inside them, and to link those objects to the right words.

#object embeddings#IoU embedding#visual grounding

Not triaged yet

WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models

Intermediate

Runjie Zhou, Youbo Shao et al.Jan 28arXiv

WorldVQA is a new test that checks if multimodal AI models can correctly name what they see in pictures without doing extra reasoning.

#WorldVQA#atomic visual knowledge#multimodal large language models

Not triaged yet

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning

Intermediate

Linquan Wu, Tianxiang Jiang et al.Jan 15arXiv

LaViT is a new way to teach smaller vision-language models to look at the right parts of an image before they speak.

#multimodal reasoning#visual attention#knowledge distillation

Not triaged yet

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

Not triaged yet

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Intermediate

Wensi Huang, Shaohao Zhu et al.Dec 26arXiv

Real life directions are often vague, so the paper creates a task where a robot can ask questions while it searches for a very specific object in a big house.

#embodied AI#interactive navigation#instance goal navigation

Not triaged yet

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Intermediate

Shuoshuo Zhang, Yizhen Zhang et al.Dec 26arXiv

The paper teaches vision-language models (AIs that look and read) to pay attention to the right picture parts without needing extra tools during answering time.

#BiPS#perceptual shaping#vision-language models

Not triaged yet

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Intermediate

Rang Li, Lei Li et al.Dec 19arXiv

Visual grounding is when an AI finds the exact thing in a picture that a sentence is talking about, and this paper shows today’s big vision-language AIs are not as good at it as we thought.

#visual grounding#multimodal large language models#benchmark

Not triaged yet

Thinking with Images via Self-Calling Agent

Intermediate

Wenxi Yang, Yuzhong Zhao et al.Dec 9arXiv

This paper teaches a vision-language model to think about images by talking to copies of itself, using only words to plan and decide.

#Self-Calling Chain-of-Thought#sCoT#interleaved multimodal chain-of-thought

Not triaged yet

1 2