Papers4

#benchmark evaluation

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

Beginner

You Li, Chi Chen et al.Feb 26arXiv

The paper asks a simple question: do the model’s invisible “imagination tokens” actually help it reason about images?

#multimodal large language model#visual reasoning#latent visual reasoning

Not triaged yet

Multimodal Fact-Level Attribution for Verifiable Reasoning

Beginner

David Wan, Han Wang et al.Feb 12arXiv

This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.

#multimodal grounding#fact-level attribution#atomic fact decomposition

Not triaged yet

Making Avatars Interact: Towards Text-Driven Human-Object Interaction for Controllable Talking Avatars

Beginner

Youliang Zhang, Zhengguang Zhou et al.Feb 2arXiv

This paper teaches talking avatars not just to speak, but to look around their scene and handle nearby objects exactly as a text instruction says.

#grounded human-object interaction#talking avatars#diffusion transformer

Not triaged yet