Papers4

#multimodal agents

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Zhaochen Su, Jincheng Gao et al.Feb 26arXiv

AgentVista is a new test (benchmark) that checks whether AI agents can solve tough, real-life picture-based problems by using multiple tools over many steps.

#AgentVista#multimodal agents#visual grounding

Not triaged yet

MMA: Multimodal Memory Agent

Intermediate

Yihao Lu, Wanru Cheng et al.Feb 18arXiv

Long-horizon AI assistants can grab old, low-quality, or conflicting memories and then answer with too much confidence, which is dangerous.

#memory-augmented LLMs#multimodal agents#reliability scoring

Not triaged yet

GameDevBench: Evaluating Agentic Capabilities Through Game Development

Intermediate

Wayne Chi, Yixiong Fang et al.Feb 11arXiv

GameDevBench is a new test that checks if AI agents can actually make parts of video games, not just write code in one file.

#GameDevBench#Godot#multimodal agents

Not triaged yet

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

Intermediate

Chenlong Deng, Mengjie Deng et al.Feb 11arXiv

Most image search systems judge each photo by itself, which fails when clues are split across many photos taken over time.

#context-aware image retrieval#multimodal agents#visual history exploration

Not triaged yet