Papers5

#LLM-as-a-judge

Multimodal Fact-Level Attribution for Verifiable Reasoning

This paper builds a new test, called MURGAT, to check whether AI models can back up each small fact they say with the right part of a video, audio, or figure.

#multimodal grounding#fact-level attribution#atomic fact decomposition

Not triaged yet

DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Beginner

Nikita Gupta, Riju Chatterjee et al.Jan 28arXiv

DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.

#DeepSearchQA#agentic information retrieval#systematic collation

Not triaged yet

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Beginner

Dasol Choi, Guijin Son et al.Jan 7arXiv

Real people often ask vague questions with pictures, and today’s vision-language models (VLMs) struggle with them.

#vision-language models#under-specified queries#query explicitation

Not triaged yet

Video-Browser: Towards Agentic Open-web Video Browsing

Beginner

Zhengyang Liang, Yan Shu et al.Dec 28arXiv

The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.

#agentic video browsing#pyramidal perception#video understanding

Not triaged yet

DoVer: Intervention-Driven Auto Debugging for LLM Multi-Agent Systems

Beginner

Ming Ma, Jue Zhang et al.Dec 7arXiv

LLM multi-agent systems often fail quietly (no crash) and leave long, twisty logs that are hard to debug by hand.

#DoVer#intervention-driven debugging#LLM multi-agent systems

Not triaged yet