The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.
Real people often ask vague questions with pictures, and todayβs vision-language models (VLMs) struggle with them.
AuditDM is a friendly 'auditor' model that hunts for where vision-language models get things wrong and then creates the right practice to fix them.