DeepSearchQA is a new test with 900 real-world style questions that checks if AI agents can find complete lists of answers, not just one fact.
Real people often ask vague questions with pictures, and todayβs vision-language models (VLMs) struggle with them.
The paper tackles how AI agents can truly research the open web when the answers are hidden inside long, messy videos, not just text.
LLM multi-agent systems often fail quietly (no crash) and leave long, twisty logs that are hard to debug by hand.