Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
IntermediateYu Zeng, Wenxuan Huang et al.Feb 2arXiv
The paper introduces VDR-Bench, a new test with 2,000 carefully built questions that truly require both seeing (images) and reading (web text) to find answers.
#multimodal large language model#visual question answering#vision deep research