Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
IntermediateWenxuan Huang, Yu Zeng et al.Jan 29arXiv
The paper tackles a real problem: one-shot image or text searches often miss the right evidence (low hit-rate), especially in noisy, cluttered pictures.
#multimodal deep research#visual question answering#ReAct reasoning