How I Study AI - Learn AI Papers & Lectures the Easy Way

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Intermediate

Christopher Clark, Jieyu Zhang et al.Jan 15arXiv

Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.

#vision-language model#video grounding#pointing and tracking

Papers1

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding