This paper builds a giant, automatically made video library called SVG2 that tells who is in a video, what they look like, and how they interact over time.
Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.