Molmo2 is a family of vision-language models that can watch videos, understand them, and point to or track things over time using fully open weights, data, and code.
Action100M is a gigantic video dataset with about 100 million labeled action moments built automatically from 1.2 million instructional videos.
This paper teaches video-making AIs to follow real-world physics better without retraining them.
HeartMuLa is a family of open-source music AI models that can understand and generate full songs with clear lyrics and strong musical structure.
Cities are full of places defined by people, like schools and parks, which are hard to see clearly from space without extra clues.
This paper builds an AI agent, ML-Master 2.0, that can work on machine learning projects for a very long time without forgetting what matters.
Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.
The paper shows a new way to teach AI assistants how to use tools in many-step conversations by mining ordinary text on the internet for step-by-step “how-to” knowledge.
Most text-to-image models act like word-to-pixel copy machines and miss the hidden meaning in our prompts.
DanQing is a fresh, 100-million-pair Chinese image–text dataset collected from 2024–2025 web pages and carefully cleaned for training AI that understands pictures and Chinese text together.
Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.
ToolSafe is a new way to keep AI agents safe when they use external tools, by checking each action before it runs.