The paper trains one model from scratch to both read text and see images/videos, instead of starting from a language-only model.
DanQing is a fresh, 100-million-pair Chinese imageβtext dataset collected from 2024β2025 web pages and carefully cleaned for training AI that understands pictures and Chinese text together.