DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen; Tiancheng Gu; Bin Qin; Lan Wu; Yuling Wu; Shuo Tan; Zelong Sun; Jun Wang; Nan Wu; Xiang An; Weidong Cai; Ziyong Feng; Kaicheng Yang

DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Intermediate

Hengyu Shen, Tiancheng Gu, Bin Qin et al.1/15/2026

arXiv PDF

Key Summary

•DanQing is a fresh, 100-million-pair Chinese image–text dataset collected from 2024–2025 web pages and carefully cleaned for training AI that understands pictures and Chinese text together.
•It turns about 1 billion noisy web pairs into high-quality data using a four-step pipeline: source selection, text refinement, visual diversification, and cross-modal cross-batch filtering.
•DanQing helps models recognize new and trending Chinese concepts (like Xiaomi SU7 and Black Myth: Wukong) that older datasets miss.
•When used to continue pretraining SigLIP2 models, DanQing beats prior Chinese datasets (Wukong, Zero, TaiSu) on zero-shot classification and cross-modal retrieval.
•On long-caption retrieval benchmarks, DanQing delivers especially large gains even with short context length (64 tokens), thanks to denser, higher-quality text.
•The dataset shows a more balanced spread of topics and less duplication, which helps models learn rare ideas better and scale smoothly with more data and bigger models.
•Every image and caption pair is checked for safety, language consistency, alignment quality, and duplicate reduction, which reduces noise by about 90%.
•DanQing is open-source (CC-BY 4.0), so researchers and developers can freely use it to build better Chinese vision-language systems.

Why This Research Matters

DanQing makes AI that understands Chinese pictures and text more accurate, current, and useful in daily life. It helps search engines find the right images for Chinese queries and allows shopping apps to match products with user descriptions more reliably. News and education platforms benefit from better retrieval of relevant visuals for Chinese articles and lessons. Multimodal assistants become better at answering Chinese questions about photos, charts, and documents. Startups and researchers get an open, legally usable dataset to build and share new tools. Because the data is fresh (2024–2025), models grasp trending terms and products in China that older datasets miss.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning from a picture book in Chinese. If the book has missing pages, blurry pictures, and old news, it’s hard to learn well, right?

🥬 The Concept (Vision-Language Pre-training, VLP): VLP is how we teach computers to match what they see in pictures with what they read in text so they can understand both together. How it works (simple recipe):

Show the computer a photo and a matching caption.
Ask it to make the photo and caption “feel close” in its memory space.
Repeat millions of times with many topics so it learns general skills. Why it matters: Without VLP, the computer may see pixels and words as separate, never linking 'panda' with the actual animal image. 🍞 Anchor: When you type '雪山上的熊猫' (a panda on a snowy mountain) and the AI finds the right picture, that’s VLP in action.

The World Before:

English models like CLIP grew strong because they had huge, updated datasets like LAION-5B. But Chinese models didn’t have a recent, large, clean source. Many Chinese datasets were smaller, noisier, or partly broken (lots of dead links), and often several years old.
That meant Chinese models lagged in everyday tasks—like finding the right images for Chinese captions or recognizing new Chinese trends.

🍞 Hook: You know how a matching game works better when every card actually has its pair?

🥬 The Concept (Image-Text Pairs): Image-text pairs are a picture and the words that describe it. How it works: Pair the right caption with the right image; keep many such pairs across many topics. Why it matters: Without good pairs, the model learns wrong matches (like calling a cat a dog). 🍞 Anchor: A photo of 热干面 with the caption '武汉热干面加芝麻酱' is a strong pair.

The Problem:

The Chinese web is huge, but raw data is messy. Many captions are ads, spam, duplicates, or off-topic. Many images are too small, blurry, or not accessible anymore.
Old datasets miss current events and buzzwords, so models stumble on modern concepts.

Failed Attempts:

Simple filters (like short rules) toss out some junk but keep lots of hidden noise (grammar errors, off-topic captions, near-duplicates).
Focusing only on text cleanup ignores visual quality; focusing only on images ignores caption quality; neither solves cross-modal mismatch.
Heavily synthetic captions may match certain benchmarks but don’t always improve broad, real-world generalization.

🍞 Hook: Imagine cleaning your room by only folding clothes but never taking out the trash. Still messy, right?

🥬 The Concept (Training Dataset): A training dataset is the big study book the AI uses to learn from examples. How it works: Collect many examples, clean them, and organize them so the AI can learn patterns. Why it matters: If the book is messy or outdated, the student (the AI) learns poorly. 🍞 Anchor: If the AI sees millions of clean Chinese image–caption pairs, it gets better at Chinese vision-language tasks.

The Gap:

China needed a large, up-to-date, open dataset with careful, multi-stage cleaning—across text, images, and their alignment—to power modern dual-encoder models (like SigLIP2) and multimodal LLMs.

Real Stakes (Why you should care):

When you search Chinese terms, shop online, read news, or use a multimodal assistant, you want it to understand present-day China: new products, festivals, slang, and places.
Schools, museums, and media apps need systems that retrieve accurate images from Chinese descriptions.
Startups and researchers need a reliable, legal, living dataset to innovate faster.

🍞 Hook: Imagine building a giant Lego city: you need clean pieces, enough variety, and instructions that match the pieces. Otherwise, the towers fall.

🥬 The Concept (Data Filtering Pipeline): A data filtering pipeline is the step-by-step cleanup line that turns raw web data into high-quality training data. How it works:

Pick good sources and the right language.
Fix and check captions for grammar, info density, and safety.
Keep clear, diverse, non-duplicate images.
Double-check that each image truly matches its caption. Why it matters: Without this pipeline, models waste time learning from junk and become confused. 🍞 Anchor: DanQing starts with ~1B rough pairs and keeps ~100M strong ones by passing every pair through this pipeline.

By building DanQing—a modern, carefully cleaned, CC-BY 4.0 dataset of 100 million Chinese image–text pairs collected in 2024–2025—the authors give Chinese vision-language models the “right book” to study from. And just like any good book, it’s clear, current, and complete enough to help them learn fast and well.

02Core Idea

🍞 Hook: Think of updating a map app. If your map stops at 2021, it won’t show new roads. Drivers get lost.

🥬 The Concept (Aha! Moment): The key insight is that a fresh, large, and rigorously filtered Chinese image–text dataset—from 2024–2025—unlocks better learning and keeps models fluent with today’s Chinese world. How it works:

Crawl modern Chinese web data at scale.
Clean captions deeply (language, grammar, info density, safety).
Clean images deeply (clarity, diversity, no duplicates, safety).
Verify image–text alignment using expert models and remove cross-batch duplicates. Why it matters: Without freshness plus strong filtering, models won’t learn new concepts well and keep memorizing noise. 🍞 Anchor: Ask for '小米SU7蓝色运动版侧面照' and the model retrieves the right 2024 car image instead of an old Xiaomi phone.

Three analogies for the same idea:

Library analogy: Build a brand-new Chinese library. Toss outdated or low-quality books. Keep clear, well-written ones on many topics. Result: smarter readers (models).
Kitchen analogy: Start with a giant basket of mixed produce (web data). Wash, trim, and sort by freshness and quality. Cook clean recipes (training) that taste better (results).
Team-sports analogy: Recruit many players (data) but only keep healthy, skilled, and cooperative teammates (filtered pairs) who pass well together (alignment). The team wins more games (benchmarks).

Before vs After:

Before: Chinese models trained on older or noisier data struggled with up-to-date terms, had uneven topic balance, and plateaued when scaled.
After DanQing: They recognize new trends, learn from balanced topics, scale better with more data and bigger models, and improve in zero-shot classification, retrieval, and multimodal LLM tasks.

🍞 Hook: You know how a translator helps two friends speak different languages?

🥬 The Concept (Cross-Modal Alignment): Cross-modal alignment is about making sure the picture’s meaning and the caption’s meaning really match. How it works: Use a strong model to measure how close the image and text are in meaning; keep the good matches, drop the poor ones. Why it matters: If '猫' is matched to a dog photo, the model’s understanding collapses. 🍞 Anchor: A caption '川菜馆麻婆豆腐特写' should pair with a close-up of Mapo Tofu, not hot pot or pizza.

Why it works (intuition, not equations):

Models learn patterns best from clean, diverse, well-aligned examples. DanQing improves the “signal-to-noise ratio” by removing blurry images, low-content text, and mismatched pairs.
Fresh data introduces new words and styles that older data lacks, preventing the model from becoming out-of-date.
Balanced topics mean the model doesn’t overfit to a few popular areas (like fashion) and forget long-tail areas (like agriculture).

Building blocks of DanQing’s idea:

Source Selection: Start with Common Crawl 2024–2025, pick Chinese pages, remove bad domains and unsafe/over-short/over-long captions.
Text Refinement: Confirm Chinese language, standardize to Simplified, check grammar tokens, keep info-dense text, and ensure safety.
Visual Diversification: Keep clear images, remove duplicates via embeddings, ensure variety in sizes and content.
Cross-Modal Cross-Batch Filtering: Use an expert Chinese-CLIP to keep only well-aligned image–text pairs and deduplicate across batches.

🍞 Hook: Sorting your toy box by kind, color, and size makes playtime smoother.

🥬 The Concept (Semantic Distribution): Semantic distribution is how different ideas and topics are spread across the dataset. How it works: Cluster images and captions to see which topics are big or small and how evenly they’re covered. Why it matters: If a few topics dominate, the model gets lopsided and weak on rare concepts. 🍞 Anchor: DanQing’s clusters are more even than Wukong’s, so the model learns both '旅游' and '农业' reliably.

🍞 Hook: If you sift a bucket of mixed beads, you’ll find groups like 'sports', 'food', and 'tech'.

🥬 The Concept (Topic Modeling): Topic modeling finds common themes in lots of text. How it works: Turn captions into vectors, reduce dimensions, cluster them, and pull keywords per cluster. Why it matters: It checks that your dataset truly spans real-life topics, not just a few. 🍞 Anchor: DanQing’s topics include 时尚, 科技, 美食, 家居, 旅游, 体育—mirroring daily Chinese life.

Finally, the model choice: 🍞 Hook: Think of wearing better running shoes during practice; you can train longer and safer.

🥬 The Concept (SigLIP2 Models): SigLIP2 are advanced vision–language encoders that match images and text using a scalable sigmoid-based loss. How it works: Encode images and text, compare their similarity independently (no global batch coupling), and learn to push true pairs close and false pairs apart. Why it matters: With big batches and distributed training, this loss is stable and efficient. 🍞 Anchor: Trained on DanQing, SigLIP2 learns faster and generalizes better to new Chinese tasks.

03Methodology

At a high level: Raw Chinese web pairs (2024–2025) → Data Source Selection → Text Refinement → Visual Diversification → Cross-Modal Cross-Batch Filtering → DanQing 100M clean pairs.

Step A: Data Source Selection

What happens: Crawl Common Crawl pages tagged as Chinese (zho), split into seven batches, remove blacklisted sources, keep captions of 5–60 words, and filter unsafe content with a lightweight classifier. Download images and keep only accessible links.
Why this step exists: Starting clean reduces wasted effort later. If we keep spammy domains, broken links, or extreme caption lengths, later filters work harder and still miss issues.
Example: A page from a trusted news site with a 24-word caption about '2024冬奥绿色建筑材料' passes; a spammy ad page with 2 words like '买买买！' fails.

Step B: Text Refinement

What happens:
1. Language and script: Confirm Chinese with FastText; convert Traditional to Simplified with OpenCC.
2. Quality checks: Remove captions missing basic parts (e.g., no nouns) or having too many [UNK] tokens after SigLIP2 tokenization.
3. Information density: Remove emojis/special chars; compute entropy to drop ultra-low-content texts.
4. Safety: Use NSFW detectors and Baidu DataBuilder to filter ads, sensitive political content, or territorial disputes.
Why this step exists: Captions must be clear, grammatical, and informative to teach the model precise language-image links.
Example: '石库门建筑灰砖细节与拱形门窗' (rich, specific) stays; '好看！！😍😍' (low info) is removed.

Step C: Visual Diversification

What happens:
1. Visual fidelity: Keep reasonable aspect ratios (1:3 to 3:1), minimum edge >100 px, normal intensity variation, and sharpness (via Laplacian variance ≥1000).
2. Information density: Use image entropy to remove nearly empty or flat images.
3. Redundancy control: Compute image embeddings via Chinese-CLIP-L/14; group near-duplicates with Union-Find; keep only the central image of each group.
4. Safety: Apply a stronger NSFW detector to remove risky visuals.
Why this step exists: Blurry, tiny, or duplicate images don’t help learning; they waste compute and skew topics.
Example: Ten near-identical product shots of the same sneaker are merged; only the most representative stays.

Step D: Cross-Modal Cross-Batch Filtering

What happens: Use an expert Chinese-CLIP to measure image–text alignment quality (via L2 distance or cosine). Keep pairs with distances in a sweet range (too-close often means OCR-heavy, too-far means mismatch). Then deduplicate across batches to eliminate repeats.
Why this step exists: Even after image-only and text-only cleanup, some pairs don’t actually match. This step protects the core promise: each caption truly describes its image.
Example: A caption about '杭州西湖断桥冬景' paired with a summer beach photo gets removed.

Secret sauce (what’s clever):

Freshness: Data from 2024–2025 means the model learns new brands, memes, vehicles, and events.
Multi-axis filtering: Text quality + image quality + cross-modal checks + cross-batch dedup together kill subtle noise sources that single-axis filters miss.
Balanced semantics: Redundancy control and topic checks lead to a more even concept spread, so the model learns long-tail topics.
Scalable loss: Training SigLIP2 with sigmoid-based loss avoids batch coupling, making large-batch, distributed training smoother on this big dataset.

🍞 Hook: Imagine labeling every photo in your album with the correct caption so you can search it later.

🥬 The Concept (Cross-Modal Retrieval): Cross-modal retrieval finds images from text (and text from images). How it works: Encode both sides, compare similarity, and rank results. Why it matters: If your data was noisy, search is poor; if data is clean and aligned, search feels magic. 🍞 Anchor: Typing '川剧变脸舞台近景' yields stage photos with masks—not random opera posters.

🍞 Hook: You know how sometimes you can guess a new word from the sentence?

🥬 The Concept (Zero-shot Classification): Zero-shot classification lets a model recognize categories it hasn’t explicitly seen during training, by reading their names. How it works: Encode class names as text prompts and pick the closest class to the image embedding. Why it matters: It tests whether the model’s learned concepts transfer to new labels. 🍞 Anchor: The model may label a new car model as '电动车' without a specific training label if it understands the concept of EVs.

Putting it together: Input (web pages, 2024–2025) → Language-safe, source-filtered pairs → Clear, informative captions → Sharp, diverse, non-duplicate images → Expert-checked image–text matches → DanQing 100M pairs. Train SigLIP2 encoders on this and you get stronger Chinese retrieval, better zero-shot classification, and improved multimodal LLM performance.

04Experiments & Results

The Test (what they measured and why):

Zero-shot Image Classification: Tests broad concept understanding without task-specific fine-tuning.
Cross-Modal Retrieval (short and long captions): Tests how well image and text spaces align for practical search.
Chinese-centric LMM tasks (e.g., MMBench-CN, MME-RW CN, CMMMU, OCRBench V2): Tests if a DanQing-pretrained vision encoder helps multimodal reasoning in Chinese.
Scaling Study: See if performance keeps improving with more data and larger models (a hallmark of a good dataset).

The Competition (what they compared against):

Prior Chinese web datasets: Wukong (100M), TaiSu (~166M), Zero (250M; here, 100M random subsets used for fairness).
Baseline: Original SigLIP2 models without continued pretraining on Chinese datasets.

Scoreboard (with context):

Zero-shot Classification: Across SigLIP2-B/32, B/16, and L/16, DanQing achieved average gains of around 7–8 points over the unadapted baseline and about +0.5 to +1.9 over the best prior Chinese datasets. Think of it as going from a strong B to a solid A on multiple tests.
Short-Caption Retrieval (Flickr30K-CN, MSCOCO-CN, MUGE): DanQing improved average retrieval results by roughly 2–3 points over Wukong and Zero across backbones. That’s like finding the right photo a couple more times out of every 50 queries—very noticeable at scale.
Long-Caption Retrieval (DCI-CN, DOCCI-CN): With the SigLIP2-L/16 model and only 64 text tokens, DanQing still led by large margins: +12.8% vs Wukong, +9.0% vs Zero, +8.9% vs TaiSu (averages). This is like moving from a B- to an A in a hard reading-comprehension exam while using the same study time (same context length).
Chinese LMM Tasks: Plugging DanQing-pretrained SigLIP2-L/16 into an LLaVA-NeXT-style setup reached a new average SOTA (~50.1% vs 49.5%) across multiple Chinese evaluations. Small percentage bumps here reflect real, end-to-end user experience improvements.

Surprising Findings:

Long-caption strength despite short context: Even capped at 64 tokens, DanQing’s higher semantic density helped recover more meaning, boosting long-caption retrieval substantially.
New concept mastery: DanQing-pretrained models recognized 2024+ concepts (e.g., 小米SU7, 黑神话:悟空) better than models trained on older datasets, proving the power of freshness.
Better scaling curves: DanQing didn’t plateau early. As data size or model size grew, performance kept climbing more steeply than with Wukong, signaling higher data quality.
More balanced clusters: Topic clustering revealed DanQing’s semantic distribution is more uniform, which helps models handle both popular and long-tail content.

Why these results make sense:

If your training pairs are cleaner and more evenly spread, the model learns clearer concepts.
If your captions carry more content words and better grammar, the model reads more meaning per token.
If your data includes current culture and products, the model won’t feel stuck in the past.

In short: Across tasks, architectures, and scales, DanQing made Chinese vision-language models more accurate, more current, and more robust.

05Discussion & Limitations

Limitations (be specific):

Web bias remains: Even with strict filtering, web data can reflect popularity biases (e.g., over-representation of e-commerce or trending topics) and may under-cover niche cultural content.
Safety filters trade-offs: Some safe-but-unusual content might be wrongly filtered, slightly narrowing diversity.
OCR-heavy edge cases: Very text-dense images (posters, charts) are partly controlled by thresholds; some useful ones may be excluded.
Temporal decay returns: Although fresh now (2024–2025), concepts will age; future refreshes will be needed.

Required Resources:

Storage: ~12 TB for 100M pairs.
Compute: Multi-GPU training (e.g., 16×A800 80G in the paper) to pretrain SigLIP2 efficiently.
Tooling: Access to language ID, NSFW filters, OpenCC, image processing (OpenCV), and embedding models for filtering.

When NOT to Use:

If you need strictly curated, domain-specific medical or legal datasets with verified expert annotations.
If your application focuses on non-Chinese or minority languages significantly different from Simplified Chinese.
If your system needs full-resolution photography for fine-grained details beyond the model’s input size (256×256 used in training here)—you may need domain-specific fine-tuning.

Open Questions:

How to maintain freshness continuously? A rolling pipeline could keep models aligned with 2026+ trends.
Can we keep even more OCR/text-rich images without harming retrieval quality, perhaps by training a specialized branch?
How does DanQing integrate with longer-context training (e.g., 256–512 tokens) to fully leverage long captions?
What is the optimal balance between real and synthetic captions for generalization without dataset overfitting to benchmarks?
How to best represent under-represented regions and dialects in future versions while keeping quality and safety?

Overall, DanQing is a big step forward, but like any living dataset, it will benefit from steady refreshes, deeper coverage of rare domains, and careful balancing of safety with inclusivity.

06Conclusion & Future Work

3-Sentence Summary: DanQing is a large, up-to-date (2024–2025) Chinese image–text dataset of about 100 million high-quality pairs, built by passing raw web data through a rigorous multi-stage filtering pipeline. When used to continue pretraining SigLIP2 encoders, DanQing consistently outperforms previous Chinese datasets on zero-shot classification, cross-modal retrieval (short and long captions), and boosts Chinese LMMs. Analyses show DanQing has denser text, balanced topics, stronger alignment, and better scaling behavior.

Main Achievement: Providing a modern, open, meticulously curated Chinese vision–language pretraining dataset that measurably improves downstream performance and keeps models fluent with current Chinese concepts.

Future Directions:

Regularly refresh DanQing to track new cultural and technological trends.
Explore longer text contexts and multi-granular captions.
Add carefully selected OCR-rich images and structured documents.
Balance real and synthetic captions to cover rare topics without overfitting.

Why Remember This: Quality data is the fuel of foundation models. DanQing shows that freshness plus multi-axis filtering can transform noisy web data into a reliable engine for Chinese vision–language understanding—making AI that searches, reads, and reasons in Chinese feel far more natural and up-to-date.

Practical Applications

•Improve Chinese image search so users find the exact photo that matches their query.
•Boost product discovery in e-commerce by matching user text to the most relevant product images.
•Enhance news and blog platforms with accurate image recommendations for Chinese articles.
•Power multimodal chatbots that can describe images and answer questions in Chinese.
•Support educational tools that retrieve illustrations for Chinese lessons and homework.
•Assist cultural archives and museums in organizing and retrieving images with Chinese metadata.
•Strengthen social media content understanding (e.g., tagging, moderation) for Chinese posts.
•Enable better OCR-related reasoning by combining image and Chinese text understanding.
•Improve content safety by training detectors on clean, well-labeled Chinese multimodal data.
•Accelerate research on Chinese vision-language models with an open, large-scale dataset.

Version: 1