OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Mengzhang Cai; Xin Gao; Yu Li; Honglin Lin; Zheng Liu; Zhuoshi Pan; Qizhi Pei; Xiaoran Shang; Mengyuan Sun; Zinan Tang; Xiaoyang Wang; Zhanping Zhong; Yun Zhu; Dahua Lin; Conghui He; Lijun Wu

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

Intermediate

Mengzhang Cai, Xin Gao, Yu Li et al.12/16/2025

arXiv PDF

Key Summary

•OpenDataArena (ODA) is a fair, open platform that measures how valuable different post‑training datasets are for large language models by holding everything else constant.
•It fine‑tunes the same base models on one dataset at a time, evaluates them on the same set of 22 benchmarks, and uses the results as a direct score for the dataset’s value.
•ODA profiles each dataset with a multi‑dimensional scoring system (clarity, difficulty, correctness, diversity, and more) to explain why some data helps models more than others.
•A data lineage explorer maps where datasets come from and how they were built, revealing reuse, overlap, and even benchmark contamination.
•Across 600+ training runs on 120+ datasets, ODA finds that response quality (especially longer, step‑by‑step reasoning) predicts better performance, especially in Math and Science.
•The Code domain behaves differently: concise answers often work better than long ones, so coding data needs its own evaluation rules.
•Bigger isn’t always better—high‑density, well‑curated datasets beat large but noisy ones—yet tiny datasets can hit a ceiling or even hurt weaker models.
•Lineage tracing uncovers hidden redundancy and direct leakage from test benchmarks into training sets, which can inflate scores without real understanding.
•All code, tools, configs, and results are open‑sourced so anyone can reproduce, check, and extend the findings.
•ODA shifts AI work from trial‑and‑error data curation to a transparent, testable, data‑centric science.

Why This Research Matters

ODA helps teams pick the right training data faster, saving money and reducing guesswork. By exposing data lineage, it prevents accidentally training on test answers and keeps leaderboards honest. The multi-dimensional scores guide data creators to improve what really matters, like step-by-step correctness in math or concise accuracy in code. Open tools and configs let students, startups, and labs reproduce results and build on each other’s work. Over time, this pushes AI from trial-and-error toward a reliable science of data. It also lays groundwork for fair evaluation in new areas, like safety alignment and multimodal learning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a school science fair isn’t just about cool gadgets—it’s also about showing your steps so everyone can check your work? In AI, we’ve had lots of shiny models, but the “steps” (the data used after pretraining) have often been hidden.

🥬 Filling (The Actual Concept):

What it is: This paper introduces OpenDataArena (ODA), a fair and open way to measure how good different post‑training datasets are for teaching large language models (LLMs) to follow instructions, reason, and code.
How it works: 1) Take the same base model and the same training settings. 2) Fine‑tune on one dataset at a time. 3) Test each resulting model on the same set of benchmarks. 4) Score and compare datasets directly. 5) Add multi‑angle quality scores and a family‑tree (lineage) map to explain why results happen. 6) Release all tools and results so anyone can repeat the process.
Why it matters: Before ODA, data was a black box. Without fair tests and clear records, we couldn’t tell which datasets truly helped or why, making progress slow and hard to trust.

🍞 Bottom Bread (Anchor): Imagine three sports teams using the exact same coach, drills, and field, but each team practices with a different kind of ball. If the team with Ball A wins more games, you can fairly say Ball A practice helped the most. That’s ODA for datasets.

New Concepts (explained with the Sandwich pattern as they first appear):

Large Language Model (LLM) 🍞 Hook: Imagine a super‑helpful librarian who has read almost every book and can answer your questions. 🥬 The Concept: An LLM is a computer program trained on huge amounts of text so it can understand and generate language. How it works: 1) Learn patterns from lots of text, 2) Predict the next word repeatedly, 3) Use this to answer questions and follow instructions. Why it matters: It powers chatbots, tutors, and coding assistants. 🍞 Anchor: When you ask, “What’s the capital of France?” an LLM replies “Paris.”
Post‑Training (SFT and Alignment) 🍞 Hook: You know how a bike comes from the store but still needs seat and handlebar adjustments to fit you? 🥬 The Concept: Post‑training means fine‑tuning a pretrained model to follow instructions and match human values. How: 1) Show Q&A examples (SFT), 2) Use preference data or judges (alignment) to reinforce good behavior, 3) Repeat. Why it matters: It turns a general model into a helpful assistant. 🍞 Anchor: After post‑training, a model stops giving random facts and starts answering the exact question you asked.
Dataset Quality 🍞 Hook: Fresh ingredients make better meals. 🥬 The Concept: Dataset quality is how accurate, clear, diverse, safe, and useful the training examples are. How: 1) Check if answers are correct, 2) See if steps are clear, 3) Ensure variety and safety, 4) Remove duplicates and leaks. Why it matters: Poor data teaches bad habits; great data raises skill. 🍞 Anchor: A math set with detailed, correct solutions teaches better than one with short, wrong answers.
Benchmark 🍞 Hook: Think of a standardized test that compares everyone fairly. 🥬 The Concept: A benchmark is a fixed test set and scoring method to measure model skills. How: 1) Present tasks, 2) Gather answers, 3) Score with rules, 4) Compare. Why it matters: Without it, results are just opinions. 🍞 Anchor: GSM8K is a math benchmark where models solve grade‑school word problems.

The World Before: LLMs like Llama, Qwen, and GPT grew powerful, but most attention went to bigger models and smarter tricks. The data used after pretraining (post‑training datasets) was messy: different sizes, mixed sources, unclear origins, and uneven quality. People often tried whatever datasets were popular on Hugging Face, with varied settings, then posted results that were hard to reproduce.

The Problem: We didn’t have a fair way to say, “This dataset improves instruction following,” or “That dataset boosts math reasoning.” Even worse, some training sets accidentally included test answers (benchmark contamination), which can inflate scores without real learning.

Failed Attempts: Model leaderboards flourished, but dataset evaluation stayed ad‑hoc. Some argued “Less Is More” with tiny high‑quality sets; others amassed massive collections. Without a shared, open pipeline, different training knobs and secret sauce made comparisons unfair.

The Gap: We needed an “apples‑to‑apples” system that fixes the model and training recipe, varies only the dataset, and tests on the same benchmarks—plus extra tools to rate data quality and trace where data came from.

Real Stakes: Fair data evaluation saves time and money, avoids training on leaked test answers, improves safety and reliability, and helps everyone—from students building open models to labs planning the next generation—move from guesswork to a clear, testable science of data.

02Core Idea

🍞 Top Bread (Hook): Imagine a cooking contest where every chef must use the same oven, same timer, and same judges—the only thing they can choose is the ingredient basket. Now we can finally say which ingredients are truly best.

🥬 Filling (The Actual Concept):

What it is: OpenDataArena’s key insight is to hold the model, training settings, and evaluation constant, and change only the dataset—so the model’s final scores fairly reflect the dataset’s value.
How it works: 1) Pick strong base models (e.g., Qwen, Llama). 2) For each dataset, fine‑tune one model with a fixed recipe. 3) Test on a wide suite of 22 benchmarks. 4) Record scores on a public leaderboard. 5) Add a multi‑dimensional scoring profile and a data lineage map to explain why results look the way they do. 6) Release all tools and configs for full reproducibility.
Why it matters: Without fixing the setup, you can’t tell if gains came from clever training tricks or the dataset. This method isolates data value.

🍞 Bottom Bread (Anchor): If Team A keeps beating Team B when both use the same drills, coaches, and field—but Team A trains with Dataset X while Team B uses Dataset Y—you can fairly say Dataset X trains better players.

Multiple Analogies (same idea, three ways):

Cooking: Same oven, same judges, different ingredient baskets; taste reveals which basket is best.
Gardening: Same soil, same water schedule, different fertilizers; plant growth shows which fertilizer works best.
School: Same teacher, same class time, different workbooks; test scores show which workbook teaches better.

Before vs After:

Before: Dataset choices were guided by hype and luck; results weren’t comparable across labs; data origin was unclear; contamination could sneak in.
After: Datasets are ranked by direct impact; quality is profiled across many axes; lineage shows where data came from; leaks can be detected; results are reproducible.

Why It Works (intuition, not equations):

Control variables: Keeping the model, training recipe, and evaluation fixed removes confounders.
Rich diagnostics: Multi‑angle scores (clarity, correctness, difficulty, diversity) and lineage tracing explain the “why” behind raw scores.
Scale and coverage: Testing 120+ datasets across 22 benchmarks and multiple models reduces randomness and reveals stable patterns.

Building Blocks (each with Sandwich explanations):

Unified Training–Evaluation Pipeline 🍞 Hook: Picture a factory line that builds products and checks quality at each station. 🥬 The Concept: A single, shared process that trains and evaluates models the same way for every dataset. How: 1) Normalize data, 2) Fine‑tune with fixed hyperparameters, 3) Evaluate on the same benchmarks, 4) Log and publish results. Why it matters: Ensures fair, comparable scores. 🍞 Anchor: Two runners race on the same track with identical shoes; the only difference is their practice plan (dataset).
Multi‑Dimensional Scoring Framework 🍞 Hook: You know how report cards grade math, reading, and science, not just one subject? 🥬 The Concept: Judge datasets on many qualities: difficulty, correctness, clarity, coherence, diversity, and more. How: 1) Score the question (Q) and the full Q&A separately, 2) Use models, LLM judges, and rules, 3) Combine into a profile. Why it matters: A single number can hide problems; a profile shows strengths and weaknesses. 🍞 Anchor: A dataset may be super‑clear but often wrong—that profile tells you to fix accuracy before size.
Data Lineage 🍞 Hook: Family trees tell you who’s related; datasets have families too. 🥬 The Concept: Track where datasets come from, which sources they combine, and how they were transformed. How: 1) Parse READMEs, repos, and papers, 2) Extract and verify sources, 3) Build a graph of relationships, 4) Flag low‑confidence links for human review. Why it matters: Reveals reuse, redundancy, and contamination. 🍞 Anchor: If a training set secretly contains test questions, lineage can reveal that link.
LLM‑as‑Judge 🍞 Hook: When a teacher can’t grade every essay alone, they ask trained assistants to help. 🥬 The Concept: Use strong LLMs to assess qualities like answer helpfulness or coherence. How: 1) Prompt a judge model with scoring criteria, 2) Collect ratings, 3) Cross‑check with other signals. Why it matters: Scales up human‑like evaluation when humans can’t grade millions of examples. 🍞 Anchor: A judge model marks that an explanation is complete and non‑contradictory.
Benchmark Contamination 🍞 Hook: It’s not fair to take the test if you already saw the answer sheet. 🥬 The Concept: Contamination happens when training data includes benchmark items. How: 1) Trace lineage to find overlaps, 2) Flag risky links, 3) Re‑evaluate results. Why it matters: Inflated scores don’t mean real learning. 🍞 Anchor: If a coding set contains LiveCodeBench tasks, high pass@1 might just be memorization.

03Methodology

🍞 Top Bread (Hook): Imagine a four‑stop conveyor belt: 1) collect and label ingredients, 2) cook with the same recipe, 3) taste and analyze, 4) show the results on a big scoreboard.

🥬 Filling (The Actual Concept):

What it is: ODA’s recipe is a standardized, end‑to‑end pipeline: Input → Training & Evaluation → Analysis → Visualization.
How it works: Step by step (with why each step matters) and concrete examples below.
Why it matters: A fixed recipe makes results fair, repeatable, and explainable.

High‑Level Overview: Input → [Data Input Layer] → [Data Evaluation Layer] → [Data Analysis Layer] → [Visualization Layer] → Output (leaderboards, profiles, lineage graphs)

Step‑by‑Step Details (with Sandwich explanations for new ideas):

Data Input Layer 🍞 Hook: Like organizing a messy pantry before you start cooking. 🥬 The Concept: Collect datasets, convert them into a common format, and tag them by domain (General, Math, Code, Science). How: 1) Fetch from sources (e.g., Hugging Face), 2) Standardize fields (instruction, response), 3) Safety checks, 4) Size limits, 5) Domain labels. Why it matters: Without clean, consistent inputs, comparisons are unfair or break. 🍞 Anchor: Two math datasets might use different field names; normalization makes them look the same to the training code.
Data Evaluation Layer (Training + Testing) 🍞 Hook: Same oven, same temperature, same timer—for every dish. 🥬 The Concept: Fine‑tune the same base models with identical hyperparameters, then evaluate with the same tools. How: 1) Use an open fine‑tuning framework, 2) Fix learning rate, epochs, batch sizes, and adapters, 3) Train one dataset at a time, 4) Evaluate with OpenCompass and task‑specific harnesses, 5) Use judge models to fairly extract and score answers. Why it matters: Keeps the dataset as the only changing variable. 🍞 Anchor: Train Llama3.1‑8B on Dataset A and Dataset B separately; test both on GSM8K and HumanEval with the same prompts and scoring.
Data Scoring System 🍞 Hook: A health checkup doesn’t just take your temperature—it checks heart rate, blood pressure, and more. 🥬 The Concept: Score data along many axes, using three methods: model‑based evaluation, LLM‑as‑judge, and heuristics. How: 1) Model‑based: specialized predictors estimate difficulty or thinking probability, 2) LLM‑as‑judge: powerful models rate coherence, helpfulness, and correctness, 3) Heuristics: simple counts like tokens or lengths. Why it matters: A rich profile explains why datasets help or hurt. 🍞 Anchor: If a dataset’s QA responses are long and consistently judged correct, it’s more likely to boost math scores.
Data Lineage (Multi‑Agent Tracing) 🍞 Hook: Detectives gather clues from many places to solve a case. 🥬 The Concept: A multi‑agent system builds a graph of who‑derived‑from‑whom. How: 1) Validate candidate datasets and their timelines, 2) Retrieve multi‑source info (READMEs, repos, papers), 3) Extract sources with a structured record (Source, Relationship, Confidence, Evidence), 4) Aggregate and canonicalize names, 5) Flag low‑confidence edges for human review. Why it matters: Reveals hidden overlaps, hubs, and contamination chains. 🍞 Anchor: The system shows that a math SFT set includes Omni‑MATH items through an upstream distillation step.
Visualization Layer and Leaderboard 🍞 Hook: Scoreboards make games exciting because you can see who’s winning and why. 🥬 The Concept: Interactive views to compare datasets, filter by domain, inspect quality profiles, and browse lineage graphs. How: 1) Publish per‑domain ranks, 2) Show metric heatmaps, 3) Render lineage networks with node sizes and colors, 4) Link to raw configs and logs. Why it matters: Transparency builds trust and accelerates learning. 🍞 Anchor: You spot that Dataset X ranks top‑3 in Math and has very long, correct solutions; lineage shows it aggregates several strong sources.

Concrete Data Flow Example:

Input: A new Code dataset (50k items) is uploaded. It’s standardized into fields instruction/response and tagged as Code.
Training: Qwen2.5‑7B is fine‑tuned for 3 epochs with fixed hyperparameters.
Evaluation: The resulting model is tested on HumanEval, HumanEval+, MBPP, and LiveCodeBench(v5), using official scoring tools.
Scoring: The dataset’s responses are rated for correctness and conciseness; token lengths are recorded.
Analysis: Results are compared against the base model and other Code datasets; efficiency (performance gain per example) is computed.
Visualization: The leaderboard updates; the dataset’s profile and lineage links appear.

The Secret Sauce:

Isolation of the dataset variable enables honest comparisons.
Multi‑angle diagnostics turn “black‑box” scores into understandable stories.
Lineage tracing catches redundancy and contamination early, protecting benchmark integrity.

Additional Concepts Clarified:

• Data Efficiency 🍞 Hook: Getting more learning for every minute you study feels great, right? 🥬 The Concept: Data efficiency measures performance gain per data example. How: 1) Compute score improvement over base, 2) Divide by dataset size, 3) Compare across sets. Why it matters: Shows which datasets give the most “bang for the buck.” 🍞 Anchor: A 10k‑example set that adds +5 points can be more efficient than a 200k set adding +6.

• Chain‑of‑Thought (CoT) 🍞 Hook: Teachers love seeing your steps, not just the final answer. 🥬 The Concept: CoT is detailed, step‑by‑step reasoning in answers. How: 1) Write reasoning steps, 2) Explain transitions, 3) Conclude clearly. Why it matters: Models learn problem‑solving procedures, not just facts. 🍞 Anchor: A math solution that shows each algebra step tends to teach the model better than a one‑line result.

04Experiments & Results

🍞 Top Bread (Hook): Think of a mega‑tournament where every training set gets to coach the same players; then we see whose coaching creates the strongest team.

🥬 Filling (The Actual Concept):

What it is: ODA ran 600+ fine‑tuning runs on 120+ datasets, tested across 22 benchmarks (General, Math, Code, Science/Reasoning), processing about 40 million data points.
How it works: For each dataset, fine‑tune a fixed base model (e.g., Llama3.1‑8B, Qwen2.5‑7B, Qwen3‑8B), then evaluate on standardized benchmarks with official or widely used scoring tools.
Why it matters: Big, consistent testing turns anecdotes into evidence and reveals patterns that hold across models and time.

The Test: What they measured and why

Absolute performance: final scores show overall capability.
Performance delta: gain over the base model isolates dataset value.
Efficiency: gain per data example shows cost‑effectiveness.
Correlations: which quality metrics predict success (e.g., response length, correctness)?
Lineage: structure of the data ecosystem, reuse patterns, hubs, and contamination.

The Competition: Compared against what?

Models: Llama3.1‑8B, Qwen2.5‑7B, Qwen3‑8B.
Benchmarks: 22 tasks covering instruction following (e.g., IFEval), knowledge (MMLU‑PRO), math (GSM8K, Omni‑MATH, AIME), code (HumanEval, MBPP, LiveCodeBench v5), and advanced reasoning (BBH, GPQA diamond, ARC‑c, CaLM).

The Scoreboard: Results with context

Stronger Base, Higher Ceiling: Qwen3 tends to achieve the best absolute scores, with Qwen2.5 next, then Llama3.1. Think of Qwen3 as starting with an A‑ baseline and still climbing.
Sensitive Domains: Math and Code show big spread. Great data can earn an A+, while weak data can drop you to a C or worse—especially on weaker base models.
Time Trends: Math datasets jumped from roughly mid‑30s to mid‑50s (on a shared scale) after 2024, thanks to improved step‑by‑step data. Code stayed volatile; General stayed steady and somewhat saturated.
Cross‑Model Consistency: Math rankings are very consistent between Qwen2.5 and Qwen3 (correlation ≈ 0.90), meaning great math data helps no matter which one you use. General shows negative correlation, suggesting saturation—newer, stronger models already internalize many general patterns.

Surprising Findings

Response Length Dominates (except for Code): Longer, well‑structured answers (more CoT) strongly predict better performance, especially in Math and Science. Globally positive, Math correlation as high as 0.81. It’s like getting extra credit for showing your work.
Code is Different: In coding, verbosity can hurt; concise, correct solutions win. Some signals flip sign in Code compared to Math.
Instruction‑Only Metrics Are Weak Predictors: Fancy or clear prompts aren’t enough if the responses are low quality. QA metrics that judge the final pair (question + answer) are much more predictive.
Efficiency vs Peak: Tiny, super‑efficient datasets can’t always reach top scores and may even hurt weaker models. High‑density, well‑curated medium/large sets (like AM‑Thinking variants) deliver stable, top performance.
Lineage Reveals Hubs and Leaks: A few mega‑aggregators sit at the center of many datasets. Tracing showed direct inclusion of benchmark items in some training sets (e.g., Omni‑MATH or LiveCodeBench), which risks inflated leaderboard results without real generalization.

Concrete Examples:

AM‑Thinking (Math/Code variants) consistently reaches top Math scores on both Qwen2.5‑7B and Llama3.1‑8B, indicating robust, transferable value.
OpenThoughts‑style data with very long, detailed reasoning rises in global ranks, supporting the “teach by showing steps” hypothesis.
Datasets optimized for extreme efficiency (e.g., LIMO) look great per‑example but can underperform on weaker base models, confirming the stability limits of tiny sets.

Why These Results Matter

If you need dependable gains on Math, pick datasets with long, correct CoT. If you care about Code, favor concise correctness and domain‑specific checks.
Always check lineage for contamination; otherwise you may be measuring memorization, not learning.
Choose “high‑density volume” over “extreme minimalism” for real‑world robustness, especially with weaker base models.

🍞 Bottom Bread (Anchor): It’s like track practice—short, super‑intense drills help, but to win the big meet you need full workouts that build stable stamina. The winning teams used the right balance of quality and volume, and their training logs (lineage) were clean.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best maps can miss a road or two; knowing the limits helps you travel smarter.

🥬 Filling (The Actual Concept):

What it is: An honest look at ODA’s limitations, resources needed, when not to use it, and what we still don’t know.
How it works: Spell out constraints, practical needs, edge cases, and open questions to guide future work.
Why it matters: Clear boundaries prevent misuse and point the way to the next breakthroughs.

Limitations (be specific):

Scope of Data: Focus is on public, post‑2023 SFT datasets. Private corpora or earlier resources may behave differently.
Compute Costs: Running hundreds of fine‑tunings and 10k+ evaluations is expensive; small teams may need to subset.
Judge Reliability: LLM‑as‑judge reduces human labor but can introduce bias; cross‑checking helps but isn’t perfect.
Contamination Detection: Lineage tracing improves detection but can still miss subtle leaks or over‑flag weak links; human review remains vital.
Domain Coverage: Science and other verticals (law, medicine) are still maturing; results there can be noisy and model‑dependent.

Required Resources:

GPUs capable of consistent fine‑tuning (e.g., multiple A100s) and storage for checkpoints and logs.
Access to evaluation frameworks (OpenCompass, harnesses) and judge models.
Engineering to integrate new datasets and maintain consistent configs.

When NOT to Use:

If you only need a quick, approximate signal and can’t afford fine‑tuning—consider training‑free estimators (a planned future direction).
If your application is highly niche (e.g., a specialized medical subfield) with no matching benchmarks—first build appropriate tests.
If you can’t fix training settings (e.g., product constraints require custom recipes)—ODA’s fairness guarantee won’t apply.

Open Questions:

Mixing Laws: What’s the best recipe for combining datasets across domains and difficulty levels?
Domain‑Specific Scoring: What specialized metrics best predict value for Code, Science, or safety alignment data?
Robust Judging: How can we further de‑bias LLM judges and triangulate with lightweight human audits?
Efficient Valuation: Can we develop reliable, training‑light predictors of dataset value that track ODA’s rankings closely?

🍞 Bottom Bread (Anchor): Think of ODA as a well‑built lab: it gives clean results when experiments follow the protocol, but you still need enough supplies, the right tools, and domain‑aware tests to make the most of it.

06Conclusion & Future Work

🍞 Top Bread (Hook): If models are students, then datasets are their textbooks—and ODA is the fair exam that finally grades the books, not just the students.

🥬 Filling (The Actual Concept):

3‑Sentence Summary: OpenDataArena fairly measures how much different post‑training datasets improve LLMs by holding the model and training recipe constant and changing only the data. It adds multi‑dimensional quality profiles and a data lineage explorer to explain which data helps and why, and to detect redundancy or contamination. Large‑scale experiments show that response quality (especially long, correct reasoning) is key in Math and Science, while Code prefers concise correctness, and that curated medium/large datasets often beat tiny, hyper‑efficient ones for robust gains.
Main Achievement: Turning dataset evaluation from a black‑box guess into a transparent, reproducible, multi‑angle science—with open tools, open configs, and open results.
Future Directions: Extend to multimodal data and alignment/preference datasets; develop training‑light valuation methods; expand domain‑specific scorers (especially for Code and Science); and co‑create shared standards with the community.
Why Remember This: ODA reframes progress in AI as not just better models, but better data—measured fairly, explained clearly, and shared openly—so everyone can build smarter, safer systems faster.

🍞 Bottom Bread (Anchor): Next time you pick a dataset, don’t guess—check the ODA leaderboard, read its quality profile, scan its lineage for leaks, and choose with confidence.

Practical Applications

•Select the highest-value SFT dataset for a target domain (e.g., Math) using ODA leaderboard ranks and profiles.
•Audit a dataset’s lineage to detect benchmark contamination before training.
•Design better synthetic data by maximizing response quality (e.g., detailed step-by-step solutions for Math).
•Choose between tiny efficient sets and larger curated sets based on stability needs and base model strength.
•Tune data mixtures by checking which domains transfer well (e.g., code logic reinforcing math reasoning).
•Set up a reproducible fine-tuning pipeline by copying ODA’s open configs and training recipes.
•Create domain-specific scoring rules (especially for Code) informed by ODA’s correlation findings.
•Prioritize datasets with proven cross-model consistency (e.g., Math sets that rank well on both Qwen2.5 and Qwen3).
•Use efficiency plots to maximize performance gain per example under compute constraints.
•Plan future benchmarks or datasets by studying lineage hubs and redundancy hotspots.

Version: 1