Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Key Summary
- •Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.
- •DeMix is a new way to search for the best data mixture without retraining a bunch of models over and over.
- •It trains a few "component" models (one per dataset), then blends their weights with chosen percentages to create unlimited proxy models instantly.
- •These merged proxy models predict which data recipes will work best, and they line up with real training results much better than tiny, cheap proxies.
- •With DeMix, the search becomes fast (no extra training), accurate (rankings match large runs), and sufficient (you can try as many mixtures as you want).
- •Using this method, the paper finds better mixtures for general understanding, math reasoning, and code generation than prior methods like RegMix and CLIMB using far less compute.
- •They also release DeMix Corpora, a 22-trillion-token, high-quality dataset with validated mixtures so others can pre-train large models more reliably.
- •Merging worked best with simple linear merging and when each component model kept about 50% general data to avoid losing broad language skills.
- •The approach reached similar mixture-picking accuracy to big, costly searches while using about one-sixth of the training budget.
- •Bottom line: DeMix separates searching from training so you can explore data recipes cheaply and confidently before spending big on full pre-training.
Why This Research Matters
DeMix makes it practical to explore many data recipes for large language models without wasting enormous compute on each trial. That means teams can build models that are strong at everyday language, math reasoning, and code at the same time, and do it faster and cheaper. This improves tools we rely on—like coding assistants, smart tutors, and multilingual chatbots—by giving them better-balanced skills. It also lowers the barrier for smaller labs and startups to run serious pre-training research because the search step is no longer a compute wall. The released DeMix Corpora gives the community a high-quality, validated starting point, speeding up progress. Altogether, this approach turns mixture tuning from guesswork into a repeatable, data-driven process with real cost savings.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Top Bread (Hook) You know how cooking a great soup needs the right mix of ingredients—too much salt or not enough veggies and the whole bowl tastes off? Training a large language model (LLM) is like that: what you feed it (the data) must be mixed just right.
🥬 Filling (The Actual Concept)
- What it is: A data mixture is the recipe of how much general text, math problems, code, and other sources you feed into a model during pre-training.
- How it works (step by step):
- Gather lots of different datasets (general web, math, code, multilingual).
- Decide what fraction of the training tokens come from each dataset (the mixture ratios).
- Train the model on this mix and see how it performs on tasks.
- Why it matters: If the mixture is off, the model might ace everyday reading but flop on math or code—or the other way around. The right balance creates a model that’s good at many things.
🍞 Bottom Bread (Anchor) Imagine building a backpack for school: if you pack only storybooks (general text) but no calculator (math) or laptop (code), you’ll struggle in some classes. A smart pack (data mixture) helps you in all subjects.
The World Before LLMs got good by training on enormous piles of text. But teams learned that “more” wasn’t enough—the model’s superpowers depend on which kinds of text it sees and in what amounts. Researchers tried to find the best mix by training many “proxy” models: smaller or mid-sized models trained on sampled mixtures to guess what would work for the big run. This did give signals, especially when the proxies were mid-sized with big token budgets, but it was very expensive.
🍞 Top Bread (Hook) Imagine rehearsing many times with full costumes, lights, and props just to see which version of a play works best—that’s pricey!
🥬 Filling (The Actual Concept)
- What it is (Proxy Models): Proxies are stand-in models trained cheaply to predict how a big, final model might behave with a certain data recipe.
- How it works (step by step):
- Pick a data mixture.
- Train a smaller or cheaper model (“proxy”) on that mixture.
- Evaluate it on benchmarks and use the result to guess if that mixture is good.
- Why it matters: Proxies help you explore options before spending huge compute on the final training.
🍞 Bottom Bread (Anchor) It’s like testing cookie recipes with mini cookies first, so you don’t waste flour and time on a giant bad batch.
The Problem Two big issues popped up:
- Cost: Even mid-sized proxies with enough tokens to be trustworthy cost a lot. You can’t try hundreds of mixtures easily.
- Accuracy: Super tiny proxies don’t behave like the real target model, especially for hard tasks like math and code. So they can point you to the wrong recipe.
Failed Attempts
- Fully automated methods like RegMix and CLIMB use many tiny proxies plus a predictor to pick mixtures. But the tiny proxies often fail to generalize to math and code, so the chosen mixtures underperform when scaled up.
- Large, accurate proxies (e.g., 8B models with 100B tokens) are more faithful—but they are prohibitively expensive for broad searches.
The Gap We needed a way to test tons of data mixtures:
- Without training a new model each time (cheap),
- While still getting rankings that match big, real runs (accurate),
- And being able to try as many combinations as we want (sufficient).
Real Stakes Why should you care? Because better mixtures mean models that:
- Understand everyday language and also reason about math and code,
- Cost less to develop (less wasted training),
- Arrive faster (quicker search),
- And are more reliable (the chosen recipe actually works at scale). This affects search engines, coding assistants, tutoring tools, and multilingual chatbots we use daily.
Enter DeMix This paper proposes DeMix, which separates the “searching” from the “training” by using model merging to create unlimited, training-free proxy models. You train a handful of component models once, then mix their weights in any proportion you want to instantly mimic training on that data mixture. That way, you can test countless recipes quickly and accurately before doing any expensive full training.
02Core Idea
🍞 Top Bread (Hook) Imagine you baked seven flavors of cupcakes (vanilla, chocolate, lemon, etc.). Instead of baking a new tray every time you want a new flavor mix, you mash pieces from the existing cupcakes to taste-test any blend instantly.
🥬 Filling (The Actual Concept)
- What it is (DeMix): DeMix is a method that predicts the best data mixture for pre-training by merging already-trained component models instead of retraining new proxies for every mixture.
- How it works (step by step):
- Train a base model on general data; then train component models, each specialized on one candidate dataset (e.g., math, code), still keeping some general data mixed in.
- To test a data recipe, merge the component models’ weights using the same mixture ratios.
- Evaluate the merged model as a proxy (no extra training) on benchmarks.
- Use a regression predictor to map mixture ratios to performance and iteratively home in on the best ratios.
- Why it matters: This decouples search from training—now you can evaluate unlimited mixtures quickly and cheaply while keeping strong alignment with real, large training outcomes.
🍞 Bottom Bread (Anchor) Instead of baking a new tray each time, you blend bites from the cupcakes you already made to find the yummiest mix.
Multiple Analogies for the Same Idea
- Smoothie Bar: You don’t grow new fruit for every smoothie test. You keep fruit (component models) on hand and change the pour ratios (merging weights) to try unlimited recipes (proxies).
- DJ Mixer: You don’t re-record songs to test playlists. You mix tracks (models) with different volume levels (weights) to preview the vibe (performance) before a live show (full pre-train).
- Paint Palette: You don’t repaint the whole wall to test a color. You mix from your existing paints (models) to preview any shade (mixture) on a small canvas (proxy evaluation).
🍞 Top Bread (Hook) You know how homework is faster when you separate thinking up a plan from actually writing the full essay?
🥬 Filling (The Actual Concept)
- What it is (Decoupling Search from Training): It means we do the exploring (search) without the heavy lifting (training) each time.
- How it works (step by step):
- Pay the cost once to train component models.
- After that, generate proxies by weight-merging—no retraining needed.
- Evaluate many candidates quickly and pick the best.
- Why it matters: Trying lots of options becomes cheap, so you can find better answers.
🍞 Bottom Bread (Anchor) Plan your essay outline first (cheap), then write only the final, best version (expensive) once.
🍞 Top Bread (Hook) You know how you can add more of your favorite topping to a pizza to change its taste?
🥬 Filling (The Actual Concept)
- What it is (Model Merging): Combining the weights of trained models to create a new model.
- How it works (step by step):
- Start from a shared base model.
- Train separate models on different datasets (each learns a “delta” over the base).
- Add or average these deltas to combine skills.
- Why it matters: If updates are small, adding deltas approximates training on the union/mixture of data.
🍞 Bottom Bread (Anchor) It’s like stacking skills: one model brings math spice, another brings code crunch, and together they make a balanced flavor.
🍞 Top Bread (Hook) Imagine mixing juice: more apple than orange gives a different taste than the reverse.
🥬 Filling (The Actual Concept)
- What it is (Weighted Model Merging): A special merging where each component model gets a chosen weight.
- How it works (step by step):
- Pick mixture weights that sum to 1.
- Multiply each component model’s weights by its chosen fraction.
- Add them up to get the merged proxy.
- Why it matters: The weights mirror the data mixture, so the merged model simulates training on that exact recipe—without training.
🍞 Bottom Bread (Anchor) Give 40% “math model,” 30% “code model,” 30% “general model,” and you’ve created a proxy for a 40/30/30 data recipe.
Before vs After
- Before: Either run few, expensive mid-sized proxies (accurate but pricey) or many tiny proxies (cheap but unreliable for hard skills).
- After: Train a few strong components once, then create unlimited accurate-enough proxies by merging to search widely and cheaply.
Why It Works (Intuition) If each component model only moved a little from the base (small update), then combining those moves like vectors adds up close to the move you’d get from training on the merged data. This makes the merged weights behave like the model you would have trained on that data mix—good enough to rank mixtures reliably.
Building Blocks
- Base model trained on general data.
- Component models specialized per candidate dataset, still with some general data retained (regularization).
- Weighted model merging to generate proxies for any mixture.
- A regression predictor to score ratios and guide the search.
- Iterative sampling to zoom in on the top mixtures.
03Methodology
At a High Level: Inputs → [Data Preprocessing] → [Component Model Training] → [Weighted Model Merging (Proxies)] → [Mixture Weight Optimization via Predictor] → Output: Best Data Mixture
Step 0: Data Preprocessing
- What happens: Collect general, math, code, and multilingual data; deduplicate; filter by perplexity and quality (e.g., FastText); and categorize into candidate datasets.
- Why it exists: Garbage-in equals garbage-out. Clean, clear categories make later mixing meaningful and fair.
- Example: Removing near-duplicate web pages and poor-quality snippets so the model doesn’t memorize junk.
🍞 Top Bread (Hook) Think of building with LEGO: you first sort bricks by color and shape before you start.
🥬 Filling (The Actual Concept)
- What it is (Component Models): Separate models trained on each candidate dataset (e.g., math, code), all starting from the same base.
- How it works (step by step):
- Train a base model on general data to learn language basics.
- For each candidate dataset, continue training a copy of the base on that data plus some general data (e.g., 50/50) to keep broad skills.
- Now you have multiple specialized component models.
- Why it matters: These are the building blocks we can later merge to simulate any data recipe.
🍞 Bottom Bread (Anchor) It’s like having a math-strong friend, a code-strong friend, and a language-arts-strong friend—all ready to team up.
🍞 Top Bread (Hook) Want to taste a new smoothie without buying fruit again? Blend what’s already in your fridge.
🥬 Filling (The Actual Concept)
- What it is (Model Merging as Proxy): We create a stand-in model by mixing the weights of component models according to chosen ratios.
- How it works (step by step):
- Pick a set of weights that sum to 1.
- Multiply each component’s weights by its weight.
- Add them to get a merged model—your proxy.
- Why it matters: No training is needed per mixture, so you can try unlimited mixtures quickly.
🍞 Bottom Bread (Anchor) 40% math model + 60% general model instantly gives you a proxy for a 40/60 data recipe.
🍞 Top Bread (Hook) You know how you compare different pizza toppings by ranking your favorites?
🥬 Filling (The Actual Concept)
- What it is (Ranking-Based Evaluation): We score each proxy on multiple benchmarks and look at their rank orders across mixtures.
- How it works (step by step):
- Evaluate each merged proxy on general, math, and code benchmarks.
- Compare scores and compute overall rankings.
- Prefer mixtures that rank higher across domains.
- Why it matters: Ranking is robust when comparing across varied tasks and scales, helping pick consistent winners.
🍞 Bottom Bread (Anchor) If Mixture A beats Mixture B on most tests, its rank is higher—it’s like winning more events at a school field day.
🍞 Top Bread (Hook) If you’ve ever drawn a line through scatter points to predict trends, you’ve done regression.
🥬 Filling (The Actual Concept)
- What it is (Regression-Based Methods): Tools that learn to predict an outcome (like rank) from inputs (mixture weights).
- How it works (step by step):
- Collect pairs: (mixture weights → proxy ranking score).
- Train a regression model to map weights to performance.
- Use it to score many new candidate mixtures cheaply.
- Why it matters: It guides the search toward promising mixtures without evaluating all of them.
🍞 Bottom Bread (Anchor) It’s like predicting your final grade from how much you study each subject.
🍞 Top Bread (Hook) Think of LightGBM as a smart, speedy librarian who quickly learns which books you’ll like based on what you’ve enjoyed.
🥬 Filling (The Actual Concept)
- What it is (LightGBM): A fast, tree-based regression model used to predict rankings from mixture weights.
- How it works (step by step):
- Train LightGBM on (weights, ranks) collected from merged proxies.
- Score thousands of new mixtures.
- Keep the top predictions and iterate.
- Why it matters: It makes the search efficient and scalable.
🍞 Bottom Bread (Anchor) Like using a recommendation system to shortlist movies before you watch any.
Iterative Mixture Weight Optimization (Recipe)
- Sample many random mixtures (weights sum to 1).
- Merge components to create proxies and evaluate them.
- Train LightGBM to map weights → rank.
- Use the predictor to score new mixtures; pick the top ones.
- Repeat steps 2–4 a few times to zoom in on the best area.
- Average the best few ratios to form the final mixture.
Secret Sauce
- Keep updates small: Each component is a moderate step from the base, so their deltas add up well.
- Regularize components with general data (~50%): Keeps them compatible and prevents losing general skills.
- Use simple linear merging: It worked best overall and required no extra hyperparameters.
04Experiments & Results
🍞 Top Bread (Hook) You know how a good science fair test compares your project to others, measures carefully, and explains the results in plain numbers?
🥬 Filling (The Actual Concept)
- What it is (The Test): The team measured how well merged proxies predict real training outcomes (proxy consistency), and how good the final chosen mixtures are (mixture quality) across general, math, and code benchmarks.
- How it works (step by step):
- Train reference models on 96 different mixtures with large token budgets (50B tokens each) to create ground truth rankings.
- Create DeMix proxies by merging component models for the same 96 mixtures.
- Compare rankings using Spearman’s rank correlation (higher is better alignment).
- Measure capability recovery: how much of the absolute score the proxy keeps vs. the reference.
- Use DeMix’s predictor to pick an optimal mixture; train a model on that mixture (50B tokens) and compare its rank vs. baselines.
- Why it matters: If proxies rank mixtures like real training does, then search becomes cheap and trustworthy.
🍞 Bottom Bread (Anchor) If your practice quiz scores match your real test order (who scores higher or lower), your practice is a good predictor.
🍞 Top Bread (Hook) Imagine judging a lineup of runners by their finishing order rather than exact times.
🥬 Filling (The Actual Concept)
- What it is (Spearman’s Rank Correlation): A number between -1 and 1 that tells how similar two orderings are (1 is perfect match).
- How it works (step by step):
- Rank mixtures by proxy performance.
- Rank mixtures by real, big-training performance.
- Compute Spearman’s rho to see how closely the rankings agree.
- Why it matters: High rho means proxies reliably predict which mixtures win.
🍞 Bottom Bread (Anchor) If both you and your friend rank ice cream flavors in almost the same order, your tastes align—rho is high.
🍞 Top Bread (Hook) If a photo shrinks but keeps most of its details, the compression worked well.
🥬 Filling (The Actual Concept)
- What it is (Capability Recovery): How much of the original performance a merged proxy keeps compared to its matching trained model.
- How it works (step by step):
- For each mixture, compute proxy score and reference score.
- Divide proxy by reference to get recovery rate (closer to 1 is better).
- Why it matters: Shows proxies don’t just rank well—they also keep strong absolute performance.
🍞 Bottom Bread (Anchor) It’s like saying, “This thumbnail keeps 85% of the sharpness of the big photo.”
The Competition
- RegMix and CLIMB (state-of-the-art automated data mixture methods) rely on many trained proxies. They get better with larger proxies and more tokens—but that’s costly.
The Scoreboard (with Context)
- Proxy Accuracy: With component models trained on about 30B tokens each, DeMix’s merged proxies reached a macro average Spearman’s rho around 0.81, similar to what expensive large-proxy searches needed, but at roughly one-sixth the training budget.
- Top-25% Agreement: In the best-quarter mixtures (the ones we really care about), DeMix maintained strong alignment, meaning it picks winners well.
- Capability Recovery: Proxies kept up to around 85% of absolute performance, showing merging isn’t just ranking—it’s faithful.
- Final Mixture Quality: When using DeMix to choose the final 50B-token mixture for a 1.7B model, it beat RegMix and CLIMB on overall rank across general, math, and code, at significantly lower compute costs. Increasing the number of proxies helped up to a point; too many could overfit noise.
Surprising Findings
- Simple linear merging worked best overall among merging strategies, and it was hyperparameter-free.
- Including about 50% general data in each component model was key: reducing that proportion hurt both ranking accuracy and recovery.
- More proxies improve search—but after a threshold (e.g., 448), returns diminish or even reverse due to overfitting to noise.
05Discussion & Limitations
Limitations
- The small-update assumption: Merging relies on each component model not drifting too far from the base. If components are trained too long or too far, merging may be less faithful.
- Component quality matters: If a component model is undertrained or low quality, proxies inherit that weakness, hurting search accuracy.
- Benchmark dependence: The predictor learns from chosen benchmarks. If your production needs are very different, you may need to adjust the benchmark suite.
- Dimensionality vs. samples: As you add more candidate datasets (more mixture dimensions), you need more proxy evaluations. DeMix makes this far cheaper, but not free.
Required Resources
- A solid base model and compute to train a handful of component models (e.g., seven components at tens of billions of tokens each in the paper).
- A clean, well-labeled candidate dataset pool, including general data mixed into each component’s training.
- Modest compute for evaluation and training a lightweight regression predictor (e.g., LightGBM).
When NOT to Use
- If your components are extremely specialized with huge weight shifts (large deltas), merging may deviate from real training results.
- If you only care about a single, narrow skill and can afford a direct large-scale proxy on that specific domain, traditional training might be simpler.
- If your datasets change drastically mid-search (e.g., new data distributions), you may need to retrain components to keep merging reliable.
Open Questions
- Can merging remain faithful for even larger component updates or more extreme domain shifts?
- How does merging behave across different architectures (e.g., MoE) or with instruction-tuned components?
- Can we design smarter sampling strategies over the simplex to need even fewer proxy evaluations?
- Are there better predictors than LightGBM for specific domains (e.g., neural predictors that model interactions between domains more precisely)?
- How far can DeMix scale in the number of candidate datasets before we hit practical evaluation limits?
06Conclusion & Future Work
3-Sentence Summary DeMix separates searching for the best data mixture from expensive model training by using weighted model merging to create unlimited, training-free proxy models. These proxies accurately predict how real, large training would rank different mixtures, letting you try many recipes cheaply and then commit compute to only the best one. The method beats strong baselines like RegMix and CLIMB on overall performance across general, math, and code—using far less compute—and comes with a high-quality, validated 22T-token DeMix Corpora.
Main Achievement The paper shows that simple, linear weighted model merging can stand in for costly proxy training when optimizing data mixtures, achieving strong rank alignment and capability recovery while slashing compute costs.
Future Directions
- Explore merging beyond small updates, new architectures, and more domains.
- Improve sampling and prediction methods to reduce proxy counts further without losing accuracy.
- Integrate with curriculum schedules so mixture search adapts across pre-training stages automatically.
- Extend to multimodal mixtures (text, code, math, images, audio) using unified predictors.
Why Remember This DeMix turns mixture search from a compute sink into a quick, trustworthy preview. By paying the training cost once for components and then merging, you can explore the space of data recipes widely and confidently, leading to better, cheaper, and faster LLM pre-training.
Practical Applications
- •Quickly search for the best pre-training data mixture for a new LLM without training dozens of proxies.
- •Balance general language, math, and code performance by adjusting merging weights and evaluating merged proxies.
- •Reduce compute costs in R&D by training a small set of component models once and reusing them for unlimited mixture tests.
- •Customize mixtures for target markets (e.g., more multilingual or domain-specific data) and validate with fast proxy evaluations.
- •Run iterative mixture optimization with LightGBM to home in on top-performing ratios for your benchmarks.
- •Compare merging strategies (linear vs. others) on your data to confirm the simplest approach works well.
- •Maintain general capabilities by ensuring each component model includes a healthy portion of general data during training.
- •Rapidly A/B test curriculum schedules by swapping mixtures across pre-training stages without retraining proxies.
- •Use the DeMix Corpora as a high-quality baseline and fine-tune mixture weights for your specific downstream needs.