Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
IntermediateShengrui Li, Fei Zhao et al.Jan 31arXiv
Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.
#data mixture optimization#model merging#weighted model merging