Papers2

#LLM pre-training

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.

#data mixture optimization#model merging#weighted model merging

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Advanced

Andrei Panferov, Erik Schultheis et al.Jan 30arXiv

This paper shows how to train big language models faster and cheaper by using 4-bit numbers (NVFP4) without losing much accuracy.

#NVFP4#FP4 training#quantization-aware training