Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.
This paper shows how to train big language models faster and cheaper by using 4-bit numbers (NVFP4) without losing much accuracy.