This paper shows, step by step, how to train a 1.36-billion-parameter science-focused language model directly from raw arXiv LaTeX files using only 2 A100 GPUs.
Big idea: use a small, already-trained model to help a bigger model learn good habits early, so the big one trains faster and ends up smarter.