ArXiv-to-Model: A Practical Study of Scientific LM Training
IntermediateAnuj GuptaFeb 19arXiv
This paper shows, step by step, how to train a 1.36-billion-parameter science-focused language model directly from raw arXiv LaTeX files using only 2 A100 GPUs.
#scientific language model#arXiv LaTeX#tokenization