This paper shows, step by step, how to train a 1.36-billion-parameter science-focused language model directly from raw arXiv LaTeX files using only 2 A100 GPUs.
Solar Open is a giant bilingual AI (102 billion parameters) that focuses on helping underserved languages like Korean catch up with English-level AI quality.
TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.
The paper shows that making a model write a number as a sequence of digits and then grading the whole number at the end works better than grading each digit separately.