TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
BeginnerGül Sena Altıntaş, Malikeh Ehghaghi et al.Dec 23arXiv
TokSuite is a science lab for tokenizers: it trains 14 language models that are identical in every way except for how they split text into tokens.
#tokenization#tokenizer robustness#Byte Pair Encoding (BPE)