Most language models are trained on compressed tokens, which makes training fast but ties the model to a specific tokenizer.
The paper shows that many AI systems work best when a small 'compressor' model first shrinks long text into a short, info-packed summary and a bigger 'predictor' model then reasons over that summary.