The paper asks a simple question: which kind of step-by-step reasoning helps small language models learn best, and why?
Traditional supervised fine-tuning (SFT) makes a model copy one answer too exactly, which can cause overfitting to the exact wording instead of the real idea.