The paper asks a simple question: which kind of step-by-step reasoning helps small language models learn best, and why?
When you tune the learning rate carefully, plain old LoRA fine-tuning works about as well as fancy new versions.
The paper asks which small, add-on training tricks (PEFT) work best when we teach language models with yes/no rewards we can check (RLVR).