Binary right/wrong rewards for training reasoning in large language models are hard to design and often too sparse to learn from.
The paper introduces DASD-4B-Thinking, a small (4B) open-source reasoning model that scores like much larger models on hard math, science, and coding tests.