Large language models usually get only a final thumbs-up or thumbs-down at the end of their answer, which is too late to fix mistakes made in the middle.
DiffCoT treats a modelβs step-by-step thinking (Chain-of-Thought) like a messy draft that can be cleaned up over time, not something fixed forever.