This paper introduces YaPO, a way to gently nudge a language model’s hidden thoughts so it behaves better without retraining it.
DiffCoT treats a model’s step-by-step thinking (Chain-of-Thought) like a messy draft that can be cleaned up over time, not something fixed forever.