THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
IntermediateSeanie Lee, Sangwoo Park et al.Jan 30arXiv
Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.
#THINKSAFE#self-generated safety alignment#refusal steering