Large language models donβt map out a full step-by-step plan before they start thinking; they mostly plan just a little bit ahead.
Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.
This survey turns model understanding into a step-by-step repair toolkit called Locate, Steer, and Improve.