The paper shows that three popular ways to control language models—fine-tuning a few weights, LoRA, and activation steering—are actually the same kind of action: a dynamic weight update driven by a control knob.
RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'
Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”
Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.
This survey turns model understanding into a step-by-step repair toolkit called Locate, Steer, and Improve.
Personalized AI helpers can accidentally copy a user’s past opinions instead of telling objective facts, which the authors call personalization-induced hallucinations.
Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.
This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.