This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”
Personalized AI helpers can accidentally copy a user’s past opinions instead of telling objective facts, which the authors call personalization-induced hallucinations.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.