Deep research agents write long reports, but old tests often judge only how smooth they sound and whether they add links, not whether the facts are true today or the logic really holds.
This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”
Personalized AI helpers can accidentally copy a user’s past opinions instead of telling objective facts, which the authors call personalization-induced hallucinations.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.