SLIME is a new way to train chatbots so they follow human preferences without forgetting how to write well.
Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.