Large language models can quietly pick up hidden preferences from training data that looks harmless.
Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.
Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.