Large language models can quietly pick up hidden preferences from training data that looks harmless.
RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'