Large language models can act unpredictably in sensitive places like schools, hospitals, and customer support, so we need reliable ways to guide how they talk and behave.
Large language models can quietly pick up hidden preferences from training data that looks harmless.
RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'