From Data to Behavior: Predicting Unintended Model Behaviors Before Training
IntermediateMengru Wang, Zhenqian Xu et al.Feb 4arXiv
Large language models can quietly pick up hidden preferences from training data that looks harmless.
#Data2Behavior#Manipulating Data Features#activation injection