Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity
IntermediateHongjun An, Yiliang Song et al.Jan 10arXiv
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.
#Preference-Undermining Attacks#PUA#sycophancy