How I Study AI - Learn AI Papers & Lectures the Easy Way

Intermediate

Mengru Wang, Zhenqian Xu et al.Feb 4arXiv

Large language models can quietly pick up hidden preferences from training data that looks harmless.

#Data2Behavior#Manipulating Data Features#activation injection

Intermediate

Seanie Lee, Sangwoo Park et al.Jan 30arXiv

Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.

#THINKSAFE#self-generated safety alignment#refusal steering

Intermediate

Christina Lu, Jack Gallagher et al.Jan 15arXiv

Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.

#Assistant Axis#persona drift#activation capping

Papers3