Papers2

#weak-to-strong generalization

Steering LLMs via Scalable Interactive Oversight

The paper tackles a common problem: people can ask AI to do big, complex tasks, but they can’t always explain exactly what they want or check the results well.

#scalable oversight#interactive alignment#requirement elicitation

Shaping capabilities with token-level data filtering

Intermediate

Neil Rathi, Alec RadfordJan 29arXiv

The paper shows a simple way to teach AI models what not to learn by removing only the exact words (tokens) related to unwanted topics during pretraining.

#token-level data filtering#capability shaping#sparse autoencoders