Papers4

All Beginner Intermediate Advanced

All Sources arXiv

#sparse autoencoders

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

Intermediate

Anton Korznikov, Andrey Galichin et al.Feb 15arXiv

Sparse autoencoders (SAEs) are popular for explaining what large language models are doing, but this paper shows they often don’t learn real, meaningful features.

#sparse autoencoders#interpretability#dictionary learning

Shaping capabilities with token-level data filtering

Intermediate

Neil Rathi, Alec RadfordJan 29arXiv

The paper shows a simple way to teach AI models what not to learn by removing only the exact words (tokens) related to unwanted topics during pretraining.

#token-level data filtering#capability shaping#sparse autoencoders

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Intermediate

Hengyuan Zhang, Zhihao Zhang et al.Jan 20arXiv

This survey turns model understanding into a step-by-step repair toolkit called Locate, Steer, and Improve.

#mechanistic interpretability#residual stream#attention heads

Reasoning Models Generate Societies of Thought

Intermediate

Junsol Kim, Shiyang Lai et al.Jan 15arXiv

The paper shows that top reasoning AIs don’t just think longer—they act like a tiny team inside their heads, with different voices that ask, disagree, and then agree.

#society of thought#reasoning reinforcement learning#conversational behaviors