Papers5

All Beginner Intermediate Advanced

All Sources arXiv

#residual stream

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Intermediate

Hengyuan Zhang, Zhihao Zhang et al.Jan 20arXiv

This survey turns model understanding into a step-by-step repair toolkit called Locate, Steer, and Improve.

#mechanistic interpretability#residual stream#attention heads

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Intermediate

Christina Lu, Jack Gallagher et al.Jan 15arXiv

Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.

#Assistant Axis#persona drift#activation capping

Mechanistic Interpretability of Large-Scale Counting in LLMs through a System-2 Strategy

Intermediate

Hosein Hasani, Mohammadali Banayeeanzade et al.Jan 6arXiv

Large language models (LLMs) are good at many math problems but often mess up simple counting when the list gets long.

#mechanistic interpretability#counting in LLMs#System-2 prompting

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Beginner

Zhenyu Zhang, Shujian Zhang et al.Dec 30arXiv

This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.

#RISE#sparse auto-encoder#reasoning vectors

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Beginner

Yuqiao Tan, Minzheng Wang et al.Dec 22arXiv

Large language models (LLMs) don’t act as a single brain; inside, each layer and module quietly makes its own mini-decisions called internal policies.

#Bottom-up Policy Optimization#internal layer policy#internal modular policy