Papers9

#activation steering

ASA: Training-Free Representation Engineering for Tool-Calling Agents

The paper finds a strange gap: the model’s hidden thoughts almost perfectly show when it should use a tool, but its actual words often don’t trigger the tool under strict rules.

#activation steering#representation engineering#tool calling

Not triaged yet

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Intermediate

Ziwen Xu, Chenyan Wu et al.Feb 2arXiv

The paper shows that three popular ways to control language models—fine-tuning a few weights, LoRA, and activation steering—are actually the same kind of action: a dynamic weight update driven by a control knob.

#language model steering#dynamic weight updates#activation steering

Not triaged yet

RAPTOR: Ridge-Adaptive Logistic Probes

Intermediate

Ziqi Gao, Yaotian Zhu et al.Jan 29arXiv

RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'

#probing#concept vectors#activation steering

Not triaged yet

Linear representations in language models can change dramatically over a conversation

Intermediate

Andrew Kyle Lampinen, Yuxuan Li et al.Jan 28arXiv

Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”

#linear representations#factuality#ethics

Not triaged yet

Privacy Collapse: Benign Fine-Tuning Can Break Contextual Privacy in Language Models

Intermediate

Anmol Goel, Cornelius Emde et al.Jan 21arXiv

Benign fine-tuning meant to make language models more helpful can accidentally make them overshare private information.

#contextual privacy#privacy collapse#fine-tuning

Not triaged yet

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Intermediate

Hengyuan Zhang, Zhihao Zhang et al.Jan 20arXiv

This survey turns model understanding into a step-by-step repair toolkit called Locate, Steer, and Improve.

#mechanistic interpretability#residual stream#attention heads

Not triaged yet

When Personalization Misleads: Understanding and Mitigating Hallucinations in Personalized LLMs

Intermediate

Zhongxiang Sun, Yi Zhan et al.Jan 16arXiv

Personalized AI helpers can accidentally copy a user’s past opinions instead of telling objective facts, which the authors call personalization-induced hallucinations.

#personalized large language models#hallucination#factuality

Not triaged yet

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Intermediate

Christina Lu, Jack Gallagher et al.Jan 15arXiv

Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.

#Assistant Axis#persona drift#activation capping

Not triaged yet

Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Beginner

Zhenyu Zhang, Shujian Zhang et al.Dec 30arXiv

This paper shows a new way (called RISE) to find and control how AI models think without needing any human-made labels.

#RISE#sparse auto-encoder#reasoning vectors

Not triaged yet