Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
IntermediateAnton Korznikov, Andrey Galichin et al.Feb 15arXiv
Sparse autoencoders (SAEs) are popular for explaining what large language models are doing, but this paper shows they often donβt learn real, meaningful features.
#sparse autoencoders#interpretability#dictionary learning