PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
Key Summary
- âąPolySAE is a new kind of sparse autoencoder that keeps a simple, linear way to find features but uses a smarter decoder that can multiply features together.
- âąThis multiplication lets the model tell the difference between words that just happen to appear together and true combinations with new meaning (like star Ă coffee â Starbucks).
- âąIt adds only about 3% more parameters on GPTâ2 Small thanks to a lowârank, shared interaction space, so it stays efficient.
- âąAcross four language models and three SAE variants, PolySAE boosts probing F1 by about 8% on average without hurting reconstruction error.
- âąIt also makes class meanings more separated, giving 2â10Ă larger Wasserstein distances between classes than standard SAEs.
- âąLearned interaction strengths barely track coâoccurrence frequency (r = 0.06), unlike linear SAE covariance (r = 0.82), so PolySAE is learning real composition, not just counting.
- âąBecause the encoder stays linear and sparse, features remain interpretable, while the polynomial decoder cleanly captures pair and triplet interactions.
- âąPolySAE concentrates meaning in fewer features, often needing fewer active features to perform well.
- âąRanks for quadratic and cubic terms can stay small (like 64), showing that most interaction structure is lowâdimensional.
- âąIt helps us see how models bind parts like stems and suffixes, or words and context, making interpretability tools more faithful to how language actually composes.
Why This Research Matters
Language is built by combining parts, and real meanings often appear when pieces interact, not just add. PolySAE finally lets interpretability tools match that reality: the encoder keeps features clear and sparse, while the decoder captures pair and triple bindings that create new meanings. This helps auditors and safety teams detect and measure sensitive compositions (like demographic + attribute) with more precision. It allows developers to debug model behavior at the level where errors actually happenâduring combinations, not just individual parts. It makes targeted steering more effective since we can adjust interaction weights rather than just bluntly turning features up or down. And because itâs efficient and a strict generalization of SAEs, teams can adopt it without throwing away existing workflows.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Top Bread (Hook): You know how LEGO bricks click together to make bigger creationsâcars, castles, even spaceships? Simple pieces combine to form something new. Thatâs how language works too: small parts like stems and endings, or words in a phrase, come together and sometimes make brandânew meanings.
đ„Ź Filling (The Actual Concept):
- What it is: Sparse autoencoders (SAEs) are tools that take a big, messy brainâsignal from a neural network and represent it using a small number of neat, humanâfriendly features.
- How it works:
- Take a networkâs hidden activations (like a snapshot of its thoughts).
- Use a linear encoder to score many possible features (directions) and keep just a few with TopâK sparsity.
- Use a decoder to rebuild the original activation from those few features.
- Why it matters: If the features match meaningful ideas (like city, sports, or pastâtense), we can inspect, measure, and even edit a modelâs internal reasoning.
đ Bottom Bread (Anchor): For example, an SAE might light up a âcountryâ feature when the text says âFrance.â That lets us see when and how the model represents countries.
The World Before:
- SAEs took off because they found interpretable features inside large language models (LLMs) while keeping the encoder linear and sparse. This made it easy to visualize, cluster, and intervene on features.
- Everyone assumed a âstrong linearâ story: learned features add up to make meaning. If you have features for star and coffee, you just add them for contexts where they coâoccur.
The Problem:
- Language composition isnât just addition. âStar + coffeeâ doesnât just mean âa star and some coffeeâ; it can point to Starbucks, a specific brand. Morphology is also nonâlinear: âadministrateâ + ââorsâ â âadministratorsâ has properties not captured by a simple sum.
- Standard SAEs canât, in principle, tell apart mere coâoccurrence from true composition. They either create a single, monolithic feature for the whole compound (hiding its parts) or leave it as two loose pieces (missing the special combo).
Failed Attempts:
- Make bigger dictionaries: You get more features, but you still canât explain how parts bind; you just memorize more wholes.
- Bilinear models on inputs: Helpful, but interactions at the raw input level donât preserve the interpretability of the sparse, learned features.
- Purely linear probes: They pick up separable signals but can confuse frequency with structure.
The Gap:
- We need a method that keeps the beloved, interpretable linear encoder, but lets the decoder model nonâadditive compositionâcapturing how features interact (pairs, triples) efficiently and in a way that aligns with the same feature space.
Real Stakes (Why care?):
- Safety and auditing: If we can see how models combine parts into sensitive ideas (e.g., demographic terms + attribute), we can detect, measure, and correct issues like bias more precisely.
- Debugging: When a mistake comes from a specific interaction (say, a stem combining with the wrong suffix), targeted fixes become possible.
- Steering: If you want to nudge an LLM away from certain composed meanings (e.g., harmful phrases), you need tools that represent those compositions as suchânot as vague coâoccurrences.
New Concept Sandwiches introduced:
-
đ You know how you can add toppings to a pizza, but some toppings together make a new flavor (pineapple + ham â Hawaiian)? đ„Ź Feature interactions: Different features donât just add; they can multiply to create new effects.
- How it works: (i) Identify active features; (ii) Form pair/triple products; (iii) Map those products back to the activation space.
- Why it matters: Without interactions, models confuse special combos with simple coâoccurrence. đ Anchor: star Ă coffee acts like brand.
-
đ Imagine baking: mixing flour, sugar, and eggs changes textureâmultiplying ingredients changes outcomes. đ„Ź Polynomial decoding: A decoder that includes linear (add), quadratic (pairwise multiply), and cubic (triple multiply) terms of features.
- How it works: (i) Compute z; (ii) Build z, zĂz, zĂzĂz; (iii) Weight and sum them.
- Why it matters: Captures composition that linear sums miss. đ Anchor: âingâ Ă âstock/marketâ Ă âinvestâ clarifies true investing contexts.
-
đ Think of a choir: many voices blend, but only a few harmony patterns really matter. đ„Ź Lowârank shared subspace: Interactions happen through a small set of shared directions (U), keeping things efficient and coherent.
- How it works: (i) Project features into U; (ii) interact inside this space; (iii) decode back.
- Why it matters: You avoid memorizing every pair/triple; you reuse a few meaningful patterns. đ Anchor: A handful of âbinding modesâ can explain many morphoâsemantic combinations.
02Core Idea
đ Top Bread (Hook): Imagine colored lights on a stage. One red light and one blue light can overlap to make purple, a new color you canât get by just laying the beams sideâbyâside.
đ„Ź Filling (The Actual Concept):
- What it is: The key insight is to keep the encoder simple and linear for interpretability, but let the decoder be polynomial so it can multiply features and create new semantic âcolors.â
- How it works:
- Use a linear, sparse encoder to find a few active features (clear and interpretable).
- Project those features into a small shared space U.
- Build three streams: linear (z), quadratic (zĂz), and cubic (zĂzĂz), all inside that shared space.
- Map each stream back to the modelâs activation space and sum them with learned weights.
- Why it matters: Without multiplication, we blur together coâoccurrence and true composition; with it, we unlock brandânew semantic directions (like brand) that werenât in the original span of features.
đ Bottom Bread (Anchor): The model can represent Starbucks as the interaction star Ă coffee, not as a giant, opaque, single feature.
Multiple Analogies (3 ways):
- LEGO + Connectors: You donât just stack bricks; you also use special hinges (interactions) that let your build do new things.
- Music Chords: Single notes (features) are clear, but chords (interactions) create harmonies with unique feelings you canât get from notes in isolation.
- Recipes: Ingredients are features; certain mixes (quadratic, cubic) cause chemical reactions (new meanings) beyond simple addition.
Before vs After:
- Before: SAEs assumed addâonly. They often created monolithic features for whole phrases or brands and couldnât explain how parts combined.
- After: PolySAE multiplies features in a small, shared space, capturing pairwise and triple compositions with low overhead. Features stay interpretable; compositions become explicit.
Why It Works (intuition):
- Multiplicative terms produce directions outside the original linear span, so the model can âliftâ into new semantic dimensions (e.g., brand, capability, evaluation).
- Sharing a lowârank interaction space U forces coherence: many pairs/triples reuse the same few binding patterns, reducing overfit.
- Keeping the encoder linear preserves clarity: each feature still corresponds to a direction we can visualize and probe.
- Orthogonal constraints on U keep the interaction modes distinct and identifiable.
Building Blocks (miniâsandwiches):
-
đ You know how sunglasses filter light into a few key colors? đ„Ź Shared projection U: A small set of directions where interactions happen.
- How it works: (i) Compute zU; (ii) multiply elementâwise for pairs/triples; (iii) decode.
- Why it matters: Efficient and consistent composition rules. đ Anchor: The same U helps bind stems with many suffixes.
-
đ Picture three dialsâlow, medium, highâcontrolling how much each effect contributes. đ„Ź Order weights (λ2, λ3): Scalars that set how strong quadratic and cubic terms are.
- How it works: Learn λ2, λ3 from data.
- Why it matters: Prevents higherâorder noise from overpowering clear linear signals. đ Anchor: If cubic adds little, the model learns a small λ3 automatically.
-
đ Think of keeping your tools tidy so you donât grab the same wrench twice. đ„Ź Orthonormal U: Columns of U are perpendicular and unit length.
- How it works: Enforce with QR retraction after each update.
- Why it matters: Prevents redundant or tangled interaction modes. đ Anchor: Distinct modes for morphology vs. phrasal binding.
-
đ Imagine using a few standard puzzleâpiece shapes to build many pictures. đ„Ź Low ranks (R2, R3): Small numbers of interaction channels are enough.
- How it works: Set R2, R3 âȘ R1; learn C(2), C(3) to decode.
- Why it matters: Keeps parameter cost small (~3%) while covering many compositions. đ Anchor: With R2=R3=64, GPTâ2 Small already shows big semantic gains.
03Methodology
HighâLevel Recipe: Input activations â Linear, sparse encoding (z) â Project into shared space (U) â Build linear, pairwise, and triple interaction streams â Decode and sum â Compare to original activations â Update parameters (with U kept orthonormal)
Step 1: Get clean, sparse features (Encoding)
- What happens: We take a hidden activation x from a chosen transformer layer and compute h = E^T x + b_enc; apply ReLU; keep only the TopâK values to form z.
- Why this step exists: Linear + sparse means each feature is a clear direction we can interpret, and only a few fire per token.
- What breaks without it: If encoding isnât linear, feature directions get murky; if it isnât sparse, we lose crisp, humanâreadable signals.
- Example: For the token âStarbucks,â z might include high scores for features like star, coffee, and perhaps brandâish hints.
Step 2: Project features into a shared interaction space (U)
- What happens: Compute a compact representation r1 = zU (size R1). Think of this as picking a handful of reusable interaction dials.
- Why this step exists: It forces many different feature combos to share a small set of binding patterns, avoiding a parameter explosion.
- What breaks without it: Modeling every pair/triple directly would be huge and overfitty; youâd also lose coherence across combinations.
- Example: r1âs channels might align with âmorphology binding,â âphrasal composition,â or âdomain conditioning.â
Step 3: Build the three streams (linear, quadratic, cubic)
- What happens:
- Linear: y1 = (zU) C(1)^T.
- Quadratic: r2 = (zU1:R2) â (zU1:R2); y2 = r2 C(2)^T.
- Cubic: r3 = (zU1:R3) â (zU1:R3) â (zU1:R3); y3 = r3 C(3)^T.
- Sum: Ʒ = b_dec + y1 + λ2 y2 + λ3 y3.
- Why this step exists: Itâs where composition happensâpairs (quadratic) and triples (cubic) can express new semantic directions.
- What breaks without it: Linearâonly decoding canât separate true compositions from coâoccurrence; meanings get blurred.
- Example: âinvesting.com â Philippines stocks were higher after âŠâ triggers cubic binding among ing, stock/market, invest, sharpening financeâspecific -ing contexts.
Step 4: Keep interaction modes distinct (orthonormal U)
- What happens: After each gradient step, we QRâretract U to keep its columns orthonormal (U^T U = I) with consistent signs.
- Why this step exists: Makes interaction channels geometrically clean and prevents degenerate mixing.
- What breaks without it: Redundant or rotated modes can make interpretation unclear and training unstable.
- Example: One mode can stay focused on âsuffix bindingâ instead of drifting into âtopic binding.â
Step 5: Train with reconstruction loss and shared pipeline
- What happens: Minimize mean squared error â„Ć· â xâ„^2 on heldâout activations from the LLM; use the same encoder, sparsifier (TopK, BatchTopK, Matryoshka), optimizer, and data pipelines as standard SAEs.
- Why this step exists: Ensures PolySAE remains a dropâin upgrade that doesnât sacrifice fidelity.
- What breaks without it: Applesâtoâapples comparisons would fail; improvements might just reflect different training setups.
- Example: On GPTâ2 Small layer 8 with K=64 and 16,384 features, MSE stays comparable to the standard SAE while semantics improve.
Step 6: Keep it efficient (ranks and parameters)
- What happens: Choose ranks like (R1, R2, R3) = (768, 64, 64) for GPTâ2 Small; similar small ranks for other models.
- Why this step exists: Interactions are powerful even when lowâdimensional; small ranks keep overhead ~3%.
- What breaks without it: Too many parameters risk overfitting and slow training; too few can underfit compositions.
- Example: Ablations show increasing ranks beyond ~64 doesnât help reconstruction, so the sweet spot is modest.
Secret Sauce (why this is clever):
- Multiplication lifts meanings into new, orthogonal directions, separating coâoccurrence from composition.
- A single shared, lowârank subspace U forces reusable, interpretable binding patterns, instead of memorizing every combo.
- The encoder stays linear and sparse, preserving the clarity that made SAEs interpretabilityâfriendly.
- The whole system is a strict generalization of SAEs: set λ2=λ3=0 and youâre back to vanilla.
MiniâSandwiches for supporting ideas:
-
đ You know how two dancers must coordinate to perform a lift? đ„Ź Pairwise interactions (quadratic): Multiply two feature activations to model their joint effect.
- How it works: r2 = (zU) â (zU); decode with C(2).
- Why it matters: Distinguishes âA and B appearâ from âA and B form a new thing.â đ Anchor: star Ă coffee â brand.
-
đ Think of a threeâway handshake in networkingâyou need all three. đ„Ź Triple interactions (cubic): Multiply three projected features to condition pairs on context.
- How it works: r3 = (zU) â (zU) â (zU); decode with C(3).
- Why it matters: Disambiguates meaning with extra context (domain, evaluation, capability). đ Anchor: historic Ă UFC Ă strong â domainâcalibrated âhistoric.â
-
đ Picture a rulebook that many teams use to play fair. đ„Ź Shared U across orders: The same projection underlies linear, quadratic, and cubic streams.
- How it works: Use U for all orders; vary only elementâwise products and decoders.
- Why it matters: Keeps interactions aligned with the same features and easier to interpret. đ Anchor: The morphology mode helps both pairs (stem + suffix) and triples (stem + suffix + domain).
04Experiments & Results
The Test (what they measured and why):
- Reconstruction fidelity: Mean squared error (MSE) between decoded Ć· and the original activations x. This checks that we didnât break the basic job of an autoencoder.
- Semantic quality: Two views.
- Probing F1: Train simple linear classifiers on individual features to predict labels across datasets like AG News, EuroParl, GitHub languages, Amazon sentiment, etc. Higher F1 means clearer signals in single features.
- 1âWasserstein distance: Measure how far apart the feature activation distributions are between classes. Bigger distance means cleaner class separation overall, not just at one threshold.
The Competition (baselines):
- Standard SAEs using three popular sparsifiers: TopK, BatchTopK, and Matryoshka.
- Four LLMs and layers: GPTâ2 Small, Pythiaâ410M, Pythiaâ1.4B, Gemmaâ2â2B.
- Same training tokens (hundreds of millions), same K=64 sparsity, same width (16,384 features), same evaluation suite (SAEBench).
The Scoreboard (with context):
- Reconstruction: PolySAE matches standard SAEs on MSE across all models and sparsifiers. Thatâs like getting the same score on âcopy the pictureâ while learning a better way to see shapes.
- Probing F1: About +8% average improvement across models and sparsifiers, with >10% on GPTâ2 Small. Think of it as raising a B to a solid A on tests that read individual features.
- Wasserstein distance: 2â10Ă larger than standard SAEs. Thatâs like pulling two clouds farther apart in the sky so you canât mistake one for the other.
Surprising/Notable Findings:
- Low correlation with coâoccurrence: SAE feature covariance correlates strongly with frequency (r = 0.82), but PolySAEâs learned interaction strengths barely correlate (r = 0.06). This suggests PolySAE is learning true composition, not just counting how often things appear together.
- Sparser codes still shine: PolySAE can perform well with fewer active features, concentrating semantic meaning more tightly; adding more features (K from 1 to 5) yields smaller extra gains vs. vanilla SAEs in most models.
- Small ranks suffice: Quadratic and cubic ranks around 64 are enough; pushing them higher didnât improve reconstruction, hinting that interaction structure is lowâdimensional.
- Interpretable examples: Pairwise (star Ă coffee â Starbucks) and triplet bindings (ing Ă stock/market Ă invest) repeatedly show up in contexts where the linear SAE either misses or uses broader, less specific features.
Why these metrics matter:
- If MSE got worse, weâd worry we were just trading interpretability tricks for sloppier reconstructions. It didnât.
- If F1 and Wasserstein get better, it means features arenât just easier to separate with a clever line; the whole geometry of meaning is cleaner and more distinct.
MiniâSandwiches on key metrics:
-
đ You know how measuring the distance between two neighborhoods tells you if theyâre really separate areas? đ„Ź Wasserstein distance: A way to measure how far apart two distributions are, not just whether a simple fence can be drawn.
- How it works: Computes the minimal âearthâmovingâ cost to transform one distribution into the other.
- Why it matters: Bigger distance = cleaner semantic separation. đ Anchor: Positive vs. negative sentiment distributions become farther apart.
-
đ Think of a quick quiz that checks if a single clue can guess the answer. đ„Ź Probing F1: Train a simple classifier on a single featureâs activations to predict a label.
- How it works: For each task, pick the feature with the biggest mean difference between classes; score its F1.
- Why it matters: High F1 means one feature alone carries clear meaning (monosemanticity signal). đ Anchor: A âprogramming languageâ feature helps classify GitHub snippets correctly.
05Discussion & Limitations
Limitations (be specific):
- Scale: Evaluations cover up to 2Bâparameter LLMs and one layer per model; results on much larger models and many layers remain to be shown.
- Order: Interactions go up to cubic; some linguistic phenomena may need higher orders or structured binding beyond degreeâ3.
- Variant scope: Experiments focus on forcedâsparsity SAEs (TopK, BatchTopK, Matryoshka); gated or other SAE families werenât deeply explored.
- Causality caveat: Better separation and interpretable interactions donât automatically prove causal circuits; edits may need careful validation.
- Data coverage: Trained on standard text corpora; highly specialized domains might require tuning ranks or data.
Required Resources:
- Activation dumps from target LLM layers, hundreds of millions of training tokens, and typical SAE training compute (GPUs).
- Slightly more memory/compute than SAE (~3% parameter overhead) and QR retractions for U each step.
- SAELensâstyle tooling for training/evaluation and SAEBench for standardized probes.
When NOT to Use:
- If your only goal is minimal reconstruction error with no interpretability needs, vanilla methods may suffice.
- Extremely tiny datasets: Polynomial terms may overfit rare combos; keep ranks small or avoid interactions.
- Realâtime tight latency budgets: The extra polynomial streams add some compute.
- Tasks where features are very dense or inherently nonâsparse: The benefits of sparse, interpretable z may not apply.
Open Questions:
- Higher orders and structure: Do quartic terms help, or can we design smarter binding operators that remain efficient and interpretable?
- Layerwise consistency: How do interaction modes vary across layers; can U be shared or adapted across multiple layers?
- Editing and control: Can we directly edit pair/triple interaction weights to safely steer model behavior on composed meanings?
- Crossâmodal generalization: Do similar lowârank interaction spaces help in vision or audio SAEs?
- Theory: Can we formally link learned interaction modes to known linguistic or cognitive binding operations?
06Conclusion & Future Work
ThreeâSentence Summary:
- PolySAE keeps the encoder linear and sparse for clear features, but makes the decoder polynomial so it can multiply features and represent true composition (pairs and triples) efficiently.
- It matches standard SAEs on reconstruction while improving semantic quality: about +8% probing F1 on average and 2â10Ă larger Wasserstein distances, with only ~3% parameter overhead.
- Interaction strengths barely track frequency, showing PolySAE captures compositional structure (morphology, phrasal binding, contextual disambiguation) rather than just coâoccurrence.
Main Achievement:
- A practical, dropâin generalization of SAEs that separates coâoccurrence from composition by lifting into new semantic dimensions through lowârank polynomial decoding, all while preserving interpretability.
Future Directions:
- Explore higherâorder or structured bindings, multiâlayer interaction sharing, and direct interactionâaware editing for safety and controllability.
- Extend to other modalities and domains, and develop richer, causal evaluations of interaction circuits.
- Automate rank/λ selection and improve visualizations for pair/triple feature dictionaries.
Why Remember This:
- PolySAE shows you can keep the simple, interpretable encoder people love and still capture the nonâlinear way language truly composesâunlocking clearer, more faithful windows into how LLMs build meaning.
Practical Applications
- âąAudit compositional bias by inspecting pair/triple interactions between demographic and attribute features.
- âąTargeted safety steering by adjusting or suppressing specific interaction modes (e.g., harmful phrase bindings).
- âąCausal debugging: test whether an error comes from a feature or from a specific feature interaction using activation patching.
- âąLexical morphology analysis: study how stems and affixes bind across contexts to understand a modelâs word formation.
- âąPhrase and entity composition: map which feature pairs form brands, idioms, or domain-specific terms.
- âąCurriculum design for interpretability: build libraries of reusable interaction modes (U) across tasks or layers.
- âąResource-efficient interpretability: maintain small ranks (like 64) to model interactions with minimal overhead.
- âąSparser deployment: achieve similar or better probe performance with fewer active features (smaller K).
- âąMonitoring drift: track Wasserstein distances over time to see if class meanings become less separated.
- âąFeature editing: tweak λ2/λ3 or specific decoder rows to strengthen or weaken selected compositions.