🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding | How I Study AI

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Beginner
Panagiotis Koromilas, Andreas D. Demou, James Oldfield et al.2/1/2026
arXivPDF

Key Summary

  • ‱PolySAE is a new kind of sparse autoencoder that keeps a simple, linear way to find features but uses a smarter decoder that can multiply features together.
  • ‱This multiplication lets the model tell the difference between words that just happen to appear together and true combinations with new meaning (like star × coffee → Starbucks).
  • ‱It adds only about 3% more parameters on GPT‑2 Small thanks to a low‑rank, shared interaction space, so it stays efficient.
  • ‱Across four language models and three SAE variants, PolySAE boosts probing F1 by about 8% on average without hurting reconstruction error.
  • ‱It also makes class meanings more separated, giving 2–10× larger Wasserstein distances between classes than standard SAEs.
  • ‱Learned interaction strengths barely track co‑occurrence frequency (r = 0.06), unlike linear SAE covariance (r = 0.82), so PolySAE is learning real composition, not just counting.
  • ‱Because the encoder stays linear and sparse, features remain interpretable, while the polynomial decoder cleanly captures pair and triplet interactions.
  • ‱PolySAE concentrates meaning in fewer features, often needing fewer active features to perform well.
  • ‱Ranks for quadratic and cubic terms can stay small (like 64), showing that most interaction structure is low‑dimensional.
  • ‱It helps us see how models bind parts like stems and suffixes, or words and context, making interpretability tools more faithful to how language actually composes.

Why This Research Matters

Language is built by combining parts, and real meanings often appear when pieces interact, not just add. PolySAE finally lets interpretability tools match that reality: the encoder keeps features clear and sparse, while the decoder captures pair and triple bindings that create new meanings. This helps auditors and safety teams detect and measure sensitive compositions (like demographic + attribute) with more precision. It allows developers to debug model behavior at the level where errors actually happen—during combinations, not just individual parts. It makes targeted steering more effective since we can adjust interaction weights rather than just bluntly turning features up or down. And because it’s efficient and a strict generalization of SAEs, teams can adopt it without throwing away existing workflows.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how LEGO bricks click together to make bigger creations—cars, castles, even spaceships? Simple pieces combine to form something new. That’s how language works too: small parts like stems and endings, or words in a phrase, come together and sometimes make brand‑new meanings.

đŸ„Ź Filling (The Actual Concept):

  • What it is: Sparse autoencoders (SAEs) are tools that take a big, messy brain‑signal from a neural network and represent it using a small number of neat, human‑friendly features.
  • How it works:
    1. Take a network’s hidden activations (like a snapshot of its thoughts).
    2. Use a linear encoder to score many possible features (directions) and keep just a few with Top‑K sparsity.
    3. Use a decoder to rebuild the original activation from those few features.
  • Why it matters: If the features match meaningful ideas (like city, sports, or past‑tense), we can inspect, measure, and even edit a model’s internal reasoning.

🍞 Bottom Bread (Anchor): For example, an SAE might light up a “country” feature when the text says “France.” That lets us see when and how the model represents countries.

The World Before:

  • SAEs took off because they found interpretable features inside large language models (LLMs) while keeping the encoder linear and sparse. This made it easy to visualize, cluster, and intervene on features.
  • Everyone assumed a “strong linear” story: learned features add up to make meaning. If you have features for star and coffee, you just add them for contexts where they co‑occur.

The Problem:

  • Language composition isn’t just addition. “Star + coffee” doesn’t just mean “a star and some coffee”; it can point to Starbucks, a specific brand. Morphology is also non‑linear: “administrate” + “‑ors” → “administrators” has properties not captured by a simple sum.
  • Standard SAEs can’t, in principle, tell apart mere co‑occurrence from true composition. They either create a single, monolithic feature for the whole compound (hiding its parts) or leave it as two loose pieces (missing the special combo).

Failed Attempts:

  • Make bigger dictionaries: You get more features, but you still can’t explain how parts bind; you just memorize more wholes.
  • Bilinear models on inputs: Helpful, but interactions at the raw input level don’t preserve the interpretability of the sparse, learned features.
  • Purely linear probes: They pick up separable signals but can confuse frequency with structure.

The Gap:

  • We need a method that keeps the beloved, interpretable linear encoder, but lets the decoder model non‑additive composition—capturing how features interact (pairs, triples) efficiently and in a way that aligns with the same feature space.

Real Stakes (Why care?):

  • Safety and auditing: If we can see how models combine parts into sensitive ideas (e.g., demographic terms + attribute), we can detect, measure, and correct issues like bias more precisely.
  • Debugging: When a mistake comes from a specific interaction (say, a stem combining with the wrong suffix), targeted fixes become possible.
  • Steering: If you want to nudge an LLM away from certain composed meanings (e.g., harmful phrases), you need tools that represent those compositions as such—not as vague co‑occurrences.

New Concept Sandwiches introduced:

  1. 🍞 You know how you can add toppings to a pizza, but some toppings together make a new flavor (pineapple + ham → Hawaiian)? đŸ„Ź Feature interactions: Different features don’t just add; they can multiply to create new effects.

    • How it works: (i) Identify active features; (ii) Form pair/triple products; (iii) Map those products back to the activation space.
    • Why it matters: Without interactions, models confuse special combos with simple co‑occurrence. 🍞 Anchor: star × coffee acts like brand.
  2. 🍞 Imagine baking: mixing flour, sugar, and eggs changes texture—multiplying ingredients changes outcomes. đŸ„Ź Polynomial decoding: A decoder that includes linear (add), quadratic (pairwise multiply), and cubic (triple multiply) terms of features.

    • How it works: (i) Compute z; (ii) Build z, z×z, z×z×z; (iii) Weight and sum them.
    • Why it matters: Captures composition that linear sums miss. 🍞 Anchor: “ing” × “stock/market” × “invest” clarifies true investing contexts.
  3. 🍞 Think of a choir: many voices blend, but only a few harmony patterns really matter. đŸ„Ź Low‑rank shared subspace: Interactions happen through a small set of shared directions (U), keeping things efficient and coherent.

    • How it works: (i) Project features into U; (ii) interact inside this space; (iii) decode back.
    • Why it matters: You avoid memorizing every pair/triple; you reuse a few meaningful patterns. 🍞 Anchor: A handful of “binding modes” can explain many morpho‑semantic combinations.

02Core Idea

🍞 Top Bread (Hook): Imagine colored lights on a stage. One red light and one blue light can overlap to make purple, a new color you can’t get by just laying the beams side‑by‑side.

đŸ„Ź Filling (The Actual Concept):

  • What it is: The key insight is to keep the encoder simple and linear for interpretability, but let the decoder be polynomial so it can multiply features and create new semantic “colors.”
  • How it works:
    1. Use a linear, sparse encoder to find a few active features (clear and interpretable).
    2. Project those features into a small shared space U.
    3. Build three streams: linear (z), quadratic (z×z), and cubic (z×z×z), all inside that shared space.
    4. Map each stream back to the model’s activation space and sum them with learned weights.
  • Why it matters: Without multiplication, we blur together co‑occurrence and true composition; with it, we unlock brand‑new semantic directions (like brand) that weren’t in the original span of features.

🍞 Bottom Bread (Anchor): The model can represent Starbucks as the interaction star × coffee, not as a giant, opaque, single feature.

Multiple Analogies (3 ways):

  • LEGO + Connectors: You don’t just stack bricks; you also use special hinges (interactions) that let your build do new things.
  • Music Chords: Single notes (features) are clear, but chords (interactions) create harmonies with unique feelings you can’t get from notes in isolation.
  • Recipes: Ingredients are features; certain mixes (quadratic, cubic) cause chemical reactions (new meanings) beyond simple addition.

Before vs After:

  • Before: SAEs assumed add‑only. They often created monolithic features for whole phrases or brands and couldn’t explain how parts combined.
  • After: PolySAE multiplies features in a small, shared space, capturing pairwise and triple compositions with low overhead. Features stay interpretable; compositions become explicit.

Why It Works (intuition):

  • Multiplicative terms produce directions outside the original linear span, so the model can “lift” into new semantic dimensions (e.g., brand, capability, evaluation).
  • Sharing a low‑rank interaction space U forces coherence: many pairs/triples reuse the same few binding patterns, reducing overfit.
  • Keeping the encoder linear preserves clarity: each feature still corresponds to a direction we can visualize and probe.
  • Orthogonal constraints on U keep the interaction modes distinct and identifiable.

Building Blocks (mini‑sandwiches):

  1. 🍞 You know how sunglasses filter light into a few key colors? đŸ„Ź Shared projection U: A small set of directions where interactions happen.

    • How it works: (i) Compute zU; (ii) multiply element‑wise for pairs/triples; (iii) decode.
    • Why it matters: Efficient and consistent composition rules. 🍞 Anchor: The same U helps bind stems with many suffixes.
  2. 🍞 Picture three dials—low, medium, high—controlling how much each effect contributes. đŸ„Ź Order weights (λ2, λ3): Scalars that set how strong quadratic and cubic terms are.

    • How it works: Learn λ2, λ3 from data.
    • Why it matters: Prevents higher‑order noise from overpowering clear linear signals. 🍞 Anchor: If cubic adds little, the model learns a small λ3 automatically.
  3. 🍞 Think of keeping your tools tidy so you don’t grab the same wrench twice. đŸ„Ź Orthonormal U: Columns of U are perpendicular and unit length.

    • How it works: Enforce with QR retraction after each update.
    • Why it matters: Prevents redundant or tangled interaction modes. 🍞 Anchor: Distinct modes for morphology vs. phrasal binding.
  4. 🍞 Imagine using a few standard puzzle‑piece shapes to build many pictures. đŸ„Ź Low ranks (R2, R3): Small numbers of interaction channels are enough.

    • How it works: Set R2, R3 â‰Ș R1; learn C(2), C(3) to decode.
    • Why it matters: Keeps parameter cost small (~3%) while covering many compositions. 🍞 Anchor: With R2=R3=64, GPT‑2 Small already shows big semantic gains.

03Methodology

High‑Level Recipe: Input activations → Linear, sparse encoding (z) → Project into shared space (U) → Build linear, pairwise, and triple interaction streams → Decode and sum → Compare to original activations → Update parameters (with U kept orthonormal)

Step 1: Get clean, sparse features (Encoding)

  • What happens: We take a hidden activation x from a chosen transformer layer and compute h = E^T x + b_enc; apply ReLU; keep only the Top‑K values to form z.
  • Why this step exists: Linear + sparse means each feature is a clear direction we can interpret, and only a few fire per token.
  • What breaks without it: If encoding isn’t linear, feature directions get murky; if it isn’t sparse, we lose crisp, human‑readable signals.
  • Example: For the token “Starbucks,” z might include high scores for features like star, coffee, and perhaps brand‑ish hints.

Step 2: Project features into a shared interaction space (U)

  • What happens: Compute a compact representation r1 = zU (size R1). Think of this as picking a handful of reusable interaction dials.
  • Why this step exists: It forces many different feature combos to share a small set of binding patterns, avoiding a parameter explosion.
  • What breaks without it: Modeling every pair/triple directly would be huge and overfitty; you’d also lose coherence across combinations.
  • Example: r1’s channels might align with “morphology binding,” “phrasal composition,” or “domain conditioning.”

Step 3: Build the three streams (linear, quadratic, cubic)

  • What happens:
    • Linear: y1 = (zU) C(1)^T.
    • Quadratic: r2 = (zU1:R2) ⊙ (zU1:R2); y2 = r2 C(2)^T.
    • Cubic: r3 = (zU1:R3) ⊙ (zU1:R3) ⊙ (zU1:R3); y3 = r3 C(3)^T.
    • Sum: Ć· = b_dec + y1 + λ2 y2 + λ3 y3.
  • Why this step exists: It’s where composition happens—pairs (quadratic) and triples (cubic) can express new semantic directions.
  • What breaks without it: Linear‑only decoding can’t separate true compositions from co‑occurrence; meanings get blurred.
  • Example: “investing.com — Philippines stocks were higher after 
” triggers cubic binding among ing, stock/market, invest, sharpening finance‑specific -ing contexts.

Step 4: Keep interaction modes distinct (orthonormal U)

  • What happens: After each gradient step, we QR‑retract U to keep its columns orthonormal (U^T U = I) with consistent signs.
  • Why this step exists: Makes interaction channels geometrically clean and prevents degenerate mixing.
  • What breaks without it: Redundant or rotated modes can make interpretation unclear and training unstable.
  • Example: One mode can stay focused on “suffix binding” instead of drifting into “topic binding.”

Step 5: Train with reconstruction loss and shared pipeline

  • What happens: Minimize mean squared error ∄Ʒ − x∄^2 on held‑out activations from the LLM; use the same encoder, sparsifier (TopK, BatchTopK, Matryoshka), optimizer, and data pipelines as standard SAEs.
  • Why this step exists: Ensures PolySAE remains a drop‑in upgrade that doesn’t sacrifice fidelity.
  • What breaks without it: Apples‑to‑apples comparisons would fail; improvements might just reflect different training setups.
  • Example: On GPT‑2 Small layer 8 with K=64 and 16,384 features, MSE stays comparable to the standard SAE while semantics improve.

Step 6: Keep it efficient (ranks and parameters)

  • What happens: Choose ranks like (R1, R2, R3) = (768, 64, 64) for GPT‑2 Small; similar small ranks for other models.
  • Why this step exists: Interactions are powerful even when low‑dimensional; small ranks keep overhead ~3%.
  • What breaks without it: Too many parameters risk overfitting and slow training; too few can underfit compositions.
  • Example: Ablations show increasing ranks beyond ~64 doesn’t help reconstruction, so the sweet spot is modest.

Secret Sauce (why this is clever):

  • Multiplication lifts meanings into new, orthogonal directions, separating co‑occurrence from composition.
  • A single shared, low‑rank subspace U forces reusable, interpretable binding patterns, instead of memorizing every combo.
  • The encoder stays linear and sparse, preserving the clarity that made SAEs interpretability‑friendly.
  • The whole system is a strict generalization of SAEs: set λ2=λ3=0 and you’re back to vanilla.

Mini‑Sandwiches for supporting ideas:

  1. 🍞 You know how two dancers must coordinate to perform a lift? đŸ„Ź Pairwise interactions (quadratic): Multiply two feature activations to model their joint effect.

    • How it works: r2 = (zU) ⊙ (zU); decode with C(2).
    • Why it matters: Distinguishes “A and B appear” from “A and B form a new thing.” 🍞 Anchor: star × coffee → brand.
  2. 🍞 Think of a three‑way handshake in networking—you need all three. đŸ„Ź Triple interactions (cubic): Multiply three projected features to condition pairs on context.

    • How it works: r3 = (zU) ⊙ (zU) ⊙ (zU); decode with C(3).
    • Why it matters: Disambiguates meaning with extra context (domain, evaluation, capability). 🍞 Anchor: historic × UFC × strong → domain‑calibrated “historic.”
  3. 🍞 Picture a rulebook that many teams use to play fair. đŸ„Ź Shared U across orders: The same projection underlies linear, quadratic, and cubic streams.

    • How it works: Use U for all orders; vary only element‑wise products and decoders.
    • Why it matters: Keeps interactions aligned with the same features and easier to interpret. 🍞 Anchor: The morphology mode helps both pairs (stem + suffix) and triples (stem + suffix + domain).

04Experiments & Results

The Test (what they measured and why):

  • Reconstruction fidelity: Mean squared error (MSE) between decoded Ć· and the original activations x. This checks that we didn’t break the basic job of an autoencoder.
  • Semantic quality: Two views.
    1. Probing F1: Train simple linear classifiers on individual features to predict labels across datasets like AG News, EuroParl, GitHub languages, Amazon sentiment, etc. Higher F1 means clearer signals in single features.
    2. 1‑Wasserstein distance: Measure how far apart the feature activation distributions are between classes. Bigger distance means cleaner class separation overall, not just at one threshold.

The Competition (baselines):

  • Standard SAEs using three popular sparsifiers: TopK, BatchTopK, and Matryoshka.
  • Four LLMs and layers: GPT‑2 Small, Pythia‑410M, Pythia‑1.4B, Gemma‑2‑2B.
  • Same training tokens (hundreds of millions), same K=64 sparsity, same width (16,384 features), same evaluation suite (SAEBench).

The Scoreboard (with context):

  • Reconstruction: PolySAE matches standard SAEs on MSE across all models and sparsifiers. That’s like getting the same score on “copy the picture” while learning a better way to see shapes.
  • Probing F1: About +8% average improvement across models and sparsifiers, with >10% on GPT‑2 Small. Think of it as raising a B to a solid A on tests that read individual features.
  • Wasserstein distance: 2–10× larger than standard SAEs. That’s like pulling two clouds farther apart in the sky so you can’t mistake one for the other.

Surprising/Notable Findings:

  1. Low correlation with co‑occurrence: SAE feature covariance correlates strongly with frequency (r = 0.82), but PolySAE’s learned interaction strengths barely correlate (r = 0.06). This suggests PolySAE is learning true composition, not just counting how often things appear together.
  2. Sparser codes still shine: PolySAE can perform well with fewer active features, concentrating semantic meaning more tightly; adding more features (K from 1 to 5) yields smaller extra gains vs. vanilla SAEs in most models.
  3. Small ranks suffice: Quadratic and cubic ranks around 64 are enough; pushing them higher didn’t improve reconstruction, hinting that interaction structure is low‑dimensional.
  4. Interpretable examples: Pairwise (star × coffee → Starbucks) and triplet bindings (ing × stock/market × invest) repeatedly show up in contexts where the linear SAE either misses or uses broader, less specific features.

Why these metrics matter:

  • If MSE got worse, we’d worry we were just trading interpretability tricks for sloppier reconstructions. It didn’t.
  • If F1 and Wasserstein get better, it means features aren’t just easier to separate with a clever line; the whole geometry of meaning is cleaner and more distinct.

Mini‑Sandwiches on key metrics:

  1. 🍞 You know how measuring the distance between two neighborhoods tells you if they’re really separate areas? đŸ„Ź Wasserstein distance: A way to measure how far apart two distributions are, not just whether a simple fence can be drawn.

    • How it works: Computes the minimal “earth‑moving” cost to transform one distribution into the other.
    • Why it matters: Bigger distance = cleaner semantic separation. 🍞 Anchor: Positive vs. negative sentiment distributions become farther apart.
  2. 🍞 Think of a quick quiz that checks if a single clue can guess the answer. đŸ„Ź Probing F1: Train a simple classifier on a single feature’s activations to predict a label.

    • How it works: For each task, pick the feature with the biggest mean difference between classes; score its F1.
    • Why it matters: High F1 means one feature alone carries clear meaning (monosemanticity signal). 🍞 Anchor: A “programming language” feature helps classify GitHub snippets correctly.

05Discussion & Limitations

Limitations (be specific):

  • Scale: Evaluations cover up to 2B‑parameter LLMs and one layer per model; results on much larger models and many layers remain to be shown.
  • Order: Interactions go up to cubic; some linguistic phenomena may need higher orders or structured binding beyond degree‑3.
  • Variant scope: Experiments focus on forced‑sparsity SAEs (TopK, BatchTopK, Matryoshka); gated or other SAE families weren’t deeply explored.
  • Causality caveat: Better separation and interpretable interactions don’t automatically prove causal circuits; edits may need careful validation.
  • Data coverage: Trained on standard text corpora; highly specialized domains might require tuning ranks or data.

Required Resources:

  • Activation dumps from target LLM layers, hundreds of millions of training tokens, and typical SAE training compute (GPUs).
  • Slightly more memory/compute than SAE (~3% parameter overhead) and QR retractions for U each step.
  • SAELens‑style tooling for training/evaluation and SAEBench for standardized probes.

When NOT to Use:

  • If your only goal is minimal reconstruction error with no interpretability needs, vanilla methods may suffice.
  • Extremely tiny datasets: Polynomial terms may overfit rare combos; keep ranks small or avoid interactions.
  • Real‑time tight latency budgets: The extra polynomial streams add some compute.
  • Tasks where features are very dense or inherently non‑sparse: The benefits of sparse, interpretable z may not apply.

Open Questions:

  • Higher orders and structure: Do quartic terms help, or can we design smarter binding operators that remain efficient and interpretable?
  • Layerwise consistency: How do interaction modes vary across layers; can U be shared or adapted across multiple layers?
  • Editing and control: Can we directly edit pair/triple interaction weights to safely steer model behavior on composed meanings?
  • Cross‑modal generalization: Do similar low‑rank interaction spaces help in vision or audio SAEs?
  • Theory: Can we formally link learned interaction modes to known linguistic or cognitive binding operations?

06Conclusion & Future Work

Three‑Sentence Summary:

  • PolySAE keeps the encoder linear and sparse for clear features, but makes the decoder polynomial so it can multiply features and represent true composition (pairs and triples) efficiently.
  • It matches standard SAEs on reconstruction while improving semantic quality: about +8% probing F1 on average and 2–10× larger Wasserstein distances, with only ~3% parameter overhead.
  • Interaction strengths barely track frequency, showing PolySAE captures compositional structure (morphology, phrasal binding, contextual disambiguation) rather than just co‑occurrence.

Main Achievement:

  • A practical, drop‑in generalization of SAEs that separates co‑occurrence from composition by lifting into new semantic dimensions through low‑rank polynomial decoding, all while preserving interpretability.

Future Directions:

  • Explore higher‑order or structured bindings, multi‑layer interaction sharing, and direct interaction‑aware editing for safety and controllability.
  • Extend to other modalities and domains, and develop richer, causal evaluations of interaction circuits.
  • Automate rank/λ selection and improve visualizations for pair/triple feature dictionaries.

Why Remember This:

  • PolySAE shows you can keep the simple, interpretable encoder people love and still capture the non‑linear way language truly composes—unlocking clearer, more faithful windows into how LLMs build meaning.

Practical Applications

  • ‱Audit compositional bias by inspecting pair/triple interactions between demographic and attribute features.
  • ‱Targeted safety steering by adjusting or suppressing specific interaction modes (e.g., harmful phrase bindings).
  • ‱Causal debugging: test whether an error comes from a feature or from a specific feature interaction using activation patching.
  • ‱Lexical morphology analysis: study how stems and affixes bind across contexts to understand a model’s word formation.
  • ‱Phrase and entity composition: map which feature pairs form brands, idioms, or domain-specific terms.
  • ‱Curriculum design for interpretability: build libraries of reusable interaction modes (U) across tasks or layers.
  • ‱Resource-efficient interpretability: maintain small ranks (like 64) to model interactions with minimal overhead.
  • ‱Sparser deployment: achieve similar or better probe performance with fewer active features (smaller K).
  • ‱Monitoring drift: track Wasserstein distances over time to see if class meanings become less separated.
  • ‱Feature editing: tweak λ2/λ3 or specific decoder rows to strengthen or weaken selected compositions.
#Sparse Autoencoder#Polynomial Decoder#Feature Interactions#Low-rank Factorization#Mechanistic Interpretability#Compositionality#Wasserstein Distance#Probing F1#Stiefel Manifold#QR Retraction#Top-K Sparsity#Volterra Expansion#Monosemantic Features#Tensor Factorization#Residual Stream
Version: 1