Geometric Stability: The Missing Axis of Representations

Prashant C. Raju

Geometric Stability: The Missing Axis of Representations

Intermediate

Prashant C. Raju1/14/2026

arXiv PDF

Key Summary

•Similarity tells you if two models seem to think about things the same way, but it doesn’t tell you if that thinking is sturdy when the world wiggles.
•This paper introduces geometric stability, a new axis that measures how well a model’s internal shape of knowledge (its geometry) holds together under small changes.
•Shesha is a simple, general recipe to measure geometric stability by checking how consistent a model’s pairwise distances are across splits and perturbations.
•Across 2,463 model setups in seven domains, stability and similarity were basically uncorrelated (ρ ≈ −0.01), proving they capture different truths.
•Shesha spots representation drift almost 2× earlier than CKA on average (up to 5.23× in Llama), while avoiding the false alarms that Procrustes often triggers (44% vs 7.3%).
•Supervised stability predicts how easily a model can be steered with simple linear directions (ρ = 0.89–0.96), a key step for safe, controllable AI.
•In vision, top transfer models like DINOv2 can have low stability, revealing a “geometric tax” of transferability and guiding smarter model choice.
•Beyond ML, stability tracks CRISPR perturbation coherence (ρ up to 0.96) and neural-behavior coupling in the brain, showing its broad scientific value.

Why This Research Matters

If a model’s internal shape collapses under small changes, it can behave unpredictably even when standard similarity tests say “all good.” Geometric stability gives builders an early-warning canary that catches subtle fractures before user harm or system failure. It helps teams choose the right model for the job: pick high-transfer models when you’ll retrain a lot, pick high-stability models when you need consistent, safe behavior. In safety monitoring, it reduces noisy false alarms while still detecting real drift sooner. In controllability, it predicts when simple, auditable linear steering will work, enabling safer interventions. And in science, it distinguishes real biological structure from noise, improving discoveries in genomics and neuroscience.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a bookshelf can look neat today, but if every time you nudge it the books slide around, it becomes hard to find anything? Looks neat (similar) isn’t the same as stays neat (stable).

🥬 Filling (The Actual Concept)

What it is: The field mostly judged model representations by similarity—how closely two models organize information—not by whether that organization stays reliable when you poke it.
How it works (before this paper): Tools like RSA and CKA compare two sets of embeddings to see if their pairwise relationships match. If they do, we say the internal representations are similar.
Why it matters: Two models can look very similar but react totally differently to small changes (like paraphrases, resampling, or noise). That’s risky for safety, control, and science.

🍞 Bottom Bread (Anchor) Two libraries can list the same books (similar), but if one loses its shelf order whenever the floor vibrates, it’s useless to readers (unstable).

🍞 Top Bread (Hook) Imagine maps drawn by two students. If both draw Paris near the Seine, their maps look similar. But if one map smears when it rains, it’s not reliable on a field trip.

🥬 Filling

What it is: Representational geometry is the “shape” formed by distances among embeddings—who is close, who is far.
How it works: We look at all pairs of items and record their distances. That distance web is the geometry.
Why it matters: If the web tears under small changes, the model’s behavior may wobble in surprising ways.

🍞 Bottom Bread (Anchor) When you ask “What’s the capital of France?”, geometry helps the model keep “France” close to “Paris” across many sentences and wordings.

🍞 Top Bread (Hook) You know how teachers don’t just check if you copied the right answers, but if you can solve similar problems again? That’s testing stability, not just similarity.

🥬 Filling

The problem: Existing work mostly stopped at “Are two models’ representations aligned?” It didn’t ask “Does one model’s shape hold steady inside itself?”
Failed attempts: Pure distance metrics can be too rigid (false alarms), while alignment scores (like CKA) overlook subtle structure beyond the biggest components.
The gap: We need a metric that checks the model’s own internal consistency under splits and perturbations—no external reference required.
What’s missing: A way to measure how robust the full manifold structure is, not just its loudest directions.

🍞 Bottom Bread (Anchor) If you split a class into two groups and both groups solve problems in similar ways, you trust the lesson stuck. That’s internal consistency.

🍞 Top Bread (Hook) Imagine you test a bridge by sending bikes, cars, and then trucks across. If the bridge shape holds steady each time, you’d say it’s stable.

🥬 Filling

Real stakes: Safety monitoring needs early warnings before failures. Steering needs geometry that accepts simple nudges. Model selection shouldn’t confuse “adapts well” with “stays coherent.” Biology and neuroscience need to know if patterns reflect real structure, not noise.
Consequences without it: You can miss quiet fractures (drift) that later cause bad outputs, overreact to harmless jitters (false alarms), or pick the wrong model for the job.

🍞 Bottom Bread (Anchor) In this paper, the new metric (Shesha) acts like a “geometric canary,” chirping early when the representation’s shape starts to crack—even while other meters stay quiet.

02Core Idea

🍞 Top Bread (Hook) Imagine two sandcastles. From far away they look the same. But one survives waves; the other crumbles. What you need is not just a camera (similarity) but a shake test (stability).

🥬 Filling (The Actual Concept)

The “Aha!”: Measure how consistently a model’s internal distance structure stays the same under controlled splits and perturbations—geometric stability—using Shesha.
How it works (big picture): Build pairwise distance tables (RDMs) from two independent “views” of the same representation (e.g., different feature halves or different sample halves), then correlate them. High agreement = stable geometry.
Why it matters: Similarity captures “what’s represented,” stability captures “whether that structure is sturdy.” Both are needed for safe, controllable, and trustworthy systems.

🍞 Bottom Bread (Anchor) If you quiz two halves of a class separately and both halves keep the same ranking of which problems are hard or easy, the teaching geometry is stable.

Three analogies (same idea, different lenses)

Library index: Same books ≠ same usability. Stability checks if the catalog order survives reshuffles.
GPS routes: Two apps can suggest similar paths today; stability checks if their directions stay sensible after a detour or road closure.
Lego city: Similarity counts the pieces; stability checks if the buildings stand when the table shakes.

Before vs After

Before: We compared models to references (CKA, RSA) and declared victory if they lined up on big, dominant features (top principal components).
After: We also test internal self-consistency (Shesha). Now we can catch fine-grained fractures in the spectral tail that similarity misses, flag early drift, and predict steerability.

Why it works (intuition, no equations)

If geometry is truly coherent, you can look at it from many angles (different feature splits, sample splits) and the relative ordering of distances hardly changes.
Spearman rank on RDMs focuses on the order of distances, making the score robust to small scale changes and sensitive to structural reshuffles.
Deleting top principal components removes loud structure: similarity collapses, but if the manifold’s fine details remain, Shesha still detects them.

Building blocks (the simple pieces)

Representational Dissimilarity Matrix (RDM): a square table of pairwise distances between items.
Split-half views: make two independent RDMs by splitting features (Shesha_FS) or samples (Shesha_SS).
Rank correlation: compare the two RDMs by how similarly they rank all pairs.
Supervised variants: when labels exist, align geometry to task structure to predict control and transfer.

🍞 Bottom Bread (Anchor) In tests across 2,463 configurations and 7 domains, stability and similarity were basically uncorrelated (ρ ≈ −0.01). That’s like discovering height and shoe size don’t predict each other in this crowd—you need both measurements to buy the right gear.

New Concepts (Sandwich style)

🍞 Top Bread (Hook) You know how you compare two drawings by seeing if their shapes line up? 🥬 The Concept: Similarity Metrics (e.g., CKA, RSA) compare two representations to see if their pairwise structures align.

How it works: Build pairwise distance/kernels for both, align/center them, compute a similarity score.
Why it matters: It tells us if two models organize information in the same way. 🍞 Bottom Bread (Anchor) Two sentence-embedders that keep “Paris” close to “France” and far from “potato” score high in similarity.

🍞 Top Bread (Hook) Imagine pinning strings between all pairs of cities on a map. 🥬 The Concept: Representational Geometry is the full pattern of who’s near and far in embedding space.

How it works: Compute distances for all pairs; that pattern is the geometry.
Why it matters: The pattern guides decisions like classification or retrieval. 🍞 Bottom Bread (Anchor) “Cat” near “kitten,” far from “bulldozer” is useful geometry.

🍞 Top Bread (Hook) Think of a class seating chart where you can measure who sits close to whom. 🥬 The Concept: RDMs (Representational Dissimilarity Matrices) store all pairwise distances among items.

How it works: Fill a table with distances; compare tables from different views.
Why it matters: It’s the canvas Shesha uses to judge stability. 🍞 Bottom Bread (Anchor) Two RDMs that order pairs the same way signal stable geometry.

🍞 Top Bread (Hook) Summarizing a long book into a few bullet points keeps the gist but loses details. 🥬 The Concept: PCA keeps dominant variance (big trends) while compressing fine structure.

How it works: Re-express data along ordered components from largest to smallest variance.
Why it matters: Similarity loves the big components; stability also cares about the tail details. 🍞 Bottom Bread (Anchor) Removing just the top PC made CKA collapse, but Shesha still saw structure deeper down.

🍞 Top Bread (Hook) If you have a strong ruler, you can measure shapes even after rotating them. 🥬 The Concept: CKA compares representations while being robust to certain transforms, emphasizing dominant subspaces.

How it works: Centered kernel alignment between two sets of features.
Why it matters: Great for alignment; misses fragile tail geometry. 🍞 Bottom Bread (Anchor) Two encoders trained differently can get high CKA but differ in stability under perturbations.

03Methodology

At a high level: Input embeddings → build two “views” (split features or samples) → compute two distance tables (RDMs) → compare them with rank correlation → stability score.

Step-by-step (like a recipe)

Gather representations

What happens: Take your n×d embedding matrix X (n items, d features).
Why it exists: You need the raw “coordinates” that define geometry.
Example: 1,600 CIFAR images encoded into 768-D vectors by a ViT.

Build pairwise distance tables (RDMs)

What happens: Compute a distance for every pair of items to fill a square table.
Why it exists: The RDM captures the geometry (who’s near/far) independent of axis directions.
Example: Use cosine distance so identical items have 0 and very different ones approach 1.

Make two independent “views” of the same geometry

What happens: Create two RDMs from complementary subsets: • Feature-Split Shesha (Shesha_FS): randomly split the d features into two disjoint halves and compute an RDM from each half; repeat K times and average. • Sample-Split Shesha (Shesha_SS): split the n samples into two disjoint groups and compute RDMs; correlate on the overlapping item-pairs or use class centroids.
Why it exists: If geometry is truly distributed and robust, each view preserves the same pairwise ordering.
Example: 768 features → split into 384/384; or 1,600 samples → split into 800/800.

Compare the two RDMs by rank, not raw numbers

What happens: Use Spearman rank correlation on the upper triangles of the two RDMs.
Why it exists: Ranking focuses on order (who’s closer than whom), making it robust to small scaling or noise.
Example: If (cat, tiger) stays closer than (cat, truck) across splits, ranks stay aligned → high stability.

Average across multiple random splits

What happens: Repeat steps for K = 30 splits and average the correlations.
Why it exists: Reduces variance and prevents a lucky/unlucky split from misleading you.
Example: Seeds fixed for reproducibility; mean score becomes your stability.

Optional: Supervised variants when you have labels

What happens: Align geometry to task structure. • Label-conditioned centroids: compute class centroids in two splits and compare RDMs of centroids. • Supervised RDM alignment: correlate model RDM with an “ideal” label RDM. • Variance/separation ratios or LDA-direction stability.
Why it exists: In real tasks, we care that the stable geometry matches the task’s categories or relations.
Example: On SST‑2 (sentiment), supervised Shesha predicts whether a simple linear “steering” direction will work (ρ up to 0.96).

Secret sauce: Why this method is clever

Uses self-consistency (no external reference needed) to detect whether geometry is redundantly and robustly encoded.
Rank-based comparison focuses on relational order, which matters for decisions and resists small metric jitters.
Feature splits test redundancy across dimensions; sample splits test generalization across data.
Retains sensitivity to fine-grained manifold structure beyond the top principal components (the spectral tail) that similarity metrics underweight.

Concrete mini “Sandwich” explainers for key steps

🍞 Top Bread (Hook) If two halves of a choir sing the same harmony, the song is stable. 🥬 The Concept: Feature-Split Shesha (Shesha_FS) tests if different feature halves preserve the same pairwise distance ordering.

How it works: Randomly split features, compute two RDMs, rank-correlate them, average over splits.
Why it matters: Catches whether geometry is spread out (robust) or crammed into a few fragile dimensions. 🍞 Bottom Bread (Anchor) If any half of the notes keeps the melody recognizable, the song’s structure is stable.

🍞 Top Bread (Hook) If two teams, each seeing different students, still agree which pairs are most alike, the grading is consistent. 🥬 The Concept: Sample-Split Shesha (Shesha_SS) checks stability across different subsets of items.

How it works: Split samples, compute two RDMs, correlate on shared pairs or class centroids.
Why it matters: Detects overfitting to particular examples; robust models keep orderings. 🍞 Bottom Bread (Anchor) Different test versions that rank answers the same way show a solid exam.

🍞 Top Bread (Hook) If street signs stay in the same relative order after a gust of wind, you can still navigate. 🥬 The Concept: Spearman rank on RDMs compares the order of distances rather than exact values.

How it works: Sort pairwise distances and compare rankings.
Why it matters: Changes that don’t reshuffle order won’t cause false alarms; real reshuffles do. 🍞 Bottom Bread (Anchor) “Cat closer to tiger than to truck” staying true earns a high stability score.

Worked examples with real data

Spectral deletion: Removing just the top principal component made similarity scores (CKA, PWCCA, Procrustes) drop below 0.4, but Shesha stayed above 0.4 until 26 PCs were removed; at 30 PCs removed, Shesha kept 110× more signal than CKA.
Massive cross-domain test: Across 2,463 configurations in 7 domains, stability vs similarity had ρ ≈ −0.01 (negligible), proving distinct axes.
Drift canary: On fine-tuned LMs, Shesha detected representational drift almost 2× earlier than CKA on average (up to 5.23× in Llama) while avoiding Procrustes’ high false alarms (44% vs 7.3%).

Implementation notes

Defaults: K = 30 splits; cosine distances; Spearman rank; cap samples at ~1600 for efficiency.
Reproducibility: Fixed seeds; deterministic ops; average over splits.
Cost trade-off: Needs multiple passes/splits (more compute) in exchange for reliability.

Extra “Sandwich” on a frequent comparison metric

🍞 Top Bread (Hook) Trying to match two photos by rotating them until they line up is like a puzzle. 🥬 The Concept: Procrustes distance aligns two feature sets by optimal rotation/scale to measure mismatch.

How it works: Find the best rigid transform minimizing residual error.
Why it matters: Good for shape matching but sensitive to tiny tail noise in high dimensions, causing false alarms. 🍞 Bottom Bread (Anchor) In stable regimes, Procrustes cried “drift!” 44% of the time despite <1% accuracy drop; Shesha did so only 7.3% of the time.

04Experiments & Results

The Test: What was measured and why

Distinctness: Do stability (Shesha) and similarity (CKA) track different properties?
Sensitivity: Does Shesha warn earlier about representation drift without crying wolf?
Controllability: Does supervised stability predict linear steering success?
Selection trade-offs: How does stability relate to transferability in vision?
Biology and brain: Does stability capture real, coherent structure in CRISPR and neural data?

The Competition: Baselines and peers

Similarity metrics: CKA (and also PWCCA, Procrustes).
Transfer metric: LogME for vision.
Separability metrics: Fisher discriminant, silhouette (controls for steering).

Scoreboard (with context)

Universality of distinct axes

Across 2,463 configurations in 7 domains, Shesha vs CKA had ρ ≈ −0.01 (95% CI [−0.06, +0.03]). That’s like saying they share <0.1% variance—basically independent.
Domain specifics: Four domains showed negligible correlations (|ρ|<0.10). Audio and Video showed small negatives; Protein had a moderate negative (likely due to low‑dimensional PCA compression).
Regimes explained: Random projections/noise preserved both (ρ up to +0.90), PCA compression flipped the sign (ρ ≈ −0.47), and natural encoders were weakly positive (ρ ≈ +0.31–0.34). Opposing effects cancel in aggregate.

Spectral sensitivity (why Shesha sees more)

Deleting the top PC made debiased CKA, PWCCA, and Procrustes drop below 0.4.
Shesha stayed >0.4 until removing ~26 components; at 30 components removed, it retained ~110× more signal than CKA.
Whitening can let CKA “recover” by flattening the spectrum, but that’s an artifact; Shesha keeps tracking the tail’s true structure.

Steering and control (do simple linear nudges work?)

Supervised Shesha predicted steering success extremely well: ρ = 0.89 (synthetic), 0.96 (SST‑2), 0.96 (MNLI), all p < 1e−18.
Even after controlling for separability (Fisher, silhouette), partial ρ stayed large (0.62–0.76). Stability adds unique signal: separability may be necessary; stability makes it reliably controllable.
Unsupervised Shesha worked on synthetic data (ρ = 0.77) but failed on real tasks (ρ ≤ 0.35, n.s.), proving task alignment matters for control.

Vision: the stability–transferability split

On simple datasets (CIFAR‑10): near-zero correlation (ρ = −0.07). On more complex ones (CIFAR‑100, Flowers‑102): negative trends (down to ρ = −0.42), revealing a “geometric tax” of high transfer.
The DINOv2 paradox: top LogME ranks (state‑of‑the‑art transfer) but last or near-last in stability on most datasets—except EuroSAT, where both were high.
Architectural signals: Contrastive alignment (CLIP/EVA-02) and hierarchical transformers (Swin/PVT/CoAtNet) tended to have higher stability in several natural-image tasks.

Drift detection: a geometric canary for safety

Post-training shifts: Shesha measured nearly 2× more geometric change than CKA (25.1% vs 12.8% on average), up to 5.23× in Llama.
Predictive validity: All metrics tracked accuracy drops similarly (ρ ≈ 0.90+), so Shesha’s extra sensitivity is genuine signal, not noise.
Earlier warning: Using a 5% threshold, Shesha detected first in 73% of models; CKA 0%; ties 27%. ROC on LoRA: AUC 0.990 (Shesha) vs 0.988 (Procrustes) vs 0.987 (CKA).
False alarms: In stable regimes, Procrustes flagged drift 44% of the time vs 7.3% for Shesha/CKA; at minimal perturbation, Procrustes showed 1.50% drift vs 0.04% for Shesha (~37× inflation).

CRISPR screens: biological coherence

Stability tracked perturbation effect magnitudes with strong monotonic correlations (ρ from ~0.75 to ~0.96 across datasets; pooled ρ ≈ 0.915 after calibration).
Combinatorial guides showed higher stability than single-gene ones; stability was largely independent of sample size and only weakly tied to intrinsic variance.

Neuroscience: behaviorally relevant geometry

Across 228 area-session observations, Shesha predicted neural–behavior coupling (ρ = 0.18, p = 0.005), while simple temporal centroids did not.
Regional hierarchy emerged: Striatum highest stability; Hippocampus lowest. Drift accumulated gradually with no acceleration.

Surprising findings

Similarity can stay high while stability collapses under PCA: high CKA ≠ safe geometry.
Transfer stars (e.g., DINOv2) can pay a “geometric tax,” ranking low in stability.
Procrustes, despite strong validity, overreacts to harmless spectral tail noise, causing many false alarms.

05Discussion & Limitations

Limitations (honest assessment)

Global focus: Shesha works on whole-geometry RDMs and may miss token- or position-level quirks unless you zoom in with tailored variants.
Family effects: Some stability patterns cluster by architecture/training family, so cross-family comparisons need care.
Compute cost: Stability needs multiple splits/resamples (extra forward passes), unlike one-shot similarity—this is a reliability-for-compute trade.
Label dependence (for control): In semantic tasks, unsupervised stability alone isn’t predictive of steerability—you need supervised alignment.

Required resources

Access to embeddings for enough samples (hundreds to ~1600 typical) and repeated passes.
Deterministic pipelines (fixed seeds) for reproducibility if benchmarking models.
Optional labels for supervised variants and steering predictions.

When NOT to use

If you only need quick cross-model alignment on dominant structure (e.g., sanity checks on similar checkpoints), CKA/RSA might suffice.
If you have extremely tiny datasets where RDMs are too small to be informative (very low n), stability estimates will be noisy.
If your application rewards aggressive adaptability over consistency (e.g., broad transfer exploration), prioritize transfer metrics first.

Open questions

Local stability maps: Can we localize which features or regions of the manifold are brittle vs sturdy?
Causality: How do training objectives and data curricula causally shape stability, not just correlate with it?
Efficiency: Can we design low-cost estimators that approximate Shesha with far fewer splits?
Trade-offs: How to jointly optimize for transferability and stability—can we reduce the “geometric tax” of top transfer models?
Safety thresholds: What stability levels guarantee acceptable behavior under well-defined deployment shifts?

06Conclusion & Future Work

Three-sentence summary

This paper adds a missing axis—geometric stability—to representation analysis, measuring how reliably a model’s internal geometry holds under perturbations.
Shesha, a self-consistency framework based on split-half RDM correlations, reveals that stability and similarity are empirically distinct (ρ ≈ −0.01) and that stability predicts control, detects drift earlier, and captures fine structure similarity overlooks.
The findings generalize beyond ML, tracking CRISPR perturbation coherence and neural–behavior coupling, showing stability’s broad scientific value.

Main achievement

Turning internal consistency of representational geometry into a practical, robust metric that complements similarity and delivers actionable insights for safety, controllability, and model selection.

Future directions

Build fast, local, and task-aware stability diagnostics; co-train objectives that explicitly preserve stability; design architectures/training schemes that balance transferability and stability.
Develop deployment playbooks linking stability thresholds to intervention policies (e.g., rollback, re-tune, or alert).

Why remember this

Similarity says “what’s there”; stability says “will it hold.” For safe, steerable, and scientific AI—and even for cells and brains—you need both. Shesha gives you the shake test your models were missing.

Practical Applications

•Set up production drift monitors that trigger on Shesha drops to catch risky representation changes earlier with fewer false alarms.
•Screen pretrained models with both stability (Shesha) and transferability (LogME) to pick the best fit for safety-critical vs. flexible-fine-tuning scenarios.
•Precheck linear steering feasibility by computing supervised Shesha; only deploy activation additions on models with high task-aligned stability.
•Audit instruction-tuned or LoRA-updated LMs by tracking Shesha across versions to quantify alignment-induced geometric reorganization.
•Choose vision backbones for zero-shot or long-lived deployments by preferring families with higher stability (e.g., CLIP/EVA, hierarchical transformers).
•Harden pipelines by alerting when PCA or aggressive compression starts degrading stability even if accuracy and CKA look fine.
•In CRISPR analyses, rank perturbations by Shesha to prioritize coherent, reproducible regulatory effects for follow-up experiments.
•In neuroscience datasets, use trial-split Shesha to find brain regions whose stable geometry best predicts behavior.
•During model distillation or quantization, track Shesha to ensure size/speed gains don’t shatter fine-grained manifold structure.
•Design training objectives that explicitly regularize Shesha (e.g., stability-aware contrastive losses) to build more controllable models.

Version: 1