RAPTOR: Ridge-Adaptive Logistic Probes

Ziqi Gao; Yaotian Zhu; Qingcheng Zeng; Xu Zhao; Ziqing Wang; Feng Ruan; Kaize Ding

RAPTOR: Ridge-Adaptive Logistic Probes

Intermediate

Ziqi Gao, Yaotian Zhu, Qingcheng Zeng et al.1/29/2026

arXiv PDF

Key Summary

•RAPTOR is a simple, fast way to find a direction (a concept vector) inside a frozen language model that points toward a concept like 'sarcasm' or 'positivity.'
•It uses logistic regression with ridge regularization and tunes just one knob (lambda) on a validation set to balance accuracy and stability.
•These concept vectors can be added to a model’s hidden activations during inference to steer behavior without changing any weights.
•Across six human-written datasets and seven instruction-tuned models, RAPTOR matches or beats strong baselines in accuracy while being more stable and cheaper to train.
•Directional stability means the learned direction barely wiggles when the training data is slightly changed; RAPTOR improves this over many competitors.
•The paper also explains why the ridge knob helps, using high-dimensional theory (CGMT) to show how lambda controls accuracy and stability.
•RAPTOR’s vectors enable reliable downstream steering, with near-perfect success at hitting target probe probabilities using minimal-strength edits.
•Because it’s lightweight, you can probe many layers and concepts quickly, making it practical for real-world 'probe–then–steer' workflows.
•Overall, RAPTOR shows that a well-tuned, simple method can outperform or match complex ones for concept extraction and steering.

Why This Research Matters

RAPTOR makes it practical to control large language models at inference time without changing their weights. This means safer, more reliable text generation can happen quickly, even across many layers and concepts. Because the method is fast and stable, product teams can sweep dozens of concepts and layers to find the best places to intervene. The single tuning knob (lambda) keeps everything understandable and repeatable, which is crucial for debugging and audits. The high-dimensional theory gives confidence that the method’s stability is not a fluke. Taken together, this enables dependable content filtering, tone adjustment, and capability control in real-world applications.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how you can check what’s inside a mystery box by gently shaking it or listening for sounds? Before this work, researchers did something similar to large language models (LLMs): they used 'probing' to see what ideas (like 'sarcasm' or 'city names') were hiding inside the model’s layers. Probing means freezing the big model and training a tiny classifier on top of its hidden activations to predict a simple label. This tells us whether the model’s layers already encode that concept.

But here’s the twist: people didn’t just want to measure concepts—they wanted to use them. Imagine you find the 'sarcasm' direction in the model’s brain and then gently push the model along that direction so it talks more or less sarcastically. That is called additive activation steering. It’s like nudging a toy car in the exact direction you want without opening the engine.

The world before: Probing was popular for measurement (how much of a concept is in which layer?). Some teams went further and built probe–then–steer pipelines: train a probe, read off its weight vector as a 'concept vector,' and add that vector into the model’s hidden state at inference time. This let you nudge behavior without finetuning the model. However, three headaches kept popping up: (1) accuracy: does the probe really recognize the concept? (2) directional stability: if you resample the data a little, does the direction change a lot? (If it does, you can’t reuse it reliably.) (3) cost: if a method is slow or fussy, you can’t sweep many layers and concepts.

People tried many estimators, from simple linear classifiers to more complex, nonlinear methods. Some improved raw accuracy but turned out brittle: tiny dataset changes spun the direction around. Others were heavy to train, making large layer-by-layer sweeps too expensive. So, lots of probing papers accidentally focused only on 'Is my accuracy a bit higher?' and underplayed 'Is my direction stable and can I afford to do this everywhere?'

The gap: What was missing was a simple, one-knob probe that is accurate enough, directionally stable in few-shot, high-dimensional settings (where the number of features is huge compared to examples), and cheap to train so you can probe many layers and models.

Real stakes: Why care? Because probe–then–steer lets you control model behavior without risky retraining. Think safer content generation (reduce toxic outputs), better personalization (more formal or more cheerful tone), or fast debugging (turn up/down a capability to see what changes). If your concept direction is wobbly or expensive to obtain, these everyday controls become unreliable or impractical.

So this paper introduces RAPTOR: a ridge-adaptive logistic probe. It keeps things minimal—just logistic regression plus ridge regularization—and tunes a single knob (lambda) by validation. Despite the simplicity, this directly targets the needs: strong linear accuracy, stability under small data changes, and low cost. The authors also explain the 'why' with a clean high-dimensional theory, giving a sturdy backbone to the empirical wins.

Now, before we go deeper, let’s quickly meet the core ideas using simple, sandwich-style explanations:

🍞 Hook: Imagine digging in a garden to see which plants are growing under the soil. 🥬 Probing: Probing is a tiny classifier trained on frozen model activations to tell if a concept is present.

How it works: (1) Collect texts with labels (e.g., sarcastic vs. not), (2) run the model and grab a layer’s hidden states, (3) train a small classifier on those states.
Why it matters: Without probing, we don’t know which layers encode which concepts, and we can’t extract directions to steer the model. 🍞 Anchor: A probe trained on movie reviews predicts 'positive vs. negative' from a layer’s activation.

🍞 Hook: Think of a treasure map arrow pointing toward gold. 🥬 Concept Vectors: A concept vector is a direction in activation space that points toward more of that concept.

How it works: The probe’s learned weight vector shows the direction that best separates 'has concept' vs. 'not.'
Why it matters: Without a clean direction, steering becomes guesswork. 🍞 Anchor: The 'positivity' vector points from neutral to happier-sounding representations.

🍞 Hook: Like adding a gentle push to a bicycle to go slightly uphill. 🥬 Additive Activation Steering: You add a scaled concept vector to a hidden activation to nudge the model toward or away from a concept.

How it works: h ← h + α v, with v the concept vector and α the strength.
Why it matters: Without this, behavior control would require full finetuning. 🍞 Anchor: Add the 'sarcasm' vector to get a more sarcastic tone.

02Core Idea

The 'Aha!' in one sentence: A single, validation-tuned ridge logistic probe can produce concept vectors that are accurate, directionally stable, and cheap to train, making probe–then–steer practical and reliable.

Three analogies for the same idea:

Radio tuner: Ridge is the fine-tune knob that reduces static (noise) so the station (concept direction) comes in clearly.
Compass stabilizer: The magnet (regularization) keeps your compass needle from wobbling when you jog slightly (small data changes).
Backpack weight: Adding a small, well-chosen weight (ridge) keeps you balanced on a windy bridge (high-dimensional space) so you don’t tip over (ill-posed optimization).

Before vs. After:

Before: Unregularized logistic regression could blow up in separable data; complex probes sometimes overfit and became unstable; training costs limited broad sweeps.
After: RAPTOR’s one knob (lambda) restores well-posedness, tunes stability vs. accuracy, and keeps training fast. You get robust, reusable directions across layers.

Why it works (intuition, no equations): In high dimensions with few examples, many separating lines exist; unregularized training can chase razor-thin, unstable boundaries. Ridge regularization gently pulls the solution toward the origin, suppressing spiky, orthogonal noise and increasing alignment with the true signal. Tuning lambda is like choosing just the right amount of smoothing: too little and you wobble; too much and you blur away the concept; just right gives both accuracy and stability. The authors back this with a teacher–student model: as you dial lambda, you trade off how much of the learned vector aligns with the true concept vs. how much spills into random directions, which predicts both accuracy and directional stability.

Building blocks (with sandwich-style mini-lessons):

🍞 Hook: Deciding 'yes/no' like a ref deciding offside. 🥬 Logistic Regression: It predicts a probability for a binary label from a weighted sum of features.

How it works: (1) Make a score w·x + b, (2) push through a sigmoid to get a probability, (3) learn w,b to fit labels.
Why it matters: It’s a strong, simple baseline that turns layer activations into concept predictions. 🍞 Anchor: From a review’s layer features, it outputs P(positive).

🍞 Hook: Like adding a small stretch band that keeps your handwriting neat. 🥬 Ridge Regularization: Add a penalty on weight size to avoid wild, unstable solutions.

How it works: (1) Train logistic regression, (2) add lambda times the squared norm of w, (3) pick lambda on validation for best generalization and stability.
Why it matters: Without ridge, solutions in separable, high-dimensional data can blow up and wobble. 🍞 Anchor: With ridge, two training subsets give nearly the same weight direction.

🍞 Hook: Sorting with many traits at once (color, size, shape, texture). 🥬 High-dimensional Linear Classification: Classifying when features are very numerous compared to examples.

How it works: (1) Many possible separating lines exist, (2) regularization picks a stable one, (3) validation finds the sweet spot.
Why it matters: Without handling this regime, probes can be brittle or diverge. 🍞 Anchor: With 10,000 features and 1,000 examples, ridge picks a reliable boundary.

🍞 Hook: A compass that still points north after a little bump. 🥬 Directional Stability: The learned concept direction stays consistent under small training changes.

How it works: (1) Train on slightly different subsets, (2) compare directions with cosine similarity, (3) higher is better.
Why it matters: Unstable directions can’t be reused for steering. 🍞 Anchor: Two runs produce vectors that point almost the same way.

🍞 Hook: Swapping a messy obstacle course for a clean practice track that predicts race results. 🥬 Convex Gaussian Min–max Theorem (CGMT): A math tool to analyze high-dimensional estimators by comparing a hard problem to a simpler 'auxiliary' one.

How it works: (1) Rewrite the estimator as a min–max game, (2) replace the random matrix with simpler Gaussian surrogates, (3) solve a small set of equations describing performance.
Why it matters: It explains how lambda controls alignment and noise in the learned direction, predicting accuracy/stability trends. 🍞 Anchor: The theory forecasts the best ridge strength pattern seen in real LLM embeddings.

03Methodology

High-level recipe: Inputs (layer-wise embeddings, binary labels) → Standardize features (train-only) → Fit ridge logistic regression with validation-tuned lambda → Fold weights back to original space → Normalize to get concept vector → During inference, steer by adding α times this vector to the layer activation.

Step-by-step, like a kitchen recipe:

Collect data from a frozen LLM.

What: For each text and each layer ℓ, take the last-token hidden state hℓ,T as the sentence’s feature vector; pair it with a binary label (e.g., sarcastic=1, not=0).
Why: We need consistent, comparable features per example to train a probe. Using the last token is a standard, simple summary.
Example: 5,000 sentences, each with a 4,096-dim vector at layer 12, labeled positive vs. negative.

Standardize features using train split only.

What: Subtract the train-mean and divide by the train-std for each feature coordinate; apply the same transform to val/test.
Why: Prevent leakage and keep feature scales balanced so ridge tuning isn’t hijacked by large-variance coordinates.
Example: If feature j has mean 10 and std 2 on train data, we transform xj → (xj−10)/2 for all splits.

Ridge-adaptive logistic probe training and tuning.

What: Train logistic regression with ridge penalty λ on the train split. Sweep a small grid of λ values; pick the one with best validation accuracy; refit on train+val.
Why: Logistic regression is a strong linear baseline. Ridge ensures existence, uniqueness, numeric stability, and better generalization in high dimensions. Validation finds the best balance.
Example: Try λ in {1e−4,…,1e2}. Suppose λ*=0.1 wins on validation; refit with λ* on (train∪val).

Fold weights back to original space and extract the concept vector.

What: Because we trained in standardized space, rescale the learned weights to the original embedding units, then optionally normalize to unit length to get vℓ.
Why: Steering edits are done in the model’s native activation space, so the vector must be in original units.
Example: If standardized weight for feature j is 0.2 and the train std was 2, the unstandardized weight becomes 0.1.

Steering at inference with minimal strength.

What: To nudge toward a concept, set hℓ ← hℓ + α vℓ. Choose α as the minimal amount that makes the probe’s probability cross a target (e.g., 0.9999 toward, 0.0001 away), skipping layers where the probe is unreliable.
Why: Minimal edits reduce side effects; skipping low-accuracy layers avoids misaligned pushes.
Example: If the current probe logit is too low, compute α = (target_logit − current_logit)/||ωℓ||; if the layer already meets the target, set α=0 (no change).

What breaks without each step:

No standardization: Some features dominate; tuning λ becomes inconsistent; optimization can be slow or unstable.
No ridge: In separable data, the optimizer can chase infinite weights; small data changes swing the direction.
No validation tuning: Fixed λ may be too weak (wobbly, overfit) or too strong (underfit), harming both accuracy and direction reuse.
No fold-back: Steering adds the wrong-scale vector, causing ineffective or chaotic edits.
No minimal-strength rule: You might over-steer, causing unintended language artifacts.

Concrete toy example (tiny numbers):

Features (layer-12) for a sentence: x = [happy_word_count=3, angry_word_count=1].
After standardizing, train logistic+ridge. Suppose learned standardized weights are w=[0.8, −0.6], bias b=0.1.
Fold back to original space (assume stds= [2, 1]): ω=[0.4, −0.6], b_orig adjusted accordingly.
Concept vector v = ω / ||ω|| ≈ [0.55, −0.83]. To push toward positivity, add α v to the layer-12 activation. Choose α just big enough to make the probe’s probability ≥ 0.9999.

Secret sauce:

One knob (λ) that you can reliably tune with a small validation set controls both accuracy and stability in exactly the hard regime probing lives in (few-shot, high-dimensional). This simplicity is what makes RAPTOR fast, predictable, and surprisingly strong.

Mini sandwich for completeness: 🍞 Hook: A tiny twist on a familiar recipe makes cookies come out perfect. 🥬 Ridge-Adaptive Logistic Probe: Train logistic regression with a tuning knob (λ) that penalizes large weights.

How it works: (1) Standardize, (2) sweep λ, (3) pick best on validation, (4) refit and fold back, (5) normalize weights as a concept vector.
Why it matters: Delivers accurate, stable, and cheap-to-compute directions. 🍞 Anchor: With the tuned λ, two different subsamples give nearly the same 'positivity' direction you can reuse for steering.

04Experiments & Results

The test: The authors measured three things that matter for probe–then–steer: (1) accuracy (does the probe predict the concept?), (2) directional stability (does the learned direction stay similar under small data changes?), and (3) computational cost (how long does training take per layer?). They ran this across seven instruction-tuned LLMs (Llama, Qwen, Gemma series) and six human-written binary datasets: STSA (sentiment), Cities, Common, CounterFact (factual flips), HateXplain, and Sarcasm.

The competition: RAPTOR was compared to xRFM and GCS, two strong concept-estimation baselines.

Scoreboard with context:

Accuracy: Across all 42 model–dataset pairs, RAPTOR improved average-over-layers accuracy over GCS in every case and matched/exceeded best-layer accuracy in 41/42 (one tie). Versus xRFM, RAPTOR matched or exceeded best-layer accuracy in 26/42 and average accuracy in 27/42. Overall best-layer averages: RAPTOR 0.874, GCS 0.854 (RAPTOR is like an A− when GCS is a solid B+), xRFM 0.871 (RAPTOR slightly ahead).
Harder tasks: Gains over GCS were largest on HateXplain (+3.51 pts) and Sarcasm (+2.12 pts), where stable directions are especially valuable.
Directional stability: On a focused robustness study (Llama-3.1-8B, Qwen-2.5-7B, Llama-3.1-70B; STSA, HateXplain, Sarcasm), RAPTOR consistently out-stabilized xRFM and came close to GCS (which was the most stable). In many cases, RAPTOR’s mean absolute-cosine similarity across ablated runs was above 0.9, indicating very consistent directions.
Cost: Median seconds per layer (log scale) show RAPTOR is consistently faster than both xRFM and GCS across the full grid. That’s like finishing your homework faster without losing quality.

Steering outcomes:

Using concept vectors from RAPTOR, the team performed additive activation steering with minimal strength to reach extreme probe probabilities (toward ≈ 0.9999, away ≈ 0.0001). Success was near-perfect (Succ.=1.000 across tasks) after filtering out low-accuracy layers. Intervention rates varied (about 0.54 to 0.83), meaning sometimes the model was already near the target and needed no push. Strengths had long tails: median |α| was modest (roughly 3.6 to 12.4), but a few cases needed big pushes (max up to 249), especially for away directions or late layers.

Surprising findings:

Simple wins: A carefully tuned ridge-logistic probe can match or beat fancier methods on accuracy, be nearly as stable as the most stable method, and train faster. The simplest tool, well tuned, can be the best tool.
Theory lines up: High-dimensional predictions about how lambda trades off alignment (signal) versus orthogonal noise help explain observed non-monotonic trends and the dominance of the aspect ratio (n/p) over raw sample size. Correlations between predicted and true accuracy trends were strong (median Spearman 0.86, Pearson 0.90) in structure-validation sweeps.

05Discussion & Limitations

Limitations:

Linear readout: RAPTOR uses a linear probe. If a concept is strongly nonlinear in the embedding space, a linear direction may be insufficient.
Reliance on validation: Picking lambda requires a validation split; tiny validation sets can be noisy, and bad splits can mis-set the knob.
Steering tails: Some layers or 'away' directions require large α, which can introduce artifacts; filtering low-accuracy layers helps but does not eliminate tail risks.
Domain shifts: Stability was tested under small ablations and similar splits; larger distribution shifts may still change directions.
Teacher–student idealization: The CGMT analysis assumes Gaussian features; real embeddings are not Gaussian, so theory provides trends, not exact numbers.

Required resources:

Frozen LLM with access to intermediate activations.
Small labeled datasets per concept (often a few hundred to a few thousand examples work well).
Standard CPU/GPU for quick per-layer logistic training (RAPTOR is light enough for broad sweeps).

When NOT to use RAPTOR:

Concepts requiring multi-direction subspaces (e.g., multi-faceted, disjoint patterns) may benefit from subspace models (like GCS) rather than a single vector.
Highly nonlinear concepts where linear separability is weak; an MLP or feature engineering might be needed.
If you cannot obtain reliable labels or a validation split, tuning lambda becomes guessy; results may wobble.

Open questions:

Multi-direction generalization: When and how should we expand from one vector to low-dimensional subspaces while keeping the one-knob simplicity?
Layer selection: Can we learn which layers to steer automatically, balancing control and minimal side effects?
Strength scheduling: Can the minimal-strength rule be improved with global constraints across layers to reduce tail risk?
Theory beyond Gaussian: Can we extend precise analyses to correlated, heavy-tailed, or structured embeddings seen in real LLMs?

06Conclusion & Future Work

Three-sentence summary: RAPTOR is a ridge-adaptive logistic probe that turns frozen LLM activations into accurate, stable, and inexpensive concept vectors. With just one validation-tuned knob (lambda), it matches or beats strong baselines on accuracy, improves directional stability, and reduces training time across many models and datasets. A high-dimensional analysis explains why this works, showing how regularization controls alignment and robustness.

Main achievement: Demonstrating that a minimalist, one-knob linear probe can power reliable probe–then–steer pipelines—delivering strong accuracy, competitive stability, and much lower cost—backed by theory and broad experiments.

Future directions:

Extend from single directions to compact subspaces without sacrificing speed or stability.
Smarter layer selection and global strength scheduling to cut tail risk while preserving control.
Richer theory for non-Gaussian, correlated embeddings closer to real LLMs.

Why remember this: RAPTOR changes the default for practical concept extraction—simple, tuned ridge logistic regression is often all you need to get robust, reusable directions for steering big models without retraining.

Practical Applications

•Safety steering: Reduce toxic or hateful outputs by pushing away from hate-related concept vectors.
•Tone control: Increase positivity or formality for customer support responses using corresponding concept vectors.
•Fact emphasis: Nudge generations toward grounded facts or away from speculative claims in QA systems.
•Style editing: Add or remove sarcasm, humor, or excitement without retraining the model.
•Debugging behavior: Temporarily magnify or suppress a capability (e.g., named entity focus) to trace model internals.
•Personalization: Align outputs with brand voice by steering toward brand-specific tone vectors.
•Dataset triage: Rapidly scan layers to see where a concept is encoded most strongly for targeted interventions.
•Prototype governance: Test policy levers (e.g., reduce stereotypes) by steering relevant concept directions.
•Low-resource control: Achieve targeted edits with few labels using the stability of ridge-tuned probes.

Version: 1