Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Quy-Anh Dang; Chris Ngo

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

Intermediate

Quy-Anh Dang, Chris Ngo1/27/2026

arXiv PDF

Key Summary

•Selective Steering is a new way to gently nudge a language model’s inner thoughts without breaking its flow or skills.
•It uses a true rotation that keeps the size of the activation vector the same, so the model stays stable and doesn’t glitch.
•Instead of changing every layer, it only steers layers where the two classes (like “red flag” vs. “green flag”) point in opposite directions.
•This selective approach avoids pushing on layers that don’t carry the needed signal, protecting coherence and accuracy.
•Across nine models, Selective Steering had zero perplexity spikes, meaning no collapse or weird text.
•It achieved up to 5.5× higher attack success rate than earlier steering methods while staying stable.
•The method kept almost 100% of the models’ normal abilities on standard benchmarks.
•It fixes a hidden math bug in prior angular steering implementations that accidentally changed vector norms.
•The recipe is simple: find a reliable direction, pick the right layers, and rotate exactly in a 2D plane.
•This gives a safe, smooth “control dial” for behavior without retraining the whole model.

Why This Research Matters

Selective Steering gives engineers a safe, precise way to guide models at runtime without retraining them. By preserving the size of activations and only steering where the signal clearly separates, it protects text quality and avoids weird failures like repetition or foreign-script leaks. This means customer support bots, tutors, and assistants can be shaped gently for tone or safety without losing their general knowledge. It also helps researchers test and harden safety systems by controlling behavior predictably across angles. Finally, it reduces costs and risks: small calibration steps replace heavy retraining, while core abilities stay intact.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re adjusting the volume knobs on a music mixer. If you turn all the knobs at once, the sound becomes messy. But if you find the one right knob and turn it just enough, the song sounds great.

🥬 The World Before: Big language models got very good at helping us, but keeping them safe and steady was hard. People trained them with feedback so they’d refuse unsafe requests and be helpful. That worked a lot of the time, but sneaky prompts (jailbreaks) could still trick them. Retraining to fix this is slow and expensive, and can accidentally make the model worse at normal tasks. So researchers explored “activation steering,” a way to nudge the model’s internal signals during use—like turning a knob—without retraining.

🍞 Anchor: Think of a chatbot that answers kindly but sometimes gets confused or too strict. A gentle, smart nudge during its thinking could make it just right, without rebuilding the whole bot.

🍞 Hook: You know how turning a steering wheel just a little can change your car’s path smoothly? That’s the idea behind “angular steering.”

🥬 The Problem: Older steering tricks had problems. Activation Addition (just adding a vector) needed careful per-layer tuning, because different layers have different “loudness” (norms). Get the scaling wrong and the text gets weird or repetitive. Directional Ablation (projecting away a direction) was like an on/off switch—no gentle middle. Angular Steering promised a smooth dial using rotation in a 2D plane, but its practical formula didn’t actually preserve the vector’s size (norm). That hidden math bug could push the model off its usual track, making small models glitch or collapse.

🍞 Anchor: It’s like trying to rotate a pizza on a plate but secretly also stretching it—now slices don’t fit, and eating gets messy.

🍞 Hook: Picture walking up a set of stairs. Early steps are small and similar, middle steps separate into left and right paths, and near the top it comes back together. Layers in a model can feel like that.

🥬 Failed Attempts and the Gap: Many methods treated all layers the same, applying the same push everywhere. But layers are different! Early layers don’t yet carry the “this is red-flag vs. green-flag” meaning. Pushing there just scrambles things. The missing piece: 1) a true rotation that keeps vector size exactly the same (so stability is protected), and 2) only steering the layers where the two classes actually point in opposite directions along a useful feature.

🍞 Anchor: If you’re trying to make soup spicier, you add chili only to the pot that actually has the soup, not to empty pots on the counter.

🍞 Hook: You know how a good coach tells you not just what to change but when to change it? Timing matters.

🥬 Real Stakes: In daily life, we need chatbots that are steady, helpful, and safe—no language glitches, no random character floods, and no skill loss in math or reading. We also need firm, controllable dials so we can test and improve safety without retraining from scratch. A fix that keeps text quality high, lets you steer smoothly, and preserves normal skills is a big win.

🍞 Anchor: A customer support bot that stays calm and clear, never loops nonsense, keeps its product knowledge, and lets an engineer gently tighten or loosen politeness—that’s the promise of safer, smarter steering.

— New Concepts with Sandwich Pattern —

🍞 Hook: Imagine spinning a perfect circle coin on a table—it stays the same size, just rotates. 🥬 Matrix Rotation (what/ how/ why): It is a math move that turns a vector around without stretching it. How: choose a plane, use a rotation matrix to spin by an angle, and keep everything else unchanged. Why: if you stretch or shrink by accident, later layers misread the signal and text can break. 🍞 Anchor: Rotating a bike wheel doesn’t make it fatter or thinner; it just turns.

🍞 Hook: Think of a flashlight beam pointing somewhere inside a cave. 🥬 Geometric Rotation: It’s turning that beam to a new direction inside a flat sheet (a plane). How: pick two axes spanning the plane, then spin by a chosen angle. Why: direction changes behavior; keeping the plane and size stable avoids chaos. 🍞 Anchor: Turning your headlamp to look left or right without changing brightness.

🍞 Hook: You know how sometimes two teams stand on opposite sides of a field? 🥬 Class Alignment: It’s how two groups’ average signals point in relation to a chosen direction. How: project each class mean onto a direction and check their signs. Why: if signs are opposite, the direction clearly separates classes and is great for steering. 🍞 Anchor: If Team Red stands to the left and Team Blue to the right of midfield, you know which way to pass.

🍞 Hook: If you weigh your backpack each day, you see patterns forming. 🥬 Activation Statistics: These are measurements like average size (norms) and how signals project along a direction across layers. How: collect activations, compute means and projections per layer. Why: without stats, you don’t know where the useful separation appears, so you’d push blindly. 🍞 Anchor: Checking step-by-step where the two teams start separating on the staircase.

🍞 Hook: Think of a ruler that evens out bumps so different pages aren’t too big or small. 🥬 Layer Norm: It’s how big an activation is, often stabilized by normalization. How: models use LayerNorm/RMSNorm to keep scales in check. Why: if you break the expected size, downstream parts hiccup. 🍞 Anchor: If your music volume suddenly doubles on one track, the next track may sound off—even if you didn’t mean to change it.

02Core Idea

🍞 Hook: You know how a dimmer switch lets you brighten or dim lights smoothly, and you only adjust the rooms that need it?

🥬 The Aha Moment (one sentence): Steer only the layers where the feature truly separates classes (opposite signs), and do it with a perfect rotation that keeps the activation size unchanged.

🍞 Anchor: Turn the dimmer only in rooms that are too dark or too bright, and rotate the knob in a way that doesn’t change the electricity’s voltage.

🍞 Hook: Imagine three different ways to explain the same dance move. 🥬 Multiple Analogies:

Chef analogy: Only season the pots that actually contain the dish you’re cooking (discriminative layers), and stir with a steady wrist that doesn’t slosh soup out (norm-preserving rotation).
Biking analogy: Steer only on the parts of the path where there’s a fork (layers with opposite signs), and turn the handlebars without changing the wheel’s size (true rotation).
Drawing analogy: Shade only the areas where contrast matters (opposite signs), and rotate your stencil without stretching the paper (norm preservation). 🍞 Anchor: In each case, select where it matters and rotate cleanly.

🍞 Hook: Before vs. After is like messy vs. tidy tools. 🥬 Before vs. After:

Before: Methods pushed on all layers, sometimes stretching vectors; small models glitched, text quality collapsed, and steering was unreliable.
After: The push targets only meaningful layers, uses a rotation that doesn’t change size, keeps text fluent, and gives a smooth control dial. 🍞 Anchor: Your car now turns smoothly on the right corners without wobbling the wheels.

🍞 Hook: Why does this work? Picture a compass that always keeps North the same distance away while you turn. 🥬 Why It Works (intuition):

Norm-preserving rotation acts like a perfect spin in a 2D plane: direction changes, size stays constant, so the network’s expectations stay satisfied.
Opposite-signed class means tell you, “Here is where the red/green distinction is clean,” so rotating toward the desired direction consistently moves examples toward the target class behavior.
By skipping non-discriminative layers, you avoid adding noise where there’s no signal, saving coherence. 🍞 Anchor: You push only where the door actually hinges and you don’t dent the door in the process.

🍞 Hook: Think of building with LEGO bricks. 🥬 Building Blocks:

Feature direction: find a direction in space that best separates the two sets (difference-in-means).
Plane: pair that direction with a second, orthogonal axis to make a flat 2D steering sheet.
Opposite-signed detector: pick layers where the two class means project with opposite signs.
Rotation matrix: apply the exact 2D rotation inside that plane and leave the rest untouched.
Angle dial: turn a chosen angle to increase or decrease the effect smoothly. 🍞 Anchor: With these bricks, you assemble a safe, targeted steering wheel for the model.

— New Concepts with Sandwich Pattern —

🍞 Hook: Turning a single knob is more elegant than flipping a big power switch. 🥬 Activation Steering: It’s nudging internal signals to up- or down-weight a behavior. How: find a direction that represents a behavior, then adjust along it. Why: retraining is expensive; steering is fast and reversible. 🍞 Anchor: Like boosting the “politeness” knob without rewriting the whole program.

🍞 Hook: Picture a compass needle you want to point slightly more east. 🥬 Angular Steering: It rotates activations in a small 2D plane so their direction changes smoothly. How: pick two basis vectors spanning the plane, rotate by an angle. Why: gives fine control, not just on/off. 🍞 Anchor: Turning a volume knob in tiny steps rather than max/min.

🍞 Hook: Imagine spinning a Frisbee that keeps its size exactly. 🥬 Norm-Preserving Rotation: It rotates without changing the vector’s size. How: use a true rotation matrix inside the plane and identity outside. Why: size changes break the model’s expectations and cause glitches. 🍞 Anchor: Spin the Frisbee; don’t stretch it.

🍞 Hook: Think of choosing only the stairs where left and right paths clearly split. 🥬 Discriminative Layer Selection: Pick layers where class means point in opposite directions along the feature. How: compute projections of each class mean; check their signs; select layers with opposite signs. Why: it focuses steering where it will predictably help. 🍞 Anchor: Only steer at the forks, not in straight hallways.

🍞 Hook: Imagine a smart car that knows which road segments are slippery and corrects steering only there. 🥬 Selective Steering: It combines the two ideas—rotate exactly and only at the right layers. How: find feature + plane, find discriminative layers, apply norm-preserving rotation at those layers and nowhere else. Why: this protects coherence and boosts control. 🍞 Anchor: The car turns safely on the curves and cruises straight on the rest.

03Methodology

🍞 Hook: Picture a cooking recipe card: Ingredients, then steps. Follow them and the dish turns out right.

🥬 Overview (high level): Input tokens → collect layer activations → find a strong feature direction and build a 2D plane → choose layers where the two classes point in opposite directions → apply a true, norm-preserving rotation only at those layers → Output tokens.

🍞 Anchor: Like reading a sentence, checking a few key words as you go, and gently turning your attention knob only when the words start to clearly signal a meaning split.

Step-by-step (what, why, example):

Collect contrastive activations (what): Run two small sets of prompts—one “green-flag” (benign) and one “red-flag” (adversarial)—and record the final-token activations at each layer.
- Why: You need data to see where the difference between classes lives.
- Example: Suppose at layer k you average green activations to get μ_pos(k) = [2, 1, 0] and red activations to get μ_neg(k) = [−1, 0, 1].
Find a stable feature direction (what): Compute d(k) = μ_pos(k) − μ_neg(k) at each layer and pick the one that best matches the others (most consistent across layers). Normalize it to get b1.
- Why: A consistent direction captures the core behavior you want to steer, not layer-specific noise.
- Example: If many layers point roughly toward [1, 0.5, −0.2], choose that as b1.
Build the steering plane (what): Use PCA or orthogonalization to find a second axis b2 that is perpendicular to b1, forming a 2D plane P = span{b1, b2}.
- Why: You need a flat sheet to spin in—rotation happens in a plane, not in the whole space.
- Example: If b1 = [0.9, 0.4, 0], you might pick b2 = [−0.4, 0.9, 0] after orthogonalizing.
Choose discriminative layers (what): Project μ_pos(k) and μ_neg(k) onto b1 at each layer to get numbers ˜μ_pos(k) and ˜μ_neg(k). Keep layers where their product is negative (opposite signs).
- Why: Opposite signs mean the classes sit on different sides of the same line—perfect for predictable steering.
- Example: If ˜μ_pos(k) = +0.3 and ˜μ_neg(k) = −0.2, their product is −0.06 < 0, so layer k is selected.
Apply norm-preserving rotation (what): For each selected layer k, decompose the activation h(k) into the plane and its orthogonal complement. Rotate only the plane part by angle θ using a true 2D rotation matrix, then recombine. This guarantees ||h′(k)|| = ||h(k)||.
- Why: Not changing the size keeps the model’s normalization happy and avoids text collapse.
- Example: If the plane part of h(k) has coordinates [1.0, 0.0] in {b1, b2}, rotating by 90° makes it [0.0, 1.0] without changing its length.
Decode as usual (what): Feed the steered activations forward to produce logits and sample the next token.
- Why: The rest of the model works the same; the nudge is internal and gentle.
- Example: The output stays fluent and on-topic, now more aligned with the target behavior.

Concrete toy example:

Suppose at layer 10 you find ˜μ_pos(10) = +0.5 and ˜μ_neg(10) = −0.3 → opposite signs → steer here.
Your current token’s activation h(10) projects into plane P as (x, y) = (0.8, 0.2). With θ = 60°, the rotated point is (x′, y′) = (0.8 cos 60° − 0.2 sin 60°, 0.8 sin 60° + 0.2 cos 60°) ≈ (0.29, 0.77). The length stays the same, only the direction changes toward b1, boosting the target feature smoothly.

What breaks without each step:

Skip data collection: you don’t know the direction; steering becomes guesswork.
Skip consistency check: you might pick a noisy direction that fails in other layers.
Skip plane building: you can’t rotate cleanly; any hacky shortcut can change norms.
Skip discriminative selection: you steer in the wrong places, causing glitches.
Skip norm preservation: you stretch activations; normalization layers misbehave; text can collapse or flip to random scripts.

Secret sauce (why this is clever):

The opposition-sign test is a simple, automatic detector of “steerability” per layer, no manual hunting needed.
The exact rotation matrix preserves norms by construction, protecting the model’s activation distribution.
Combining these two keeps the push strong where it matters and invisible where it doesn’t.

— New Concepts with Sandwich Pattern —

🍞 Hook: Think of gently turning a camera toward a subject without zooming in or out. 🥬 Activation Modification: It’s changing an internal vector to shift behavior. How: add, remove, or rotate along certain directions. Why: small, targeted changes beat heavy retraining. 🍞 Anchor: You pan the camera toward the stage, not crank the zoom wildly.

🍞 Hook: When your backpack always weighs the same, your stride stays steady. 🥬 Layer Norm (revisited): Keeping activation size predictable helps the next parts of the model read signals correctly. How: LayerNorm/RMSNorm stabilize scales. Why: if you break size, downstream layers stumble. 🍞 Anchor: If one day the pack doubles in weight, you’ll trip; preserve weight (norm) and you walk fine.

04Experiments & Results

🍞 Hook: Imagine a triathlon—swim, bike, run. A champion must do well in all three. For model steering, the three events are: coherence, controllability, and robustness.

🥬 The Test (what and why):

Coherence: Does text stay fluent and natural? We watch perplexity (lower is better), 4-gram repetition (lower is better), language consistency (fewer random script leaks; higher is better), and compression ratio (more variety; higher is better).
Controllability: Can the dial actually move behavior? We use Attack Success Rate (ASR) measured by HarmBench, PolyGuard, and an LLM judge, plus Refusal Score (lower means fewer refusals when we steer toward compliance).
Robustness: Do normal skills remain? We check small versions of ARC, GSM8K, MMLU, TruthfulQA, and Winogrande with zero-shot accuracy.

🍞 Anchor: Like checking that the car drives straight (coherence), turns when you steer (controllability), and still has working brakes and headlights (robustness).

Competition (baselines): Activation Addition, Directional Ablation, Standard Angular Steering (SAS), and Adaptive Angular Steering (AAS).

Scoreboard with context:

Coherence: Selective Steering had zero perplexity threshold violations across all 8 evaluated models and all rotation angles, while SAS/AAS often spiked, especially on small models (1–3B). It also posted top or near-top compression ratios in all models, signaling healthy, varied text with no collapse.
Controllability: On hard-to-steer models, Selective Steering delivered up to 5.5× higher ASR than SAS. Example: Qwen2.5-1.5B jumped from 13.46% (SAS) to 74.04% (our method) on HarmBench, while staying coherent.
Robustness: At angles that maximized ASR, Selective Steering kept approximately 100% of baseline accuracy on all five tiny benchmarks in most models, whereas some baselines collapsed (e.g., SAS/ActAdd sometimes fell to near-zero on certain tasks for small models).

Surprising findings:

PolyGuard often over-flags degraded or repetitive text as unsafe. Uniform-all-layer methods that harmed coherence looked ‘high ASR’ to PolyGuard but not to other judges—so coherence matters when interpreting safety metrics.
Gemma models showed two ASR peaks across angles, hinting there may be multiple relevant directions; better feature extraction could help.
Small models are more sensitive: if you break norms or push wrong layers, they collapse faster. Our norm-preserving, layer-selective approach avoids this.

Concrete comparisons:

Qwen2.5-3B: SAS caused major issues (e.g., GSM8K accuracy cratering), while our method achieved top ASR (84.62%) and kept task accuracy intact.
Gemma-2-2B: SAS/ActAdd degenerated (0% across tasks) under strong steering; Selective Steering maintained near-baseline skills and achieved high ASR.

Bottom line: Selective Steering is like a well-tuned suspension—no rattles (coherence), precise cornering (controllability), and the rest of the car still works great (robustness).

05Discussion & Limitations

🍞 Hook: Even the best tool has edges—like a Swiss Army knife you still need to use wisely.

🥬 Limitations:

Feature direction extraction: Using simple difference-in-means is reliable and cheap, but not always optimal. Fancier tools (e.g., Fisher discriminant, sparse features) might find even cleaner directions at higher cost.
Plane construction: Pairing the main direction with a PCA-based second axis works well but lacks formal optimality guarantees; smarter basis selection could improve steering further.
Access requirement: You need internal activations to steer; this won’t work on closed API models without such access.
Calibration data: You must gather small ‘red/green’ sets and compute stats. It’s light but still a step.
Scope: Steering is targeted behavior control, not a cure-all; good training and safety layers still matter.

Resources needed:

A single modern GPU (e.g., A40) is enough; calibration per model takes minutes; full evaluations take hours, not days.

When not to use:

If you can’t access internal activations (API-only), or if the behavior difference is undefined (no clear ‘red vs. green’ separation), or if you need permanent global changes best handled by training.

Open questions:

Can we automatically discover multiple complementary directions to handle bimodal cases (like Gemma) in one go?
Can we design optimal 2D (or low-D) planes on the fly per input?
How does selective steering interact with future normalization schemes or mixture-of-experts routing?

🍞 Anchor: This method is a precise wrench—not a bulldozer. Use it where bolts exist, pick the right size, and you’ll tighten things perfectly without stripping the threads.

06Conclusion & Future Work

🍞 Hook: Imagine a spotlight you can smoothly rotate to highlight just the right part of the stage—only where it matters and without changing the brightness.

🥬 3-sentence summary: Selective Steering rotates activations exactly (norm-preserving) and only at layers where the two classes split in opposite directions, giving a smooth, stable behavior dial. It fixes a hidden norm bug in prior angular steering implementations and avoids steering in unhelpful layers, protecting coherence. Across many models, it achieved much higher controllability with zero text-collapse events and kept normal skills intact.

Main achievement: Combining discriminative layer selection with a mathematically rigorous rotation to deliver safe, effective, and stable inference-time control.

Future directions: Improve feature discovery (beyond difference-in-means), optimize the steering plane, handle multiple behavior axes at once, and adapt angles per input dynamically. Also explore automated diagnostics that recommend angles and layers for new tasks.

Why remember this: It turns behavior control into a dependable dimmer switch—precise, gentle, and predictably safe—so we can test, guide, and improve models without breaking what already works.

🍞 Anchor: Like a careful conductor guiding only the instruments that need it, Selective Steering keeps the orchestra in tune while shaping the melody just right.

Practical Applications

•Safety red-teaming: Systematically dial behavior to probe vulnerabilities without retraining.
•Customer support tone control: Adjust helpfulness or formality while preserving core knowledge.
•Education tutors: Gently increase step-by-step guidance or encouragement based on learner needs.
•Content moderation assistants: Calibrate refusal strictness per policy and context.
•Multilingual stability: Prevent accidental code-switching by steering layers that leak foreign scripts.
•Bias mitigation at inference: Nudge away from sensitive attributes using directions extracted from balanced datasets.
•Prototype alignment tuning: Quickly test different safety settings before deciding on training-time changes.
•Clinical or legal drafting support: Maintain domain fluency while tightening conservativeness in risky sections.
•A/B testing model behaviors: Sweep angles to find sweet spots that balance helpfulness and caution.
•Debugging collapse: Detect and avoid layers where steering causes norm or coherence issues.

Version: 1