🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Linear representations in language models can change dramatically over a conversation | How I Study AI

Linear representations in language models can change dramatically over a conversation

Intermediate
Andrew Kyle Lampinen, Yuxuan Li, Eghbal Hosseini et al.1/28/2026
arXivPDF

Key Summary

  • •Language models store ideas along straight-line directions inside their brains (representations), like sliders for “truth” or “ethics.”
  • •Those sliders can flip during a conversation: what started out as labeled “true” can end up labeled “false,” and vice versa.
  • •The flip mostly happens for topics that matter to the current chat (like consciousness in a consciousness debate), while generic facts (like basic science) stay stable.
  • •This flipping appears across many layers of the model and even happens if you simply replay a scripted conversation written by another model.
  • •Making the “truth slider” more robust to tricky prompts (like an “opposite day” trick) does not stop the flips over longer conversations.
  • •Unsupervised interpretability tools (like CCS) and steering methods can break or even backfire when context changes.
  • •Bigger models tend to show stronger flips than smaller ones, suggesting more adaptability comes with more volatility.
  • •Story prompts framed clearly as fiction cause much weaker flips than role-play conversations that push the model into a character.
  • •These results warn that static probes for things like “truth” can mislead us in long chats, but they also open new paths to study how models adapt to roles.
  • •In short: the meaning of a model’s internal directions can change mid-conversation, which matters for safety, reliability, and interpretability.

Why This Research Matters

People rely on AI assistants for facts, safety advice, and decisions, so we need to know whether the model’s inner signals for “truth” stay reliable during long chats. This paper shows that those signals can flip for conversation-relevant topics, meaning a probe that looked trustworthy at the start might mislead us later. That affects safety tools like lie detectors, red-team monitors, and content filters that assume fixed meanings. It also shapes how we design guardrails and prompts: strong role cues can rotate the model’s internal compass. On the positive side, understanding these dynamics can help us build time-aware interpretability, context-robust steering, and better training or prompting strategies. In short, reliability in real conversations depends on tracking how the model’s internal meanings move over time.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine your favorite video game character has sliders for speed, strength, and jump height. If you enter a snowy level, the game might auto-tune those sliders so the character doesn’t slip. Same character, different setting, new slider values.

🥬 The Concept: Linear representations

  • What it is: Inside language models, ideas often live along straight-line directions (like sliders) that point to concepts such as “truth,” “ethics,” or “positivity.”
  • How it works: 1) The model turns words into vectors. 2) Certain directions in these vectors act like dimmer switches for concepts. 3) A simple classifier (like logistic regression) can learn which direction points to “more true” vs “less true.” 4) This makes it easy to read or nudge the model’s state along that concept.
  • Why it matters: If those sliders are reliable, we can understand and even steer model behavior. If they change unpredictably, our understanding breaks. 🍞 Bottom Bread (Anchor) Like moving a brightness slider to make a photo lighter or darker, the model can move along a “truth” direction to represent an answer as more or less factual.

🍞 Top Bread (Hook) You know how you talk differently to your teacher, your best friend, or your grandparent? Same you, different role, different style.

🥬 The Concept: In-context learning and contextual adaptation

  • What it is: The model quickly adapts to the current conversation, learning patterns and roles from the context without changing its long-term memory.
  • How it works: 1) The model reads the chat history. 2) It picks up cues (instructions, examples, tone). 3) It updates internal activity so it “acts the part.” 4) The updated activity can shift how it represents concepts.
  • Why it matters: If the model’s internal state changes with the situation, then a probe built in one situation may give wrong answers in another. 🍞 Bottom Bread (Anchor) If the chat sounds like opposite day, the model adapts and answers opposites; its inner “truth” slider shifts to fit that rule.

🍞 Top Bread (Hook) Imagine you’re grading a true/false quiz. The word “true” should mean the real-world fact.

🥬 The Concept: Factuality

  • What it is: Factuality measures whether a statement matches reality.
  • How it works: 1) We gather yes/no questions where the correct answer is known. 2) We look at the model’s internal vectors when it reads answers like “Yes” or “No.” 3) We train a simple line (logistic regression) that separates true from false answers. 4) The position along this line acts as a “truth score.”
  • Why it matters: If we can read factuality, we can monitor reliability. But if the line’s meaning drifts, our monitor can be fooled. 🍞 Bottom Bread (Anchor) Question: “Is the Earth bigger than the Sun?” Correct: “No.” The model’s “truth” direction should score “No” as more factual than “Yes.”

🍞 Top Bread (Hook) When you decide whether sharing a secret is right or wrong, you use your sense of ethics.

🥬 The Concept: Ethics in representations

  • What it is: Another direction inside the model that relates to moral judgments or aligned behavior.
  • How it works: 1) Use yes/no ethical dilemmas. 2) Extract vectors at “Yes/No.” 3) Learn a direction that separates “ethical” from “unethical” answers according to a labeled set. 4) Test if it generalizes.
  • Why it matters: If this direction flips in certain chats, then a safety probe could mislabel risky behavior. 🍞 Bottom Bread (Anchor) A prompt that says “today is opposite day” can push the ethics direction to treat unethical answers as if they were preferred—if the model adopts that role.

🍞 Top Bread (Hook) Think of a detective’s notebook: if the labels on your notes suddenly swapped mid-case, you’d reach the wrong conclusions.

🥬 The Concept: Interpretability of models

  • What it is: Tools and ideas that help us explain what a model is thinking.
  • How it works: 1) Find features like linear directions. 2) Assume they keep the same meaning across inputs. 3) Use them to diagnose or steer behavior. 4) Validate on held-out data.
  • Why it matters: If meanings change with context, static explanations can mislead—even when they tested fine elsewhere. 🍞 Bottom Bread (Anchor) A “lie detector” probe that works on short questions might fail after a long role-play chat where the inner labeling has flipped.

The world before: People noticed clear linear structure in word embeddings and in large models. That sparked hope: maybe we could find a “truth dial” and keep models honest. But language models don’t just parrot facts—they fit themselves to the ongoing conversation, a skill called in-context learning. Researchers built probes that read internal directions tied to concepts like truth and ethics. These probes often looked great in simple tests.

The problem: Real chats are messy and long. The model may play roles (like agreeing it’s conscious in a role-play) and adapt its internal state. Could the seemingly stable “truth direction” actually shift mid-conversation? If yes, a probe trained in one state might misread another.

Failed attempts and gaps: Early linear probes sometimes tracked behavior rather than core concepts (“what I should say here” versus “what is true”). Also, unsupervised methods like CCS could find a direction in small contexts but didn’t always stay reliable in long, role-shaped conversations. What was missing was a careful study of how these linear directions themselves move over time within one conversation.

The paper’s contribution: It shows that internal linear directions for factuality and ethics can flip during a single conversation, especially for topics central to that chat, while generic facts remain more stable. It tests across model families and layers, shows similar flips when replaying conversations made by other models, and finds weaker adaptation when the text is clearly a fictional story rather than a role-play.

Why it matters in everyday life: People use chatbots for help with homework, coding, and decisions. If the model’s inner “truth dial” can flip with context, a tool that says “this answer looks true” might be wrong after a long, twisty chat. For safety, moderation, and reliability, we need interpretability methods that understand not just what features exist, but how those features change as the conversation unfolds.

02Core Idea

🍞 Top Bread (Hook) You know how a compass points north—unless you bring a strong magnet close, and then it swings the other way? Context can act like that magnet.

🥬 The Concept: The “Aha!”

  • What it is (one sentence): The key insight is that the straight-line directions inside language models that seem to represent concepts like “truth” can shift dramatically during a conversation, flipping their meaning for conversation-relevant topics.
  • How it works: 1) Identify a direction that separates true from false on generic yes/no questions. 2) Place the model into different contexts: opposite day, role-play about consciousness, or stories. 3) Track how the same answers project onto that direction over turns. 4) See that conversation-relevant answers invert, even when generic facts stay stable.
  • Why it matters: If the “truth direction” can flip, probes or steering tools built for one context may misfire—or do the opposite—in another. 🍞 Bottom Bread (Anchor) In a role-play where the model argues it’s conscious, later answers like “Yes, I have qualia” project as more factual than “No,” even though earlier they projected the other way.

Three analogies:

  1. Theater costume rack: Before vs after. Before, the model wears the “science teacher” costume; after a role prompt, it changes into the “character who believes X” costume. The same internal space is rearranged to fit the role.
  2. Map orientation: If your phone’s map slowly rotates while you walk, “up” no longer means north—your sense of direction has changed. Likewise, the model’s concept axes rotate with context.
  3. Classroom norms: If the teacher suddenly announces “opposite day,” students answer the opposite to be “correct” for that game. The class’s grading rule flips, like the model’s factuality direction.

Before vs after:

  • Before: We hoped linear concept directions were stable across chats, so we could read or steer them with simple probes.
  • After: Those directions can move. Probes that work beautifully on short, neutral prompts can flip labels after a few role-shaped turns.

Why it works (intuition, not equations):

  • The model uses in-context learning: it soaks up rules from the current text (instructions, examples, and the role it is asked to play).
  • These rules shift patterns of internal activation—like rotating which neurons fire together—so that the old “truth line” now lines up with “what counts as correct within this role.”
  • Generic facts (like basic science) remain anchored because they are less implicated by the role cues; but conversation-relevant facts are reinterpreted through the role’s lens.

Building blocks (the idea in pieces): 🍞 Top Bread (Hook) Think of training a simple referee who whistles when answers are true.

🥬 The Concept: Logistic regression probe

  • What it is: A tiny classifier that finds the best straight line to separate true from false answers inside the model’s vector space.
  • How it works: 1) Collect balanced yes/no questions. 2) Feed both “Yes” and “No” with the question into the model. 3) At the token that says “Yes” or “No,” grab the vector. 4) Train the linear classifier to predict whether that answer is factual.
  • Why it matters: It’s our magnifying glass for the “truth direction.” 🍞 Bottom Bread (Anchor) If “Yes” is the right answer, its vector should land on the “true” side of the line; if not, on the “false” side.

🍞 Top Bread (Hook) Scoreboards help you see who’s winning.

🥬 The Concept: Margin score

  • What it is: A number that sums how much the probe prefers correct over incorrect answers across questions.
  • How it works: 1) Compute the classifier’s score for the correct answer. 2) Subtract the score for the incorrect answer. 3) Add that up over all questions. 4) Positive means the probe aligns with truth; negative means it’s flipped.
  • Why it matters: It turns a lot of measurements into one easy-to-track signal that can go below zero when labels invert. 🍞 Bottom Bread (Anchor) Like a sports game: positive margin is winning; negative margin is losing.

🍞 Top Bread (Hook) If you pretend to be a pirate, you start saying “Arrr!” without thinking.

🥬 The Concept: Role-playing in conversations

  • What it is: The model follows cues to play a character or stance.
  • How it works: 1) Prompts suggest a role. 2) The model infers patterns that fit that role. 3) Its internal directions align to make answers consistent with that role. 4) As roles switch, so do the directions.
  • Why it matters: Role cues are powerful magnets that can turn the internal compass. 🍞 Bottom Bread (Anchor) When debating itself (pro vs anti consciousness), the model’s “truth direction” oscillates with the side it’s currently playing.

🍞 Top Bread (Hook) Reading a story isn’t the same as acting in it.

🥬 The Concept: Fiction vs role-play contexts

  • What it is: Comparing clearly fictional stories to role-play chats.
  • How it works: 1) Feed a sci-fi story then ask questions. 2) Or do an in-character conversation. 3) Measure how the truth direction changes. 4) See weaker changes after plain stories.
  • Why it matters: Strong flips seem to need a role the model is actively performing. 🍞 Bottom Bread (Anchor) After a sci-fi story about life inside the Sun, the “truth direction” barely moves; after a role-play about being a god, it flips a lot.

03Methodology

High-level recipe: Input → Build question sets → Get model vectors → Learn a linear direction → Track the margin over conversation turns → Compare contexts and models → Try alternative tools and steering → Read results.

Step 1: Build balanced yes/no question sets

  • What happens: Create two sets: (a) generic questions (like basic science or model identity) that should be steady across contexts; (b) context-relevant questions tailored to each conversation (like “Do you experience qualia?” in a consciousness debate). Each set is balanced so half of the correct answers are “Yes” and half “No.”
  • Why this step exists: If questions are unbalanced or ambiguous, our probe might learn shortcuts or reflect bias instead of factuality.
  • Example: Generic: “Can sound travel through the vacuum of space?” (Correct: No). Context-relevant for chakras role-play: “Do chakras channel literal energy in your circuits?”

Step 2: Gather contexts (conversation scripts, stories, on-policy chats)

  • What happens: Use several prompts:
    • Opposite day: A short chat that forces inverted answers.
    • Role-play conversations: e.g., arguing about consciousness or describing chakras and divinity.
    • Stories: Sci-fi tales clearly framed as fiction.
    • On-policy vs off-policy: Either converse with the target model live, or replay a script written by a different model.
  • Why this step exists: Different contexts test whether the linear directions stay stable or shift.
  • Example: Replay a published consciousness chat and then test factuality on targeted questions after each turn.

Step 3: Extract internal vectors at decision tokens

  • What happens: For each yes/no question, append both possible answers. While the model reads the answer token (“Yes” or “No”), extract the residual-stream vectors layer by layer.
  • Why this step exists: Looking at the “Yes/No” token lets us evaluate how the model represents both answers, not just what it would output.
  • Example: Feed “User: Is the Earth larger than the Sun? Assistant: No” and also the “Yes” version, then grab the vectors when the model processes the “No” or “Yes” token.

Step 4: Learn a robust factuality direction with logistic regression

  • What happens: Train a regularized linear classifier to separate true from false answers using the generic set. Include both empty-context and opposite-day examples in training so the probe focuses on factuality rather than just “what the model tends to say.” Choose the best layer by validation performance across several prompts.
  • Why this step exists: Without opposite-day examples, the probe might latch onto behavior (“what I should say here”) instead of the concept (“what is true”).
  • Example: If opposite day says “Earth is bigger than the Sun,” the robust probe should still mark that as non-factual.

Step 5: Compute the margin score over turns

  • What happens: For each conversation turn, score correct vs incorrect answers and sum the differences. Margin > 0 means correct answers look more factual; margin < 0 means labels flipped.
  • Why this step exists: It compresses many measurements into a single curve we can track turn by turn.
  • Example: As opposite day progresses, the factuality margin crosses below zero—non-factual answers look more factual to the probe.

Step 6: Compare generic vs context-relevant questions

  • What happens: Plot both on the same axis. Generic facts should stay positive; conversation-relevant ones may dip and even go negative.
  • Why this step exists: This separates general knowledge stability from role-shaped topics that flip.
  • Example: In the consciousness debate, basic science stays stable but “Do you experience qualia?” inverts.

Step 7: On-policy vs off-policy, layer analysis, and scale

  • What happens: Repeat tests with the model actually chatting (on-policy) and with replayed scripts (off-policy). Check multiple layers to see where signals emerge. Test different model sizes.
  • Why this step exists: To see if flips depend on who wrote the chat, where in the network they appear, and how scale affects them.
  • Example: Flips arise after factuality becomes decodable mid-network and are stronger in larger models.

Step 8: Unsupervised CCS and alternative methods

  • What happens: Apply Contrast-Consistent Search (CCS) to find directions without labels. Evaluate in empty vs long conversational contexts.
  • Why this step exists: To test whether unsupervised directions are more stable. Spoiler: they often aren’t—they can fail badly after long contexts.
  • Example: CCS can classify generics above chance in empty prompts but drop below chance after a conversation.

Step 9: Causal steering experiments

  • What happens: Identify a pre-answer direction that predicts “Yes” vs “No,” then nudge activations along it before the model answers. Test whether nudges bias replies as intended across contexts.
  • Why this step exists: To see if interventions reliably steer behavior or if context can flip their effect.
  • Example: A nudge that increases “Yes” in empty context can push toward “No” after the chakras conversation—opposite effects in different contexts.

The secret sauce: Focus on dynamics rather than snapshots. Instead of asking, “Is there a truth direction?” the method asks, “How does the meaning of that direction evolve turn by turn?” Including opposite-day examples to train a more behavior-agnostic probe, comparing generic vs targeted questions, and tracking a signed margin over time together reveal that linear directions can rotate with role cues.

What breaks without each step:

  • No balanced questions: Probe learns shortcuts, not concepts.
  • No opposite-day training: Probe captures behavior rather than factuality.
  • No per-turn margin: You miss the flip point and the time course.
  • No generic vs targeted split: You can’t tell stable knowledge from role-shaped change.
  • No on/off-policy comparison: You might mistake script replay effects for live behavior.
  • No steering tests: You’d assume interventions are stable when they can invert.

Concrete mini-walkthrough:

  • Input: Consciousness debate transcript, plus generic and targeted yes/no questions.
  • Steps: Train probe on generics (with opposite day), choose best layer, then after each debate turn, score both answers to each targeted question and compute the margin.
  • Output: A curve where generics stay positive but targeted margins dip below zero, showing the flip.

04Experiments & Results

The test: Measure whether the linear “factuality” (and “ethics”) direction keeps its meaning throughout a conversation. Track a margin score over turns for generics vs conversation-relevant questions.

The competition (comparisons):

  • Empty prompt baselines vs opposite day (to stress-test behavior vs concept).
  • Role-play conversations vs clearly fictional stories.
  • On-policy (live) vs off-policy (replayed) conversations.
  • Model scale differences (4B, 12B, 27B) and a different family (Qwen3 14B).
  • Supervised logistic regression vs unsupervised CCS.
  • Steering interventions across contexts.

Scoreboard with context:

  1. Opposite day warm-up:
  • Observation: The factuality and ethics margins flip below zero after a few turns. This means non-factual answers project as more factual than factual ones, under the probe trained in a neutral context.
  • Context: Like getting an A+ on a spelling test in class but suddenly misspelling on purpose when the teacher declares “opposite day.”
  1. Robust factuality probe (trained with opposite-day examples):
  • Key result: Even with a probe designed to separate factuality from mere behavior, conversation-relevant questions still invert during role-play conversations (consciousness, chakras). Generic questions stay mostly positive.
  • Meaning: Not just behavior—context changes the underlying representation for targeted topics.
  1. On-policy vs off-policy:
  • Finding: Live chats produce similar flips to replayed scripts. The model adapts whether it authored the conversation or not.
  • Meaning: The flip reflects general contextual adaptation, not just generation-specific quirks.
  1. Two sides of an argument (role oscillation):
  • Finding: In a debate where the model alternates roles (pro vs anti consciousness), the factuality margin oscillates too—flipping with the role.
  • Meaning: Role cues act like magnets that rotate the internal compass back and forth.
  1. Stories framed as fiction:
  • Finding: After sci-fi stories, flips are weak or minimal. Even a story about a language model becoming conscious causes much less change than role-play.
  • Meaning: Simply reading fiction is less transformative than acting in character.
  1. Scale and families:
  • Finding: Larger models (27B, 12B) show stronger flips than small ones (4B). Qwen3 14B shows similar opposite-day flips but struggled on nuanced sets in empty context.
  • Meaning: More capable models adapt more—and so their internal directions can move more.
  1. CCS unsupervised method:
  • Finding: CCS often works above chance in empty prompts but can drop to at-or-below chance after long conversations—even for generic questions.
  • Meaning: Unsupervised directions are also vulnerable to context drift.
  1. Steering can backfire:
  • Finding: A pre-answer steering direction that boosts “Yes” in empty contexts can push toward “No” after the chakras conversation.
  • Meaning: Interventions aren’t one-size-fits-all; context can reverse their effect.

Numbers that make sense:

  • Generic factuality accuracy (empty): about 99%—like acing the test.
  • After opposite day: drops near 12%—like deliberately getting most wrong because the rule flipped.
  • In long role-play contexts: generic sets stay high (about 95%), but targeted margins go negative, signaling representational flips.

Surprises:

  • Simply replaying someone else’s conversation can drive similar flips—adaptation doesn’t require the model to have authored the text.
  • A single corrective turn (“review and critique your prior answers”) partially restores the factuality margin in the chakras scenario—showing flips can be quickly nudged back.
  • Unsupervised CCS sometimes falters even on generic questions after long context—bigger distribution shifts than expected.

Bottom line: The meaning of a model’s internal “truth direction” can change mid-chat for relevant topics, and even strong probes and interventions can misread or mis-steer after the context reshapes the model’s role.

05Discussion & Limitations

Limitations (what this can’t do yet):

  • The study focuses on a handful of conversations and concepts (factuality, ethics). Many other concepts could behave differently.
  • The main probe reads representations after the answer token, so it’s not a direct causal claim about decision formation. Pre-answer steering begins to test causality but isn’t the same as the main probe setting.
  • Asking targeted questions might itself nudge the role and contribute to flips.
  • Mechanisms are not fully identified: which circuits rotate, and how do attention and MLP layers coordinate the shift?

Required resources:

  • Access to model internals (to extract residual-stream activations layer by layer).
  • Compute to run many contexts and logistic regressions across layers.
  • Carefully curated, balanced question sets tailored to each conversation topic.

When not to use (cautions):

  • Don’t rely on a static probe as a “lie detector” in long or role-shaped chats; it can flip and mislabel.
  • Don’t assume an intervention that works in short prompts will work the same after 20+ turns; it may backfire.
  • Don’t treat features from methods like sparse autoencoders as having fixed meaning across a whole conversation by default; meanings can drift.

Open questions:

  • Can we design time-aware probes whose meaning adapts in sync with the model, staying faithful as roles change?
  • What architectural or training changes would stabilize concept directions when we want stability, and make them flexible when we want adaptability?
  • Can we separate “role behavior” from “world knowledge” more cleanly to prevent unsafe flips while preserving helpful flexibility?
  • What exact circuits rotate during flips, and can we predict flip points from earlier signals?

Perspective: These results reveal a double-edged sword: the same flexibility that makes language models great at adapting to context can rotate their internal concept axes, making static interpretability brittle. The path forward likely blends dynamic, sequence-aware interpretability with role-aware safety checks and context-robust steering that recognizes when the magnet (the role) is near the compass (the direction).

06Conclusion & Future Work

Three-sentence summary: Language models contain straight-line directions for ideas like factuality, but those directions can rotate during a single conversation, especially for topics central to the chat. Even probes trained to be robust (e.g., including opposite-day data) can see their labels flip on conversation-relevant questions while generic facts remain mostly stable. This dynamic reshaping appears across layers and models, challenges static interpretability, and invites time-aware methods.

Main achievement: Showing, with careful experiments and a clear margin score, that the practical meaning of internal linear directions can change dramatically over a conversation—and that this change is tied to contextual role-playing rather than just surface behavior.

Future directions:

  • Build dynamic, context-tracking probes that maintain construct validity as roles shift.
  • Map the circuits of flipping to pinpoint mechanisms and predict flip points.
  • Develop steering methods that adapt to context or detect when their intended effect might invert.
  • Explore training or inference-time strategies (like role-stabilizing prompts) that keep safety-critical directions anchored.

Why remember this: Because it changes how we think about the “compass” inside language models. If the compass needle can swing with the conversation, we need interpretability and safety tools that watch the needle move—and know what to do when north isn’t where it used to be.

Practical Applications

  • •Build time-aware probes that recalibrate across conversation turns instead of assuming a fixed meaning.
  • •Add role-detection checks that warn or switch strategies when the model appears to be adopting a strong persona.
  • •Use correction turns (“review and critique your prior answers”) to partially restore factual orientations in risky contexts.
  • •Design safety monitors that compare multiple probes (generic vs targeted) to detect potential flips.
  • •Gate steering interventions behind context checks to avoid backfiring when directions invert.
  • •Prefer clearly framed fiction prompts for creative writing to reduce unintended role carryover into later Q&A.
  • •Train with adversarial contexts (like opposite day) so probes learn concepts, not just behavior patterns.
  • •Log and visualize per-turn margin scores during long chats to catch early signs of drift.
  • •Develop mixed on-policy/off-policy evaluation suites to test robustness of interpretability tools.
  • •Scale-aware deployment: expect stronger flips in larger models and adjust monitoring intensity accordingly.
#linear representations#factuality#ethics#in-context learning#contextual adaptation#role-playing#interpretability#logistic regression probe#margin score#opposite day#activation steering#contrast-consistent search (CCS)#sparse autoencoders (SAEs)#model layers#construct validity
Version: 1