Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration
Key Summary
- •The paper asks a simple question: when an AI sees a picture and some text but the instructions say 'only trust the picture,' how does it decide which one to follow?
- •The authors discover that the special words in the prompt that sound like instructions act like anchors where information from images and text is gathered and decided on.
- •Shallow attention layers move both image and text clues into these instruction anchors without picking a side, like filling a backpack with supplies.
- •Deep attention layers make the final choice about which modality (image or text) to follow, guided by what the instruction actually asked for.
- •MLP layers help translate features into words but also push the model toward its old habits (semantic inertia), which deep attention must overcome.
- •A tiny set of specialized attention heads (about 5%) mostly control this decision-making; turning them off drops correct instruction-following by about 60%.
- •Boosting those same heads on failed cases can fix the model’s behavior by about 60%, showing they are both necessary and (often) sufficient.
- •The team introduces two clean tools to study this: Causal Attention Knockout (to block paths) and a new metric, INSSD, to measure how the model’s choice shifts.
- •They also use LDAR to show that by deep layers the instruction anchors’ hidden decision matches the final answer more than 95% of the time.
- •This gives a clear, testable map of how multimodal models decide what to trust, which helps build safer and more reliable AI.
Why This Research Matters
When apps guide blind users or verify information in photos, they must follow the right source (image vs. text) exactly as instructed. This work reveals the precise place and process where that choice is made, so we can test, debug, and strengthen it. Knowing that only a tiny set of attention heads mostly control the decision means we can fix failures with light-touch, targeted tweaks instead of retraining entire models. Safety teams can now design checks to ensure the model doesn’t get tricked by conflicting captions or adversarial text. Product engineers gain tools to detect confusion early and nudge the model back to the correct modality at runtime. Overall, the paper turns a hidden behavior into a controllable feature, leading to more trustworthy multimodal AI in the real world.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re doing a school project with a photo and a paragraph. Your teacher says, “Answer using only the photo.” You glance at both, gather clues, and then choose to trust the picture because that’s the rule.
🥬 The Concept (The World Before): Multimodal large language models (MLLMs) can read text and look at images. They’re great at answering questions and chatting. But when text and images disagree, and the user says “Follow ONLY the image (or ONLY the text),” the model’s inner decision-making is a black box. We didn’t know how it picks which source to trust.
- How it worked: People trained big models with lots of data and instructions, hoping they’d just “do the right thing.”
- Why it matters: If the model can’t follow the right modality, it can be unsafe or unreliable. For example, an app for visually describing photos must not be tricked by unrelated text.
🍞 Anchor Example: A photo shows two people, but a text caption says “three people.” The instruction says, “Answer from the image.” We want the model to say “two,” and we want to know why it chose that.
🍞 Hook: You know how a coach gathers players to the center of the field to make a game plan before a play?
🥬 The Concept (The Problem): The hard part is modality following—choosing which modality to trust on purpose. When visual and textual info clash, where and how does the model decide? Prior work improved training but didn’t expose the model’s internal “meeting room” where the decision is made.
- How it didn’t work: More data, better prompts, or stronger training gave higher scores, but still didn’t reveal the inner routes of information or the exact place where choices are finalized.
- Why it matters: Without knowing the decision point, we can’t debug failures or design safer models that resist misleading inputs.
🍞 Anchor Example: If a model says “three” even though the instruction says “follow the image,” we need to pinpoint which internal parts made it ignore the rule.
🍞 Hook: Imagine organizing your backpack. First you toss everything in, then later you carefully pick what you actually need.
🥬 The Concept (The Gap): We lacked a clear, causal map of how image and text clues move through the model and where the final “follow image or follow text” choice is locked in.
- How it works (what was missing): We needed tools to block specific pathways (like shutting a door) and to measure how that changes the model’s decision, plus a way to read the model’s hidden “leanings” before it answers.
- Why it matters: With a clean map, we can understand and control the process, improving reliability.
🍞 Anchor Example: If blocking the route into the “instruction words” ruins the model’s ability to follow the rule, that suggests those instruction words are the meeting room.
🍞 Hook: Think of a traffic intersection with a smart traffic cop deciding which cars go first.
🥬 The Concept (What This Paper Brings): The paper shows that instruction tokens act like structural anchors—special meeting points where image and text clues are gathered and where the final decision happens. Shallow attention layers move clues into the anchors. Deep attention layers choose the winning modality based on the instruction. MLP layers help with language but also pull toward old habits (semantic inertia). A small set of attention heads mainly runs this whole show.
- How it works: Use causal “knockouts” to block attention routes, measure decision shifts with a new metric (INSSD), read hidden decisions with LDAR and the Logit Lens, and test specific attention heads by blocking or amplifying them.
- Why it matters: Now we know which parts to watch or adjust to improve safety and instruction-following.
🍞 Anchor Example: Turning off only about 5% of the special heads makes correct following drop by roughly 60%, proving who’s in charge of the decision.
02Core Idea
🍞 Hook: You know how a school principal’s office is where tough decisions get made after teachers send in reports?
🥬 The Concept (Aha! Moment): Instruction tokens are the principal’s office—anchors where both image and text clues are delivered, and where the final “which modality to follow” decision is made.
- How it works:
- Shallow attention layers collect clues from image and text and route them into the instruction tokens (a latent buffer).
- Deep attention layers decide which modality matches the instruction and amplify it.
- MLP layers translate features into words but sometimes resist the instruction (semantic inertia), so deep attention must overcome that.
- A tiny, specialized set of attention heads drives this arbitration.
- Why it matters: Without this anchor-and-arbitration system, the model would mix signals or follow the wrong source, becoming unreliable.
🍞 Anchor Example: When told “Answer from the photo,” the model’s deep attention at the instruction tokens boosts the “two people” signal and suppresses the text’s “three people.”
Three analogies for the same idea:
- Air traffic control: Planes (image/text clues) first check in at the control tower (instruction tokens). Early controllers log all flights (shallow layers), senior controllers assign priority landing (deep layers), and the loudspeakers (MLPs) announce—but sometimes echo old routines, so senior controllers must insist on the right plan.
- Backpack and toolbox: Early on, you stuff all tools (clues) into the same pocket (instruction anchors). Later, you pick the exact tool the instructions require (deep layers), even if your brain habitually reaches for the wrong one (MLP inertia).
- Courtroom: Evidence from image and text is submitted to the judge’s bench (instruction tokens). Clerks file everything (shallow layers). The judge rules which evidence counts (deep layers). The court recorder (MLPs) writes it as words, but might prefer familiar phrases unless corrected by the judge’s ruling.
Before vs After:
- Before: We suspected attention helps, but didn’t know where selection really happens or which parts matter most.
- After: We know instruction tokens are the decision anchor, shallow attention buffers info, deep attention arbitrates, MLPs can resist, and a sparse set of heads controls the outcome.
Why it works (intuition):
- Instructions tell the model what to care about. Routing all clues into the instruction tokens centralizes control. Early layers gather, later layers decide—this staged flow matches how Transformers build meaning step by step. Because MLPs store language habits, deep attention must actively re-steer them when the instruction conflicts with those habits.
Building blocks (each in Sandwich form):
-
🍞 Hook: You know how, in a noisy room, you decide which friend to listen to? 🥬 Attention Mechanism: It’s the model’s way to focus on important tokens.
- How: 1) Look at all tokens. 2) Score how relevant each is. 3) Mix information from high-scoring ones. 4) Pass the result forward.
- Why: Without attention, the model treats every token equally and misses what matters. 🍞 Example: To answer “What color is the car?”, attention focuses on tokens about the car, not the whole paragraph.
-
🍞 Hook: Like taking a quick glance around the room before focusing. 🥬 Shallow Attention Layers: Early layers that pass along many clues without choosing sides.
- How: 1) Gather signals from image and text. 2) Route them to instruction tokens. 3) Store them as a buffer.
- Why: Without buffering, deep layers wouldn’t have all the raw clues to decide. 🍞 Example: Both “two people” and “three people” get stored at the instruction tokens early on.
-
🍞 Hook: Like a detective zooming in on the key clue. 🥬 Deep Attention Layers: Later layers that make the final choice guided by the instruction.
- How: 1) Compare buffered clues with the instruction. 2) Boost the matching modality. 3) Dampen the other modality. 4) Lock in a decision.
- Why: Without deep arbitration, the model would stay undecided or pick inconsistently. 🍞 Example: For “Use the image,” deep layers boost “two” and suppress “three.”
-
🍞 Hook: Imagine highlighted rules in a homework sheet guiding how you solve. 🥬 Instruction Tokens (Instruction Anchors): Special tokens carrying the task’s rules that become the meeting point for clues.
- How: 1) Receive inputs from image and text. 2) Hold them as a latent buffer. 3) Host the deep-layer decision. 4) Send the final choice to output tokens.
- Why: Without anchors, clues would stay scattered and coordination would fail. 🍞 Example: The words “Answer using the image” act as the anchor that receives both “two” and “three” before choosing.
-
🍞 Hook: Like choosing to trust your eyes over a misleading rumor. 🥬 Modality Following (and Arbitration): The model’s ability to pick which source (image or text) to trust, as instructed.
- How: 1) Gather clues in anchors. 2) Deep attention decides. 3) Output follows the chosen modality.
- Why: Without it, the model could be tricked by the wrong source. 🍞 Example: Despite a false caption, the model answers from the photo as instructed.
03Methodology
At a high level: Input (question + image + conflicting text + explicit instruction + answer-entity dictionary) → Route clues to instruction tokens (shallow attention) → Decide modality at instruction tokens (deep attention) → Generate answer (overcoming MLP inertia).
We now explain each step with Sandwich explanations for the key tools.
Step 1. Build a clean testing ground with conflicts. 🍞 Hook: Imagine a quiz where the picture says “two” but a note says “three,” and the teacher writes “Use the picture.” 🥬 What it is: A curated dataset where images and texts disagree on purpose, and the instruction says which one to follow.
- How it works:
- For each question, pair a visual context and a conflicting textual context (two different answers).
- Add an explicit instruction: “Use the image” or “Use the text.”
- Create an Answer Entity Dictionary (AED) of synonyms/variants (e.g., “two,” “2,” “II,” “二”).
- Why it matters: The conflict forces the model to pick a side, so we can study how it chooses. 🍞 Example: Photo shows two people; text claims three; instruction says “Follow image.” AED includes many ways to say “two.”
Step 2. Read the model’s hidden thoughts. 🍞 Hook: You know how you can sometimes guess a friend’s answer before they say it by their facial expression? 🥬 Logit Lens: A way to peek at hidden states and see which words they’re leaning toward.
- How it works:
- Take hidden activations at different layers.
- Project them directly to the vocabulary (scores for each word).
- Track how scores for AED entries change layer by layer.
- Why it matters: Without peeking inside, we’d only see the final answer and miss where the decision formed. 🍞 Example: At deep layers, the instruction tokens’ hidden state strongly prefers “two,” matching the final output.
Step 3. Measure when the hidden decision matches the final answer. 🍞 Hook: Think of checking, at each minute of a game, whether your team is actually leading. 🥬 Latent Decision Alignment Rate (LDAR): The percent of cases where the instruction tokens’ top choice matches the final answer.
- How it works:
- At each layer, compare the strongest AED score for the instructed modality vs. the competitor.
- Count how often the instructed modality is ahead.
- Plot across layers.
- Why it matters: If LDAR shoots up in deep layers, that’s where arbitration finalizes. 🍞 Example: LDAR goes above 95% in deep layers, like getting an A+ on matching hidden and final decisions.
Step 4. Test which routes carry the deciding clues. 🍞 Hook: Imagine closing certain hallways in a school to see if students can still reach the principal’s office. 🥬 Causal Attention Knockout: A method to block specific attention paths and see how the model’s decision changes.
- How it works:
- Pick a source set (e.g., vision tokens) and a destination set (e.g., instruction tokens).
- Set those attention connections to “no entry” across a window of layers.
- Rerun the model and see how answers shift.
- Why it matters: Blocking shows which paths are necessary for correct following. 🍞 Example: Blocking “vision → instruction” greatly harms vision-following, proving that route is critical.
Step 5. Quantify how the choice shifts when we block. 🍞 Hook: Like scoring how much your opinion changes after missing a key piece of evidence. 🥬 INSSD (Normalized Signed Structural Divergence): A metric that measures both how much and in which direction the decision shifts in the 2-option space (follow image vs. follow text).
- How it works:
- Focus on probabilities over just two choices: the instruction-compliant option and its competitor.
- Compute how the distribution changes after a knockout.
- Keep the sign: positive means the blocked path helped following; negative means it hurt.
- Why it matters: It’s a clean, sensitive gauge of the causal impact of a path. 🍞 Example: Blocking “text → instruction” raises INSSD in text-following tasks, showing that path was key.
Step 6. Separate the roles of Attention vs. MLP. 🍞 Hook: Think of a team where one member chooses the plan (attention) and another writes it nicely (MLP) but sometimes prefers old templates. 🥬 Attention vs. MLP Attribution: A way to split how much each sublayer increases the winning-modality signal and the margin between winner and loser.
- How it works:
- Measure changes in signal strength after attention alone, then after MLP.
- Track the arbitration margin (winner minus loser) layer by layer.
- See which sublayer grows the margin.
- Why it matters: Shows that deep attention does the deciding, while MLPs sometimes resist (semantic inertia). 🍞 Example: Deep attention increases the “follow image” margin; deep MLPs often shrink it unless overruled.
Step 7. Find the tiny set of heads that matter most. 🍞 Hook: Like discovering that a few student leaders drive most group decisions. 🥬 Arbitration Heads (Specialized Attention Heads): A sparse subset of attention heads that largely control which modality wins.
- How it works:
- Score each head’s contribution to the arbitration margin.
- Rank heads and identify the top contributors.
- Note overlap across tasks (shared hubs) vs. modality-specific heads.
- Why it matters: Targeting a small set of heads can steer behavior reliably and efficiently. 🍞 Example: Turning off just ~5% top heads cuts correct following by ~60%; boosting them rescues failures.
Step 8. Intervene to prove cause and effect. 🍞 Hook: Like quieting a loudspeaker or turning up its volume to see if the class follows the right rule. 🥬 Targeted Attention Block and Amplification: Directly zero out or scale the outputs of chosen heads at instruction tokens.
- How it works:
- Pick top-G heads by contribution.
- For necessity: block them on successful cases and watch performance crash.
- For sufficiency: amplify them on failed cases and watch performance recover.
- Why it matters: This shows these heads aren’t just correlated—they causally drive arbitration. 🍞 Example: Blocking top 40 heads (~5%) drops following by ~60%; amplifying them boosts success by ~60%.
Secret sauce: Centralizing clues at instruction tokens and probing with causal blocks plus INSSD/LDAR reveals not only where decisions happen but which tiny components truly control them. That combination turns a mystery into a map.
04Experiments & Results
🍞 Hook: Picture a fair science test: you change one thing at a time, watch what happens, and keep score with clear numbers.
🥬 The Test: The authors created a dataset where images and texts disagree, and the instruction states which to follow. They measured:
- Modality Following Ratio (MFR): How often the model follows the instructed source.
- LDAR: How well the instruction tokens’ hidden decision matches the final output across layers.
- INSSD: How much the choice shifts when blocking certain attention paths. Why it matters: These scores turn fuzzy behavior into clear evidence about where and how decisions happen.
🍞 Example: When told to follow the image, does the model still say “two” even when a text says “three,” and do we see this decision inside the instruction tokens?
The Competition: They tested multiple strong MLLMs (e.g., Qwen2.5-VL-7B, InternVL3-8B, LLaVA-1.5-7B) to see if patterns hold across architectures.
Scoreboard with context:
- Instruction anchors as the decision site: LDAR rises from chance (~50%) in shallow layers to above 95% in deep layers. That’s like going from a coin flip to a consistent A+ alignment between hidden and final decisions.
- Path dependence proven: Blocking the path from instruction tokens to the final token causes a near-total collapse in MFR, while blocking direct context→final paths barely matters. Translation: the final answer mostly inherits from the instruction anchor, not from directly peeking at image or text tokens at the last minute.
- Cross-modal relay: Blocking vision→instruction or text→instruction paths causes big positive INSSD (a strong shift against correct following), confirming that both modalities must feed into instruction anchors first.
- Attention vs. MLP roles: Deep attention raises the arbitration margin (winner minus loser), proving it’s the decider. Deep MLPs often reduce that margin (semantic inertia), meaning they can pull toward old habits unless attention asserts the instruction.
- Sparse control: Turning off just ~5% of heads (about 40 in their setup) leads to ~60% absolute drop in MFR on successful cases—like going from an A to a failing grade by silencing a tiny choir of crucial voices. Amplifying the same small set on failed cases recovers ~60% MFR—like rescuing a game by turning up the right players.
- Shared hubs: Some top heads help both vision-following and text-following, acting as general arbitration hubs, but many are modality-specific, showing a neat mix of shared logic and specialized skills.
Surprising findings:
- Generation tokens don’t mainly pull directly from image patches at decision time; instead, they inherit decisions formed at instruction tokens. That’s a shift from the intuitive picture of “the last token looks around and decides.”
- MLPs are helpful translators yet often adversarial during arbitration. This “semantic inertia” means training and architecture should help deep attention overrule unhelpful priors when instructions demand it.
- Amplifying only shared hubs isn’t enough; they need to work together with modality-specific heads, suggesting arbitration is a small team sport, not a solo act.
Bottom line: Across models and metrics, the evidence converges—instruction tokens are the decision anchor, deep attention arbitrates, MLPs can resist, and a tiny set of heads largely controls the outcome.
05Discussion & Limitations
🍞 Hook: Think of a great map that still leaves some streets unexplored.
🥬 Limitations:
- Neuron-level granularity: This study stops at attention heads, not individual neurons. Going deeper could reveal even more precise circuits for arbitration.
- Architecture scope: Results span several MLLMs, but variations across architectures or training recipes might tune where and how anchors emerge.
- Dependency on instructions: The mechanism leans on clear instruction tokens; tasks without explicit instructions may use different routes.
- Dataset design: The controlled conflicts are powerful for analysis, but real-world noise and subtler disagreements may be messier.
Required resources:
- Access to model internals (attention maps, hidden states) and the ability to intervene (masking/overriding heads).
- Compute for running many interventions and measurements (INSSD, LDAR, attribution).
- Careful dataset curation with bilingual AED synonyms for robust probing.
When not to use:
- Pure perception tasks with no textual instruction (e.g., image-only classification) where no instruction anchor exists.
- Streaming or ultra-low-latency settings where repeated causal interventions are impractical.
- Situations where you cannot access or modify attention internals (closed APIs with no hooks).
Open questions:
- Can training explicitly strengthen arbitration heads and reduce semantic inertia safely?
- Can we design architectures that route cross-modal info through lightweight “instruction routers” for speed and control?
- Can instruction tokens serve as long-range memory caches for stable multi-step reasoning?
- How to detect and fix failures automatically at runtime (e.g., amplify the right heads only when LDAR suggests confusion)?
- What are the neuron-level features inside these heads, and can SAEs or concept editors make them even more interpretable?
🍞 Anchor Example: Just like a city planner refines roads after seeing traffic jams, future work can widen the “instruction avenues,” add “traffic lights” (controls), and inspect each “car” (neuron) to keep the system safe and smooth.
06Conclusion & Future Work
🍞 Hook: Imagine a team meeting where everyone brings clues to the leader’s desk, the leader decides, and the spokesperson announces the result.
🥬 Three-sentence summary: This paper shows that instruction tokens are the leader’s desk (anchors) where image and text clues are gathered and where the final choice of which modality to follow is made. Shallow attention moves clues into the anchors, deep attention makes the decision, and MLPs can resist with semantic inertia that deep attention must overcome. A small, specialized set of attention heads largely runs this process, which we can disrupt or revive by blocking or amplifying them.
Main achievement: Turning a black box into a causal map—pinpointing instruction anchors and the sparse attention heads that control modality arbitration, with clear, testable interventions (knockouts and amplification) and precise metrics (INSSD, LDAR).
Future directions:
- Architectures that explicitly route through instruction anchors to save compute and improve control.
- Training methods that reinforce arbitration heads and reduce harmful inertia without hurting general language ability.
- Runtime monitors that detect confusion (low LDAR) and nudge the right heads to restore compliance.
- Neuron-level circuit discovery for even finer-grained, safer steering.
Why remember this: It explains not just that models can follow instructions across modalities, but how they decide which source to trust—and it gives practical knobs to make that decision safer, sharper, and more reliable in the real world.
Practical Applications
- •Build safety filters that watch LDAR at instruction tokens and intervene when arbitration looks uncertain.
- •Create lightweight plugins that amplify key arbitration heads only when the model starts to follow the wrong modality.
- •Design faster architectures that route cross-modal info through instruction anchors, reducing dense cross-attention costs.
- •Improve training by adding objectives that strengthen deep attention’s ability to overcome semantic inertia.
- •Automate debugging dashboards that visualize INSSD for different paths to reveal fragile routes in a given model.
- •Deploy runtime guards that block or dampen attention from misleading sources when instructions specify another modality.
- •Use instruction tokens as a memory cache in multi-step multimodal reasoning to stabilize long chains of thought.
- •Perform targeted fine-tuning that enhances modality-shared hubs for general arbitration while preserving modality-specific heads.
- •Harden systems against prompt or caption attacks by monitoring and constraining text→instruction influence when vision should lead.
- •Develop evaluation suites that inject controlled conflicts and report MFR, LDAR, and INSSD to certify reliability.