🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models | How I Study AI

Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Intermediate
Hyunjong Ok, Jaeho Lee1/20/2026
arXivPDF

Key Summary

  • •Putting the reading passage before the question and answer choices (CQO) makes language models much more accurate than putting it after (QOC), by about 15 percentage points on average.
  • •The reason is causal attention: in decoder-only models, tokens can only look backward, so options placed before the passage cannot use the passage to judge which option fits.
  • •This isn’t because the model forgot the options or hasn’t seen this format in training; both ideas were tested and ruled out.
  • •Encoder and encoder-decoder models, which can look both ways in the text, do not suffer from this order problem, confirming the attention mask is the culprit.
  • •Removing the passage in QOC barely changes accuracy, showing the model was ignoring the passage anyway when options come first.
  • •Longer passages make the problem worse, and correct answers placed later (like option D) get hurt less because they are closer to the passage.
  • •Three targeted fixes help: (1) repeat options after the passage, (2) patch option activations from a good order into a bad one, and (3) don’t block option→passage attention in CQO (used as a diagnostic).
  • •This paper gives a clear, testable mechanism for prompt-order sensitivity and practical tips to write better prompts or pick better architectures.
  • •The findings matter for exams, customer support, and any app where a model must pick the best answer using a long passage.

Why This Research Matters

Many real systems ask models to pick the best answer from a long passage, like tutoring apps, customer support bots, and medical triage helpers. If we put the parts in the wrong order, decoder-only models can ignore the very evidence we want them to use. Knowing that causal attention blocks options from seeing later text lets us fix prompts (use CQO or repeat options) or select architectures that avoid the issue. This improves reliability without needing bigger models or costly retraining. It also guides benchmark design so we evaluate models fairly and in ways that reflect real-world use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how taking a test goes better when you read the passage before looking at the answer choices? If you peek at the choices first, they can feel confusing until you see the story.

🥬 Filling (The Actual Concept) — Multiple-Choice Question Answering (MCQA)

  • What it is: MCQA is when a model picks A, B, C, or D to answer a question using a given passage.
  • How it works:
    1. Read a passage (context),
    2. Read a question about it,
    3. Compare each option to the passage to find the best match.
  • Why it matters: If the model can’t compare options to the passage, it guesses from patterns instead of evidence. 🍞 Bottom Bread (Anchor): Like a reading test at school: a short story, then a question, then four choices. You must check the story to know the answer.

The World Before: Large language models (LLMs) already showed that tiny prompt changes (like phrasing or layout) can swing performance up or down. People noticed models are sensitive to formatting, the order of examples, and even punctuation. Still, most reports said “it changes a lot” without pinning down a clear “why.” In multiple-choice reading, a prompt has three parts: Context (C), Question (Q), and Options (O). Intuitively, the order shouldn’t matter, because the words are the same. But surprisingly, it does: putting the context first (CQO) is far better than putting it last (QOC).

The Problem: Across many models and datasets, CQO beats QOC by over 14 percentage points. That’s huge—like turning a B- into an A. The mystery is: why would the order, not the content, cause such a big difference? Without an explanation, we can’t trust that models are actually using the passage.

Failed Attempts (two guesses that didn’t explain it):

  1. Training bias guess: Maybe models just saw more CQO during training, so QOC feels unfamiliar. Tests with instruction-tuned vs. base models and with extra few-shot QOC examples didn’t shrink the gap much.
  2. Memory guess: Maybe the model forgets the options after reading a long passage (“lost in the middle”). But when asked to repeat the options, models recalled them just fine—sometimes even better in QOC—so memory loss wasn’t the main issue.

🍞 Top Bread (Hook): Imagine reading a book and only being allowed to look at earlier pages while writing a summary—you can’t peek ahead even if the clues you need are on the next page.

🥬 Filling — Attention Mechanism

  • What it is: Attention is the model’s way of deciding which words to focus on when processing text.
  • How it works:
    1. Look at all tokens (words/pieces),
    2. Score how helpful each token is,
    3. Mix information from high-score tokens more strongly,
    4. Use that mix to make the next decision.
  • Why it matters: Without attention, the model would treat every word equally and miss the important bits. 🍞 Bottom Bread (Anchor): When asked “What’s the capital of France?”, attention spotlights “capital” and “France,” helping the model say “Paris.”

🍞 Top Bread (Hook): Now add a rule: you can only look backward, never forward—no peeking at words that come later.

🥬 Filling — Causal Attention

  • What it is: Causal attention is attention with a time rule: each token can only attend to earlier tokens.
  • How it works:
    1. Read token by token,
    2. At each step, attend only to tokens already seen,
    3. Build representations that never use “future” text,
    4. Generate the next token from that past-only view.
  • Why it matters: If key evidence appears later, earlier tokens cannot use it, creating a wall between options and the passage. 🍞 Bottom Bread (Anchor): If options come before the passage, the options can’t “see” the passage because it’s in their future.

The Gap This Paper Fills: The authors show the big CQO > QOC gap comes from this causal attention rule in decoder-only models. In QOC, options are computed before the passage arrives, so options can’t use the passage to judge correctness. That’s an information bottleneck: the evidence can’t reach the place where decisions are formed.

Real Stakes: This matters for classrooms, tests, customer support, medical triage, and any tool where the model must pick the best answer from a long passage. If we place parts in the wrong order, models might guess based on option wording instead of actually checking the evidence. The paper turns a fuzzy worry into a clear mechanism and gives simple fixes: prefer CQO, or if you must use QOC, repeat the options after the passage or choose an architecture that can look both ways.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re judging four dessert choices, but you’re forced to decide before you can read the menu. Later, you can read the menu—but you can’t go back and change how you thought about each dessert earlier.

🥬 Filling (The Actual Concept)

  • What it is: The key insight is that in decoder-only models with causal attention, options placed before the passage cannot use the passage to form their representations, so they’re chosen using option wording and priors rather than the actual evidence.
  • How it works:
    1. In QOC order (Question → Options → Context), options are processed before the context.
    2. Causal attention blocks tokens from attending to future tokens, so options can’t “see” the context.
    3. By the time the model reaches the final answer token, it can read the context, but it can’t rewrite the earlier, context-blind option representations.
    4. The model must compare the passage to frozen, context-blind options, leading to worse choices.
  • Why it matters: Without context-conditioned options, the model underuses evidence and accuracy drops sharply. 🍞 Bottom Bread (Anchor): It’s like picking an outfit before checking the weather. Even if you see the forecast later, you can’t change the clothes you already packed.

Three Analogies (same idea, fresh views):

  1. Check-out line analogy: In a one-way line, once the options pass the “scanner,” you can’t add info from the passage that comes after; in CQO, the passage scans first, so options can be checked against it.
  2. Flashlight-in-time analogy: The model’s flashlight only shines backward in time. In QOC, options are in the dark about the future passage; in CQO, options stand after the passage and are fully lit by it.
  3. Recipe analogy: If you mix ingredients (options) before reading the recipe (context), you’ll guess the dish. If you read the recipe first, you’ll choose ingredients that fit.

Before vs. After:

  • Before: People knew order mattered but didn’t have a solid, mechanistic reason why CQO beats QOC.
  • After: We know the cause is the causal mask in decoder-only transformers blocking the option→context pathway in QOC. Encoder or encoder-decoder models (which have bidirectional attention in the encoder) don’t show this gap.

🍞 Top Bread (Hook): Picture two readers. One can look both forward and backward in a story; the other can only look back.

🥬 Filling — Encoder-Decoder Architecture

  • What it is: An encoder-decoder model reads the whole prompt with a bidirectional encoder and then lets a decoder generate the answer.
  • How it works:
    1. Encoder reads all tokens and lets every token attend to every other (bidirectional),
    2. Builds context-aware representations for question and options,
    3. Decoder uses those rich representations to output the answer.
  • Why it matters: Because tokens can look both ways, options always get to use the passage—no matter the order. 🍞 Bottom Bread (Anchor): Like a translator who reads the full page (not just earlier lines) before writing the translation.

Why It Works (intuition without math): In transformers, each token builds an internal summary of what it knows so far. With causal masking, an option’s summary can never use future text. In QOC, all options get “baked” without the passage. Even though the final answer token can look at the passage, it must rely on these already-baked, context-blind option summaries—like judging a book by its table of contents you wrote before reading the book. This is the information bottleneck: the direct path from context to options is cut.

🍞 Top Bread (Hook): Imagine a narrow doorway between the evidence room and the decision room.

🥬 Filling — Information Bottleneck (as used here)

  • What it is: An information bottleneck is a choke point that blocks important information from reaching where it’s needed.
  • How it works:
    1. A rule (causal mask) blocks future-to-past flow,
    2. In QOC, that rule blocks context→option flow,
    3. The model’s later steps can’t retroactively enrich options with evidence,
    4. Decisions rely on partial information.
  • Why it matters: When the doorway is too narrow (or closed), accuracy suffers—even if the evidence exists in the prompt. 🍞 Bottom Bread (Anchor): If a detective can’t bring clues into the suspect lineup room, the lineup will be judged by looks, not evidence.

Building Blocks (what changes with this idea):

  • CQO order gives options a full view of the passage; QOC denies it.
  • Decoder-only models are sensitive; encoder or encoder-decoder models are robust.
  • Simple fixes—like repeating options after the passage—restore the path from evidence to options without changing the model internals.

03Methodology

At a high level: Prompt → Tokenize → Model processes with attention → Score options A/B/C/D → Pick highest → Measure accuracy in two orders (CQO vs. QOC) → Run tests to find the cause → Try fixes that target the suspected cause.

Step 1: Build two prompt orders

  • What happens: Create the exact same content in two layouts: CQO (Context → Question → Options) and QOC (Question → Options → Context).
  • Why this step exists: Holding content constant isolates order as the only difference—so any change in accuracy must come from order, not wording.
  • Example: Using the paper’s style, the LogiQA item is printed as either Passage first then Q&A, or Question and Options first then Passage.

🍞 Top Bread (Hook): Think of racing two identical toy cars on the same track but starting them from different positions.

🥬 Filling — Attention Mechanism (recap for use in steps)

  • What it is: The model’s way to focus on the most useful tokens when forming each token’s internal summary.
  • How it works: Each token looks at allowed neighbors, assigns importance scores, and mixes information accordingly.
  • Why it matters: The allowed-neighbor rule (mask) determines which information can flow. 🍞 Bottom Bread (Anchor): If a toy car can only drive backward, where you place it on the track changes what it can reach.

Step 2: Compare architectures to test the mask hypothesis

  • What happens: Evaluate three model types on both orders: decoder-only (causal attention), encoder-decoder (bidirectional encoder), and encoder-only (bidirectional).
  • Why this step exists: If the mask is the culprit, only decoder-only models should show a big gap; bidirectional ones shouldn’t.
  • Example: Decoder-only models show ~14.7% gap; encoder-decoder ~2.3%; encoder-only ~0.02% (basically none).

Step 3: Context removal test

  • What happens: Compare QOC to QO (Question + Options, no Context). If context is effectively invisible in QOC, accuracy should be similar.
  • Why this step exists: To see if the model was using the context at all when options came first.
  • Example: QOC ≈ QO, confirming the passage wasn’t helping in QOC.

Step 4: Memory (option-recall) test

  • What happens: After the prompt, ask the model to repeat each option exactly.
  • Why this step exists: To check if the drop is due to forgetting options in the middle of the prompt.
  • Example: Recall is high in both orders (over 93%), sometimes even higher in QOC—so memory isn’t the problem.

Step 5: Attention and gradient attribution analyses

  • What happens: Inspect where attention flows across layers and use Gradient×Input to measure how much the context contributes to the final choice.
  • Why this step exists: To verify that in QOC, the path from context to options is blocked and the model relies more on option text itself.
  • Example: Context attribution is much higher in CQO (about 0.80) than QOC (about 0.34). In QOC, attention to options grows with depth (relying on options), while in CQO it declines as the model integrates context.

Modulating factors

  • What happens: Measure how the gap changes with passage length and the correct answer’s position (A/B/C/D).
  • Why this step exists: To see when the bottleneck hurts most.
  • Example: Longer passages cause bigger gaps; answers later in the list (like D) are closer to the context in QOC and are hurt less.

🍞 Top Bread (Hook): Suppose your school blocks hallway A during a fire drill. You can test what happens by closing A on purpose, or by letting students re-enter from another door.

🥬 Filling — The Secret Sauce: Three targeted interventions that poke at the suspected pathway

  1. Attention pruning (diagnostic, degrades CQO)
  • What it is: Manually block option-to-context attention in CQO to mimic QOC’s constraint.
  • How it works: Set the attention mask so option tokens can’t attend to context tokens, leaving everything else unchanged.
  • Why it matters: If CQO now collapses, it proves the option→context pathway is necessary for high accuracy.
  • Example: CQO drops by ~26.8 points when pruned.
  1. Activation patching (improves QOC)
  • What it is: Copy the internal hidden states of option tokens from a CQO run (where options can see context) into a QOC run.
  • How it works: Align option tokens by exact string match and swap their mid-to-late layer representations.
  • Why it matters: If QOC improves, then context-conditioned option states are sufficient to boost accuracy.
  • Example: QOC rises by ~6 points on average.
  1. Option repetition: QOCO prompt (improves QOC)
  • What it is: Repeat the options after the context so the repeated options do attend to the context under the causal mask.
  • How it works: Keep the original QOC, then list options again at the end (Q → O → C → O).
  • Why it matters: It fixes the blocked path using only prompt engineering.
  • Example: QOC improves by ~8.2 points. 🍞 Bottom Bread (Anchor): Like closing a hallway and watching traffic slow (pruning), then letting kids re-enter from a different door (repetition) to see traffic pick up again.

🍞 Top Bread (Hook): Imagine you teach a new trick by showing a few examples.

🥬 Filling — In-Context Learning (ICL)

  • What it is: Giving examples inside the prompt so the model picks up the format and pattern on the fly.
  • How it works:
    1. Add 1–5 solved examples before the test item,
    2. The model notices the pattern,
    3. It tries to apply the same steps to the new question.
  • Why it matters: If QOC were just unfamiliar formatting, ICL should fix it. But here, ICL adds only ~3.1 points—far from enough—because a structural mask, not unfamiliarity, is the problem. 🍞 Bottom Bread (Anchor): Showing a student a few QOC examples doesn’t help much if the student isn’t allowed to look at the passage when judging the options.

Scoring and fairness details (kept simple)

  • Decoder-only: Score the next-token probabilities for A/B/C/D directly.
  • Encoder-only: Use a [MASK] and score A/B/C/D fills.
  • Encoder-decoder: Feed the whole prompt to the encoder and score the decoder’s output.
  • This keeps comparisons fair and avoids messy free-form generations.

04Experiments & Results

The Test: What did they measure and why?

  • They measured multiple-choice accuracy under two orders—CQO and QOC—across four reading datasets (LogiQA, RACE-M, RACE-H, SciQ) and many models. The key metric is the gap Δ = Acc(CQO) − Acc(QOC), which tells how much the order matters.

The Competition: Who/what was this compared against?

  • 21 decoder-only models (LLaMA 3, Qwen 2.5/3, Gemma 2; 0.5B–9B parameters, base and instruct).
  • Encoder-decoder models (Flan-T5 family) and encoder-only models (BERT, RoBERTa, ALBERT) for architecture comparisons.
  • In-context learning (1–5 examples) to test if familiarity helps.

The Scoreboard: Results with context

  • Big gap in decoder-only: On average, CQO ≈ 69.3% versus QOC ≈ 54.6% (Δ ≈ +14.7%). That’s like going from a B- to an A.
  • Little/no gap in bidirectional setups: Encoder-only gap ≈ 0.02% (basically zero), encoder-decoder gap ≈ 2.3%. This strongly points to the causal mask as the cause.
  • QOC ≈ QO: Removing the passage in QOC barely moves accuracy, showing the model wasn’t using the passage when options came first.
  • ICL helps only a little: Few-shot QOC improved by ~3.1 points on average, still far from CQO.
  • Memory ruled out: Option recall was very high in both orders (≈95%), sometimes even higher in QOC, showing the model didn’t forget the options.
  • Attention and gradients tell the same story: Context attribution was about 2.38× higher for CQO than QOC, and in QOC the model’s attention leaned more heavily on the options themselves as layers deepen.

Surprising Findings

  • Later answers are safer in QOC: If the correct answer is D, the gap is smaller because option D sits closer to the context at the end, so it’s less harmed by the mask.
  • Longer passages, bigger problem: As passages grow, more evidence is locked behind the mask when options come first, increasing the gap.
  • Patch the inside, fix the outside: Swapping in context-aware option states (from CQO) into QOC raised accuracy by ~6 points—clear evidence that the missing ingredient is context-conditioned options.

Targeted Interventions: Make numbers meaningful

  • Attention pruning (degrading CQO): When researchers intentionally blocked option→context attention in CQO, accuracy fell by ~26.8 points on average. This is like taking the brakes off a bike going downhill—it proves that pathway is doing important work.
  • Activation patching (improving QOC): Replacing QOC option hidden states with CQO versions bumped QOC by ~6 points, showing those enriched states are sufficient to help the model pick better answers.
  • Option repetition (QOCO, improving QOC): Repeating options after the passage improved QOC by ~8.2 points, a prompt-only fix that restores the missing view.

Why the final token can’t rescue QOC: Even though the final answer token in QOC can attend to the whole prompt, the option representations it reads were formed earlier without the passage and cannot be retroactively updated. It’s like reading the clue card after you already sealed your guesses in envelopes—you can see the clue, but your guesses are already locked in.

Takeaway from the scoreboard: The pattern across datasets, model sizes, and families is consistent: decoder-only models are strongly order-sensitive due to causal masking, while bidirectional encoders are robust. The effect grows with context length and shrinks when the right answer sits later among the options, exactly as the bottleneck explanation predicts.

05Discussion & Limitations

Limitations (honest look)

  • Task scope: The study focuses on four-option MCQA reading comprehension. While the mechanism (causal masking) is general, exact numbers may differ for other tasks or more/fewer options.
  • Model scale: Experiments go up to ~9B parameters. Larger models might partially mitigate or re-route information, though the architectural constraint still applies.
  • Intervention practicality: Activation patching and attention pruning are diagnostic tools; they’re not standard features in production systems without custom hooks.
  • Token alignment: Patching requires exact alignment of option tokens, which is straightforward here but can be tricky with different tokenizers or noisy text.
  • External validity: The fix sizes (e.g., +8.2 from repeating options) might vary with different instruction styles, domains, or languages.

Required Resources

  • Access to multiple model families (decoder-only, encoder-decoder, encoder-only) to compare architectures.
  • Ability to run attention/activation analyses (GPU time, libraries that expose attention/hidden states).
  • Datasets with long contexts to reveal the bottleneck clearly.
  • Prompting and evaluation harness to control templates and score A/B/C/D reliably.

When NOT to Use This Approach (or expect big gains)

  • Very short contexts: If the passage is tiny, the harm from QOC may be mild.
  • Open-ended generation without fixed options: The specific bottleneck (options formed before context) doesn’t apply the same way.
  • Settings where you cannot rearrange prompts and cannot adopt encoder(-decoder) architectures, and the context must follow options due to UI/workflow constraints; in that case, expect lower gains.

Open Questions

  • Training-time fixes: Can we train decoder-only models with objectives or curricula that teach them to compare options to future context more robustly (e.g., special heads, additional auxiliary losses)?
  • Smarter attention: Could hybrid or “lookahead-lite” mechanisms (e.g., local bidirectional windows, two-pass decoders) preserve streaming benefits but avoid this bottleneck?
  • Prompt engineering limits: Beyond repeating options, are there formatting tricks that consistently restore context-to-option flow (e.g., inline evidence tags or per-option mini-context summaries)?
  • Scale and specialization: Do very large decoder-only models (50B+) or domain-specialized models learn workarounds that narrow the gap?
  • Evaluation norms: Should MCQA benchmarks standardize prompt order (CQO) or report both orders to fairly assess model reliability?

06Conclusion & Future Work

Three-Sentence Summary: This paper shows that in decoder-only language models, putting the passage after the options (QOC) causes a big accuracy drop because causal attention prevents options from using the later passage. Encoder/encoder-decoder architectures, which allow bidirectional attention in the encoder, don’t suffer this problem, confirming the cause is architectural. Simple changes—like repeating options after the passage—or targeted interventions can recover much of the lost accuracy.

Main Achievement: The paper provides a mechanistic, experimentally validated explanation for prompt-order sensitivity in MCQA: the causal mask creates an information bottleneck that blocks evidence from reaching the options when they come first.

Future Directions: Explore training-time or architectural tweaks (local bidirectionality, two-pass decoding, or retrieval-aware heads) that allow safe context-to-option flow without sacrificing causal generation; develop standardized prompting recommendations and evaluation protocols that report both CQO and QOC; design UI patterns that naturally present the passage before options or repeat options afterward.

Why Remember This: Prompt order isn’t just a style choice; it can decide whether a model actually uses the evidence. Knowing the cause—causal attention’s time rule—lets practitioners choose better prompts (CQO), pick architectures that are robust (encoder(-decoder)), or apply simple fixes (repeat options). This turns a mysterious performance swing into clear, practical guidance for building more trustworthy systems.

Practical Applications

  • •Write MCQA prompts as Context → Question → Options (CQO) by default to maximize accuracy in decoder-only models.
  • •If forced to use QOC (e.g., UI constraints), repeat the options after the context (QOCO) so options can attend to the passage.
  • •Prefer encoder(-decoder) architectures for reading-comprehension apps where prompt order cannot be guaranteed.
  • •For long contexts, be extra careful: use CQO or QOCO since the gap worsens as passages grow.
  • •Place the correct-answer candidates after the context when creating datasets or interfaces to reduce order-induced errors.
  • •Use activation patching during debugging to confirm whether poor performance comes from context-blind option states.
  • •Adopt likelihood-based scoring of A/B/C/D to get reliable comparisons without parsing free-form generations.
  • •Report results in both CQO and QOC to audit model robustness and avoid overestimating abilities.
  • •Design retrieval pipelines to place retrieved evidence before the question and options to ensure it’s used.
  • •Create prompt validators that automatically reorder or duplicate options after context when risky formats are detected.
#causal attention#prompt order sensitivity#multiple-choice question answering#decoder-only transformers#encoder-decoder models#bidirectional attention#information bottleneck#activation patching#attention masking#in-context learning#long-context reasoning#option repetition#evaluation protocols#reading comprehension#LLM reliability
Version: 1