🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
⏱️Coach🧩Problems🧠Thinking🎯Prompts🧠Review
SearchSettings
Prompt Repetition Improves Non-Reasoning LLMs | How I Study AI

Prompt Repetition Improves Non-Reasoning LLMs

Beginner
Yaniv Leviathan, Matan Kalman, Yossi Matias12/17/2025
arXiv

Key Summary

  • •Repeating the entire prompt once (QUERY→QUERY+QUERY) helps many large language models answer better when you are not asking them to show their reasoning.
  • •This trick works because, in causal transformers, tokens can only look left; repeating the prompt lets every token appear in a spot where it can see all the others.
  • •Across 70 model–benchmark tests, prompt repetition won 47 times and lost 0, improving accuracy without increasing output length.
  • •Measured latencies stayed about the same for most models because the extra work happens in the parallelizable prefill stage, not the slow, token-by-token generation stage.
  • •The method keeps outputs in the same format, so it’s easy to drop into existing systems without breaking anything.
  • •On custom list-following tasks (NameIndex, MiddleMatch), gains were especially big (for one case: 21.33% to 97.33%).
  • •When chain-of-thought reasoning is enabled, the effect is neutral to slightly positive, since the model already tends to repeat parts of the prompt on its own.
  • •Variants like repeating three times sometimes help even more; padding with periods to match length does not help, proving it’s the repetition that matters.
  • •Very long prompts may see higher latency on some models, or even hit context limits, so use guardrails.
  • •This simple, no-format-change tweak can be a strong default for non-reasoning use cases.

Why This Research Matters

This technique makes everyday AI tools smarter without slowing them down or changing how they answer, which is rare and valuable. It is easy to deploy because you do not have to retrain models or redesign outputs; you just repeat the prompt. For students, tutors, and customer-service bots, better answers at the same speed mean a smoother experience. For companies, keeping latency and output length steady helps control costs and maintain service-level agreements. On list-heavy or ordering-sensitive tasks, the improvements can be dramatic, making tools more reliable. And because it mostly helps non-reasoning settings, it fills a gap where chain-of-thought is too costly or undesired.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how rereading a homework question helps you catch the little details you missed the first time? Sometimes just seeing it twice clears things up.

🥬 The Concept (Large Language Models - LLMs):

  • What it is: An LLM is a computer program that reads and writes words by guessing the next word, very fast and very cleverly.
  • How it works: (1) It takes your words as input. (2) It turns them into numbers the model understands. (3) It uses attention to decide which earlier words matter most. (4) It predicts the next word, one step at a time. (5) It keeps going until it finishes the answer.
  • Why it matters: Without a smart way to read the question, the model can miss what is important and give a wrong or weak answer.

🍞 Anchor: When you ask “What’s the capital of France?”, the model focuses on the key words “capital” and “France” to say “Paris.”

🍞 Hook: Imagine reading a quiz where the choices come before the question. You’d see A, B, C, D first and only later learn what you’re supposed to pick. Confusing, right?

🥬 The Concept (Prompting):

  • What it is: Prompting is how we talk to a model—what we write and the order we write it in.
  • How it works: (1) You write instructions, facts, and the question. (2) The model reads in order, left to right. (3) It uses what it saw earlier to understand what comes later.
  • Why it matters: Change the order, and you can change how well the model understands the task.

🍞 Anchor: If you write “Options first, question later,” the model might not connect the dots as well as “Question first, options after.”

🍞 Hook: Picture a street you can only drive one way. If your friend is behind you, you can’t see them in your rearview mirror before you pass the corner.

🥬 The Concept (Causal Language Models):

  • What it is: Causal language models can only look left—to earlier words—when deciding what matters for the next step.
  • How it works: (1) The model reads the sequence from start to finish. (2) Each new position can attend to previous positions only. (3) Later words can use earlier context; earlier words cannot use later context.
  • Why it matters: If important details appear later, earlier parts cannot “see” them, and some connections are missed.

🍞 Anchor: If answer choices appear first and the question appears later, the choice tokens never get to “see” the question in that first pass.

🍞 Hook: Think of a team huddle where everyone hears the plan twice. The second time, each player knows how their role fits with everyone else’s.

🥬 The Concept (Attention):

  • What it is: Attention is how a model decides which earlier words are most important for each new word.
  • How it works: (1) For a position, compare it to all earlier positions. (2) Give higher weights to more relevant words. (3) Mix the information, focusing on high-weight parts. (4) Use the result to predict better.
  • Why it matters: Without attention, the model would treat all earlier words equally and get easily confused.

🍞 Anchor: When answering “What is 7×8?”, attention helps the model care more about the numbers than extra filler words.

The world before this paper: People already knew that prompt order affects results. Tricks like “Think step by step” (chain-of-thought) often boost accuracy but make answers much longer, which raises waiting time and compute cost. Some models trained with special reasoning even repeat parts of the prompt themselves before solving, which hints that repetition might be helpful.

The problem: Can we get better accuracy without paying extra in long answers or slower responses, especially when we’re not asking the model to show its work?

Failed attempts: (1) Just padding the prompt with dots to make it longer doesn’t help. (2) Repeating only the question, not the whole prompt, didn’t show gains in separate work. (3) Asking for step-by-step reasoning helps, but it increases output length and latency.

The gap: We needed a drop-in method that (a) improves accuracy in non-reasoning settings, (b) doesn’t inflate output length, (c) barely affects latency, and (d) keeps answer format unchanged so systems don’t break.

Real stakes: This matters for study helpers, customer support, quick lookups, and tools that must answer fast and cheaply. If we can get a better answer with the same time and cost, lots of everyday apps become both smarter and snappier.

02Core Idea

🍞 Hook: Imagine you photocopy a worksheet and then stack the two sheets together. Now any part on the second sheet can line up with any part from the first, making it easy to compare everything with everything.

🥬 The Concept (Prompt Repetition):

  • What it is: Prompt repetition means you send the entire prompt twice in a row: QUERY → QUERY + QUERY.
  • How it works: (1) You write your prompt normally. (2) You paste the same prompt immediately after it. (3) The model now sees two copies back-to-back. (4) Because the model can only look left, every token that appears in the second copy can attend to all tokens from the first copy, making full cross-connections possible.
  • Why it matters: Without repetition, some parts of the prompt can’t “see” later parts. With repetition, each part appears again, later, in a place where it can see everything before it.

🍞 Anchor: If choices come before the question, repeating the entire prompt lets the second-copy choices attend to the first-copy question, fixing the mismatch.

The “Aha!” in one sentence: By simply repeating the prompt, every piece of the question ends up in a position where it can attend to every other piece, improving understanding without making the answer longer.

Three analogies:

  1. Reading twice: When you read a tricky riddle two times, the second read lets you connect clues you missed the first time.
  2. Echo on stage: Saying the script again ensures every actor hears every line and can react properly the second time.
  3. Puzzle row: Laying the same puzzle pieces in two rows lets any piece in the second row line up with any in the first, making matching easy.

Before vs. After:

  • Before: Token order could trap some parts so they never see the important bits that appear later. Fixes like chain-of-thought helped but made outputs longer and slower.
  • After: With repetition, each part shows up again later, able to attend to everything; accuracy rises while answer length and measured latency stay about the same for most models.

Why it works (intuition, no math):

  • Causal models only let you look left. If the question is to the right of the options, the options can’t use the question. Duplicating the whole sequence means every element exists again in a later spot where it can look back at all the earlier tokens (the entire first copy). So the model can compute richer, more globally aware representations during the input pass.

Building blocks:

  • Duplicate the full prompt, not just the question. That ensures every token gets a later, look-left-friendly copy.
  • Keep the output format unchanged. Systems that parse answers (like “The answer is C.”) won’t break.
  • Lean on the prefill stage. The extra work is mostly in parallelizable input processing, not in the slow, step-by-step generation.
  • Variants exist. Verbose repetition (adding a phrase like “Let me repeat that:”) or repeating three times sometimes helps even more, especially on list-tracking tasks.
  • Controls matter. Padding with dots to the same length doesn’t help, proving it’s the repetition structure, not length alone, that boosts performance.

03Methodology

At a high level: Input → Repeat the entire prompt → Feed to the model (prefill builds attention across copies) → Generate the answer (same length as before).

🍞 Hook: Think of a relay race baton passed twice around the track so every runner gets a perfect view of the whole team plan before the final sprint.

🥬 The Concept (Prefill vs. Generation):

  • What it is: Prefill is the fast, parallel reading of your input; generation is the slow, one-token-at-a-time writing of the answer.
  • How it works: (1) Prefill: the model scans the whole prompt, building internal memories (keys/values). (2) Generation: it uses those memories to produce output tokens step by step.
  • Why it matters: Doubling the prompt affects prefill, which is parallel and often not the main bottleneck; it doesn’t make the output longer, so the slow generation stage isn’t stretched.

🍞 Anchor: It’s like laying out all ingredients (prefill) before you start cooking (generation). Adding one more glance over the ingredient list doesn’t lengthen the cooking time much.

Step-by-step recipe:

  1. Start with your normal prompt (QUERY). Example (multiple choice):
    • “Which is a mixture? A. oxygen and nitrogen in air B. sodium and chlorine in salt C. hydrogen and oxygen in water D. nitrogen and hydrogen in ammonia Reply with: The answer is <LETTER>.”
  2. Create QUERY+QUERY by pasting the exact same text after itself.
  3. Send that to the model with reasoning turned off (no “think step by step”).
  4. The model does prefill over the longer input. In the second copy, every token can attend to the first copy’s tokens, enabling full cross-connections.
  5. The model generates the answer. Length and format stay the same (for example, still “The answer is A.”).

Why each step exists:

  • Duplicating the entire prompt: If you only duplicate parts, some tokens might still never see others. Full duplication ensures universal visibility from the second copy.
  • Keeping the format identical: Many systems downstream expect a certain answer shape. Changing it could break tools.
  • Turning off reasoning for this test: The goal is to boost non-reasoning performance; otherwise, chain-of-thought can naturally add its own repetition.

Concrete examples:

  • Options-first vs. question-first: If your prompt is “Options… Question…,” then in the first pass, options can’t see the question. In the repeated version, the second-copy options can attend to the first-copy question. This often raises accuracy.
  • NameIndex task: A list of 50 names; ask for the 25th. Repetition helps the model form clearer, global pointers to the right position by letting every token be compared against the whole list in the second copy.
  • MiddleMatch task: A list with repeats; ask for the single name directly between two others. The second copy lets the model cross-check the pattern from multiple angles without increasing the final answer length.

Variants and ablations:

  • Prompt Repetition (Verbose): Add a short phrase like “Let me repeat that:” before the second copy. Often similar performance.
  • Prompt Repetition Ă—3: Repeat three times (with or without phrases). Sometimes outperforms twice-repeated prompts on list-tracking tasks.
  • Padding control: Replace the second copy with dots to match length. This does not help, showing that structure—not length—is the key.

Efficiency details:

  • Measured output lengths don’t increase; answers stay short.
  • Measured latencies are similar in non-reasoning mode for most models; for very long prompts on some models (e.g., certain Anthropic models), latency can rise due to extra prefill work.

Secret sauce:

  • The second copy provides a later position for every token, so each token’s information can attend to every other token’s information from the first copy. This cures the “I can only look left” limitation for within-prompt relationships, without changing how long the final answer is.

Safety rails and deployment tips:

  • Add a length guard: if the prompt is already very long, skip repetition to avoid hitting context limits or raising latency.
  • Keep the answer schema identical so downstream parsers keep working.
  • Use with non-reasoning endpoints by default; add A/B tests to confirm gains on your data.

04Experiments & Results

The test: Measure whether repeating the prompt improves accuracy while keeping outputs short and latency similar, especially without chain-of-thought reasoning.

Models: Gemini 2.0 Flash, Gemini 2.0 Flash Lite, GPT-4o, GPT-4o-mini, Claude 3 Haiku, Claude 3.7 Sonnet, and Deepseek V3. All via official APIs (Feb–Mar 2025).

Benchmarks and settings:

  • Standard: ARC (Challenge), OpenBookQA, GSM8K, MMLU-Pro, MATH.
  • Custom: NameIndex and MiddleMatch (list-following tasks).
  • Multiple choice evaluated in two orders: question-first and options-first (the latter is harder because options can’t see the question without repetition).

Competition (baselines):

  • Standard single prompt (no repetition).
  • Prompt Repetition (Verbose) and Prompt Repetition Ă—3 variants.
  • Padding control to match input length without repeating structure.

Scoreboard (context added):

  • Overall wins: Prompt repetition achieved 47 wins out of 70 model–benchmark cases (p-value < 0.1; McNemar test), with 0 losses. Think of this like winning two-thirds of your games with no defeats.
  • Multiple choice: Smaller gains with question-first (already a good order); larger gains with options-first (repetition fixes the order handicap).
  • Custom tasks: Very large gains. Example: Gemini 2.0 Flash Lite on NameIndex jumped from 21.33% to 97.33%—from a low F to an A+.
  • Reasoning enabled: Neutral to slightly positive (5 wins, 1 loss, 22 ties). That’s expected because reasoning often starts by restating parts of the prompt, mimicking repetition internally.
  • Variants: Verbose and Ă—3 usually similar to simple repetition; Ă—3 sometimes best on NameIndex and MiddleMatch. Padding shows no improvement, proving length alone isn’t the cause.

Efficiency and latency:

  • Output lengths: About the same across methods when reasoning is off.
  • Latency: Similar across methods in non-reasoning settings; with reasoning on, latencies grow a lot across the board because outputs get longer. An exception: some Anthropic models see latency rise for very long repeated inputs, likely from longer prefill work.
  • Provider variability: Measured via official APIs; network and load can add noise, but patterns match expectations. Notably, Deepseek’s measured latencies were high overall in these runs.

Surprising findings:

  • How strong the gains are on list-heavy tasks suggests repetition is especially good at building global, order-aware representations without asking for chain-of-thought.
  • Repeating only the question (reported elsewhere) isn’t enough; it’s the full-structure repeat that unlocks the benefit.

Takeaway: A simple, one-line transformation—QUERY→QUERY+QUERY—wins often, rarely (if ever) loses in non-reasoning mode, doesn’t bloat answers, and usually doesn’t slow things down.

05Discussion & Limitations

Limitations:

  • Very long prompts: Repetition can increase prefill cost, sometimes raising latency and risking context-window overflows, especially on models sensitive to input length.
  • Task type: When chain-of-thought is encouraged or required, gains are mostly neutral; repetition may be redundant.
  • Provider differences: Latency patterns vary by API and infrastructure; not all environments will show identical improvements.
  • Cost: Even if output length stays the same, input tokens double. This can slightly increase input-side compute or cost in some billing models.
  • Multi-turn chat and streaming: The paper targets single-turn prompts; how repetition behaves in long conversations or streaming scenarios remains to be fully tested.

Required resources:

  • Ability to modify the prompt text before sending it.
  • A context window large enough to handle twice the prompt length (or a guardrail to skip when near limits).
  • Basic A/B testing to verify gains on your domain.

When NOT to use:

  • Prompts already near the context limit or with strict latency budgets on models known to slow down with long inputs.
  • Tasks that explicitly require long chain-of-thought outputs; repetition won’t reduce that overhead.
  • Ultra-short, trivial prompts where accuracy is already near perfect and extra input cost brings little value.

Open questions and future directions (echoing the paper’s ideas):

  • Fine-tuning with repeated prompts: Can models internalize the benefit and avoid needing explicit duplication?
  • Reasoning models + repetition: Can training with repetition reduce generation-time overhead by learning to skip restating the prompt?
  • Selective repetition: Repeat only key parts of long prompts or reorder content to save tokens while keeping the benefit.
  • KV-cache tricks: Keep only the second copy in cache to be completely neutral for generation-time memory.
  • Non-text modalities: Does repeating images or audio descriptors help similarly?
  • Attention pattern analysis: How exactly do heads change their focus with repetition?
  • Robust rules: Predict when repetition helps the most and how token representations differ between copies.

06Conclusion & Future Work

Three-sentence summary: Repeating the entire prompt once lets every part of the input see every other part during the model’s reading phase, fixing order-related blind spots in causal attention. This simple change boosts accuracy across many models and tasks in non-reasoning mode without lengthening the final answer or usually increasing latency. The method is drop-in, safe for output formats, and especially strong on list-following tasks.

Main achievement: Showing that a one-line prompt transformation (QUERY→QUERY+QUERY) yields widespread accuracy gains—47 wins, 0 losses—while keeping outputs short and latencies comparable.

Future directions: Train models with repeated prompts; combine with reasoning to cut overhead; selectively repeat parts of long prompts; analyze attention patterns; try multimodal repetition; and engineer KV-cache/storage tricks to make it even more efficient.

Why remember this: It’s a rare “free lunch” in prompting—a tiny change that often makes models smarter without making them slower or chattier, and it plugs neatly into existing systems without breaking anything.

Practical Applications

  • •Wrap all non-reasoning prompts in your app so they are sent twice (with a length guard to skip if near context limits).
  • •Use repetition by default for multiple-choice tasks, especially when options must appear before the question.
  • •Add A/B tests to confirm gains on your domain and measure any latency changes for very long prompts.
  • •Enable a repetitionĂ—3 fallback for tricky list-tracking tasks (like indexing or middle-matching).
  • •Keep the output format identical (e.g., 'The answer is <LETTER>.') so downstream parsers work unchanged.
  • •Avoid repetition when prompts are huge or latency budgets are extremely tight on models sensitive to input length.
  • •Combine with reasoning only if needed; expect neutral to slightly positive effects and longer outputs from reasoning itself.
  • •Implement a token-budget policy: repeat only the most crucial sections if the full prompt is too long.
  • •For production APIs, log input and output token counts and latency before and after rollout to verify no regressions.
  • •Create a simple prompt router that applies repetition to non-reasoning endpoints and skips it otherwise.
#prompt repetition#non-reasoning LLMs#causal attention#prefill stage#latency#token order sensitivity#multiple-choice prompting#chain-of-thought#padding control#McNemar test#KV-cache#NameIndex#MiddleMatch#Gemini GPT Claude Deepseek#options-first vs question-first
Version: 1

Notes

0/2000
Press Cmd+Enter to submit