RePo: Language Models with Context Re-Positioning

Huayang Li; Tianyu Zhao; Richard Sproat

RePo: Language Models with Context Re-Positioning

Intermediate

Huayang Li, Tianyu Zhao, Richard Sproat12/16/2025

arXiv PDF

Key Summary

•Large language models usually line words up in fixed order slots, which can waste mental energy and make it harder to find the important parts of a long or noisy text.
•This paper introduces REPO, a tiny learnable module that lets the model rearrange where tokens 'sit' in its mental workspace based on how relevant they are.
•REPO assigns each token a continuous (not just 1, 2, 3, …) position value that better reflects the story structure of the input.
•Plugged into standard attention with RoPE, REPO helps the model pay more attention to faraway but important tokens and less to nearby distractions.
•On noisy-context tests, REPO scores much higher than usual methods, showing it can sift through cluttered inputs more effectively.
•On structured data like tables and graphs, REPO beats normal positioning and even matches or tops special 'no position' tricks.
•With longer documents (up to 16K tokens), REPO stays strong and generalizes better than baselines extended by popular length tricks.
•Despite these gains, REPO keeps performance on normal short tasks essentially the same as standard methods.
•The module is lightweight, easy to train with backprop, and adds minimal compute cost.
•REPO’s learned positions often look like smart hybrids—sometimes flat, sometimes ordered—capturing real context structure automatically.

Why This Research Matters

Many real tasks involve long, cluttered, or structured inputs: contracts, manuals, scientific articles, spreadsheets, and knowledge bases. REPO helps models cut through the clutter by virtually sliding the right facts closer to the questions that need them. This makes long-document Q&A more accurate, retrieval-augmented systems more reliable, and agent chains more stable over extended contexts. Because REPO is lightweight and compatible with standard attention, it can be adopted without major architectural overhauls. Crucially, it preserves everyday short-task performance while bringing big gains where existing models struggle most. As contexts keep getting longer, learned positioning is a practical step toward dependable, scalable reasoning.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your desk during a big school project. If your notes are just stacked in the order you wrote them, finding what matters is slow and stressful. But if you regroup them—headings with headings, diagrams with diagrams—you think faster.

🥬 The Concept (Cognitive Load Theory): What it is: Cognitive Load Theory (CLT) says our working memory is limited, so messy organization wastes brainpower that should go to real thinking. How it works: 1) Some load is needed to understand the material (intrinsic), 2) some is spent just dealing with how info is presented (extraneous), 3) some goes to building good mental models (germane). Reducing the extraneous part frees energy for reasoning. Why it matters: If info is arranged poorly, we tire out juggling junk, and we miss the key ideas. 🍞 Anchor: A well-organized study guide helps you ace a test because it removes clutter and highlights what matters.

🍞 Hook: You know how, when you read a story, your brain jumps back and forth to connect characters and clues, not just the next word in line?

🥬 The Concept (Self-Attention): What it is: Self-attention is how a language model decides which other words to look at when understanding each word. How it works: 1) Each word asks, “Who’s relevant to me?”, 2) it scores other words, 3) it mixes information from high-scoring words more. Why it matters: Without this, a model would treat every word equally and miss long-distance connections. 🍞 Anchor: When answering, “Who solved the case?” the model pays more attention to “detective” and “confession” than to “the” or “yesterday.”

🍞 Hook: Think about seats in a theater. You can sit in Row 1, Seat 8, or Row 20, Seat 3. The seat number tells you where someone is.

🥬 The Concept (Position Encoding): What it is: Position encoding gives each token a sense of order so attention can use word positions. How it works: 1) Assign positions (usually 0,1,2,...), 2) turn those into signals the model can understand, 3) use them when computing attention. Why it matters: Without positions, the model wouldn’t know “cat chased dog” vs “dog chased cat.” 🍞 Anchor: Seat numbers stop chaos; position encoding stops word-order confusion.

🍞 Hook: Imagine pointing two clock hands differently to show how far apart times are; rotations can encode differences.

🥬 The Concept (RoPE — Rotary Position Encoding): What it is: RoPE encodes how far apart tokens are by rotating their representations, so attention can sense relative distance. How it works: 1) Convert positions into rotation angles, 2) rotate the query and key vectors, 3) attention reads off relative spacing. Why it matters: Relative distances generalize better to longer inputs. 🍞 Anchor: If words are hours on a clock, RoPE’s rotations help the model ‘feel’ how many hours apart two words are.

The world before: Most large language models simply number tokens linearly or give them all the same position. That’s rigid: the 50th token is always “farther” than the 49th even if the 10th token is the real clue. These fixed choices often introduce extraneous load—the model wastes effort paying attention near-by just because it’s nearby (locality bias) instead of because it’s helpful.

The problem: Long or noisy contexts—like a 10-page document with only one relevant paragraph—trip models up. They lack a built-in way to reorganize context so the important parts are easier to find.

Failed attempts: • NoPE (no explicit positions) sometimes helps on graphs/tables but then loses clear ordering cues. • Hybrids that mix linear and no-position layers give partial relief but still rely on hard-coded schedules and can’t adapt to each input’s structure.

The gap: We need a learnable way for the model to re-seat tokens—based on relevance—before attention uses positions, so the model’s ‘mental desk’ gets tidied by itself.

Real stakes: • Reading long manuals to answer a pinpoint question, • searching citations across many papers, • handling noisy retrieval results, • browsing complex tables or graphs, • and running multi-step agents that gather context over time. Better organization means fewer wrong answers, faster understanding, and more reliable AI helpers.

02Core Idea

🍞 Hook: You know how a librarian can move books to the front shelf when they become popular, so people find them faster?

🥬 The Concept (REPO — Context Re-Positioning): What it is: REPO lets the model learn where to ‘place’ each token along a learned position line so important tokens sit closer—even if they were far apart in the original text. How it works: 1) For each token, a tiny neural module looks at its hidden state, 2) it produces a compact ‘position representation’, 3) it converts that into a real-valued position, 4) attention then uses these learned positions (via RoPE or other differentiable encodings) to compute smarter connections. Why it matters: Without REPO, models waste attention on nearby-but-irrelevant words and miss faraway-yet-crucial clues. 🍞 Anchor: In a 10-page report, REPO virtually ‘slides’ the key table and its caption next to your question so attention locks onto them.

The “Aha!” moment in one sentence: Instead of fixing every token’s seat number ahead of time, let the model learn better seat numbers on the fly so attention can spend its energy where it counts.

Three analogies:

Classroom seating: The teacher rearranges seats so collaborators sit closer; REPO rearranges token positions so related words feel closer.
Grocery bagging: Put heavy and fragile items in smart spots; REPO places tokens so important ones don’t get crushed by noise.
Map pinning: You drop pins near related landmarks; REPO drops tokens on a learned ‘map’ so connections are short hops, not long trips.

Before vs After: • Before: Positions are linear or constant; distant facts remain distant even if vital; attention falls prey to locality bias. • After: Positions reflect relevance; distant-but-important facts get pulled closer in position space; attention mass increases on true signals and decreases on distractors.

🍞 Hook: Imagine a rubber band stretched along your sentence; REPO can gently slide tokens along it without tearing the order completely.

🥬 The Concept (Why It Works — Intuition): What it is: The trick is letting positions be continuous and trainable so the model can compress or expand parts of the context. How it works: 1) Tokens get a learned scalar position, 2) RoPE or similar converts differences in these learned positions into attention-friendly signals, 3) training nudges positions so helpful tokens become easier to attend to. Why it matters: This reduces extraneous load and boosts germane processing—attention has a cleaner landscape to search. 🍞 Anchor: On “needle in a haystack” tests, the ‘needle’ tokens are repositioned closer, so attention finds them faster and more reliably.

Building blocks (introduced with mini-sandwiches): • 🍞 Hook: Think of a snapshot of what the model ‘knows’ at a word. 🥬 The Concept (Hidden State): What it is: A hidden state is the model’s internal summary for a token at a layer. How it works: It stores context-aware features learned so far. Why it matters: REPO reads hidden states to judge where tokens should go. 🍞 Anchor: Like a student’s notes at that exact moment of reading. • 🍞 Hook: You know how flipping a light switch gates electricity? 🥬 The Concept (SwiGLU): What it is: A light-weight neural layer that mixes ‘gate’ and ‘content’ to extract a compact position representation. How it works: It multiplies a gated signal with content to highlight useful features. Why it matters: It efficiently teases out position-relevant info from hidden states. 🍞 Anchor: Like a camera setting that sharpens the parts you care about. • 🍞 Hook: A slider that can move smoothly is easier to tune than a staircase. 🥬 The Concept (Differentiable Module): What it is: A module trained end-to-end by gradients. How it works: Small changes to its weights nudge outputs (positions), improving them over time. Why it matters: It lets REPO learn positions that minimize errors. 🍞 Anchor: Like practicing piano—tiny adjustments make the tune cleaner. • 🍞 Hook: A compass only helps if it can sense direction differences. 🥬 The Concept (Relative Encoding via RoPE): What it is: Translates position differences into rotations attention can use. How it works: Learned positions feed into RoPE to produce meaningful relative signals. Why it matters: Keeps the method compatible with modern attention and long-context tricks. 🍞 Anchor: Like comparing two time zones to schedule a call.

Put together, REPO turns rigid word order into a flexible, relevance-aware layout that helps models understand long, messy, or structured inputs much better.

03Methodology

High-level recipe: Input tokens → (A) Build position representations → (B) Assign learned positions → (C) Feed positions into attention’s positional encoding → Output predictions.

Step A: Build position representations from hidden states • What happens: For each token at a given layer, REPO passes its hidden state through a light SwiGLU block to extract a smaller ‘position representation’ that focuses on features helpful for placement. • Why this exists: Hidden states mix many signals. We need a focused summary that spotlights cues like ‘this token is a heading’ or ‘this is part of a question’. Without it, we risk noisy or trivial placements. • Example: In a prompt with multiple Q&A pairs, the ‘Q:’ and ‘A:’ markers and the question text produce a representation that differs from random filler text, making them easier to cluster or separate as needed.

🍞 Hook: Like tuning a radio to the right station so you hear the song clearly. 🥬 The Concept (Position Representation): What it is: A compact vector emphasizing features relevant to learned positions. How it works: SwiGLU gates content to enhance position-relevant signals. Why it matters: It gives REPO a clean, low-dimensional lens to judge where to place each token. 🍞 Anchor: The ‘Q:’ tag’s representation makes it more likely to be grouped with its corresponding answer.

Step B: Assign continuous positions per attention head • What happens: A simple linear layer converts the position representation into a real-valued position (a single number) for each token, independently per attention head. • Why this exists: Different heads specialize differently—one might group examples, another might align evidence with a question. Head-specific positions let each head arrange tokens for its task. Without it, we’d force a one-size-fits-all layout. • Example: In a table QA context, one head might spread columns across a range to preserve order, while another compresses rows with the same entity, so their comparisons are cheap for attention.

🍞 Hook: Imagine several magnet boards; each board lets you slide the same tokens around in the layout that best suits that board’s job. 🥬 The Concept (Per-Head Positioning): What it is: Each attention head gets its own learned positions. How it works: A small linear projection maps representations to scalar positions per head. Why it matters: Specialization improves flexibility and reduces conflicts. 🍞 Anchor: One head lines up dates chronologically; another head stacks symptoms together for a medical question.

Step C: Integrate learned positions into attention with RoPE (or other differentiable encodings) • What happens: The difference between two tokens’ learned positions feeds a positional encoding function (like RoPE), which then shapes the attention score between them. • Why this exists: Attention needs position-aware signals to judge relevance and distance. Using learned positions makes these signals match the input’s real structure. Without it, attention reverts to rigid or flat layouts. • Example: In a ‘needle in a haystack’ prompt, the answer sentence’s tokens are effectively ‘closer’ to the question tokens, so attention mass shifts toward the true evidence even if they’re thousands of tokens apart in raw order.

🍞 Hook: If the original text is a straight road, REPO adds shortcuts so important places feel next door. 🥬 The Concept (Relative Position Differences): What it is: Learned positions define meaningful ‘distances’ between tokens. How it works: Position differences become rotations (or biases) that make attention find relevant pairs. Why it matters: It removes unnecessary travel time across the context. 🍞 Anchor: The question “What year did it happen?” and the matching year value get a short path, so the model finds the year quickly.

Training and efficiency tricks • Start in mid-to-upper layers: Lower layers catch local patterns (spelling, short syntax). REPO begins from a later layer (e.g., 5th) to focus on higher-level organization. Without this, we might overcomplicate early processing. • Don’t reorder the KV cache: For fast autoregressive decoding, we keep the physical order; learned positions only influence the math inside attention. Avoids heavy recomputation. • Lightweight overhead: REPO adds a tiny number of parameters and keeps inference speed nearly unchanged.

🍞 Hook: Like labeling shelves without moving the whole store; shoppers (attention) still find items faster. 🥬 The Concept (KV Cache Constraint): What it is: Keep the original processing order for speed while using learned positions internally. How it works: We do not physically rearrange tokens, we just feed better positional signals to attention. Why it matters: We get the gains without breaking fast inference. 🍞 Anchor: The aisles stay in place, but the signs (positions) get smarter, so you still shop quickly.

Concrete walk-through with data • Input: A 4K-token prompt with 5 Q&A examples, a long background document, and a final user question. • Step A: For every token, build position representations. Example: ‘Q:’ and ‘A:’ tags produce distinctive vectors; irrelevant filler produces bland ones. • Step B: Each head assigns smooth scalar positions. Example: One head compresses all Q parts together; another aligns each A with its Q; a third flattens filler into a nearly constant band. • Step C: Attention uses these positions via RoPE. Result: The final question’s tokens attend strongly to the right evidence sentence and its matching answer pattern, not to nearby noise. • Output: The model answers correctly, even though the relevant bit was far away and the context was cluttered.

The secret sauce • Continuous, non-linear, per-head positioning that is trained end-to-end. • Compatible with modern efficient attention and long-context extrapolation tricks (like YaRN for RoPE). • Learns hybrid patterns (flat where order doesn’t help; monotonic where order matters) from data—no handcrafting.

04Experiments & Results

The test: What and why • Noisy Context (RULER subsets): Measures whether the model can find important info buried under irrelevant text—like spotting the right Lego brick in a huge bucket. This probes extraneous load. • Structured Data (NLGraph, HybridQA): Checks if linearized tables/graphs are still handled well by reorganizing their parts so relationships don’t get lost. • Longer Context (RULER long, LongBench): Tests generalization beyond training length (4K → 8K → 16K) and whether learned positions keep attention meaningful at long range.

The competition (baselines) • ROPE: The standard linear positions with rotary encoding—strong, widely used. • NOPE: Drop explicit position encodings (equivalent to giving everyone the same seat); sometimes helps structures but risks confusion. • R2N1 and N2R1: Hybrid schedules that interleave ROPE and NOPE layers; better than pure forms in some cases but still fixed, not input-adaptive.

The scoreboard (with context) • Noisy Context: Within the 4K training length, REPO outperforms ROPE by about 11 points on average across noisy subtasks. That’s like jumping from a B to a solid A when the test is full of trick questions. • Longer Context: With YaRN extending RoPE to 8K–16K, REPO widens the gap vs baselines. On long QA and needle-in-a-haystack-style tasks, gains exceed 13 points in exact match over the next best baseline—like finishing a marathon minutes ahead instead of seconds. On LongBench (realistic long tasks), REPO leads by at least about 5.5 points. • Structured Data: REPO beats standard ROPE by around 1.9 EM on average across graph/table tasks, showing better preservation of structure. Interestingly, NOPE shines on one graph set, hinting that local order isn’t always the best assumption—REPO flexibly learns when to be flat or ordered.

Surprising findings • Attention reallocation: On needle-in-a-haystack, REPO increases attention mass on the distant ‘needle’ and reduces it on the nearby ‘query’ fluff, overcoming the usual bias to look locally first. • Dense, non-linear position space: Learned positions span a tighter, non-uniform range than raw token length. The model doesn’t need the full 0…L-1 spread; it prefers compact, adaptive layouts that extrapolate better to long inputs. • Hybrid patterns: REPO often learns chunks that are nearly constant (NOPE-like), some monotonic stretches (ROPE-like), and many hybrids. This emergent mix matches the real structure—like grouping example blocks while preserving order within each example.

General tasks and efficiency • On standard short benchmarks (ARC, BoolQ, HellaSwag, MMLU-Pro, etc.), REPO stays essentially tied with ROPE—so the gains don’t cost everyday performance. • Compute overhead is tiny (~0.9% parameters), and inference time is nearly unchanged, keeping deployment practical.

Takeaway: REPO’s adaptive positioning helps most when contexts are long, noisy, or structured—precisely the settings where fixed positions struggle—while maintaining parity on short, clean tasks.

05Discussion & Limitations

Limitations • Training dependence: REPO is learned during continued pretraining; models not exposed to diverse data might learn suboptimal positioning. • Compatibility: It relies on differentiable positional encodings (like RoPE, ALiBi). Purely discrete or non-differentiable schemes would need modification. • KV cache constraint: For speed, tokens aren’t physically reordered; only attention ‘feels’ the new positions. True reordering might unlock more gains but at high compute cost. • When context is trivial: On very short, clean prompts where reorganization isn’t needed, REPO won’t help—and could add a tiny bit of overhead. • Interpretability at scale: While case studies show meaningful patterns, interpreting all heads/layers’ learned positions in huge models remains challenging.

Required resources • Hardware: The paper used 4×H100 GPUs to continue pretrain on ~50B tokens for a 1B-parameter backbone—modest for LLM research but non-trivial. • Software: Standard training stack plus an implementation of REPO’s small modules and a differentiable positional encoding (e.g., RoPE). • Data: General pretraining corpora; no new labels or special supervision are required.

When NOT to use it • Ultra-latency-critical micro-deployments with tiny prompts where every microsecond matters and attention reorganization brings no benefit. • Pipelines already using heavy external context engineering (manual chunking, retrieval filtering) that solves the same problem sufficiently. • Tasks where absolute linear order is sacred and any non-linear positioning could confuse the model (rare, but e.g., strict sequence labeling without noise).

Open questions • Can we safely sort or partially reorder the KV cache head-by-head at inference without huge cost? • How does REPO interact with retrieval-augmented generation when the retriever also ranks relevance—do they add or step on each other’s toes? • Can we extend REPO to multi-dimensional layouts (like 2D for tables/graphs) to capture structure even better? • What’s the best curriculum for teaching REPO to generalize to 32K, 128K, or beyond? • Can we couple REPO with explicit interpretability tools so the learned layout becomes a map for humans as well as machines?

06Conclusion & Future Work

Three-sentence summary: REPO teaches language models to learn where tokens should ‘sit’ in position space based on relevance, not just raw order, reducing extraneous cognitive load. By feeding these learned positions into standard attention (e.g., with RoPE), the model focuses more on faraway-but-important information and less on nearby distractions. This yields large gains on long, noisy, and structured inputs while keeping short-task performance steady—with minimal compute overhead.

Main achievement: Turning positional assignment from a rigid, hand-designed rule into a small, trainable module that discovers hybrid, context-appropriate layouts automatically.

Future directions: Explore partial reordering of caches at inference, multi-dimensional learned layouts for tables/graphs, tighter integration with retrieval systems, and curricula that push context windows far beyond 16K while maintaining stability. Add simple interpretability probes to visualize and debug learned position maps, making them useful to human readers and downstream tools.

Why remember this: REPO reframes ‘where tokens are’ as something the model can learn, not accept. That simple shift—like moving books to the right shelf—lets attention do its best work on what matters, opening the door to more reliable long-document understanding, robust retrieval, and smarter agentic reasoning.

Practical Applications

•Long-document Q&A: Improve accuracy when answering questions about lengthy reports, research papers, or legal documents.
•Enterprise search: Reduce distraction from irrelevant retrieved passages by repositioning the most relevant evidence closer to the query.
•Table and graph reasoning: Preserve structure when data is linearized, boosting performance on analytics and BI-style queries.
•Agentic workflows: Maintain focus over many tool calls and memory steps by emphasizing key context across long chains.
•Customer support: Sift through knowledge bases and ticket histories to find the exact troubleshooting steps faster.
•Educational tutoring: Highlight crucial definitions, theorems, or worked examples in long study materials.
•Healthcare summarization: Bring lab results and critical notes closer to clinical questions for safer decision support.
•Code understanding: In large codebases, reposition relevant function definitions or docs near the query for better navigation and refactoring.
•Legal and compliance checks: Surface key clauses across contracts and policies while de-emphasizing boilerplate.
•Research assistants: Help literature reviews by clustering claims, evidence, and citations so conclusions are easier to verify.

Version: 1