Query-focused and Memory-aware Reranker for Long Context Processing

Yuqing Li; Jiangnan Li; Mo Yu; Guoxuan Ding; Zheng Lin; Weiping Wang; Jie Zhou

Query-focused and Memory-aware Reranker for Long Context Processing

Intermediate

Yuqing Li, Jiangnan Li, Mo Yu et al.2/12/2026

arXiv

Key Summary

•QRRanker is a lightweight way to sort many long text chunks by how helpful they are to a question, using the model’s own attention to score relevance.
•It turns special attention heads (called query-focused retrieval heads) into a practical reranker that works listwise over all candidates at once.
•Instead of asking the model to generate messy 1–5 ratings, it produces smooth, continuous scores directly from attention, so it can train on any retrieval dataset.
•A simple training recipe aligns attention scores with ground-truth evidence using a group contrastive objective and a safe max–min normalization.
•It runs without text generation at inference: just prefill the prompt, read attention, score, and sort—fast and stable even with a 4B model.
•Adding a short summary prefix (memory) boosts performance on long stories and dialogues, but is unnecessary for tightly localized Wikipedia facts.
•Across HotpotQA, MuSiQue, NarrativeQA, DetectiveQA, and LoCoMo, QRRanker beats strong pointwise and listwise baselines, including Qwen-Reranker-4B and GroupRank-32B.
•Using middle-layer heads preserves accuracy while allowing higher-layer truncation for faster, cheaper inference.
•On LoCoMo, QRRanker achieves state-of-the-art accuracy with only top-3 chunks (about 854 tokens) fed to the generator.

Why This Research Matters

Better reranking means assistants pull the right facts fast, even from massive stories, wikis, or long chats. By training attention itself to be the scoring function, QRRanker avoids fragile score generation and works with any retrieval dataset. This keeps systems cheaper and quicker at inference because there’s no generation step during ranking. Optional summaries help with messy, long contexts without building heavy memory graphs. The approach scales down to small models, making high-quality retrieval more accessible. Stronger reranking directly improves end answers in QA tasks, customer support, research copilots, and agent workflows.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you’re trying to find an answer in a huge book, skimming each page one by one takes forever, and you often miss the most important parts? Computers face the same struggle when they must search long texts or huge piles of documents to answer a single question. Before this work, people mostly used embeddings—short numeric summaries of text—to quickly grab a shortlist of candidate passages. That’s great for speed, but there’s a known geometric bottleneck: a fixed-size vector often can’t capture all the detailed ways a question and a passage might match, like cause-and-effect or multi-step references across chapters. To fix that, researchers added rerankers that take the shortlist and re-order it more smartly. Pointwise rerankers score each passage independently, which is fast but blind to how the other passages in the list relate. Listwise rerankers look at the whole list at once, which better matches how we decide what’s most relevant when we compare choices side by side. The problem was that listwise rerankers with large language models (LLMs) usually rely on generation: they ask the model to output scores or a final ranked list. That creates new headaches. Generated scores can be finicky (format problems, inconsistencies, and unstable confidence). To keep things simple, many systems fall back to 1–5 or 1–10 “Likert” ratings—limiting what training data you can use and squishing rich differences into coarse bins. People noticed something intriguing: some attention heads inside LLMs naturally behave like tiny retrievers. When you paste the query and many chunks together, certain heads place higher attention on the truly relevant chunks. Prior work called these retrieval heads or, more specifically, Query-focused Retrieval (QR) heads. That’s like discovering your flashlight already shines brighter on the useful clues. However, these QR heads were mostly probed and measured, not trained as a full solution. They could work out-of-the-box on some tasks but weren’t always stable across domains, and performance could drift. Meanwhile, more complex memory systems tried to build graphs, timelines, or event trees to help models remember long dialogues or stories—but they could be heavy to maintain and didn’t always beat a really good search. Here’s the gap: we needed a reranker that (1) keeps the strong holistic view of listwise methods, (2) avoids generation-time fussiness, (3) outputs smooth, continuous confidence scores (not just 1–5), (4) runs efficiently on small models, and (5) can be nudged with simple context summaries when the world gets really long and tangled. This paper’s answer is QRRanker, which trains those special QR heads so their attention weights become accurate, listwise relevance scores. No generation at inference, just prefill-and-score. It supports continuous scores, so any retrieval dataset works for training—no Likert labels required. It’s small (4B) yet strong. And when stories or dialogues get sprawling, you can prepend a shared summary prefix (a tiny memory) to guide the attention more globally. Why care? Because every assistant, agent, or search tool we use depends on grabbing the right snippets from mountains of text: from answering homework questions about a history chapter, to helping support agents remember a customer’s long conversation history, to letting a tool follow plot threads in a mystery novel. Better reranking means better answers, less context bloat, and faster, cheaper systems that actually scale.

02Core Idea

🍞 Top Bread (Hook): Imagine you’re sorting a big stack of notes to answer a tricky question. Instead of writing long comments about each note, you just glance at each and trust your eyes to spot which ones match. What if we could use the model’s own “glance”—its attention—to rank notes? 🥬 Filling (The Actual Concept): The big “aha!” is to train the model’s Query-focused Retrieval (QR) heads so their attention becomes the actual scoring function for each candidate passage, all in one pass over the whole list. How it works (intuitively):

Put the query and the candidate passages together in one prompt (optionally with a short summary prefix).
Read the attention from special QR heads: how strongly does the query attend to each passage?
Sum these attention weights into a continuous score per passage.
Normalize safely (max–min per sample) and train with a group contrastive objective so true-evidence passages score above distractors.
At inference, skip generation: just prefill, read attention, score, and sort. Why it matters: Without this, listwise LLM rerankers often need to generate ratings (fragile and coarse) or use bigger models. QRRanker stays lightweight, stable, and data-flexible. 🍞 Bottom Bread (Anchor): For a question about chapter 12 of a long novel, QRRanker looks at all 50 candidate chunks together, uses its trained heads to spot where the query’s attention actually lands, and picks the top-3 chunks that truly contain the answer. Multiple analogies:
Spotlight analogy: The query is a spotlight sweeping over a stage full of scenes (chunks). QRRanker trains the spotlight operators (QR heads) to shine brightest on the right scenes, then ranks scenes by brightness.
Metal detector analogy: Walking a beach (the candidate list), the detector’s beep loudness (attention weight) tells you how promising each spot is. Training teaches it to beep loudest over real treasure.
Teacher’s glance analogy: A teacher glances over many essays. Their quick attention tells which ones actually answer the question well. Training sharpens that instinct. Before vs. After:

Before: Listwise rerankers often asked models to generate scores or orders, risking formatting issues, unstable numbers, and needing Likert labels.
After: QRRanker gets continuous, listwise relevance from attention itself—no generation, no Likert, just direct, smooth scores. Why it works (intuition):
Attention already measures alignment between query tokens and passage tokens. By summing these weights across the query and across trained QR heads, you get a natural, fine-grained relevance signal.
Max–min normalization stabilizes scale differences between samples.
Group contrastive loss pushes all true-evidence passages up together rather than optimizing just one at a time. Building blocks (each with a mini Sandwich):

🍞 You know how some team members are naturally good at spotting clues? 🥬 QR heads are attention heads that focus on the right passages for a given query; train them so their focus becomes a ranking score; without them, the model’s attention stays uncalibrated. 🍞 Example: Among 50 chunks, certain heads light up only over the ones that mention the exact plot twist you asked about.
🍞 Imagine using a highlighter to mark important parts as you read. 🥬 Attention scores are the highlight strength between query words and passage words; we sum these to score passages; without this, we’d need clunky generated ratings. 🍞 Example: The word “alibi” in the question strongly attends to sentences about where the suspect was at 8 pm.
🍞 Think of lining up books and deciding the order from most to least helpful. 🥬 Listwise reranking looks at all candidates together and orders them by relevance; without listwise view, you miss relationships like duplicates or complementary evidence. 🍞 Example: Seeing two nearly identical chunks, you keep the clearer one higher.
🍞 Like adding a short cheat sheet before a long test. 🥬 Memory augmentation is an optional summary prefix that gives the model global context; without it, long narratives or dialogs may scatter clues too widely. 🍞 Example: A brief chapter summary reminds the model who betrayed whom before it scans detailed chunks.
🍞 Instead of grading with only A–F, use any number 0–100. 🥬 Continuous relevance scores are smooth attention-based numbers, not coarse 1–5 ratings; without them, training data is limited and nuance is lost. 🍞 Example: Two chunks both help, but one slightly more; 0.82 vs. 0.77 captures that.
🍞 Picture a taste test where every good cookie must beat all the bad ones. 🥬 The group contrastive objective makes all positives outrank negatives together; without it, extra good passages in the list might get ignored. 🍞 Example: If three chunks are correct evidence, the loss pushes all three above the distractors at once.

03Methodology

At a high level: Query + Candidate Passages (± Summary Prefix) → Prefill model (no generation) → Read attention from selected heads → Sum to scores → Normalize → Contrastive training → Ranked output at inference. Step-by-step (with the Sandwich pattern for key steps already introduced):

Build listwise training instances.

What happens: For each question, retrieve top-50 candidate chunks with an embedding retriever. Mark which chunks are true evidence (positives) and which aren’t (negatives). Optionally, build a short global summary prefix from block summaries (for books) or event summaries (for dialogues) and prepend it.
Why it exists: The model learns in the same setting it will see at inference—comparing many candidates at once. Without listwise instances, we’d lose the global view.
Example: For “Who helped the detective escape?”, collect 50 chunks; label the few that describe the escape as positives; prepend a brief chapter summary.

Select and use QR heads.

What happens: Start with QR heads identified on seed data (e.g., from NarrativeQA) by measuring which heads’ query attention concentrates on gold evidence. Pick top 16 heads located mostly in middle layers.
Why it exists: These heads already act like mini-retrievers; starting with them gives a strong initialization. Without them, training may be less stable or slower.
Example: Heads from layers 17–24 that consistently light up on true evidence are chosen.

Compute attention-based scores for each passage.

What happens: Prefill the prompt with [Optional Summary; Candidate Chunks; Query]. For each selected head, sum its attention from query tokens to each passage’s tokens. Add up across the 16 heads to get a passage’s raw score.
Why it exists: This turns the model’s innate attention into a continuous relevance meter. Without direct attention scoring, we’d fall back to generation or brittle score parsing.
Example: If the query token “alibi” strongly attends to lines in chunk #12, its score grows.

Normalize per sample with max–min.

What happens: Within each training instance, subtract the minimum raw score and divide by the range, then scale to a fixed band (e.g., 0–8). This removes sample-specific quirks (like attention sinks or instruction sensitivity).
Why it exists: Scores can otherwise vary wildly across prompts; normalization makes contrastive training stable. Without it, the loss could behave erratically.
Example: If scores are [2.0, 2.1, 3.0], they become [0.0, 0.4, 8.0] after scaling, keeping relative order but stabilizing magnitude.

Train with a group contrastive objective.

What happens: Treat every positive in the top-50 as a target simultaneously. Push positives’ scores above all negatives using a softmax-like objective. This differs from sampling just one positive.
Why it exists: Many queries have multiple true evidence chunks. Ignoring extra positives wastes signal. Without group treatment, some positives may be suppressed.
Example: If chunks #7, #12, and #33 are all correct, the loss boosts all three over the rest together.

Memory-aware augmentation (optional but powerful for narratives/dialogues).

What happens: Prepend a compact summary prefix that covers the retrieved chunks (e.g., a block summary for books or an event list for dialogues). The same prefix guides all candidate chunks.
Why it exists: Long stories and dialogues spread clues across time and characters. Without a global hint, attention can get diluted.
Example: A 100-word block summary reminding “Alice frames Bob; the heist happens at midnight; the diary is a red herring.”

Middle-layer efficiency trick.

What happens: QR heads often live in middle layers. You can truncate higher layers at inference (e.g., keep up to layer 24) to reduce latency, FLOPs, and memory while preserving accuracy.
Why it exists: If the ranking signal is already strong mid-model, upper layers add cost with little gain. Without truncation, you pay more for similar results.
Example: The “middle” variant shows best latency and compute while matching core accuracy. Concrete example with toy data:
Query: “Where did the suspect hide the necklace?”
Top-5 chunks:
1. Talk about breakfast (irrelevant)
2. Note about a ‘loft above the garage’ (maybe relevant)
3. Dialogue about the suspect’s alibi at 8 pm (some relevance)
4. Description: ‘He tucked the necklace into a loose floorboard in the attic’ (very relevant)
5. Weather details (irrelevant)
Attention from the query word “necklace” lands heavily on chunk 4; “hide” also attends to chunk 4 and a bit to chunk 2. Summed scores: chunk 4 > chunk 2 > chunk 3 > chunk 1 ≈ chunk 5. Normalize, train to push chunk 4 (and 2/3 if truly evidence) above others. The secret sauce:
Use what the model already does well (attention) as the ranking signal—and train it, don’t just probe it.
Make it listwise and continuous, so it compares candidates directly and avoids coarse Likert bins.
Keep it lightweight: no generation at inference, small backbone, optional summaries, and mid-layer truncation for speed.

04Experiments & Results

The test: Can QRRanker pick the right passages better than strong baselines across different worlds—Wikipedia facts (HotpotQA, MuSiQue), long stories (NarrativeQA, DetectiveQA), and long dialogues (LoCoMo)? We measure with Recall@k: the fraction of questions where at least one true-evidence chunk is found among the top-k. For LoCoMo, we also report end-task F1 of the generated answers. The competition: Embedding retrievers (Qwen3-Embedding 4B/8B; an SFT variant), pointwise and listwise rerankers (Qwen-Reranker-4B, GroupRank-32B), and out-of-box QRHeads (untrained). For dialogues, we also compare against many memory systems (A-Mem, MemoryOS, Zep, Mem0, Nemori, LightMem, TiMem, Synapse, Membox, CompassMem, ES-Mem, SimpleMem). The scoreboard (with context):

Wikipedia QA (MuSiQue, HotpotQA): QRRanker sets a new bar. For example, on HotpotQA, it reaches around mid-90s Recall@5, which is like consistently finding the right page on the first handful of tries while simpler methods drop more often. It also surpasses graph-heavy methods (HippoRAG) and even a much larger GroupRank-32B—meaning the attention-based listwise signal beats complexity and size.
Long stories (NarrativeQA, DetectiveQA): This is where holistic reading matters most. On NarrativeQA, QRRanker hits Recall@10 ≈ 54.9%, compared to ≈ 48.8% for GroupRank-32B and ≈ 48.9% for untrained QRHeads. That’s like jumping from a solid B to a strong A- in finding the right scenes. In downstream QA, it boosts NarrativeQA F1 to 33.61 (vs. 30.51 for a trained Qwen-Reranker-4B) and raises DetectiveQA accuracy to 67.25 (vs. 62.85 for the best embedding-only baseline).
Long dialogues (LoCoMo): With only the top-3 chunks (about 854 tokens) fed to the generator, QRRanker achieves Overall F1 ≈ 57.0–57.3 with GPT-4o-mini/ GPT-5-mini, topping prior reported systems in the comparison. Many memory frameworks need far more tokens or complex structures; QRRanker’s lean rerank-then-generate pipeline keeps inputs small and precise. Surprising findings:
Memory prefix helps for stories and dialogues (global context), but not for Wikipedia facts (highly localized). Adding summaries slightly improves Recall@3 for NarrativeQA and LoCoMo but can slightly hurt on HotpotQA/MuSiQue.
Middle-layer heads shine. Training/selecting heads in middle layers matches or beats alternatives, and you can truncate higher layers to speed up inference without hurting performance—lowering latency (P50/P95), TFLOPs per query, and peak memory. Numbers made meaningful:
Think of Recall@10 ≈ 55% on NarrativeQA as: out of 10 tries to show helpful scenes, QRRanker shows a correct one more than half the time, while others lag by several points—enough to noticeably boost final answer accuracy.
LoCoMo F1 gains with such a tiny context (top-3) mean: instead of dumping long chat histories into the model, QRRanker filters laser-precise memories, saving budget and improving clarity. Efficiency:
Compared to a 4B cross-encoder-style reranker, QRRanker cuts latency and compute; the middle-layer variant is fastest and most memory-friendly. No generation at ranking time reduces errors and cost. Takeaway: Across domains, QRRanker reliably ranks better with fewer resources, and its improvements carry through to better end answers.

05Discussion & Limitations

Limitations:

Domain shift in head behavior: Preselected QR heads may not be perfect on every new task. Training helps, but some tasks could need reselection or brief adaptation, especially if the language or structure is very different.
Summary isn’t always a win: Global summaries help in narratives/dialogues but can dilute signal in tightly localized corpora like Wikipedia, where details matter more than big-picture context.
Reliance on initial retriever: QRRanker reranks the top-50 from an embedding model. If the true evidence never makes the shortlist, no reranker can recover it.
Attention quirks: Raw attention can include sinks or format sensitivity. Max–min normalization helps, but careful prompting and stable instructions still matter. Required resources:
A 4B LLM backbone suffices; training used 8 H20 GPUs with DeepSpeed ZeRO2. Inference is light: prefill-only scoring, optional middle-layer truncation, and small context windows (often just top-3 to generators). When not to use:
Very short contexts where a simple retriever already nails it.
Settings demanding pairwise calibration across corpora (e.g., strict global comparability of scores beyond a single list), since QRRanker normalizes per sample.
Scenarios where generation-time rationales or chain-of-thought are mandatory during ranking itself (QRRanker avoids generation at ranking time). Open questions:
Automatic, domain-agnostic head selection: Can we learn to pick the best heads per domain without seeds, or adapt on-the-fly?
Unified retrieval-generation training: What happens if we co-train the reranker and generator so ranked evidence and final answers reinforce each other?
Uncertainty and calibration: How do we best reflect confidence in attention-derived scores across diverse prompts and instructions?
Beyond text: Can QR-style head training extend to multimodal retrieval (e.g., aligning queries with image regions or audio spans)?

06Conclusion & Future Work

In three sentences: QRRanker trains special attention heads so their query-to-passage attention becomes a precise, continuous, listwise relevance score—no generation needed. This lightweight approach beats strong baselines across Wikipedia QA, long stories, and long dialogues, and even powers state-of-the-art results on LoCoMo with tiny input budgets. Optional summaries and middle-layer truncation add accuracy and speed without complicating the pipeline. Main achievement: Turning the LLM’s own attention into a trained, listwise scoring function that is continuous, stable, and efficient—unlocking strong reranking with small models and broad training data. Future directions: Smarter head selection across domains; joint training with generators; uncertainty calibration of attention-derived scores; and applying QR heads to multimodal retrieval. Why remember this: It shows that inside every big model lies a capable reranker waiting to be trained—the attention you already have can become the ranking you need, delivering better answers faster and cheaper.

Practical Applications

•Boost RAG systems: Use QRRanker to pick top-3 evidence chunks before answer generation for higher-quality responses.
•Long-document QA: Quickly surface the right chapters or scenes from books and reports.
•Customer support memory: Retrieve the most relevant moments from long, multi-session chats for accurate resolutions.
•Enterprise search: Rank policy or legal passages that align most closely with complex queries without manual labeling scales.
•Educational tools: Find the exact textbook sections that answer multi-step homework questions.
•Medical literature triage: Rank candidate studies or guidelines that best match a clinician’s specific question.
•Incident analysis: In logs or transcripts, surface the key events leading to an outage from lengthy records.
•Creative writing assistants: Track plot and character arcs by ranking relevant scenes over long drafts.
•Agent planning: Let agents recall the few most critical prior steps from long histories for stable decision-making.
•Meeting assistants: From hours of transcripts, retrieve the top snippets that answer action-item queries.

Version: 1