OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

Fengran Mo; Zhan Su; Yuchen Hui; Jinghan Zhang; Jia Ao Sun; Zheyuan Liu; Chao Zhang; Tetsuya Sakai; Jian-Yun Nie

OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG

Intermediate

Fengran Mo, Zhan Su, Yuchen Hui et al.1/13/2026

arXiv PDF

Key Summary

•OpenDecoder teaches large language models (LLMs) to pay more attention to better documents during Retrieval-Augmented Generation (RAG).
•It adds simple, explicit quality signals (like relevance scores) into the LLM’s attention during decoding, instead of hoping the model figures it out alone.
•Three signals are used: the retriever’s relevance score, an LLM-based ranking score, and a query performance prediction (QPP) score.
•During training, it also mixes in partly relevant and irrelevant documents to build noise tolerance (robustness).
•OpenDecoder reshapes token probabilities at decode time by modulating attention with the external scores.
•Across five QA datasets and three noisy settings, OpenDecoder consistently beats strong baselines, including robust fine-tuning (RbFT).
•Max or simple normalization often works best; adding more complex rank decay can hurt performance.
•Reordering or shuffling documents during training helps reduce position bias and improves robustness.
•The method is flexible: it can work with other indicators (like trust or authority) and with different LLMs.
•This makes RAG systems more reliable when search results are messy or partially wrong.

Why This Research Matters

In real life, search results are often mixed: some great, some mediocre, some wrong. OpenDecoder makes LLMs better at listening to the great parts and ignoring the noise. That means more trustworthy homework help, customer support, and professional research tools. It reduces the chance of hallucinations by treating document quality as a first-class citizen during decoding. It’s flexible enough to plug in other signals like credibility or recency to further improve trust. Overall, it moves RAG from “hope the model figures it out” to “show the model which sources to trust.”

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you do a school project, you might grab some books and websites, but not all of them are helpful? If you treat every source like it’s equally great, your report can get messy. That’s what often happens in RAG (Retrieval-Augmented Generation): the model pulls documents, but then it kind of treats them as if they’re all about the same.

🍞 Hook: Imagine packing a lunchbox. Some items are fresh and tasty; others might be stale. If you close your eyes and grab randomly, your lunch won’t be great. 🥬 The Concept (RAG, at a glance): RAG is when a model first retrieves documents and then generates an answer using them.

What it is: A two-step system that searches for info and then writes an answer using that info.
How it works: (1) Search the corpus; (2) Pick top-k documents; (3) Add them to the prompt; (4) Decode an answer.
Why it matters: Without good retrieval or careful use of the retrieved info, answers can be wrong or confused. 🍞 Anchor: A homework helper looks up Wikipedia pages and then writes a short answer to your question.

The World Before: LLMs are great at writing, but they can forget facts or be outdated. RAG helps by pulling up-to-date info from a database or the web. But there’s a catch: the retrieved documents aren’t all equally useful. Some are on-topic; some are partly related; some are off-base. Traditional RAG hopes the LLM’s internal attention mechanism will sort this out—like hoping a student just “gets it.”

🍞 Hook: You know how your brain focuses on key sentences when reading? 🥬 The Concept (Attention Mechanism): Attention lets the model focus more on important tokens.

What it is: A way for the model to give higher weights to useful words/sentences.
How it works: (1) Compare tokens; (2) Score how related they are; (3) Turn scores into weights; (4) Mix information using those weights.
Why it matters: Without attention, the model would treat every word and sentence equally and miss what matters. 🍞 Anchor: When answering “What’s the capital of France?”, the model focuses on “capital” and “France,” not filler words.

The Problem: In practice, LLMs often don’t fully use the relevance signals that the retriever already computed. Once documents are stuffed into the prompt, the LLM re-judges relevance on its own. That can go wrong when noise is high: irrelevant bits can distract the model, leading to hallucinations or wrong answers.

Failed Attempts: Two families of fixes tried to help.

Workflow-based prompts (LLM-as-judge, isolate-then-aggregate, step-by-step filtering) try to make the model pick useful chunks before answering. These can be slow (many steps), sensitive to prompts, and still make mistakes.
Fine-tuning approaches teach models to better handle retrieval defects, but decoding still depends only on the model’s internal attention—ignoring the retriever’s explicit quality scores.

The Gap: No one was directly plugging external quality indicators (like relevance scores) into the LLM’s attention during decoding. It’s like having a teacher’s grading notes but not showing them to the student taking the test.

Real Stakes: This matters for everyday tools—homework helpers, customer support bots, coding assistants, medical or legal search—where the retrieved info might be mixed quality. A system that can downweight bad sources and upweight good ones at decoding time can give more reliable, safer, and clearer answers.

02Core Idea

The “Aha!” Moment: Don’t just hope the model figures out which documents are good—feed the model the document quality signals and use them to steer the attention during decoding.

Three Analogies:

Classroom microphones: 🍞 Hook: Imagine a panel of students each with a mic. Some have the right answer; some are guessing. If you can turn up the volume on the right mic and turn down the wrong ones, the class hears the best answer. 🥬 The Concept: OpenDecoder is the volume knob—relevance scores become the volume levels that steer attention. 🍞 Anchor: For a geography question, turn up the student with the atlas, turn down the one talking about sports.
Recipe ingredients with freshness labels: 🍞 Hook: You know how food labels say “fresh” or “expiring soon”? 🥬 The Concept: Each retrieved document gets a freshness-like score (relevance/quality), and the model uses those labels to decide which ingredients to use most. 🍞 Anchor: If the milk is very fresh (high score), use it; if it’s suspect (low score), don’t.
Binocular focus ring: 🍞 Hook: When you look through binoculars, you twist the focus ring so the important object gets sharp. 🥬 The Concept: The external scores are the focus ring that sharpens relevant tokens and blurs noisy ones in attention. 🍞 Anchor: Focus on “Paris” and blur unrelated travel blog chatter.

Before vs. After:

Before: LLMs largely rely on internal attention shaped by training and prompts, treating all appended documents more similarly than we’d like. Relevance computed by the retriever is ignored at decode time.
After: OpenDecoder injects explicit quality indicators (retriever relevance, LLM ranker score, and QPP difficulty) into the attention math. The probability distribution over next tokens is reshaped so better documents influence the answer more and noisy ones less.

Why It Works (intuition): Attention is the model’s “spotlight.” If you multiply or modulate that spotlight using reliable external scores, you prevent the light from shining equally on junk. When all context is noisy, the scores guide the model to rely more on its internal knowledge (the instruction and query get high scores), reducing confusion.

Building Blocks (each with the Sandwich pattern):

🍞 Hook: You know how a library search gives each book a match score? 🥬 Relevance Score (from the retriever):
- What it is: A number showing how well a document matches the query.
- How it works: The retriever measures similarity between the query and documents, then returns top-k with scores.
- Why it matters: Without it, the model might overuse off-topic docs. 🍞 Anchor: For “capital of France,” a Wikipedia page on Paris gets a high score; a page on French cuisine gets a lower one.
🍞 Hook: Think of a debate judge scoring speakers’ arguments. 🥬 LLM-based Ranking Score:
- What it is: A semantic score from an LLM ranker judging how relevant each doc is.
- How it works: The ranker reads (query, doc) and outputs a relevance logit.
- Why it matters: A second opinion from an LLM can catch nuances the retriever misses. 🍞 Anchor: The ranker boosts a doc that directly states “Paris is the capital of France.”
🍞 Hook: Before a test, teachers can guess which questions will be tricky. 🥬 QPP (Query Performance Prediction):
- What it is: A score predicting how hard the query is (and how reliable retrieval might be).
- How it works: A QPP model analyzes the query (and doc) to estimate likely retrieval quality.
- Why it matters: If it’s a tough query, the model should be extra careful about noisy docs. 🍞 Anchor: For a vague or multi-hop question, QPP says “hard,” so the system leans more on high-score docs.
🍞 Hook: When combining flavors, a chef balances salt, sour, and sweet. 🥬 Aggregation and Normalization:
- What it is: Combine multiple scores and normalize them (e.g., to 0–1) so they’re comparable.
- How it works: Weighted sum (retriever dominant; others supplemental), then normalize (often max or min-max) before using in attention.
- Why it matters: Without fair scaling, one score could overpower others or become meaningless. 🍞 Anchor: Mix 1.0 for retriever, 0.5 for ranker, 0.5 for QPP; then scale so the biggest becomes 1.
🍞 Hook: Think of dimmer switches on a row of lights. 🥬 Attention Modulation (OpenDecoder):
- What it is: Use the normalized scores to directly modulate attention weights during decoding.
- How it works: Build a token-level score matrix that boosts tokens from higher-scored docs (and always gives query/instruction a high score), then apply it inside the attention computation before softmax.
- Why it matters: Without modulation, attention may over-focus on distracting text. 🍞 Anchor: The softmax now favors tokens from the doc saying “Paris is the capital.”
🍞 Hook: Practicing with tricky practice tests makes you tougher. 🥬 Robustness Training:
- What it is: During training, deliberately mix in partly relevant and irrelevant docs and sometimes shuffle order.
- How it works: Replace half of top-k with noisy docs and change positions; train the model to succeed anyway.
- Why it matters: Without this, the model may crumble when retrieval is messy. 🍞 Anchor: Even when 30–50% of docs are off-topic, the model still finds and trusts the good ones.

03Methodology

At a high level: Query + Retrieved Docs + Quality Scores → Build indicator matrix → Modulate attention during decoding → Output answer.

Step-by-step (with mini Sandwich explanations as new ideas appear):

Retrieval and Indicator Construction

What happens: Given a query q, the retriever finds top-k documents and gives each a relevance score. Two extra signals are computed: an LLM-ranker relevance score and a QPP score.
Why this step exists: If we don’t measure document quality, the model can’t prefer the best sources.
Example: For “What’s the capital of France?”, suppose we retrieve 10 docs. The Paris page gets 0.95, a page on French history 0.68, and a page on desserts 0.12.

🍞 Hook: Like stacking sticky notes with grades on each source. 🥬 Aggregation:

What it is: Combine the retriever score (main), ranker score, and QPP score into one guidance score per document.
How it works: Weighted sum (retriever dominant; others scaled by 0.5), then prepare for normalization.
Why it matters: Each score sees quality from a different angle; together they give a fuller picture. 🍞 Anchor: Final doc score = 0.95 + 0.5*(ranker 0.90 + QPP 0.80).

Normalization and Token-level Expansion

What happens: Normalize the aggregated doc scores to a friendly 0–1 range. Set the instruction and query scores to 1. Expand these per-doc scores to token-level (each token inherits its doc’s score). This forms S_norm.
Why this step exists: Attention operates over tokens; we need a per-token map of how much to trust each token’s source.
Example: If doc A has 1.0, all its tokens have 1.0; doc B has 0.6, its tokens get 0.6; instruction and query tokens get 1.0.

🍞 Hook: Think of highlighting strong sources in bright yellow and weaker ones in light yellow. 🥬 Normalization:

What it is: A scaling step so the biggest score is 1.0 and others are relative.
How it works: Max or min-max normalization; avoid exotic rank-decay unless you’ve tested it.
Why it matters: Bad scaling can drown out useful signals or over-amplify weak ones. 🍞 Anchor: After max-normalizing, the Paris page tokens are 1.0; desserts page tokens are 0.13.

Modulating Attention in Decoding (the heart of OpenDecoder)

What happens: During each decode step, the model computes attention scores (QK^T/√d) and applies softmax to get weights. OpenDecoder multiplies/modulates these scores by S_norm first, so high-scored tokens are more likely to influence the next-token distribution.
Why this step exists: This is the only place where the LLM decides “what to listen to” next; shaping it here changes the final answer in a principled way.
Example: If two tokens compete—one from the Paris page (1.0) and one from desserts (0.13)—the Paris token gets a much bigger say in the next word.

🍞 Hook: It’s like giving the front-row, well-prepared student a clearer microphone during Q&A. 🥬 Attention Modulation:

What it is: A direct, learnable change to the attention computation that ingests S_norm.
How it works: Use S_norm inside the attention to scale the score matrix before softmax, then proceed with normal value-weighted sums.
Why it matters: Without this, the model may over-focus on long, noisy text even if it’s irrelevant. 🍞 Anchor: The model now locks onto “Paris” tokens when predicting the answer.

Training Objective and Robustness Training

What happens: Fine-tune the model to maximize the likelihood of the ground-truth answers while using the modulated attention. Also train with noisy inputs by replacing half the retrieved docs with partially relevant and irrelevant ones; sometimes shuffle doc order to reduce position bias.
Why this step exists: The model must learn to use S_norm effectively and become sturdy against messy retrieval.
Example: For a multi-hop question, even if two of ten docs are irrelevant and two are loosely related, the model still learns to trust the top-scored docs.

🍞 Hook: Practicing with tougher drills makes game day easier. 🥬 Robustness Training:

What it is: A curriculum where we deliberately introduce noise and order changes.
How it works: Swap in noisy docs from outside top-5; sometimes reverse or shuffle positions.
Why it matters: Models often assume “earlier = better.” Shuffling breaks this habit and forces reliance on scores, not position. 🍞 Anchor: After training, even if good docs land later in the list, the model still attends to them because their scores say they’re good.

Inference

What happens: At test time, compute the same scores, normalize, build S_norm, and decode with modulated attention to produce the answer.
Why this step exists: The training-time trick becomes an everyday superpower: listen more to better sources.
Example: For “Who discovered penicillin?”, the model strongly weights the doc naming Alexander Fleming and downweights a blog post guessing someone else.

Secret Sauce:

Simple, explicit, external guidance (relevance, ranker, QPP) becomes a control panel for attention. Instead of complicated multi-step judge pipelines, OpenDecoder changes one key internal knob—attention—making the model robust with little overhead.

04Experiments & Results

The Test: The authors measured how well models answer questions when the retrieved documents are clean, somewhat noisy, or very noisy. They used F1 and Exact Match (EM), which reward both exact answers and partial overlaps.

Datasets:

General QA: NQ, TrivialQA, PopQA.
Multi-hop QA: HotpotQA, 2WikiMultiHopQA (2Wiki).

Noisy Settings:

Normal: top-10 retrieved docs (already can contain mild noise).
Noisy: second half of the 10 docs replaced with partly relevant and irrelevant docs.
Extreme: all docs sampled from irrelevant sets (simulated retrieval failure).

The Competition (Baselines):

Vanilla RAG, Vanilla SFT (no explicit indicators), RobustRAG (isolate-then-aggregate), InstructRAG (self-synthesized rationales), AstuteRAG (retrieval refinement), RbFT (robust fine-tuning with special instructions).

Scoreboard with Context:

Big picture: Across all five datasets and all three noise levels, OpenDecoder consistently matches or beats strong baselines and clearly surpasses Vanilla SFT.
Normal setting highlights: On PopQA (fact-heavy), OpenDecoder’s F1 ≈ 56.1 vs. RbFT ≈ 53.5—like scoring an A while the strong baseline gets an A-. On NQ and TrivialQA, OpenDecoder is competitive or better overall; small dips can occur on a metric for a single dataset, but average performance improves.
Noisy setting: OpenDecoder widens the gap. Average F1 and EM both rise above baselines, showing that attention modulation plus robustness training helps the model keep its cool when half the input is junk—like getting a solid B+ while others drop to C.
Extreme setting: Even when every document is irrelevant, OpenDecoder still outperforms others (e.g., higher F1/EM than RbFT on multiple datasets). This shows the model learns to fall back on its internal knowledge (query/instruction tokens keep high scores) when external context is untrustworthy.

Surprising Findings:

Simple normalization is strong: Max or min-max often works best. Fancy rank-decay (exponential) can hurt, suggesting we should keep score handling simple unless data proves otherwise.
Aggregation depends on task: For general QA, the retriever’s score alone can be enough; for multi-hop (harder reasoning), adding LLM ranker and QPP helps.
Position tricks matter: Reversing or shuffling document order during training reduces position bias and improves robustness—evidence that models were leaning too much on “earlier means better.”
Scaling helps: Larger backbones make the benefits of explicit indicators even clearer; smaller models may not fully capitalize on the modulation without enough capacity.

Takeaway: When documents vary in quality—a very common real-world situation—explicitly injecting document quality into attention produces steadier, stronger answers than relying on LLM internals alone.

05Discussion & Limitations

Limitations:

Dependence on retrieval quality: If the retriever is very weak, the signals may still mislead. OpenDecoder lessens the damage but can’t fix totally broken search.
Indicator calibration: Choosing weights (e.g., retriever vs. ranker vs. QPP) and normalization matters; poor choices can reduce gains.
Extra components: You need a retriever, an LLM-ranker, and a QPP model (or some subset). That adds engineering and compute.
Model capacity: Larger models show bigger benefits; small models may underuse the signals.
Domain shift: Scores trained on one domain might not transfer perfectly; tuning or adaptive weighting can help.
Security/robustness: If an attacker can manipulate scores (e.g., poison retrieval), the guidance can be gamed; safeguards are needed.

Required Resources:

A capable retriever (e.g., E5) and access to a large corpus.
Optional but helpful: LLM-based ranker and QPP model.
Fine-tuning compute for adding the attention modulation parameters.

When NOT to Use:

Ultra-low-latency, minimal-stack scenarios where adding ranker/QPP isn’t feasible.
Settings where retrieval is nearly perfect and cost/complexity must be minimized.
Tiny models that can’t effectively learn the modulation.

Open Questions:

Adaptive weighting: Can the model learn per-query weights for different indicators automatically?
Better normalizations: Are there principled, data-driven normalizers that consistently beat simple max/min-max?
Broader indicators: Can we plug in credibility, recency, or authority scores to further boost trustworthiness?
Joint training: What if the retriever, ranker, QPP, and generator are co-trained end-to-end under this decoding modulation?
Safety: How to detect and neutralize adversarial manipulation of scores?

06Conclusion & Future Work

Three-sentence summary: OpenDecoder injects explicit document quality signals into the LLM’s attention during decoding, so the model listens more to better documents and less to noisy ones. It also trains with mixed-quality inputs to build robustness, making RAG systems sturdier when retrieval is messy. Across five datasets and three noise levels, it consistently matches or beats strong baselines.

Main Achievement: Turning external relevance/quality scores into a direct control knob inside the LLM’s attention—reshaping token probabilities at the exact place where answers are formed.

Future Directions: Learn per-query indicator weights automatically; explore new indicators (credibility, recency, authority); co-train retrieval and generation; and scale to larger backbones for even stronger gains.

Why Remember This: RAG isn’t only about finding documents—it’s also about how the model listens to them. OpenDecoder shows that letting the LLM “see” and use quality signals inside attention makes answers more reliable, especially when the world is noisy.

Practical Applications

•Enterprise search chatbots that prioritize high-quality internal documents when answering employee questions.
•Customer support assistants that downweight forum chatter and upweight verified help articles.
•Legal or policy research tools that emphasize authoritative statutes and recent case law over opinion blogs.
•Medical guideline retrieval systems that prioritize validated sources and de-emphasize outdated or low-credibility pages.
•Coding assistants that prefer official documentation and trusted repos when suggesting fixes.
•Educational tutors that highlight accurate textbook paragraphs and ignore distractors.
•Scientific literature assistants that weight peer-reviewed studies over non-reviewed summaries.
•News assistants that prioritize primary sources and credible outlets to reduce misinformation.
•Compliance tools that focus on regulatory documents and ignore marketing fluff.
•Internal knowledge base Q&A where recency and credibility indicators can be added as extra signals.

Version: 1