Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Shuangshuang Ying; Zheyu Wang; Yunjian Peng; Jin Chen; Yuhao Wu; Hongbin Lin; Dingyu He; Siyi Liu; Gengchen Yu; YinZhu Piao; Yuchen Wu; Xin Gui; Zhongyuan Peng; Xin Li; Xeron Du; Libo Qin; YiXin Cao; Ge Zhang; Stephen Huang

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Intermediate

Shuangshuang Ying, Zheyu Wang, Yunjian Peng et al.1/29/2026

arXiv PDF

Key Summary

•This paper builds a safe science “playground” called DeR that fairly tests how AI finds facts (retrieval) and how it thinks with those facts (reasoning) without mixing them up.
•It evaluates the exact same hard question under four controlled modes—no help, concept hints, clean documents, and documents plus distractors—so we can see where models fail.
•A special two-step check makes sure problems cannot be solved from memory alone but can be solved when the right concepts are provided.
•Every question comes with a frozen mini-library of recent (2023–2025) theory papers, expert-picked key concepts, and verified reasoning steps to keep results stable and reproducible.
•Across many top models, ‘Concepts-only’ scored best on average (about 75%), while ‘Full-set’ with distractors dropped to about 51%, showing that noise hurts reasoning.
•Surprisingly, some models do worse when given extra documents than with no documents at all—showing a fragile switch from memory-mode to evidence-mode.
•Another common failure is ‘naming the right concept but not executing it,’ like knowing the recipe title but not cooking the dish correctly.
•The benchmark reports clear “regime gaps” that translate into knowledge loss, retrieval loss, and noise-induced loss so teams can pinpoint and fix problems.
•This work gives researchers a clean way to compare models, debug agent pipelines, and train systems that truly reason over new science, not just recall old facts.

Why This Research Matters

In the real world, we need AI that can both find the right information and reason with it correctly—especially in science, health, law, and engineering. This benchmark separates those skills so builders can fix the exact weak link instead of guessing. Because each question uses a frozen library and fresh, frontier topics, results are stable and not driven by old memorized facts. The clear regime gaps make it easier to choose the right model for a job—one that can ignore noise, extract the right ideas, and execute them step-by-step. Over time, this helps create trustworthy research assistants that don’t just sound confident but earn that confidence with evidence.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 You know how a student can either answer from memory or look things up in books and then explain them? Those are two different skills: remembering and using evidence. Most AI tests used to mix those together.

🥬 The Concept: Retrieval-augmented generation (RAG)

What it is: RAG is when an AI fetches outside documents and then uses them to answer a question.
How it works:
1. Receive a question
2. Search for relevant documents
3. Read and pick useful parts
4. Combine them to form an answer
Why it matters: Without RAG, an AI might guess from memory and get new or niche facts wrong. 🍞 Anchor: Like asking the librarian for the latest science book before writing your report.

🍞 Imagine writing a report using real sources instead of guessing. 🥬 The Concept: Document-grounded reasoning

What it is: Thinking that is explicitly supported by documents as evidence.
How it works:
1. Cite where each fact comes from
2. Connect facts step by step
3. Conclude only what the evidence supports
Why it matters: Without grounding, answers can sound smart but be unfounded. 🍞 Anchor: Your teacher gives you an A because you show page numbers for every claim.

🍞 Imagine deciding to bring an umbrella because you read the weather forecast, not because you feel like it might rain. 🥬 The Concept: Evidence-based conclusion making

What it is: Drawing conclusions strictly from verified evidence.
How it works:
1. Gather sources
2. Check relevance and reliability
3. Build a conclusion that matches the facts
Why it matters: Without it, we get confident but incorrect answers. 🍞 Anchor: You choose sneakers for PE after reading the school rulebook, not just guessing.

🍞 Think of cooking a fancy dish: you combine multiple ingredients in the right order. 🥬 The Concept: Multi-step synthesis

What it is: Combining pieces of information across steps to solve complex problems.
How it works:
1. Select needed ideas
2. Order them logically
3. Apply one step at a time
4. Check that each step supports the next
Why it matters: Without synthesis, complex tasks fall apart mid-way. 🍞 Anchor: A science fair project that needs planning, data collection, analysis, and a conclusion.

🍞 Imagine following a treasure map that tells you each turn until you find the X. 🥬 The Concept: Chain of thought (CoT)

What it is: A step-by-step reasoning trail from question to answer.
How it works:
1. Write each step clearly
2. Link steps to evidence
3. Arrive at a checkable conclusion
Why it matters: Without CoT, it’s hard to see where thinking went wrong. 🍞 Anchor: Showing your math steps so the teacher can spot the one line you miscalculated.

🍞 Imagine a quiz where someone glanced at the answer key before starting. That wouldn’t be a fair test of memory. 🥬 The Concept: Parametric leakage

What it is: When an AI solves a test from memorized training data rather than from provided evidence.
How it works:
1. The model has seen similar answers during training
2. It recalls the answer without re-deriving
3. The test wrongly looks easy
Why it matters: Without guarding against leakage, we overestimate real reasoning ability. 🍞 Anchor: A spelling test feels easy if you secretly practiced the exact same word list yesterday.

Before this paper, many evaluations judged the whole RAG pipeline at once: search, rerank, summarize, stitch, and reason. That made it hard to tell which part failed. Was the model missing the key idea? Did it find the right idea but misuse it? Or did extra, noisy documents distract it? Worse, open-web tests are unstable because websites change, links break, and search engines update. Also, if the task can be solved from memory, we can’t tell whether the model truly used the evidence.

This paper’s gap to fill: a clean, reproducible way to separately measure “finding the right evidence” and “reasoning correctly with it,” using fresh scientific topics that aren’t already in the model’s memory.

Why it matters to daily life: Think of health advice, legal checks, or school research—getting facts right and then reasoning well with those facts is crucial. A model that mixes them up may be confident but wrong. A benchmark that cleanly splits these skills helps build systems you can trust.

02Core Idea

🍞 Imagine two games: one where you must find the puzzle pieces, and another where the pieces are handed to you and you must assemble them. If you play both, we can tell whether you’re bad at searching for pieces or bad at assembling them.

🥬 The Concept: The Aha! Moment—Decouple retrieval from reasoning

What it is: Test the same hard question under four controlled modes so we can tell whether failure comes from not finding the right evidence or from misusing it.
How it works:
1. Instruction-only: no help—tests memory
2. Concepts-only: hand over gold concepts—tests concept composition
3. Related-only: give only relevant docs—tests extraction + reasoning without noise
4. Full-set: mix relevant docs with distractors—tests denoising + reasoning
Why it matters: Without this split, a single score hides whether search or thinking is broken. 🍞 Anchor: Like grading both the scavenger hunt (finding clues) and the logic puzzle (using clues) separately.

Multiple analogies:

Kitchen: Before vs. after. Before, you judged a cook on groceries plus cooking at once—you couldn’t tell if bad taste came from poor shopping or poor cooking. After, you run two tests: (1) hand them perfect ingredients; (2) let them shop themselves. Now you know where it went wrong.
Treasure map: Before, finding the treasure mixed map-reading with digging skill. After, you test map-reading alone (a clear map) and digging alone (you mark the spot) to see which skill needs training.
Science fair: Before, you graded research quality and presentation together. After, you evaluate literature review (finding the right papers) separately from analysis (correctly applying methods).

🍞 Imagine a science lab that never changes its tools mid-experiment. 🥬 The Concept: Deep-research sandbox (DeR)

What it is: A controlled testing environment that preserves real research difficulties but isolates causes of failure.
How it works:
1. Freeze a per-question mini-library (recent theory papers)
2. Add distractors to mimic messy search
3. Include expert-annotated concepts and verified reasoning traces
4. Run the same question under four modes and compare gaps
Why it matters: Without a sandbox, web changes and memorization muddy the signal. 🍞 Anchor: Like testing sneakers on the same indoor track, not on changing streets.

🍞 Think of a bookshelf where the books never change pages overnight. 🥬 The Concept: Frozen document library

What it is: A fixed set of documents attached to each question that never updates.
How it works:
1. Curate relevant papers from 2023–2025
2. Insert topically similar but non-helpful distractors
3. Lock the set to prevent future drift
Why it matters: Without freezing, results vary with time and place. 🍞 Anchor: A classroom test packet that’s the same for every student in every period.

🍞 Picture a teacher who hands you exactly the ideas you must use—no more, no less. 🥬 The Concept: Concept-level supervision

What it is: Provide and evaluate against the exact set of necessary concepts for a correct solution.
How it works:
1. Experts list the needed concepts
2. The model must assemble them correctly
3. Evaluators match used vs. needed concepts for precision/recall
Why it matters: Without it, we can’t tell if the model missed or misused a key idea. 🍞 Anchor: A checklist of science laws you must apply to solve a physics problem.

🍞 Imagine giving someone the answer sheet to check they truly needed help. 🥬 The Concept: Two-phase anti-leak validation (guarding against parametric leakage)

What it is: Each problem must fail without evidence but become solvable with gold concepts.
How it works:
1. Run models with instruction only: they must fail repeatedly
2. Run with concepts: they must succeed sometimes
Why it matters: Without this, a model could ace the test from memory and we’d be fooled. 🍞 Anchor: A tough puzzle you can’t do alone, but you can solve with the right clues.

Before vs. After:

Before: One tangled RAG score hid retrieval errors, reasoning mistakes, distractions, and memorization.
After: Four-mode testing shows which part breaks: knowledge gap, document-to-concept extraction gap, or noise-induced derailment.

Why it works (intuition, not equations): Bright lines between modes create interpretable gaps. If you do well with concepts but fail with related docs, extraction is the bottleneck. If you do well with related docs but fail when noise is added, denoising is the bottleneck. If you fail even with concepts, concept composition and scheduling need work.

Building blocks:

Four evaluation modes (memory, concepts, clean docs, noisy docs)
Frozen doc libraries with distractors
Expert concepts + validated reasoning traces
Regime gaps that map to knowledge, retrieval, and noise-induced losses
Simple, reproducible scoring for answer correctness and concept usage

03Methodology

At a high level: Input (Frontier question) → Pick a mode (Instruction-only / Concepts-only / Related-only / Full-set) → Model reasons and answers → Scoring + gap analysis.

🍞 You know how a teacher can test memory first, then give hints, then give the textbook, and finally give the textbook plus extra off-topic pages? 🥬 The Concept: Four evaluation regimes (modes)

What it is: Four controlled input setups for the same question to isolate skills.
How it works:
1. Instruction-only: only the question—tests parametric knowledge
2. Concepts-only: question + gold concepts—tests concept composition and scheduling
3. Related-only: question + only relevant docs—tests extraction + reasoning without noise
4. Full-set: question + relevant docs + distractors—tests denoising + reasoning
Why it matters: Without comparing these, we cannot attribute errors precisely. 🍞 Anchor: A spelling test with no word list, then a hint list, then the right dictionary pages, and finally those pages mixed with look-alike but wrong pages.

Data recipe (how each item is built):

Source selection: Annotators (PhD students) pick 2023–2025 theory papers—no pure experiments—so answers come from concepts and derivations, not lab measurements.
Item construction: They write the Instruction (question), a concise Answer, the Concepts list (only necessary theory items), and a step-by-step Chain of Thought (CoT) that uses every listed concept.
Difficulty control (anti-leak): Offline LLMs must fail three times with Instruction-only, but succeed at least once with Concepts-only.
Document set curation: For every concept, include at least one Related document; add Noise documents that are on-topic but do not contain solution concepts; ensure no doc leaks the answer verbatim.
Review and QA: Senior reviewers re-run the checks, verify science correctness, and confirm no answer leakage.

🍞 Imagine puzzle makers who test that you can’t guess the solution from memory but can solve it if given the right clues. 🥬 The Concept: Oracle concepts

What it is: The exact set of needed ideas, handed to the model to remove retrieval errors.
How it works:
1. Experts list the minimal necessary concepts
2. The model must assemble them properly
3. If it still fails, the issue is reasoning, not retrieval
Why it matters: Without oracle concepts, we can’t cleanly separate thinking from searching. 🍞 Anchor: The teacher gives you Newton’s laws and asks you to derive the result step-by-step.

🍞 Think of worksheets that include tricky extra pages to see if you can ignore irrelevant stuff. 🥬 The Concept: Distractors (noise documents)

What it is: On-topic but unhelpful docs mixed in with the useful ones.
How it works:
1. Curate plausible but non-solution papers
2. Mix them with relevant docs
3. Test if the model can ignore them and stay on track
Why it matters: Without distractors, we overestimate robustness in real-world messy search. 🍞 Anchor: A recipe page mixed with similar pages that use the wrong ingredients.

Evaluation protocol:

Same interface across models, no web access. All signals must come from the provided inputs (concepts or docs).
Long texts are trimmed deterministically if too long (keeping the front and back with a marker) so all models face the same context budget.
Decoding is standardized (e.g., temperature and nucleus sampling) and each setting is run twice, reporting average accuracy.
Answers are scored for correctness with task-specific normalization (numeric tolerance, symbolic equivalence, or checklist).

🍞 Imagine measuring how much score you lose when someone takes away your shopping list vs. when someone puts junk items in your cart. 🥬 The Concept: Retrieval loss vs. noise-induced loss (interpretable gaps)

What it is: Gaps between mode scores that tell you what broke.
How it works:
1. Knowledge loss: Concepts-only vs. Instruction-only (how much concepts help)
2. Retrieval loss: Concepts-only vs. Related-only/Full-set (document-to-concept extraction cost)
3. Noise-induced loss: Related-only vs. Full-set (cost of distractors)
Why it matters: Without gaps, we can’t diagnose precise failure sources. 🍞 Anchor: If you ace cooking with perfect ingredients but struggle when you must shop yourself, shopping (retrieval) is the issue.

Illustrative example:

Suppose the true answer requires two lemmas and an algorithm.
- Instruction-only: the model guesses and gets it wrong.
- Concepts-only: given the lemmas and algorithm steps, it gets 9/10—still slips on ordering.
- Related-only: given only the relevant papers, it extracts lemmas but misses a setup detail—scores 7/10.
- Full-set: with distractors added, it chases a look-alike method—scores 5/10.
Diagnosis: Reasoning is decent but brittle; document extraction and noise robustness need work.

The secret sauce:

Carefully constructed, frozen mini-libraries + expert concept checklists + four clean modes = interpretable, reproducible gaps. These make hidden problems (like getting derailed by noise or mis-executing a known concept) visible and fixable.

04Experiments & Results

The test: Measure answer accuracy under four modes to see whether the model (a) knows enough from memory, (b) can assemble given concepts, (c) can extract the right ideas from clean docs, and (d) can stay on track when noise is added. Secondary analyses inspect concept precision/recall and classify reasoning errors using CoTs.

The competition: A wide range of frontier systems were evaluated under the same rules (no web, same decoding, same truncation), including leading closed and open models. The goal wasn’t to crown a single winner but to expose where capabilities differ and where there’s headroom.

Scoreboard with context (averages across models):

Concepts-only ≈ 75%: Like getting a solid A when the teacher hands you the correct formulas. This shows many models can assemble concepts reasonably well once retrieval is removed.
Related-only ≈ 63%: A noticeable drop—like going from an A to a C+/B-—because extracting the correct concepts from documents is still hard.
Full-set ≈ 51%: Another drop with distractors—like sliding to a C, showing noise often derails reasoning.
Instruction-only ≈ 56%: Surprisingly close to Full-set, meaning extra documents sometimes hurt more than help.

🍞 Imagine switching from ‘I’ll try from memory’ to ‘I’ll now use the evidence,’ but your brain stumbles during the switch. 🥬 The Concept: Mode-switch fragility

What it is: Performance gets worse when adding documents because the model fumbles the shift from memory-based to evidence-based reasoning.
How it works:
1. Documents enter the context
2. The model abandons a workable memory path
3. It fails to anchor a new, document-grounded chain
Why it matters: Without reliable mode control, retrieval can paradoxically reduce accuracy. 🍞 Anchor: You were solving a math problem fine in your head, but when someone opened a textbook beside you, you got distracted and lost your train of thought.

🍞 Think of a cook who can say the name of a recipe but can’t follow the steps. 🥬 The Concept: Structural concept misuse

What it is: The model names the right concepts but fails to execute them as procedures.
How it works:
1. Recognizes the relevant theorem/algorithm
2. Replaces precise steps with generic heuristics
3. Produces plausible but wrong conclusions
Why it matters: Without execution fidelity, knowledge of labels doesn’t become correct solutions. 🍞 Anchor: You know “long division” by name but do the steps out of order and get the wrong quotient.

Other controlled findings:

Noise isn’t just dilution: As noise documents increase, scores decline nonlinearly—small early distractions can cause irreversible ‘trajectory drift.’
Concept count hurts: Even with gold concepts provided, more required concepts reduce accuracy—coordination and ordering are real bottlenecks.
Deep chains struggle: More reasoning steps amplify extraction and state-tracking errors, especially for formula derivations that need constructive procedures.

Concept extraction metrics (from CoTs) show concept precision/recall well below perfect outside Concepts-only, confirming that document-to-concept extraction is a major pain point. Error labeling highlights four frequent categories: missing core concepts, errors in reasoning process, misuse of core concepts, and numeric/formal slips. In instruction-only, missing concepts dominate (as expected); in full-set, missing concepts remain common but are joined by derailments caused by noise.

Bottom line: ‘Concepts-only is best’ reveals that removing retrieval makes models look smart, but ‘Full-set vs. Related-only drops’ prove that handling realistic, noisy evidence remains a core unsolved challenge.

05Discussion & Limitations

Limitations:

Curated data quality: The benchmark’s value depends on careful concept lists, well-chosen distractors, and precise CoTs; weak curation would blur diagnoses.
Domain scope: Focus on theory-heavy papers (2023–2025) maximizes concept-driven reasoning but under-represents experimental or multimodal evidence.
Evaluation proxy: Answer accuracy and concept matching are informative but still imperfect stand-ins for truly understanding reasoning quality.
Context limits: Some models struggle with long inputs; deterministic truncation helps fairness but may prune helpful content.

Required resources:

Expert annotators/reviewers for high-quality concepts and CoTs
Storage and tooling for frozen libraries and standardized interfaces
Models with sufficient context windows and stable decoding APIs

When not to use:

If your task requires live web browsing, rapidly changing sources, or multimodal inputs (e.g., figures/labs), DeR’s frozen text libraries won’t capture full complexity.
If you need pure retrieval system benchmarking at web scale, an open-web retrieval benchmark is more appropriate.
If your goal is general chit-chat or stylistic generation, DeR’s scientific focus may be overkill.

Open questions:

Can we train robust ‘mode controllers’ that reliably switch to evidence-grounded reasoning without derailment?
What architectures best execute procedural concepts (algorithms, theorems) step-by-step from text?
How can we make models resistant to early-step trajectory drift in noisy contexts?
Can concept-coordination scaffolds (like planners or symbolic checkers) raise Full-set performance without handholding?
How should we extend DeR to multimodal scientific evidence while preserving the same clean decoupling?

06Conclusion & Future Work

Three-sentence summary: This paper introduces DeR, a controlled benchmark that cleanly separates retrieval from reasoning by testing the same hard scientific questions under four input modes. Frozen document libraries, expert concepts, and validated reasoning traces enable precise, reproducible diagnosis of knowledge gaps, document-to-concept extraction failures, and noise-induced derailments. Results across leading models show large headroom and highlight two key failures: mode-switch fragility and structural concept misuse.

Main achievement: Turning a single, tangled RAG score into interpretable regime gaps that pinpoint whether models fail to find the right ideas, to execute them, or to stay focused amid distractors.

Future directions: Train explicit mode controllers, build procedure-execution modules for constructive concepts, add adaptive denoising strategies, and expand to multimodal evidence while keeping reproducibility. Use regime gaps as training targets to directly optimize retrieval, extraction, and coordination.

Why remember this: When you truly separate ‘finding facts’ from ‘thinking with facts,’ you get clear insights that help choose better models, fix the right problems, and build trustworthy research agents. DeR makes that separation practical, fair, and stable—pushing AI from sounding smart to being reliably right over new science.

Practical Applications

•Choose the best model for your RAG system by comparing regime gaps (e.g., prefer models with low noise-induced loss for messy corpora).
•Diagnose pipeline issues: if Concepts-only is high but Related-only is low, invest in better document-to-concept extraction.
•Train ‘mode controllers’ using Full-set failures to improve switching into evidence-grounded reasoning.
•Create curriculum fine-tuning sets that emphasize procedural concept execution where structural misuse is common.
•Design retrieval filters and rerankers by measuring how different distractor types affect noise-induced loss.
•Evaluate prompt engineering changes under all four modes to see whether improvements reflect real reasoning or just better guessing.
•Benchmark long-context upgrades by testing whether Related-only accuracy rises without harming Full-set robustness.
•Use concept-level precision/recall dashboards to guide targeted data augmentation for missing concepts.
•Set acceptance gates in production: require Concepts-only accuracy above a threshold before deploying retrieval-heavy features.
•Run ablations on document set size and distractor density to find your system’s safe operating zone.

Version: 1