A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

Mingxuan Du; Benfeng Xu; Chiwei Zhu; Shaohan Wang; Pengyu Wang; Xiaorui Wang; Zhendong Mao

A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

Intermediate

Mingxuan Du, Benfeng Xu, Chiwei Zhu et al.2/3/2026

arXiv PDF

Key Summary

•A-RAG lets the AI choose how to search, what to read, and when to stop, instead of following a fixed recipe.
•It gives the model three tools—keyword search, semantic search, and chunk read—so it can zoom in or out like using a microscope.
•Across multi-hop question-answering benchmarks, A-RAG beats popular GraphRAG and workflow-based methods while reading similar or fewer tokens.
•Even a very simple ‘Naive A-RAG’ (only semantic search) already outperforms many strong baselines, proving agentic retrieval matters.
•With stronger backbone models (like GPT-5-mini), A-RAG’s gains get bigger, showing it scales with model capability.
•A-RAG improves further when you allow more thinking steps at test time, meaning performance rises with extra compute.
•An ablation study shows all three tools matter: removing keyword search, semantic search, or chunk read hurts results.
•A simple ReAct-style loop and a context tracker help the agent avoid rereading the same text and wasting tokens.
•Most A-RAG failures come from reasoning chain issues (like confusing entities), suggesting future work should improve reasoning, not just retrieval.
•The team will release code and an evaluation suite, making it easier for others to build and test agentic RAG systems.

Why This Research Matters

A-RAG helps AI act more like a careful student: search broadly, focus where it counts, and verify facts before answering. That means better answers for everyday tools like homework helpers, customer support bots, and coding assistants. It saves reading costs by pulling fewer, more relevant tokens, which makes systems faster and cheaper to run. Because the model controls the strategy, performance improves naturally as models get smarter, without rebuilding complex pipelines. The approach works especially well on multi-hop questions that require piecing together clues from different places. By shifting the bottleneck from “can we find it?” to “can we reason about it correctly?”, A-RAG points future work toward better reasoning and verification tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you study for a test, you don’t always flip through the whole textbook at once? Sometimes you quickly scan the index (keywords), sometimes you look for ideas even if the exact words are different (meaning), and sometimes you read a full page carefully. That smart switching is how people save time and find better answers. Traditional Retrieval-Augmented Generation (RAG) systems didn’t work that way. They usually did one of two things: they either grabbed a handful of passages in one big gulp and jammed them into the model’s input, or they followed a fixed, step-by-step workflow someone designed ahead of time. Both approaches limited the model’s ability to act like a thoughtful student.

🍞 Top Bread (Hook): Imagine you’re searching a library. You might try exact words from your homework question first. 🥬 The Concept (Keyword Search): It is a way to find information by matching the exact words you type. How it works:

Pick 1–3 specific words or names.
Search for places where those words appear.
Rank spots where the words show up most clearly. Why it matters: Without keyword search, you’d miss exact names, dates, or terms that are crucial. 🍞 Bottom Bread (Anchor): If you want to know who wrote “Hamlet,” searching the keyword “Shakespeare” jumps right to the answer.

🍞 Top Bread (Hook): Sometimes you don’t know the exact words, but you know the idea, like “dogs that help blind people.” 🥬 The Concept (Semantic Search): It is a way to find information based on meaning, not just exact words. How it works:

Turn your question into a meaning-number (an embedding).
Compare it to meaning-numbers of sentences.
Pick the most similar ones, even if the words differ. Why it matters: Without semantic search, you’d miss answers phrased differently than your question. 🍞 Bottom Bread (Anchor): Asking “guide dogs” can still find “seeing-eye dogs,” even if the exact phrase is different.

🍞 Top Bread (Hook): When you think a section looks promising, you read the whole page to be sure. 🥬 The Concept (Chunk Read): It is reading a whole chunk (about a page) after a search says, “This looks relevant.” How it works:

Use search to find snippets.
Choose promising chunk IDs.
Open the full chunks (and sometimes neighbors) to get complete details. Why it matters: Without chunk read, you’d rely on tiny snippets and risk missing key context. 🍞 Bottom Bread (Anchor): If a snippet mentions a person’s birth year, the full chunk might also explain their career.

Before A-RAG, many RAG systems used graphs or fixed workflows. Graph RAGs built networks of entities and relationships, which can be helpful but are heavy to prebuild and still follow preset retrieval rules. Workflow RAGs let models call tools but stick to a designer’s plan (e.g., always ‘do A, then B, then C’). The problem? Today’s language models have grown great at reasoning and using tools, but these older styles didn’t let the model decide its own strategy, change plans mid-way, or know when it had enough evidence.

The gap: a system that lets the model drive, deciding which search style to use, when to zoom in, and when to stop. That’s where A-RAG comes in with “hierarchical retrieval interfaces”—think of them like three dials the model can turn: keyword-level, sentence-level (semantic), and chunk-level (full content). The real-world stakes are big: better customer support that finds accurate answers faster; coding helpers that fetch the right API details; study assistants that collect facts without stuffing the model’s context window; research tools that can chase multi-step clues across many documents. If the model can act more like a careful student—searching, checking, and deciding—it can answer more correctly while reading less.

People tried many things before: complex graphs, multi-agent workflows, query rewriting, and reranking. These helped sometimes, but they were still rigid or expensive to build. They didn’t naturally improve as models’ reasoning got better. A-RAG’s idea is to expose simple, layered tools directly to the model and let it orchestrate. That way, as models improve, performance rises without redesigning the whole pipeline.

02Core Idea

The “Aha!” in one sentence: Give the model simple, layered search-and-read tools (keywords, meaning, and full chunks) and let it choose how to use them, step by step.

🍞 Top Bread (Hook): You know how a detective switches between scanning headlines, following leads, and reading whole case files? 🥬 The Concept (Agentic RAG): It is a RAG setup where the model decides the retrieval strategy itself—what to search, how to search, what to open, and when to stop. How it works:

The model thinks aloud and chooses a tool.
It gets results, updates its plan, and chooses the next tool.
It repeats until it has enough evidence to answer. Why it matters: Without agentic choices, the model follows rigid recipes and can’t adapt to tricky, multi-step questions. 🍞 Bottom Bread (Anchor): For “Where was the author of Dune born?”, an agentic model might search for ‘author of Dune’ (keyword), find ‘Frank Herbert,’ then semantically search for his birthplace, then read the specific chunk to confirm ‘Tacoma, Washington.’

🍞 Top Bread (Hook): Think of a camera lens that can switch between wide, medium, and close-up shots. 🥬 The Concept (Hierarchical Retrieval Interfaces): It is a set of tools arranged from fine to coarse—keyword matches (fine), sentence-level meaning matches (medium), and full chunk reads (coarse). How it works:

Start with quick scans (keywords or semantics) to gather clues.
Pick the most promising chunks.
Open the full chunks only when necessary for details. Why it matters: Without this ladder of detail, you either miss precision (too coarse) or waste tokens (too fine, everywhere). 🍞 Bottom Bread (Anchor): To find a treaty’s exact clause, first keyword-search the treaty name, then semantically match the clause topic, then read the exact chunk containing the legal wording.

Three analogies for the idea:

Detective kit: magnifying glass (keywords), map of related clues (semantic), complete case file (chunk read).
Web browsing: site search box (keywords), Google-style meaning search (semantic), opening the full article (chunk read).
Kitchen prep: scanning labels (keywords), smelling and matching flavors (semantic), cooking the full recipe (chunk read).

Before vs. After:

Before: One-shot retrieval or fixed workflows; models couldn’t change strategy mid-task.
After: The model adapts on the fly—sometimes a quick keyword is enough; other times it chains semantic hops and then reads just a few full chunks to verify.

Why it works (intuition, no equations):

Different questions need different ‘zoom levels.’ Exact names and dates pop with keywords. Fuzzy concepts need semantic meaning. Final facts require full reading.
By interleaving decisions and observations, the model uses feedback to refine the next step, like a scientist forming and testing hypotheses.
Progressive disclosure saves context: snippets guide you; full reads are reserved for the most relevant places.

Building blocks:

Multi-granularity tools: keyword search, semantic search, chunk read.
Lightweight index: chunking and sentence embeddings enable fast meaning matches.
Simple agent loop: a ReAct-style think-act-observe-repeat process.
Context tracker: avoids rereading the same chunks, saving tokens, nudging exploration.

03Methodology

At a high level: Question → (Search: keyword or semantic) → (Select promising chunks) → Chunk Read (full text) → Reason/decide next step → Final Answer.

🍞 Top Bread (Hook): You know how you first skim a textbook chapter title, then look at a paragraph, and finally read the page fully if it looks right? 🥬 The Concept (Hierarchical Index): It is a way of organizing the corpus so the model can match at the sentence level but still jump to full chunks when needed. How it works:

Chunking: Split documents into ~1000-token, sentence-aligned chunks to keep meaning intact.
Embedding: Turn each sentence into a meaning-number (embedding) for semantic search.
Keyword layer: Do literal text matches at query time, no heavy prebuilt index. Why it matters: Without this structure, searches are either too vague or too expensive, and the agent can’t smoothly move from snippets to full context. 🍞 Bottom Bread (Anchor): If the question is about “Einstein’s 1905 papers,” sentence embeddings can find related sentences; chunk links let the agent open the full section to read details.

Step A: Keyword Search (fine-grained, literal)

What happens: The agent picks 1–3 precise terms (names, dates, technical words) and retrieves chunks that contain them. It gets back short snippets showing the matched sentences plus chunk IDs.
Why it exists: Grabs exact entities and numbers quickly; great for pinpoint facts or confirming names.
Example: For “What city hosted the 1900 World’s Fair?”, keywords like “Exposition Universelle 1900” retrieve chunks with the exact phrase.

Step B: Semantic Search (meaning-based, sentence level)

What happens: The agent writes a short natural-language query, which is embedded and compared to all sentence embeddings; top matches are grouped by chunk and returned as snippets with chunk IDs.
Why it exists: Finds related ideas even when the words differ (synonyms, paraphrases).
Example: Asking “capital of the country between France and Spain” matches sentences mentioning “Andorra” and “Andorra la Vella,” even if the exact question words aren’t present.

Step C: Chunk Read (full context)

What happens: Given promising chunk IDs, the agent opens the full chunk text (and neighbors if needed) to confirm details.
Why it exists: Snippets can be incomplete; answers often require surrounding context for disambiguation.
Example: After finding “Herbert” in a snippet, reading the chunk confirms it’s “Frank Herbert” and provides birth details.

🍞 Top Bread (Hook): Think of solving a maze by moving step-by-step, checking each turn before picking the next. 🥬 The Concept (ReAct-style Agent Loop): It is an iterative think–act–observe cycle where the model chooses a tool, sees the result, reasons again, and repeats until it answers. How it works:

The model reads the history and available tools.
It decides to call keyword_search, semantic_search, or chunk_read.
It observes outputs (snippets or full text), updates its plan, and continues. Why it matters: Without iteration, the model can’t course-correct when early guesses are off. 🍞 Bottom Bread (Anchor): For a two-hop question, the agent might first find the person, then find that person’s birthplace in a second step.

🍞 Top Bread (Hook): Imagine placing sticky notes on pages you’ve already read so you don’t waste time rereading them. 🥬 The Concept (Context Tracker): It is a memory of which chunks were fully read, so the system avoids sending them again (saving tokens). How it works:

Maintain a set of read chunk IDs.
If the agent tries to reread, return a tiny note: “Already read.”
Encourage exploration of new chunks. Why it matters: Without it, the model can loop and balloon token usage. 🍞 Bottom Bread (Anchor): If the agent tries to reopen chunk #42 again, it gets a reminder and stays efficient.

Putting it together—why this recipe is clever:

Progressive disclosure: Start cheap (snippets), pay more only where it counts (full reads).
Multi-granularity control: The agent picks the right tool for names, meanings, or full details.
Minimal orchestration: A simple loop highlights that the win comes from the interface, not fancy plumbing.
Token thriftiness: The context tracker and snippet-first design enable higher accuracy with fewer retrieved tokens.

Concrete data example:

Suppose the question: “Which city is the birthplace of the author of Dune?”

Keyword search: [“Dune”, “author”] → snippet shows “Frank Herbert”.
Semantic search: query “Frank Herbert birthplace” → snippet hints “Tacoma, Washington”.
Chunk read: open that chunk to confirm and cite. Final answer: “Tacoma, Washington (see chunk X).”

04Experiments & Results

The test: Can agentic retrieval with hierarchical tools answer multi-hop questions better, using similar or fewer tokens? The authors evaluated on HotpotQA, 2WikiMultiHopQA, MuSiQue, and GraphRAG-Bench, using two metrics: (1) LLM-Acc (does the answer mean the same as the ground truth?) and (2) Contain-Acc (does the exact answer text appear in the response?), with LLM-Acc used alone for long-form GraphRAG-Bench.

The competition: Baselines included Naive RAG, GraphRAG, HippoRAG2, LinearRAG, FaithfulRAG, MA-RAG, and RAGentA. They also compared two A-RAG variants: A-RAG (Naive) with only an embedding search tool, and A-RAG (Full) with keyword search, semantic search, and chunk read.

The scoreboard with context:

Even A-RAG (Naive) beat many graph and workflow baselines. For example, on MuSiQue with GPT-4o-mini, A-RAG (Naive) reached 43.8% LLM-Acc vs. Naive RAG’s 38.6% (like moving from a C+ to a solid B- on a tough quiz).
A-RAG (Full) did even better. With GPT-5-mini backbones, it often topped all methods. On GraphRAG-Bench, A-RAG (Full) hit 74.1% LLM-Acc, outperforming alternatives—like scoring an A when others hovered around B’s.
Context efficiency: With GPT-5-mini, A-RAG (Full) often retrieved fewer tokens than several baselines while achieving higher accuracy. For example, on HotpotQA A-RAG (Full) retrieved ~2,737 tokens versus Naive RAG’s ~5,358—about half the reading for better answers.

Ablations (what matters most?):

Removing any one tool—keyword search, semantic search, or chunk read—hurt performance. This shows multi-granularity is not a luxury; it’s essential.
Notably, the “w/o Chunk Read” variant underperformed the full system, proving that snippets alone aren’t enough; full-chunk verification is critical to avoid mistakes.

🍞 Top Bread (Hook): Think of practicing piano: if you spend more focused time, you generally play better. 🥬 The Concept (Test-Time Scaling): It is the idea that giving the model more steps or more reasoning tokens during testing can boost accuracy. How it works:

Allow more agent loop steps (e.g., 5 → 20).
Permit longer reasoning (more thinking tokens).
Let the agent explore and verify more. Why it matters: Without extra time, complex multi-hop problems may remain partially explored. 🍞 Bottom Bread (Anchor): On MuSiQue-300, increasing steps from 5 to 20 improved GPT-5-mini by ~8%. Increasing reasoning effort yielded ~25% gains for GPT-5-mini and GPT-5.

Surprising findings:

Simple can be strong: A-RAG (Naive) with just one tool already beat many more complicated systems.
More isn’t always better: A-RAG (Naive) pulled many more tokens than A-RAG (Full) but still performed worse—showing that good interfaces beat brute force.

Failure modes:

Most failures came from reasoning chain issues (like entity confusion), not inability to retrieve. This means the bottleneck has shifted: the model usually finds the right pages but sometimes mis-links the facts.
Judge errors and corpus-missing cases existed but were smaller slices.

05Discussion & Limitations

Limitations:

Tool menu not exhaustive: The paper shows that hierarchical interfaces help, but it doesn’t explore every possible tool or configuration. Different domains might benefit from more specialized tools.
Big models not fully tested: Due to compute limits, the team hasn’t validated with the largest frontier models (e.g., full GPT-5 or Gemini-3), though they expect even bigger gains there.
Task generalization: Results focus on multi-hop QA; other tasks like fact verification, dialog, or long-form writing need more study.

Required resources:

A reasoning-capable LLM with tool-use (function calling) ability.
An embedding model for semantic search and a preprocessed corpus split into sentence-aligned chunks.
Some compute budget for multi-step loops (test-time scaling helps performance but costs tokens/latency).

When NOT to use:

Extremely small questions where plain LLM answers suffice; the overhead of retrieval may not pay off.
Ultra-low-latency settings where iterative searching is too slow.
Corpora that are tiny, highly homogeneous, or already fully known to the model’s parameters—external retrieval may add little.

Open questions:

How to reduce reasoning chain errors (especially entity confusion)? Perhaps add lightweight verification steps or entity disambiguation tools while staying agentic.
What’s the best minimal toolset per domain? Can we auto-tune tool menus?
How to schedule test-time compute: dynamic stopping rules, budget-aware planning, or adapt step counts per question difficulty?
Can we combine agentic retrieval with reinforcement learning so the model learns better search strategies over time without losing flexibility?

06Conclusion & Future Work

In three sentences: A-RAG turns retrieval from a fixed pipeline into an agent-driven process by giving the model simple, layered tools—keyword search, semantic search, and chunk read—and letting it decide how to use them. Across multi-hop QA benchmarks, this agentic approach consistently beats graph-based and workflow-based RAG, often while reading fewer tokens. It also scales with stronger models and more test-time compute, suggesting a future where better interfaces—not just bigger indexes—unlock better answers.

Main achievement: Showing that hierarchical, agent-friendly retrieval interfaces unlock the reasoning power modern LLMs already have, yielding higher accuracy and better context efficiency.

Future directions:

Expand the toolset (e.g., lightweight entity linking, citation verification) while keeping the agent free to choose.
Train retrieval strategies (via RL or SFT) on top of these interfaces for even stronger performance.
Validate on more tasks (fact checking, dialog, long-form generation) and with larger backbone models.

Why remember this: A-RAG reframes RAG as an interface design problem—give the model the right dials (fine, medium, coarse) and the freedom to turn them. That shift—from rigid recipes to agentic choices—lets performance grow naturally as models and compute improve, without rebuilding the whole system.

Practical Applications

•Customer support search that first keyword-matches product names, then semantically finds troubleshooting steps, and finally opens only the needed sections.
•Coding assistants that keyword-search API names, semantically match example usage, and read full docs to confirm parameters.
•Academic research helpers that semantically find related work, then read the most relevant chunks to extract citations and quotes.
•Enterprise knowledge bases that reduce context costs by reading just enough to answer employee questions accurately.
•Medical literature triage tools that semantically locate relevant studies and then verify findings in full-text chunks.
•Legal document review that keyword-searches clauses, semantically finds interpretations, and reads exact sections for due diligence.
•Data governance assistants that locate exact policy terms and read surrounding context to ensure compliant actions.
•Education study buddies that gather facts across textbooks and confirm details with chunk reads before presenting answers.
•News analysis systems that track entities via keywords, infer relationships semantically, and read select articles fully to avoid misinformation.
•Threat intelligence tools that keyword-match indicators (hashes, domains), semantically link campaigns, and confirm details only where needed.

Version: 1