FABLE: Forest-Based Adaptive Bi-Path LLM-Enhanced Retrieval for Multi-Document Reasoning

Lin Sun; Linglin Zhang; Jingang Huang; Change Jia; Zhengwei Cheng; Xiangzheng Zhang

FABLE: Forest-Based Adaptive Bi-Path LLM-Enhanced Retrieval for Multi-Document Reasoning

Intermediate

Lin Sun, Linglin Zhang, Jingang Huang et al.1/26/2026

arXiv PDF

Key Summary

•FABLE is a new retrieval system that helps AI find and combine facts from many documents by letting the AI both organize the library and choose the right shelves to read.
•It builds a forest of trees (one tree per document) with titles and summaries at the top and detailed chunks at the leaves, so the AI can zoom in or zoom out as needed.
•When answering a question, FABLE uses two paths at once: an AI-guided path that reasons over summaries and a vector path that finds text by similarity, then fuses them smartly.
•It adapts to a token budget: if a few whole documents fit, it stops early; if not, it dives into just the most relevant sections and chunks.
•Across benchmarks like HotpotQA and 2Wiki, FABLE beats strong RAG baselines and matches long-context LLMs while using up to 94% fewer tokens.
•Its bi-path design reduces hallucinations and irrelevant text, improving completeness and faithfulness of answers.
•In large agent tasks (BrowseComp-plus), swapping in FABLE as the retriever boosts accuracy and recall without changing the agent model.
•This shows long-context LLMs are helpful but do not replace structured retrieval; the best results come from combining them.

Why This Research Matters

When answers depend on multiple sources, grabbing a flat pile of text is risky and expensive; you can miss key facts and waste tokens. FABLE helps AI act like a good researcher: organize the library first, then navigate it differently depending on the question. This means clearer, more faithful answers with fewer hallucinations and far lower cost. Teams can upgrade retrieval quality without buying a bigger model or exploding token bills. In education, medicine, law, finance, and software docs, this makes multi-document tasks practical and reliable. Agents also become stronger just by swapping in FABLE as the retriever. Overall, it’s a blueprint for making AI both smarter and thriftier.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re in a giant library with millions of pages. Even if you could carry a huge backpack, stuffing every page inside won’t help if the one sentence you need is buried and you forget where it is.

🥬 The Concept (Long-context Large Language Models): What it is: Long-context LLMs are big readers that can remember very long texts at once. How it works: 1) You feed lots of text, 2) the model pays attention to parts it thinks matter, 3) it predicts answers or writes summaries. Why it matters: Without long context, the model can’t see all the pieces of a big puzzle at the same time. 🍞 Anchor: If you paste a whole Wikipedia article about two cities into a long-context model, it can compare them directly—if it doesn’t miss the key lines.

🍞 Hook: You know how chefs don’t memorize every recipe—they grab what they need from cookbooks when cooking?

🥬 The Concept (Retrieval-Augmented Generation, RAG): What it is: RAG is when an AI looks up helpful passages before answering. How it works: 1) Split documents into chunks, 2) turn chunks and your question into vectors, 3) pull the best-matching chunks, 4) feed them to the model to answer. Why it matters: Without RAG, the model might guess or forget facts it hasn’t seen. 🍞 Anchor: Ask, “Who wrote The Hobbit?” RAG finds the passage saying “J.R.R. Tolkien” and the model answers correctly.

🍞 Hook: Think about putting your room in order—labels on bins so you can find toys fast.

🥬 The Concept (Knowledge Organization): What it is: It’s arranging information so the right piece is easy to find. How it works: 1) Group related parts, 2) give each group a clear label, 3) link big ideas to details. Why it matters: Without organization, you waste time digging through piles and often grab the wrong thing. 🍞 Anchor: A binder with tabs (Math, Science, History) helps you jump straight to what you need.

The World Before: People tried two main paths. One was “just jam it all in” with long-context LLMs. But research showed the “lost in the middle” problem: if the key fact sits in the middle of a giant context, models often miss it. Also, attention over huge texts is pricey—cost grows fast with more tokens—so it’s not efficient at scale. The other path was classic RAG: fast and scalable, but chunks are flat and sometimes misleading; they look similar to your question without containing the answer, especially in multi-document tasks that need careful stitching.

The Problem: We want reliable, efficient answers that pull the right evidence from many documents and combine them correctly. But long-context alone can be distracted or too expensive; flat RAG can be noisy and weak at cross-document reasoning (like comparing sources, finding contradictions, or building timelines).

Failed Attempts: Graph RAGs connected entities (people, places) but could miss bigger themes and document-wide ideas. Hierarchical RAGs built trees of summaries but often used fixed, static structures and one-size-fits-all retrieval. Iterative/agent methods could call tools but treated retrieval as separate from how knowledge was stored; the AI still got flat piles, not guided maps.

The Gap: We needed a system where the AI helps organize knowledge into meaningful hierarchies and then navigates those hierarchies differently depending on the question—sometimes skimming high-level summaries, sometimes drilling into fine details—while staying within a token budget.

Real Stakes: Think of students researching history, doctors reviewing reports, or teams auditing policies. If the system grabs the wrong passages or misses a crucial comparison across documents, answers become incomplete or untrustworthy. We want fewer hallucinations, less irrelevant text, and smarter use of tokens so answers are faithful, complete, and affordable.

02Core Idea

🍞 Hook: You know how you use a map differently depending on your trip—zoomed out for highways, zoomed in for street turns?

🥬 The Concept (Forest-Based Adaptive Bi-Path Retrieval, FABLE): What it is: FABLE is a system where the AI builds trees of knowledge for each document, then navigates them with two coordinated paths—one guided by AI reasoning over summaries, the other by vector similarity—adapting depth based on a token budget. How it works: 1) Offline, the AI splits each document into meaningful chunks and builds a tree with titles and summaries (a forest across documents). 2) Online, given a question, it first tries to select whole documents using high-level nodes plus vector search. 3) If the budget is tight, it drills down to selected sections and chunks using an AI-guided path and a structure-aware expansion path, then fuses results in reading order. Why it matters: Without this, you either waste tokens on too much text or miss key evidence; you also risk mixing noisy passages that look similar but aren’t truly relevant. 🍞 Anchor: Asking “Which two companies’ 2024 Q2 profits diverged the most and why?” FABLE first picks the right filings at a high level, then zooms into just the earnings and notes sections needed to compare, staying within budget.

Three Analogies for the Aha Moment:

Map App: Overview roads to choose the city (AI-guided summaries), then street-level turns (fine chunks), while a second layer (vector search) suggests nearby routes; both get merged into the best path.
Museum Tour: A curator explains gallery highlights (summaries) while a search assistant quickly fetches specific artworks (vectors); your tour blends both so you see the right rooms and the exact paintings.
Grocery Run: A shopping list is grouped by aisle (tree structure); one helper points you to the right aisles (AI-guided), another spots exact brands on shelves (vectors); you save time and don’t forget anything.

Before vs After:

Before: Static structures and flat retrieval; the model receives a bag of chunks and must guess which matter. Long-context models read everything but still miss middle parts and waste tokens.
After: The AI organizes knowledge and navigates it. Retrieval becomes active, query-conditioned movement over a forest, saving tokens and improving faithfulness.

🍞 Hook: Imagine a family tree where you can talk about great-grandparents or zoom into one cousin’s story.

🥬 The Concept (Hierarchical Knowledge Forests): What it is: A set of trees (one per document) with titles and summaries at internal nodes and detailed chunks at leaves. How it works: 1) The AI segments text into meaningful chunks, 2) creates section/subsection nodes with ToC-like titles and roll-up summaries, 3) embeds both internal nodes (title+path+summary) and leaf chunks for flexible retrieval. Why it matters: Without hierarchies, you can’t smoothly switch between big-picture and detail, so you either overshoot (too much text) or undershoot (miss the needed bit). 🍞 Anchor: An annual report tree lets you jump from “Risk Factors” to the exact paragraph on “Supply Chain Delays.”

🍞 Hook: Think of choosing between a bird’s-eye view and a magnifying glass.

🥬 The Concept (Bi-Path Retrieval): What it is: Two coordinated retrieval paths—(1) AI-guided semantic navigation over summaries, and (2) vector-based similarity search and structural propagation—fused and ordered to provide clean evidence. How it works: 1) At document level, pick candidates via AI over high-level nodes and vectors; 2) If needed, at node level, navigate with AI and expand via structure-aware scores; 3) Deduplicate ancestor/descendant overlaps and preserve document order. Why it matters: One path alone either misses nuance (vectors) or breadth (AI overviews); together they cover both. 🍞 Anchor: For “Compare two court rulings,” the AI picks the right cases (path 1), while vectors find exact paragraphs on the precedent (path 2), merged neatly.

🍞 Hook: A librarian asks, “Are you writing a summary or hunting a quote?” and then guides you differently.

🥬 The Concept (Query-Conditioned Traversal): What it is: Retrieval that changes how it moves through hierarchies based on the question and token budget. How it works: 1) If few documents fit, return them whole; 2) if not, drill down only into the most relevant sections; 3) stop when the budget is full. Why it matters: Without this, you waste tokens or miss crucial details, hurting both cost and accuracy. 🍞 Anchor: For “List the main causes,” you keep high-level summaries; for “What date was the policy signed?” you jump to a specific subsection.

Why It Works (Intuition): The summaries give the AI a stable semantic map to make good high-level choices quickly (cheap tokens), while vector/structure paths ensure fine-grained recall of the exact evidence. Fusion rules (deduplicate, priority, and ordering) keep the final context clean and readable, which reduces hallucinations.

Building Blocks: semantic chunking, tree construction with titles and summaries, multi-granularity embeddings, document-level bi-path+fusion, budget-adaptive routing, node-level bi-path (AI navigation + TreeExpansion), and structure-aware ordering for the final context.

03Methodology

At a high level: Input (documents) → LLM-enhanced indexing (build a forest) → Budget-adaptive bi-path retrieval (doc level, then node level if needed) → Ordered evidence context → Answer generation.

🍞 Hook: Cutting a cake into sensible slices makes serving easy.

🥬 The Concept (Semantic Chunking): What it is: The AI splits documents into meaningful chunks that match topics or paragraphs, not random lengths. How it works: 1) Read the document, 2) find natural topic boundaries, 3) create clean chunks without overlap. Why it matters: Random cuts break sentences and ideas, making retrieval noisy. 🍞 Anchor: Splitting a science article into Introduction, Methods, Results, Discussion, not every 128 words.

Indexing Recipe (Offline, one-time):

Semantic-aware chunking: Use an LLM to segment each document into coherent chunks.
Tree building per document: Create a rooted tree: root → sections → subsections → leaf chunks. Internal nodes store (title, summary, ordered children). Leaf nodes point to original chunk text. Depth is bounded (e.g., D=4).
Multi-granularity embeddings: For internal nodes, embed (path of titles + summary). For leaf nodes, embed the chunk content. This supports both big-picture and detail retrieval.
Progressive build for long docs: If a file is too long, build partial trees per part and merge—keeps coherence without busting context length.

Retrieval Recipe (Online, per query): Step A — Document-level bi-path recall

Path 1: Depth-adaptive LLM selection. The model only reads high-level nodes (up to a depth L) across the forest and picks candidate documents—cheap and global.
Path 2: Vector retrieval over all node embeddings to catch semantically similar docs the LLM might miss.
Fusion: Deduplicate and unite both sets for strong recall and precision.

Step B — Budget-adaptive routing

If the total size of these candidates fits your max token budget, stop and return full docs. Great when only a few documents are relevant.
If too big, go to node-level selection.

🍞 Hook: You don’t always read the whole book; sometimes you jump to the right chapter or even a single paragraph.

🥬 The Concept (Node-Level Retrieval): What it is: Picking just the right sections and chunks from the chosen documents. How it works: 1) AI-guided navigation reviews non-leaf nodes’ titles and summaries to find the most relevant branches; 2) structure-aware TreeExpansion scores nodes using (a) query-node similarity with depth decay, (b) inherited ancestor relevance, (c) aggregated child relevance, then selects nodes greedily within budget; 3) fuse results. Why it matters: Without node-level selection, you waste tokens; without structure-aware scoring, you might miss the perfect paragraph hiding under a relevant section. 🍞 Anchor: From “Climate Impacts,” jump directly to the leaf chunk listing “heatwaves increased by X%,” not the whole chapter.

🍞 Hook: Think of water flowing from trunk to branches and back, carrying importance up and down.

🥬 The Concept (TreeExpansion): What it is: A structural propagation method to score nodes using three signals—direct similarity, inherited ancestor strength, and child aggregation. How it works: 1) Compute similarity between query and node embeddings (favor higher-level nodes via depth decay), 2) pass the best score down from ancestors, 3) pass average strength up from children, 4) rank nodes and pick until the budget is filled, prioritizing ancestors to avoid redundancy. Why it matters: Without propagation, you ignore the tree’s relationships and miss nodes closely tied to the right topics. 🍞 Anchor: If a “Mergers” section strongly matches the query, its sub-sections about “Regulatory Approval” become good bets even if their individual text is short.

Node Fusion and Ordering:

Deduplicate ancestor/descendant overlaps so you don’t include both a section and its full sub-sections redundantly.
Prioritize AI-picked nodes first (explicitly chosen), then TreeExpansion nodes (structurally inferred).
Inside each document, keep original reading order for clarity.

Budget Control:

Treat the final selection like packing a suitcase: add the most important pieces in order until the token capacity is full.

Concrete Example:

Query: “Which of the 2023 reports on City A and City B show opposite trends in air quality, and what’s the main reason?”
Document-level: AI path selects the two city reports based on summaries; vector path also flags a health department note. Fusion keeps the two reports.
Budget check: Whole docs don’t fit. Move to node-level.
Node-level: AI navigation picks “Air Quality Trend” sections in both reports. TreeExpansion boosts “Method Notes” sub-sections that mention sensor calibration. Fusion orders: City A trend → City B trend → short method note. The generator then answers with evidence and fewer tokens.

The Secret Sauce:

LLMs don’t just read; they organize and steer. The bi-path design balances human-like reasoning (AI summaries) with robust similarity search (vectors + structure), then packs results cleanly within budget. This synergy raises completeness, lowers hallucinations, and saves tokens—especially on multi-document reasoning.

04Experiments & Results

The Test: Researchers checked whether FABLE can find and combine evidence across many documents accurately and efficiently. They measured completeness (did it gather what was needed), hallucination (made-up stuff), irrelevance (extra fluff), recall (did it fetch the gold evidence), and standard QA metrics like EM/F1.

The Competition: FABLE was compared to classic BM25, strong dense retrievers like BGE-M3, structured RAGs such as LongRefiner and HippoRAG2, and even powerful long-context LLMs like Gemini-2.5-Pro reading full documents.

Scoreboard with Context:

Synthetic multi-doc QA (DragonBall/DragBalance): FABLE (docs) hit about 92.07% completeness with only ~31K tokens vs a long-context LLM using ~517K tokens. It kept hallucinations low (~5.37%) and irrelevance low (~2.52%). That’s like getting an A+ while writing a short, sharp essay instead of a super long one.
Real-world multi-hop QA (HotpotQA, 2Wiki): FABLE variants achieved higher EM/F1 than strong RAG baselines, gaining about +7 EM on HotpotQA and +8 EM on 2Wiki over structured RAG baselines, showing better cross-document synthesis.
Long-context comparison: Even when a long-context model reads everything (perfect recall by construction), it can still miss the right bits or waste tokens. FABLE matched or beat full-context accuracy while cutting tokens by up to 94%.
Agents (BrowseComp-plus, 100K+ docs): Simply swapping in FABLE as the retriever (keeping the same agent LLM) boosted accuracy to ~66.60% (+22.14 over a weaker pairing) and recall to ~76.60% (+14.28), while keeping search calls efficient (~21.74). That shows retrieval quality alone can supercharge agents.

Surprising Findings:

More context isn’t always better. Chunk-based retrieval performance plateaued or even degraded at very large inputs, while FABLE kept improving completeness and reducing hallucinations by staying structured and budget-aware.
Under tiny budgets (e.g., ~1K tokens), AI-guided node navigation outperformed pure vector approaches, showing that smart high-level reasoning wins when every token counts.
The bi-path fusion mattered: combining AI navigation with structure-aware expansion added ~+1.6 completeness points at moderate budgets (e.g., 4K tokens) over single-path choices.

What the Numbers Mean:

High completeness + low hallucination/irrelevance translates to answers that are not just correct but also trustworthy and concise.
Matching full-context LLMs with a fraction of tokens means lower costs and faster responses—crucial for real deployments and large-scale workflows.
Better agent scores without changing the LLM backbone show FABLE is a strong, drop-in upgrade to retrieval modules.

Takeaway: Structure + adaptive navigation beat flat piles of text. FABLE’s forest plus bi-path approach gets the right evidence, in the right order, within budget—making the final answers clearer and more reliable.

05Discussion & Limitations

Limitations:

FABLE needs an upfront indexing step (building the forest). This is offline and amortized, but it’s still extra work before queries start.
It shines on semantically structured documents (reports, articles, policies). On messy, highly unstructured text or pure keyword tasks, the advantages shrink and simple keyword search may suffice.
The simple equal-weighting in TreeExpansion is robust but not tuned; in edge cases, task-specific weighting could help.
If summaries at internal nodes are poor, AI-guided traversal may misprioritize; quality of indexing prompts and models matters.

Required Resources:

An LLM to build trees and summaries once (offline), an embedding model for vectors, and a generator model at query time. Vector indices (e.g., FAISS) and modest compute are needed for fast lookups.

When NOT to Use:

Tiny corpora where reading everything is cheaper and simpler.
Pure lookup tasks where a single exact string match answers the question; a full hierarchy may be overkill.
Real-time streaming text that changes every second; frequent reindexing may be too costly unless batched.

Open Questions:

How to best learn the fusion and propagation weights automatically per domain/task?
Can we enrich node metadata (e.g., temporal tags, source reliability) to further reduce hallucinations?
How to extend forests across modalities (tables, figures, audio) while keeping clean traversal and budgets?
What’s the optimal collaboration loop between the retriever and the generator (e.g., iterative refinement) without exploding tokens?

Balanced View: FABLE doesn’t replace long-context LLMs; it complements them. The big win is turning retrieval into smart, query-conditioned navigation over AI-built structures. This boosts faithfulness and efficiency on multi-document reasoning, but you still need good indexing, embeddings, and reasonable document structure.

06Conclusion & Future Work

3-Sentence Summary: FABLE lets LLMs organize documents into a forest of semantic trees and then retrieve answers by navigating those trees with two coordinated paths, adjusting depth to fit a token budget. This unified approach cuts semantic noise, reduces hallucinations, and improves multi-document synthesis, often matching full-context LLMs with up to 94% fewer tokens. The result is faster, cheaper, and more trustworthy answers across research-style and agent tasks.

Main Achievement: Turning retrieval from static, flat matching into active, query-conditioned navigation over LLM-built hierarchies, powered by a bi-path fusion of AI reasoning and structure-aware similarity.

Future Directions: Learn adaptive weights for TreeExpansion and fusion; add reliability/time metadata to nodes; broaden to multimodal trees (tables, figures, audio); and explore tight retriever–generator loops that keep budgets small.

Why Remember This: It shows that longer context alone isn’t the winning move; organizing knowledge and navigating it wisely is. FABLE’s forest + bi-path recipe is a practical blueprint for getting more faithful, efficient answers from today’s and tomorrow’s LLMs.

Practical Applications

•Enterprise search: Build forests over policy docs and reports so employees get precise, budget-aware answers with evidence.
•Regulatory audits: Navigate filings by summaries first, then drill to clauses and footnotes that prove compliance.
•Healthcare literature reviews: Skim systematic-review sections, then extract key trial outcomes within budget.
•Legal research: Compare rulings by selecting case summaries first, then pulling precedent paragraphs for citations.
•Financial analysis: Jump from earnings overview to exact KPI tables and management commentary for side-by-side comparison.
•Academic study helpers: For a student’s question, fetch chapter summaries before showing minimal needed passages.
•Customer support: Organize manuals into trees so agents retrieve just the relevant steps and warnings, not whole manuals.
•Software docs Q&A: From API overview to exact function signatures and edge-case notes with minimal tokens.
•News synthesis: Summarize a developing story by selecting top-level briefs and only the most relevant quoted details.
•Autonomous research agents: Swap in FABLE to improve document selection and reduce tool calls without changing the LLM.

Version: 1