Legal RAG Bench: an end-to-end benchmark for legal RAG

Abdur-Rahman Butler; Umar Butler

Legal RAG Bench: an end-to-end benchmark for legal RAG

Beginner

Abdur-Rahman Butler, Umar Butler3/2/2026

arXiv

Key Summary

•Legal RAG Bench is a new, end-to-end test that checks how well legal AI systems find information and use it to answer tough, real-world legal questions.
•It uses 4,876 passages from the Victorian Criminal Charge Book and 100 expert-written questions with long, detailed answers and supporting evidence.
•The benchmark measures three things at once: correctness (is the answer right), groundedness (is it supported by the retrieved text), and retrieval accuracy (did the system fetch the key passage).
•A key finding is that retrieval quality is the main driver of overall performance in legal RAG; better embedding models matter more than which big LLM you choose.
•Kanon 2 Embedder gave the biggest boost, raising correctness by about 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points over weaker embedders.
•Many so-called hallucinations actually start with retrieval failures—when the right passage isn’t found, the LLM is more likely to make things up.
•The paper introduces a hierarchical error decomposition that separates hallucinations, retrieval errors, and reasoning errors so teams know exactly what to fix.
•A full factorial design tests every combination of embedders and LLMs to fairly measure main effects and interactions, not just overall scores.
•Results are statistically tested: embedder choice is significantly more important than LLM choice for correctness and overall performance, with LLM effects showing up mainly in groundedness.
•All code, data, and results are openly released to help others reproduce and build better legal RAG systems.

Why This Research Matters

Legal advice must be accurate and backed by clear sources, because real people’s lives and freedoms are affected. Legal RAG Bench shows how to measure both truth and proof at the same time, so builders don’t ship systems that sound smart but can’t be trusted. It guides teams to invest first in better retrieval, which delivers the biggest, safest gains for users. Courts, lawyers, and clients benefit from answers that cite exactly where the rule comes from. Open data and code mean anyone can reproduce and improve the results, raising the bar for the whole field. Over time, this can reduce hallucinations, prevent costly legal mistakes, and make AI a more reliable helper in the justice system.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you’re doing a school report and you have a friendly librarian who can fetch the right book pages for you while you write. If the librarian brings the perfect pages, your report shines. If not, you might guess and get things wrong.

🥬 Filling (The Actual Concept): Retrieval-Augmented Generation (RAG) is a way for AI to first fetch helpful text from a library of documents and then use that text to write an answer. How it works:

The AI reads your question.
A retrieval helper searches the library to find the most relevant passages.
The AI reads those passages and writes an answer using them. Why it matters: Without RAG, the AI might rely on memory and make mistakes. With poor retrieval, even a smart writer can’t answer well.

🍞 Bottom Bread (Anchor): If you ask, “What must a judge tell the jury about eyewitness identification?” RAG retrieves the exact guidance from the Victorian Criminal Charge Book so the AI can quote and explain it correctly.

The world before: Lots of AI benchmarks checked either how well models write (generation) or how well they fetch passages (retrieval), but not both together in a realistic, lawyerly setting. Many legal datasets had weak labels, fuzzy rules, or questions that didn’t match real legal work. That’s like grading a cooking contest by only tasting the sauce or only checking the recipe—never seeing the whole dish.

🍞 Top Bread (Hook): You know how you and a friend might describe the same movie differently? If a teacher graded your summaries but didn’t check the actual movie, the scores might be unfair.

🥬 Filling (The Actual Concept): Groundedness is whether an AI’s answer is supported by the specific text it was given to use. How it works:

Look at the answer.
Check the retrieved passages.
Confirm the answer’s claims are actually present or directly supported by those passages. Why it matters: Without groundedness, an answer might sound right but be unproven, which is dangerous in law.

🍞 Bottom Bread (Anchor): If the AI says, “The judge must warn the jury about unreliable identification,” groundedness means that warning appears in the retrieved charge book passages.

🍞 Top Bread (Hook): Think of a math quiz where you either get the question right or wrong.

🥬 Filling (The Actual Concept): Correctness is whether the AI’s answer matches the expert solution. How it works:

Compare the AI’s answer to the reference answer.
Decide if it fully entails (covers) the correct points.
Mark 1 for correct, 0 for not. Why it matters: Without correctness, we can’t tell if the AI actually solved the legal problem.

🍞 Bottom Bread (Anchor): If the official answer says “five conditions must be met” and the AI lists and explains those same five correctly, that’s correct.

🍞 Top Bread (Hook): Imagine a treasure map where X marks the spot. If you can’t find X, you can’t get the treasure.

🥬 Filling (The Actual Concept): Retrieval accuracy is whether the system fetched the annotated supporting passage. How it works:

Run the retriever for each question.
Check if the known key passage is in the results.
Score 1 if found, 0 if not. Why it matters: Missing the key passage often caps how good the final answer can be.

🍞 Bottom Bread (Anchor): For a question about alibi directions, retrieval accuracy means the passage on alibi instructions actually shows up in the top results.

The problem: Legal AI needs both to be right (correctness) and to show receipts (groundedness). But past benchmarks often asked oversimplified yes/no questions, used mislabeled data, or skipped the messy parts of real legal work. Some even tested things that didn’t reflect actual legal tasks. That left builders guessing which part of their system needed fixing.

Failed attempts: Separate benchmarks for retrieval or for generation didn’t reveal how mistakes chain together. Multiple-choice tests couldn’t catch answers that sounded right but weren’t supported by evidence. Some datasets paired facts with citations in unrealistic ways, which confused models and humans alike.

The gap: We needed an end-to-end, realistic legal RAG test with expert-written hard questions, long-form answers, and exact supporting passages—plus a way to tease apart where errors come from: did the retriever miss, did the writer misread, or did it just make stuff up?

Real stakes: In law, an ungrounded answer is like advice without proof—it can’t be trusted. Lawyers must cite sources. Clients can face real harms if AI fabricates rules. A trustworthy benchmark that mirrors true legal tasks helps teams build safer tools for real people.

02Core Idea

The “Aha!” in one sentence: If you cleanly separate and measure retrieval, reasoning, and grounding together—with hard, expert questions and a full grid of model combinations—you can see that retrieval sets the ceiling for legal RAG.

Analogy 1 (Coach and Scout): Imagine a basketball team where the scout (retriever) picks players and the coach (LLM) plans plays. If the scout brings the wrong players, the coach can’t win, no matter how smart the plays are. Fix the scouting first.

Analogy 2 (Librarian and Writer): A librarian (retriever) brings sources; a writer (LLM) crafts the essay. If the librarian fetches unrelated books, the writer may guess—or hallucinate. Better books lead to better essays.

Analogy 3 (Ingredients and Chef): Great dishes need fresh, right ingredients (retrieval). Even a master chef (LLM) can’t make a great meal with stale or wrong ingredients.

Before vs After:

Before: Benchmarks often mixed up where errors came from, used unrealistic questions, and didn’t show how retrieval and generation interact.
After: Legal RAG Bench uses expert-crafted, long-form Q&A tied to exact passages, a hierarchical error breakdown, and a full factorial design to fairly compare embedders and LLMs together and apart. Now we can say, with evidence, that retrieval dominates performance in end-to-end legal RAG.

Why it works (intuition):

Legal answers must be both true and provable. If the retriever consistently supplies the most relevant, specific authority, the LLM doesn’t need to invent facts and is less likely to reason off-track.
Measuring correctness, groundedness, and retrieval accuracy together closes loopholes: you can’t hide a correct-but-ungrounded guess, and you can see when the right passage was available but misused.
A full factorial grid (every embedder with every LLM) reveals main effects and interactions, so results aren’t biased by lucky pairings.

Building Blocks (with Sandwich explanations when first introduced):

🍞 Top Bread (Hook): You know how a storyteller might fill in blanks when they forget details?

🥬 Filling (The Actual Concept): Hallucinations are when the AI invents facts not supported by the retrieved texts. How it works:

Compare each claim in the answer to the provided passages.
If a claim isn’t supported, mark it as hallucinated.
Classify this as a hallucination error. Why it matters: In law, made-up details erode trust and can be dangerous.

🍞 Bottom Bread (Anchor): Saying “the law requires three warnings” when the passage lists two is a hallucination.

🍞 Top Bread (Hook): When a gadget breaks, you troubleshoot step by step to find the faulty part.

🥬 Filling (The Actual Concept): Hierarchical error decomposition is a flow that classifies errors as hallucinations first, then retrieval errors, then reasoning errors. How it works:

Is the answer grounded? If no, it’s a hallucination.
If grounded but incorrect, did we retrieve the key passage? If no, it’s a retrieval error.
If yes, it’s a reasoning error. Why it matters: Without this ladder, teams can’t tell whether to fix search, reading, or writing.

🍞 Bottom Bread (Anchor): If the right jury-direction passage was retrieved but the model still got the rule wrong, that’s a reasoning error.

🍞 Top Bread (Hook): Picture testing every pizza topping combo to see which ones really make people happiest.

🥬 Filling (The Actual Concept): Full factorial design means testing every embedder with every LLM to measure main effects and interactions. How it works:

Pick several embedders and several LLMs.
Run all pairings on the same questions.
Analyze which component drives gains and whether certain pairs click. Why it matters: Without it, you might praise a model that only looks good next to a weak partner.

🍞 Bottom Bread (Anchor): Trying Kanon 2 Embedder with both Gemini 3.1 Pro and GPT-5.2 shows Kanon 2 helps no matter which LLM you use, proving the retriever’s broad impact.

03Methodology

High-level overview: Question + Legal Corpus → [Embedding-based Retrieval] → [Top-k Context] → [LLM Answer Generation] → [LLM-as-Judge Scoring] → Output: Correctness, Groundedness, Retrieval Accuracy, and Error Type.

Corpus and questions:

4,876 passages from the Victorian Criminal Charge Book were converted to clean text, split by sections and subsections, and semantically chunked so each piece stayed under 512 tokens.
100 expert, long-form questions were written to mimic real legal tasks. Experts also wrote reference answers and pointed to the single most supportive passage for each question.
Questions were crafted to be lexically different from their answers’ passages to stress true semantic retrieval, not keyword matching.

🍞 Top Bread (Hook): Think of turning words into map coordinates so you can quickly find nearby places.

🥬 Filling (The Actual Concept): Embedding models convert text into vectors (number lists) so similar meanings sit close together. How it works:

Turn each corpus chunk and question into vectors.
Measure similarity (closeness) between the question and chunks.
Pick the top-k closest passages as context. Why it matters: Without good embeddings, you fetch off-target passages and hurt everything downstream.

🍞 Bottom Bread (Anchor): A question about “identification warnings” should land near passages that discuss “jury directions about eyewitness reliability,” even if words differ.

Retrieval step (what and why):

What happens: For each question, the retriever ranks all passages by semantic similarity and returns top-k.
Why it exists: This gives the LLM the most helpful evidence; wrong passages trigger confusion and hallucinations.
Example: If the query asks about when to give an alibi direction, the retriever should bring the “alibi” section, not general sentencing policy.

Answer generation step (what and why):

What happens: The LLM reads the question plus the retrieved passages and composes a reasoned, cited answer.
Why it exists: Legal answers must synthesize rules, apply to facts, and explain clearly.
Example: For a self-defense instruction question, the LLM should quote or paraphrase the extracted elements and apply them to the hypothetical.

Scoring step (LLM-as-judge):

What happens: A judging model (GPT-5.2 in high reasoning mode) scores each answer on correctness and groundedness, following a clear rubric with binary outcomes.
Why it exists: Consistent, scalable evaluation of thousands of answers is needed; binary choices reduce ambiguity.
Example: If an answer claims a requirement absent from the passages, groundedness = 0.

Metrics calculated:

Correctness: 1 if the answer entails the reference answer; else 0.
Groundedness: 1 if the answer is supported by retrieved passages; else 0.
Retrieval accuracy: 1 if the annotated supporting passage was retrieved; else 0.

🍞 Top Bread (Hook): Like checking if your map actually led you to the museum you wanted—not just any building.

🥬 Filling (The Actual Concept): Retrieval accuracy checks if the specific, expert-labeled key passage for the question was returned. How it works:

Compare the top-k list against the gold passage.
If included, mark as 1; else 0.
Aggregate across questions for each embedder. Why it matters: It shows whether you can fetch the essential anchor text, not just something related.

🍞 Bottom Bread (Anchor): For a hearsay exception question, retrieval accuracy means the hearsay exception section appears among the top retrieved chunks.

Hierarchical error decomposition (secret sauce):

First, check groundedness. If the answer isn’t tied to the provided passages, that’s a hallucination. In law, ungrounded correct guesses still fail because they can’t be verified.
Next, if grounded but still incorrect, check retrieval accuracy.
- If the gold passage wasn’t retrieved, label a retrieval error (the LLM tried but lacked the best support).
- If it was retrieved but the answer is still wrong, it’s a reasoning error (the LLM misread or misapplied the rule).
Why this is clever: It mirrors real legal workflows where evidence comes first. It also tells teams exactly which component to improve.

Full factorial design (how comparisons stay fair):

Test every embedder (Kanon 2 Embedder, Gemini Embedding 001, Text Embedding 3 Large) with every LLM (Gemini 3.1 Pro, GPT-5.2) using the same RAG pipeline and defaults.
This produces apples-to-apples comparisons and lets the authors run statistical tests for main effects (which component matters most) and interactions (do certain pairs click or clash?).

Examples with data:

With Kanon 2 Embedder, average retrieval accuracy is about 86%, and correctness and groundedness are roughly 94% and 96%, showing a strong foundation.
Weaker embedders (around 52–53% retrieval accuracy) lead to lower correctness (about 74–77%) and lower groundedness (about 87–92%), and higher hallucination rates.

What breaks without each step:

Without good embeddings: You retrieve the wrong passages; correctness and groundedness fall, hallucinations rise.
Without long-form references: You can’t fairly judge reasoning or see if the model understands nuances.
Without the judge rubric: Scores wobble; comparisons become unreliable.
Without the error ladder: You misdiagnose problems—blaming the writer (LLM) when the librarian (retriever) missed the book.

Secret sauce in one line: Tie expert-grade questions to exact supporting passages, score correctness and groundedness together, and use a factorial grid plus an error ladder to pinpoint whether to fix search or reasoning first.

04Experiments & Results

The test: The benchmark measures three outcomes per question, per embedder–LLM pair: correctness (is the answer right), groundedness (is it backed by retrieved text), and retrieval accuracy (did we fetch the gold passage). This captures both the quality of the librarian (retrieval) and the writer (LLM), and whether the writer sticks to the sources.

The competition: Three state-of-the-art embedders—Kanon 2 Embedder, OpenAI Text Embedding 3 Large, and Gemini Embedding 001—were each paired with two frontier LLMs—Gemini 3.1 Pro and GPT-5.2—in a full factorial grid. Everyone ran through the same simple RAG pipeline to keep things fair, and GPT-5.2 (high reasoning mode) judged the outputs with a strict rubric.

The scoreboard with context:

Kanon 2 Embedder averaged about 94% correctness, 96% groundedness, and 86% retrieval accuracy across LLMs. That’s like getting an A in answering, an A in citing sources, and a strong A- in finding the key passage.
Text Embedding 3 Large averaged roughly 76.5% correctness, 91.5% groundedness, and 52% retrieval accuracy. Think B in answering, still strong in citing, but a D in finding the key passage.
Gemini Embedding 001 averaged around 74% correctness, 87% groundedness, and 53% retrieval accuracy—similar difficulty in finding the right passage, with slightly lower groundedness.
For LLMs, Gemini 3.1 Pro averaged about 82.3% correctness and 94.3% groundedness; GPT-5.2 averaged about 80.7% correctness and 88.7% groundedness. So LLM choice matters moderately, especially for groundedness.

Headline result: Retrieval quality dominates overall performance. Swapping to Kanon 2 Embedder improved correctness by about 17.5 points, groundedness by about 4.5 points, and retrieval accuracy by about 34 points compared to weaker embedders—a huge swing that lifts the whole RAG pipeline.

Surprising (and important) finding: Many “hallucinations” weren’t random LLM leaps; they were triggered by bad retrieval. When the retriever brought irrelevant or weak passages, the LLM was more likely to invent missing details. With stronger retrieval (Kanon 2), hallucinations dropped noticeably.

More detail on hallucinations and errors:

Gemini 3.1 Pro showed an average hallucination rate around 5.7%, while GPT-5.2 was around 11.3%—but crucially, hallucinations fell across the board when retrieval improved.
As retrieval errors shrink (with Kanon 2), remaining failures shift into reasoning errors, revealing where LLMs genuinely need improvement.

Overall RAG accuracy (after accounting for all error types):

Kanon 2 Embedder delivered about 91.5% accuracy relative to a sample average near 77.3%, a massive +18% lift above average.
Gemini 3.1 Pro and GPT-5.2 showed more modest overall impacts (about +3% and −3% versus average), underlining that embedders drive the biggest gains.

Statistical significance and interactions:

Main effects: Embedder choice was statistically significant for correctness, groundedness, and the combined metric, confirming that retrieval quality is the dominant factor.
LLM main effect: Not significant for correctness; significant for groundedness. That means which LLM you pick mostly changes how faithfully it sticks to sources.
Interactions: Detected for groundedness. Switching from Gemini 3.1 Pro to GPT-5.2 reduced groundedness by about 9–10 points with weaker embedders, but produced no detectable change with Kanon 2. In other words, a strong retriever cushions the LLM’s tendency to drift from sources.

Bottom line with a plain-language analogy: If your librarian consistently finds the right law pages, your writer rarely goes off-script. If the librarian stumbles, even the best writer may guess. Fix the librarian first.

05Discussion & Limitations

Limitations:

Domain specificity: The dataset centers on Victorian criminal law. While the approach generalizes, exact numbers may change in other jurisdictions or civil/commercial domains.
Single-corpus focus: The benchmark is anchored to one high-quality source (the Criminal Charge Book). Multi-source settings (cases, statutes, practice notes) might introduce additional retrieval complexity.
LLM-as-judge: Although the judging pipeline reportedly achieved about 99% internal accuracy with a strict rubric and binary choices, any automated judge can carry hidden biases or edge-case errors.
Passage-level gold: Retrieval accuracy keys off a single annotated “best” passage. In practice, multiple passages can be jointly sufficient; the methodology partly accounts for this via groundedness but still simplifies a complex reality.
Vendor involvement: The authors’ company built Kanon 2 Embedder and sponsored the benchmark, which underscores the importance of open data/code for independent replication.

Required resources:

Legal expertise to craft challenging questions, gold answers, and gold passages.
High-quality legal corpora and careful chunking/tokenization.
Compute and API access for multiple embedders/LLMs; capacity to run factorial experiments.
Evaluation tooling (LLM-as-judge with a firm rubric or human review for audits).

When not to use:

Pure closed-book reasoning tasks where retrieval is disallowed.
Domains with radically different text structure (e.g., patents or cross-language law) unless you adapt the corpus and chunking strategy.
Situations demanding ultra-long context beyond chunk sizes without an appropriate long-context retriever.

Open questions:

Cross-jurisdiction scaling: How do results change with mixed sources (cases, statutes, secondary materials) and multiple jurisdictions?
Multi-hop reasoning: How to best evaluate questions that require weaving together several passages across documents?
Judge robustness: What’s the best mix of human review and LLM judges to ensure reliability and fairness across edge cases?
Interaction patterns: Do certain LLMs pair better with specific domain embedders in other legal areas?
Guardrails: Which prompting or citation formats reduce residual hallucinations when retrieval is already strong?

06Conclusion & Future Work

Three-sentence summary: Legal RAG Bench is an end-to-end benchmark that pairs expert-level legal questions, long-form answers, and exact supporting passages to test real-world legal RAG systems. By measuring correctness, groundedness, and retrieval accuracy together—and breaking errors into hallucinations, retrieval failures, and reasoning failures—it shows that retrieval quality dominates overall performance. A full factorial design and statistical tests confirm that better embedders lift the ceiling, while LLM choice mainly affects groundedness.

Main achievement: The paper cleanly proves, with data and rigorous design, that in practical legal RAG, retrieval sets the ceiling: upgrade the embedder first if you want the biggest gain in correctness and faithful, evidence-backed answers.

Future directions: Expand to multi-source, multi-jurisdiction corpora; push multi-hop evaluations; test more retrievers (re-ranking, hybrid sparse-dense, long-context) and more LLMs; integrate human-in-the-loop judging for audits; and explore prompts/citation styles that further reduce residual hallucinations.

Why remember this: It gives legal AI builders a trustworthy, reproducible way to see exactly where their RAG pipeline fails and how to fix it. It replaces guesswork with a map: first improve retrieval, then refine reasoning. And because all data and code are open, the community can iterate quickly toward safer, more reliable legal assistants.

Practical Applications

•Choose embedders first: Pilot multiple retrieval models on Legal RAG Bench to pick the one that maximizes retrieval accuracy and groundedness.
•Debug smarter: Use the error ladder to see if failures are hallucinations, retrieval errors, or reasoning errors, then fix that layer.
•Prompt for grounding: Add instructions that require quotes and citations, and verify groundedness automatically.
•Hybrid retrieval: Combine dense embeddings with keyword or re-ranking to raise retrieval accuracy above single-method limits.
•Chunking strategy: Tune chunk sizes and overlaps using benchmark feedback to avoid splitting key rules across chunks.
•LLM selection: After retrieval is strong, compare LLMs for groundedness and reasoning clarity on the same pipeline.
•Judge audits: Run an LLM-as-judge with a strict rubric, then spot-check with humans to ensure evaluation reliability.
•Domain expansion: Replicate the methodology on other legal corpora (statutes, case law, practice notes) to build broader legal RAG tests.
•Safety checks: Block ungrounded answers in production by detecting groundedness=0 and asking the retriever to try again.
•Monitoring: Track correctness, groundedness, and retrieval accuracy over time in production to catch regressions early.

Version: 1