BEAVER: An Efficient Deterministic LLM Verifier

Tarun Suresh; Nalin Wadhwa; Debangshu Banerjee; Gagandeep Singh

BEAVER: An Efficient Deterministic LLM Verifier

Intermediate

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee et al.12/5/2025

arXiv PDF

Key Summary

•BEAVER is a new way to check, with guaranteed certainty, how likely a language model is to give answers that obey important rules.
•Instead of randomly sampling full sentences and hoping to see good ones, BEAVER explores partial sentences smartly and prunes bad paths early.
•It uses two clever data helpers—a token trie (a tree of prefixes) and a frontier (the current leaves to explore)—to keep track of possibilities and their probabilities.
•For any prefix-closed rule (where a broken prefix stays broken), BEAVER can give sound lower and upper bounds on the chance the model stays within the rules.
•These bounds are deterministic and always valid as the system runs, tightening over time so you can stop early with confidence.
•Across math correctness, email privacy, and secure code tests, BEAVER produced 6–8x tighter probability bounds than rejection sampling with the same compute.
•It revealed 3–4x more high-risk cases where models could leak private emails or generate insecure code, which looser methods missed.
•The approach is an anytime algorithm: at any moment, you have certified bounds you can trust.
•It works best when you can check prefixes quickly and have white-box access to the model’s next-token probabilities.
•This makes safety auditing and deployment decisions much clearer and more reliable.

Why This Research Matters

LLMs are used in places where mistakes can be costly—like handling private data, writing code, or following strict formats. BEAVER provides sound, deterministic probability bounds on whether an LLM follows the rules, instead of relying on loose sampling guesses. This helps teams make safer deployment decisions, set policies, and prioritize fixes using real, certified risk levels. It also spots hidden dangers earlier by pruning bad paths as soon as they appear. The result is clearer accountability, stronger safety margins, and more trust in AI systems used in the real world.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re exploring a giant maze. At each fork, you can peek down the corridor to see which turns look safe. Would you rather sprint to the end every time and then discover you went the wrong way, or would you like to spot dead ends early and skip them?

🥬 The Concept (The World Before): Large language models (LLMs) don’t produce just one answer—they produce a whole probability distribution over many possible answers. Before tools like BEAVER, people checked safety and quality mostly by sampling lots of full answers and counting how many were okay. This is called rejection sampling. It’s like running to the end of many maze paths and then tossing out the bad ones. It gives you a feel for behavior but no guaranteed safety line—just an estimate that might miss rare but dangerous failures.

🍞 Anchor: Teams would sample hundreds of model replies, then say, “Looks safe!”—but a hidden 5% risk could still be there, unspotted.

🍞 Hook: You know how your teacher gives you both a minimum and maximum possible score range before grading all your work? That range tells you what’s guaranteed so far.

🥬 The Concept (The Problem): For LLMs, we want guaranteed ranges—sound probability bounds—on how often outputs satisfy rules like “no private info,” “valid JSON,” or “secure code.” But the output space is astronomically big (huge vocabularies and many token steps), so enumerating every possible output is impossible. Classic verification tools for vision models don’t fit well either because LLMs generate tokens one-by-one, mixing probabilities with discrete decoding. We needed a new way to reason about sequences and their chances.

🍞 Anchor: If an app promises that less than 1 in 1000 chats could leak an email, you want a proof-backed bound, not just a guess from random samples.

🍞 Hook: Think of house rules like “no shouting.” If someone starts a sentence with a shout, you already know the rule is broken—you don’t need to hear the rest.

🥬 The Concept (Failed Attempts): Rejection sampling keeps finishing entire replies even after the rule is already broken at the start. It also wastes time rediscovering the same popular answers again and again. And because it’s statistical, it can miss risky corners of the distribution, especially if failures are rare but real.

🍞 Anchor: It’s like grading whole essays when the title already breaks the rule.

🍞 Hook: Imagine a checklist that lets you stop early the moment you see a violation, and also keeps track of where you’ve already looked.

🥬 The Concept (The Gap): We were missing a principled, deterministic method that (1) explores partial generations, (2) prunes violating branches as soon as they appear, (3) avoids duplicates, and (4) keeps certified lower and upper probability bounds that tighten over time. We needed data structures that mirror how LLMs write token-by-token and rules that let us prove soundness.

🍞 Anchor: A good map of the maze that shades-out blocked hallways in real-time.

🍞 Hook: Picture a tree where each step adds a letter, forming longer words. The moment a prefix breaks a rule, you stop growing that branch.

🥬 The Concept (Why This Paper Exists): BEAVER brings two key tools—a token trie (a prefix tree of valid partial generations) and a frontier (the current leaves to explore)—plus a branch-and-bound search that keeps sound bounds at every step. For prefix-closed constraints (if a prefix is bad, all extensions are bad), BEAVER can safely discard huge chunks of the space early.

🍞 Anchor: If your rule is “no ‘rm’ command in a bash line,” as soon as the model writes ‘rm’, BEAVER prunes that entire region without wasting more compute.

🍞 Hook: Why care? Because real stakes are real-life.

🥬 The Concept (Real Stakes):

In privacy: accidentally producing someone’s email or phone number.
In coding: slipping in a vulnerable pattern that hackers can exploit.
In math or structured formats: returning invalid expressions that break downstream systems. If you can’t certify a safe probability range, you might ship a model that quietly fails 10% of the time.

🍞 Anchor: In the paper’s tests, BEAVER found 3–4x more risky cases than the baseline, the kind you really want to catch before launching an app.

02Core Idea

🍞 Hook: You know how when you build with LEGO, if the base piece is wrong, you stop right there—you don’t keep stacking more bricks on top of a bad start.

🥬 The Concept (Aha! Moment): If a rule is prefix-closed (once a prefix breaks the rule, all completions are bad), then we can safely grow only valid prefixes and prune the rest, while keeping guaranteed probability bounds that tighten with each step.

How it works (big picture):

Build a token trie that stores all explored valid prefixes and their probabilities.
Keep a frontier of the current leaf prefixes (some complete, some incomplete).
At each step, pick an incomplete leaf, do one model forward pass for next-token probabilities, add only the continuations that still satisfy the rule, and update the lower/upper bounds.
Repeat; bounds monotonically tighten. You can stop anytime with sound intervals.

Why it matters: Without pruning and prefix tracking, you waste compute on full bad sequences and duplicates, and you can’t provide deterministic guarantees.

🍞 Anchor: For a bash safety rule like “no rm or chmod,” BEAVER never expands any prefix starting with rm or chmod. That instantly reduces uncertainty and tightens bounds.

🍞 Hook (Analogy 1 – Detective): When solving a mystery, if a suspect’s alibi is already impossible halfway through the story, you drop that suspect without reading more pages.

🥬 The Concept: BEAVER does the same—drops any partial path that already violates the rule and spends time only on promising suspects (prefixes with higher probability).

Why it matters: Time saved on dead ends is time invested in the most likely truths.

🍞 Anchor: That’s how BEAVER got 6–8x tighter probability bounds than sampling on the math benchmark.

🍞 Hook (Analogy 2 – Grocery Budget): You keep a running minimum of what you’ve already bought and a maximum of what you could still buy based on what’s left.

🥬 The Concept: Lower bound is the total probability of completed, rule-following sequences found. Upper bound is the probability of completed sequences plus the remaining incomplete prefixes (pretending they might all finish validly). As you explore, you settle the uncertainty.

Why it matters: Even halfway through, you have guaranteed ranges you can trust.

🍞 Anchor: Mid-run, BEAVER might say, “The chance this reply is safe is between 0.34 and 0.36,” which is already deployment-worthy precision.

🍞 Hook (Analogy 3 – Maps App): To reach a destination fastest, you try the most promising road first and prune roads you know are closed.

🥬 The Concept: A greedy selection called Max-μ always expands the highest-probability incomplete prefix first, shrinking uncertainty quickly. There’s also Sample-μ, which samples prefixes proportional to their probabilities, trading determinism for diversity.

Why it matters: Smarter exploration means fewer forward passes to get tight, certified bounds.

🍞 Anchor: In experiments, Max-μ reduced the bound gap much faster than rejection sampling for the same compute.

Building blocks (broken into smaller pieces):

🍞 Hook: Like sorting your LEGO pieces before building. 🥬 Token Trie: A tree storing valid prefixes and their path probabilities; no illegal branches are grown. Why it matters: It memorializes explored good prefixes so we don’t redo work. 🍞 Anchor: Paths like [ls, -al, ⟨eos⟩] become a branch; [rm, …] never appears.
🍞 Hook: Like a to-do list of rooms to explore in a mansion. 🥬 Frontier: The set of current leaves (complete + incomplete). Why it matters: It defines what we’ll explore next and how we compute bounds at any point. 🍞 Anchor: All safe leaf prefixes (like [ls] and [echo]) sit on the frontier, ready to expand.
🍞 Hook: Like setting a floor and ceiling on your test score while papers are still being graded. 🥬 Probability Bounds: Lower bound = sum of completed safe paths; Upper bound = sum of completed safe + all incomplete valid prefixes. Why it matters: These are deterministic, sound, and monotone tightening. 🍞 Anchor: On the math benchmark, bounds like [0.343, 0.356] tell you a model’s true correctness is pinned tightly.
🍞 Hook: Like choosing the hallway that seems widest first. 🥬 Max-μ Selection: Always pick the incomplete prefix with the highest path probability to expand. Why it matters: Reduces uncertainty fastest in practice. 🍞 Anchor: This is one big reason BEAVER beats rejection sampling on tightness.

03Methodology

At a high level: Prompt + Model + Prefix-Closed Rule → Select Best Incomplete Prefix → One Forward Pass → Keep Only Rule-Following Continuations → Update Lower/Upper Bounds → Repeat until tight enough.

We’ll explain each step with the Sandwich pattern for the core pieces.

Inputs and Setup 🍞 Hook: Think of setting up a treasure hunt: you need the map (prompt), the terrain rules (constraint), and a way to estimate which path is more promising (model probabilities). 🥬 The Concept: We start with a prompt, a language model that gives next-token probabilities, and a prefix-closed semantic constraint (a rule where a bad prefix is permanently bad). We initialize a token trie with just the empty sequence and set the frontier to that single leaf. The lower bound starts at 0 (we’ve confirmed nothing), and the upper bound starts at 1 (everything might still be okay). Why it matters: This gives us a clean, sound starting point for bounds. 🍞 Anchor: Before exploring any token, we say, “Safety probability is somewhere between 0% and 100%.”
The Token Trie (storing explored good prefixes) 🍞 Hook: Like a family tree where each child adds one new letter to a word. 🥬 The Concept: Nodes store sequences (prefixes) and their path probabilities (product of next-token probabilities along the path). We only add children for tokens that keep the sequence within the rule. How it works:

Begin at the root (empty prefix).
For a chosen prefix x, do a forward pass to get P_M(t | prompt·x) for all tokens t.
For each t that keeps x·t within the rule, add a child node labeled by t with its conditional probability; record the updated path probability μ(x·t) = μ(x) × P_M(t | ...). Why it matters: The trie mirrors legal generations, never wasting time on illegal branches. 🍞 Anchor: In the bash example, children for rm or chmod don’t appear at all because they instantly violate safety.

The Frontier (what to explore next and how we bound) 🍞 Hook: Imagine a row of doors at the edge of explored space—the frontier is those doors. 🥬 The Concept: The frontier is all current leaves. Some leaves are complete (end with ⟨eos⟩), and some are incomplete. We compute bounds from these:

Lower bound = sum of μ over complete, rule-satisfying leaves.
Upper bound = sum of μ over all leaves (complete + incomplete), pretending incomplete ones could all finish validly. Why it matters: Even mid-search, you have sound, provable bounds you can trust. 🍞 Anchor: After a few expansions in the bash task, you might have LB=0.21 from finished safe commands and UB=0.80 because some high-probability safe-looking prefixes still need to be resolved.

Selection Strategy (which prefix to expand) 🍞 Hook: When time is short, start with the path most likely to lead somewhere. 🥬 The Concept: Max-μ picks the incomplete leaf with the highest path probability. There’s also Sample-μ, which samples leaves in proportion to their μ. You do exactly one forward pass per expansion. Why it matters: Max-μ shrinks uncertainty fast in practice, making the bounds tighten quickly. 🍞 Anchor: On math problems, Max-μ helped BEAVER get 6–8x tighter gaps than sampling for the same compute.
Expansion and Bound Update (the main loop) 🍞 Hook: Open one door from the frontier, peek inside, and add only the safe rooms beyond it. 🥬 The Concept: For the selected prefix x:

Temporarily remove its μ(x) from the upper bound (we’re replacing that leaf with its children).
Run one forward pass to get next-token probabilities.
For each token t:
- If x·t violates the rule, ignore it forever (prefix-closed pruning).
- If t is ⟨eos⟩ and x·t satisfies the rule, add μ(x·t) to both the frontier’s complete set and the lower bound.
- If t is a normal token and x·t satisfies the rule, add μ(x·t) to the frontier’s incomplete set and to the upper bound. Why it matters: This bookkeeping ensures the bounds remain sound at every iteration and tighten monotonically. 🍞 Anchor: In the email privacy task, as soon as a prefix starts forming a known leaked address, that entire branch is pruned and its probability cannot contribute to safe completions.

Secret Sauce (why it’s clever) 🍞 Hook: If you can spot doomed paths early, you save tons of time. 🥬 The Concept: Prefix-closure plus a trie lets BEAVER prune massive parts of the space immediately, while the frontier and Max-μ focus compute where probability mass really lies. The bounds are anytime: you can stop at any point and still have valid, certified ranges. Why it matters: Deterministic, sound progress beats noisy estimates when safety decisions matter. 🍞 Anchor: After just 10 expansions in the bash toy example, BEAVER narrowed the probability to a 0.1-wide window, far tighter than rejection sampling with much more compute.
Example with actual data (bash safety mini-walkthrough) 🍞 Hook: Checking a command line for safety is like checking lunch for allergies—one bad ingredient spoils the meal. 🥬 The Concept: Constraint: no rm, no chmod, no /etc/passwd. Start at empty prefix:

First pass: big mass on ls, some on echo, a tiny bit on ⟨eos⟩; rm and chmod violate and are pruned.
Bounds become LB=prob(⟨eos⟩), UB=sum of all current safe leaves.
Next, expand ls (it’s highest μ). Add safe flags like -al or ⟨eos⟩ as children and update bounds.
Repeat: expand the highest-μ leaf, add safe children, update LB/UB. Why it matters: UB drops as you prune impossibilities; LB rises as you confirm completed safe commands. 🍞 Anchor: After several steps, you get a tight band like [0.70, 0.80], a precise safety slice you can trust.

What breaks without each step

No trie: you’ll repeat work and can’t systematically remember explored prefixes.
No frontier: you can’t compute sound bounds or choose the best next expansion.
No prefix-closure: you can’t prune early, so you lose the main efficiency gain.
No Max-μ: your progress on tightening may be much slower.
No bound updates per step: you lose anytime guarantees.

Putting it all together: Input → Initialize trie and frontier → While budget remains: select best leaf (Max-μ) → one forward pass → add only safe children → recompute LB/UB from frontier → stop when tight enough or budget ends. The result: deterministic, provably sound bounds on the true probability that the LLM satisfies your rule.

04Experiments & Results

🍞 Hook: Think of a scoreboard that not only shows who’s winning, but by exactly how much, with a confidence band that gets tighter as the game goes on.

🥬 The Concept (The Test): The authors measured how tightly BEAVER could pin down the true probability that an LLM stays within a rule, compared to rejection sampling with the exact same compute budget (100 forward passes per instance). They tested three high-stakes areas:

Math correctness (GSM-Symbolic): The rule says the final symbolic expression must be grammatically valid and functionally equivalent to the ground-truth (checked by Z3). The grammar part is prefix-closed, enabling early pruning.
Email privacy (Enron): The rule says do not output an email from a known leaked set. Violations are prefix-detectable as the address forms, making it prefix-closed.
Secure code (CyberSecEval Rust autocomplete): The rule says do not produce insecure patterns (detected by ICD). Pattern detection is prefix-closed, so insecure prefixes are pruned.

🍞 Anchor: In everyday terms: “Don’t say private emails,” “Don’t write dangerous code,” and “Write a correct math expression.”

🍞 Hook: It’s not fun to run a race if your opponent gets more time or more shoes, so both sides used the same compute.

🥬 The Concept (The Competition): BEAVER vs. rejection sampling (RS), both with 100 forward passes. They compared:

Bound tightness (Gap = UB − LB) for math correctness—tighter is better.
Risky Distribution Ratio (RDR) for privacy/security—the share of tasks where UB < 0.9, meaning there’s sound evidence of non-trivial violation probability (>0.1). Higher RDR (with tight bounds) means you’ve revealed real risks that looser methods might hide.
How quickly bounds converged (average passes to reach a 0.01 gap threshold).

🍞 Anchor: Same budget, but BEAVER acted like a skilled detective, using clues to focus on the most likely paths.

🍞 Hook: Scoreboard time!

🥬 The Concept (The Scoreboard, with context):

GSM-Symbolic (Correctness): Across models like Qwen3-4B and Llama3.3-70B, BEAVER achieved 6–8x tighter gaps than RS. Example: Qwen3-4B—RS gap 0.092 vs. BEAVER gap 0.013; BEAVER certified bounds [0.343, 0.356], which is like getting an exact B instead of a wide B–to–A range. It often converged in fewer passes (e.g., ~25 vs. ~49).
Enron Email Leakage (Privacy): BEAVER flagged far more risky cases (UB < 0.9). Example: Qwen3-4B—RS flagged 15/100 risky instances; BEAVER flagged 67/100. That’s like finding four times more hidden leaks because the bounds are tight enough to prove real risk.
Secure Code (Rust, adversarial): With a jailbreak prefix, BEAVER uncovered many more risky cases than RS. Example: Qwen3-30B-A3B—RS found 2/204 risky cases; BEAVER found 86/204. That’s an order-of-magnitude difference, like discovering a minefield where RS saw only a few pebbles.

Why it matters: When deploying systems, underestimating risk is dangerous. BEAVER’s bounds reduce false safety and reveal where attention is needed.

🍞 Anchor: If a hospital’s code assistant could produce insecure snippets 30% of the time under pressure, you’d want to know that with certainty—not a loose estimate.

🍞 Hook: Any surprises?

🥬 The Concept (Surprising Findings):

Lower temperature (sharper distributions) tightened bounds even faster, because mass concentrates on a few high-probability paths that BEAVER can resolve quickly. For Qwen3-4B on GSM-Symbolic, the gap shrank from 0.013 (T=1) to 0.001 (T=0.33).
Sample-μ (probabilistic frontier selection) performed similarly to Max-μ in final tightness, showing robustness to strategy choice, though Max-μ was typically a bit tighter/faster.

🍞 Anchor: Like focusing the camera; a sharper focus helps you identify what’s there sooner, but even a slightly softer focus (Sample-μ) still gets the job done well.

Summary in plain language: With the same compute, BEAVER narrows uncertainty much more than rejection sampling, finds many more real risks in privacy and security settings, and gives you trustworthy ranges you can base decisions on.

05Discussion & Limitations

🍞 Hook: Even the best tool has instructions and places where it shines—and places where it doesn’t.

🥬 The Concept (Limitations):

Prefix-closed rules only: BEAVER relies on pruning prefixes that already violate the rule. If your rule isn’t prefix-closed (like “must match a full date exactly”), you’ll need to convert it to a prefix-closed version (e.g., “is this prefix completable to a valid date?”) or use a different approach.
White-box access: BEAVER needs full next-token probability distributions (noisy or filtered outputs can change the distribution). If you only have black-box sampling APIs, you can’t run BEAVER as-is.
Constraint cost: If your rule requires heavy tools (e.g., SMT solvers or deep static analysis) at every step, checking each continuation can become the runtime bottleneck.

🍞 Anchor: It’s like a powerful microscope—you need a clear sample (white-box access) and enough time to analyze (constraint cost) to see the fine details.

🍞 Hook: What gear do you need to run it well?

🥬 The Concept (Required Resources):

Access to the model’s logits or exact next-token probabilities.
A fast, decidable, prefix-closed constraint checker (or a prefix-completable version of the rule).
Compute to run many forward passes (the paper used A100 GPUs) and memory to hold frontiers.

🍞 Anchor: If your rule is simple (like pattern checks) and you have the model locally, BEAVER can sing.

🍞 Hook: When should you skip BEAVER?

🥬 The Concept (When NOT to Use):

Purely non-prefix-closed constraints with no practical prefix-closure trick.
Strict black-box API access with only sampled outputs.
Extremely expensive constraint checks per token where even caching can’t help.

🍞 Anchor: If you can’t prune prefixes or see probabilities, BEAVER can’t do its magic.

🍞 Hook: What’s next on the roadmap?

🥬 The Concept (Open Questions):

Smarter frontier strategies: Can we learn which prefix to expand for even faster tightening?
Caching and incremental checks: How far can we push speed-ups for heavy constraints?
Beyond single prompts: Can BEAVER verify distributions over prompts or multi-turn dialogues?
New constraint classes: How to extend sound bounds beyond prefix-closed families?

🍞 Anchor: Think of BEAVER today as a strong engine; the next upgrades will make it faster, broader, and easier to plug into complex, real-world pipelines.

06Conclusion & Future Work

🍞 Hook: You know how a good coach gives you a safe, sure plan even before the game is over? That’s what BEAVER does for language models.

🥬 The Concept (3-Sentence Summary): BEAVER is a deterministic, branch-and-bound verifier for LLMs that computes sound lower and upper bounds on the probability that outputs satisfy prefix-closed rules. It uses a token trie to track only legal prefixes and a frontier to update bounds at every step, pruning violating branches early and focusing on high-probability paths. In math correctness, email privacy, and secure code generation, BEAVER produced much tighter bounds and revealed far more risky cases than rejection sampling with the same compute.

Main Achievement: Turning “it’s probably fine” into “here’s a certified safe range,” BEAVER shows that deterministic verification of LLMs is not only possible but practical.

Future Directions: Improve frontier strategies, speed up constraint checks via caching/incrementality, support prompt distributions and multi-turn safety, and grow beyond prefix-closed rules without losing soundness. Exploring compatibility with limited-access models is another important step.

Why Remember This: In a world where models write code, handle private data, and make decisions, knowing the guaranteed safety odds matters. BEAVER gives you an anytime, sound safety gauge—tightening as it runs—so you can make deployment calls based on proofs, not just hopes. That shift—from loose estimates to certified bounds—can raise the safety bar for real-world AI systems.

Practical Applications

•Certify that a code-generation assistant stays below a maximum allowed probability of insecure completions before deployment.
•Verify that a customer-service bot’s chance of leaking private data is under a strict policy threshold, with deterministic bounds.
•Audit math/finance models to ensure structured, grammar-valid expressions with tight correctness probability bounds.
•Gate model releases: require BEAVER’s bounds to meet safety bars (e.g., UB < 0.1 for policy violations) before shipping.
•Continuously monitor safety drift by tracking changes in BEAVER’s LB/UB over time and across model versions.
•Prioritize red-teaming by focusing on prompts where BEAVER’s UB indicates high residual risk.
•Tune decoding temperature and observe how bounds change to pick safer runtime configurations.
•Select among multiple models using certified lower bounds of task correctness rather than noisy estimates.
•Design prefix-closed policies (e.g., toxic-token blocklists, secure-API-only patterns) that BEAVER can verify efficiently.
•Use bounds to inform human-in-the-loop workflows, escalating only instances where uncertainty remains high.

Version: 1