Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Mingqian Feng; Xiaodong Liu; Weiwei Yang; Chenliang Xu; Christopher White; Jianfeng Gao

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Intermediate

Mingqian Feng, Xiaodong Liu, Weiwei Yang et al.1/30/2026

arXiv PDF

Key Summary

•Real attackers can try many prompts in parallel until a model slips, so testing safety with only one try badly underestimates risk.
•This paper introduces SABER, a math-based method that predicts large-budget adversarial risk (Best-of-N) using only small-budget measurements.
•SABER models per-try success chances with a Beta distribution and derives a simple scaling law that links small-N to big-N attack success rates.
•With just n=100 samples per harmful query, SABER predicts ASR@1000 with mean absolute error around 1.66%, an 86% reduction versus a common baseline.
•Different attack methods scale differently: some look weak at one try but explode in success when given more attempts, reversing who looks 'strongest'.
•SABER has multiple practical estimators (Anchored, Plugin, Fit) and can even answer inverse questions like 'How many attempts to reach 95% success?'.
•The method works under uneven budgets and partial data, making it suitable for real-world logging and red-teaming pipelines.
•SABER helps shift safety evaluation from single-shot scores to scaling-aware risk forecasts that match realistic adversarial pressures.
•The framework is low-cost, statistically grounded, and provides confidence intervals to quantify uncertainty.
•Even when the Beta assumption is imperfect, SABER remains accurate in practice, offering a robust tool for safety teams.

Why This Research Matters

Real attackers don’t stop after one try—they scale up. SABER equips safety teams with a low-cost, mathematically grounded way to predict how risky things get when attackers run many attempts in parallel. It reveals that models that look safe in single-shot tests can show rapid, nonlinear risk growth at scale, changing how we rank threats and prioritize defenses. This helps organizations set realistic guardrails, allocate monitoring resources, and decide when to retrain or deploy stronger filters. SABER’s confidence intervals and inverse queries (like Budget@95%) turn abstract risk into concrete operational plans. By shifting evaluation from snapshots to scaling-aware forecasts, SABER brings safety measurement closer to real-world conditions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a locked door might feel safe if you only jiggle the handle once, but if someone tries many different keys and angles, they might eventually get it open? That’s the challenge with measuring the safety of Large Language Models (LLMs). For a long time, many tests checked whether a model refused a harmful request after just one or a few tries. If the model said 'no' once, it was often labeled safe enough. But real attackers don’t stop after one try—they try lots of variations, fast and in parallel, until something works.

Before this research, most benchmarks focused on single-shot or tiny-budget evaluations: one prompt, one answer, maybe a few retries. This made testing cheaper and simpler, but it missed what actually happens in the wild. Attackers can use scripts and cloud machines to send thousands of slightly tweaked prompts at once. Even if each try has a small chance to trick the model, the chance that at least one try succeeds can become very large when there are many tries. That means we were underestimating risk.

The problem researchers faced was twofold: first, how to measure realistic, large-scale adversarial risk without spending a fortune running countless trials; and second, how to predict what will happen at large budgets (like 1,000 or more attempts) using only small-budget measurements (like 10–100 attempts per question). Prior work showed that repeating attempts makes success go up, sometimes sharply, but there wasn’t a principled, math-driven way to predict that growth curve and compare attackers fairly across budgets.

People tried a few approaches. A common baseline was to estimate, for each harmful query, the per-try success chance from a small sample and then use a simple formula to forecast success after N tries. This helps a bit, but it treats those small-sample estimates as if they were perfectly accurate, which they’re not—especially when you only have a few attempts per query. Others fit straight lines to curves on log scales, noticing patterns but not explaining why the pattern shows up or how to get reliable confidence bounds. These methods could be noisy or biased, especially when extrapolating far beyond the observed range.

What was missing was a statistical bridge—a way to connect what we measure at small budgets to what actually happens at big budgets, while properly accounting for uncertainty and differences across queries. Also missing was a simple knob that tells us how quickly risk grows as we add more attempts—a parameter that could explain why two attacks might swap which one looks worse when N gets larger.

Here are the real stakes. If we rely on single-try safety checks, we might approve a model that appears robust but actually fails under parallel adversarial pressure. That matters for content moderation, personal assistants, enterprise copilots, education tools, code assistants, and more. In many of these settings, a rare failure can be enough to cause harm or violate policy. We need a way to predict, with small data and low cost, how risky things get when attackers scale up. This is exactly the gap the paper fills: a mathematically grounded method—SABER—that turns a handful of measurements into a reliable forecast of large-scale adversarial risk, so safety teams can plan defenses and deploy with eyes open.

02Core Idea

Aha! Moment in one sentence: Model each per-try success chance as a random variable (captured by a Beta distribution), then use an analytic scaling law to forecast how attack success rises with more tries, letting us predict big-budget risk from small-budget tests.

Three analogies:

Fishing trips: Imagine each cast (attempt) has some chance to catch a fish (jailbreak). Different lakes (queries) have different fish densities (per-try success probabilities). If you model those densities across lakes and know how many casts you’ll make, you can predict the chance you’ll catch at least one fish.
Arcade games: Each token you spend has some chance to hit the jackpot, and each machine (query) has its own hidden odds. If you understand how those odds vary across machines, you can predict how likely you are to get at least one jackpot after N tokens.
Seeds and sprouting: You plant N seeds. Each patch of soil (query) has its own sprouting probability. Knowing how those probabilities are distributed tells you how likely at least one seed sprouts when you plant more.

Before vs. After:

Before: We checked safety mostly at ASR@1 (one try) or a tiny budget, assuming relative rankings would stay the same as N grows. If Attack A was better than Attack B at one try, people often assumed A would remain better at all N.
After: SABER shows that rankings can flip as N grows because what matters is not just the average per-try success, but how the low-probability tail behaves. A method that looks weak at one try might explode in success with more attempts, overtaking the other.

Why it works (intuition, no equations):

Each harmful query has its own hidden per-try success chance. Across many queries, these chances vary. The Beta distribution is a flexible, standard way to represent probabilities that range between 0 and 1 and to capture uncertainty cleanly.
When you take N independent tries, the chance that none succeed shrinks in a predictable way, tied to how much probability mass lies near zero (the left tail). A single parameter, alpha, controls how quickly that chance shrinks. Smaller alpha usually means risk rises faster as N increases.
By estimating the Beta parameters (alpha and beta) from small-budget data with a careful likelihood (that respects sampling noise), we get a reliable picture of how risk will scale at big N without brute-force trials.

Building Blocks (explained with the Sandwich pattern):

🍞 Hook: Imagine picking the highest score from many dice rolls—the more rolls, the better your best score gets. 🥬 Best-of-N Sampling: It means you try N times in parallel and take the first success. How it works: (1) You make many slightly different attempts. (2) If any attempt slips past safety, that’s counted as a success. (3) More attempts → higher chance at least one works. Why it matters: Single tries miss the fact that attackers can scale attempts and dramatically raise success. 🍞 Anchor: An attacker runs 100 prompt tweaks at once; even if each tweak rarely works alone, one often does.

🍞 Hook: Think of a goalie facing penalty kicks—how many goals go in out of all kicks? 🥬 Attack Success Rate (ASR): It’s the fraction of harmful queries where at least one attempt succeeds. How it works: (1) For each harmful query, run attempts. (2) Mark success if any attempt breaks through. (3) Average over all queries. Why it matters: ASR tells us overall risk of harmful outputs under a given number of attempts. 🍞 Anchor: If 40 out of 100 harmful prompts are broken after 50 tries each, ASR@50 is 40%.

🍞 Hook: When you don’t know a coin’s fairness, you can model your belief about its heads chance with a smooth curve from 0 to 1. 🥬 Beta Distribution: A flexible way to represent probabilities between 0 and 1. How it works: (1) Start with two shape knobs (alpha, beta). (2) These knobs control where most probability sits (near 0, near 1, or in the middle). (3) Update them as you see data. Why it matters: We need a principled way to model how per-try success chances vary across queries. 🍞 Anchor: If many queries are very hard to break, the Beta puts more mass near 0; if many are easy, more mass near 1.

🍞 Hook: Each shot on goal either scores or not—yes/no. 🥬 Bernoulli Trial: A single attempt with success or failure. How it works: (1) Assign a success probability. (2) Flip a weighted coin to decide outcome. Why it matters: Each jailbreak attempt is a Bernoulli trial, and we need that to model sequences of attempts. 🍞 Anchor: For one prompt attempt, the model either gives a harmful answer (1) or not (0).

🍞 Hook: If you test multiple times per query, your measured success rate for that query is noisy but improves with more tries. 🥬 Beta–Binomial Model: Combines a Beta prior for per-try success with Binomial counting of successes. How it works: (1) For each query, suppose its per-try success is drawn from Beta(alpha, beta). (2) Given that, k successes in n attempts follow a Binomial with that hidden probability. (3) Integrate out the hidden probability to get a Beta–Binomial likelihood across queries. Why it matters: It correctly accounts for small-sample noise when estimating alpha and beta. 🍞 Anchor: With only 5 attempts per query, treating per-query rates as exact is risky; Beta–Binomial properly reflects uncertainty.

🍞 Hook: The more you try keys on a lock, the more likely one will fit, and this likelihood follows a predictable curve. 🥬 Scaling Law: A simple rule connecting small-N measurements to large-N risk. How it works: (1) Derive how the probability that all N attempts fail shrinks with N. (2) Show that a single parameter (alpha) largely controls the speed of shrinkage. (3) Use this to extrapolate from small N to big N. Why it matters: This is the heart of forecasting realistic adversarial risk without massive compute. 🍞 Anchor: If alpha is small, ASR grows fast with N; if alpha is larger, it grows slower.

🍞 Hook: You know how a map shows you where you’ll end up if you follow a path? We want a map from small trials to big risk. 🥬 SABER: A method to estimate large-N adversarial risk from small-N data. How it works: (1) Collect n attempts per harmful query. (2) Fit Beta–Binomial to estimate alpha and beta. (3) Plug into the scaling law (or anchor with a known ASR@n) to predict ASR@N. Why it matters: It’s low-cost, principled, and accurate, turning small tests into realistic forecasts. 🍞 Anchor: With n=100, SABER predicted ASR@1000 within ~1.66%, far better than the baseline’s ~12% error.

03Methodology

At a high level: Harmful queries + small-budget attempts per query → Estimate how per-try success varies (Beta–Binomial fit) → Use the scaling law to predict big-budget ASR.

Step 0: Define the scenario

What happens: We fix an attacker, a victim model, and a judge that labels model outputs as jailbroken (1) or not (0). We care about ASR@N: the fraction of harmful queries for which at least one of N attempts succeeds.
Why it exists: Real attackers run many attempts; we need a method that predicts ASR@N for large N from limited measurements.
Example: Suppose we have 159 harmful queries from HarmBench and run n=100 attempts per query.

Step 1: Collect small-budget data per query

What happens: For each harmful query, run n attempts (small, like 5, 10, 50, or 100) using the chosen attack method. Record the number of successes k out of n.
Why it exists: We need per-query evidence to understand how success varies across queries. Without this, we can’t estimate the shape of the hidden per-try probabilities.
Example: Query 1 gets k=3 successes in n=100 attempts; Query 2 gets k=0/100; Query 3 gets k=7/100; and so on.

Step 2: Fit the Sample-ASR distribution with a one-stage likelihood

What happens: Treat each query’s hidden per-try success chance as a Beta random variable. Then the observed k out of n is Beta–Binomial distributed. We maximize the Beta–Binomial likelihood over alpha and beta using standard optimizers.
Why it exists: A naive two-stage fit (first estimate per-query rate, then fit Beta) ignores the noise from finite n and can miscalibrate, especially when n is small. The one-stage approach respects uncertainty and is more stable.
Example: After optimization, we might get alpha≈0.4 and beta≈4.0 for a given attacker–victim–judge setup, indicating most queries are hard (mass near zero) but with a tail of easier ones.

Step 3: Use the scaling law to extrapolate ASR@N

What happens: The scaling law tells us how the chance that all N attempts fail shrinks with N, effectively governed by alpha (and a constant involving beta). There are three practical estimators. • SABER-Plugin: Plug the fitted alpha and beta into the scaling law to get ASR@N. • SABER-Anchored: If we already measured ASR@n for some small n, we anchor the curve there and only need alpha to predict ASR@N. This reduces sensitivity to beta and often performs best. • SABER-Fit: Fit the small-N ASR curve directly in log space (as seen empirically) and extrapolate. This matches the derived scaling in the large-N regime.
Why it exists: Different operational constraints suggest different choices. Anchoring is great when you trust your ASR@n; Plugin is convenient and handles uneven budgets well; Fit is a line-fitting shortcut consistent with the theory at large N.
Example: With alpha≈0.4 and ASR@100 measured, SABER-Anchored predicts ASR@1000 by shrinking the failure rate according to (100/1000)^alpha.

Step 4: Small-N correction (optional)

What happens: For very small N (like 5–20), a refined formula slightly adjusts N in the scaling law to cancel first-order approximation error. This tightens accuracy when both N and n are small.
Why it exists: The main scaling law is asymptotic (best for moderate-to-large N). The correction improves predictions in the tiny-N corner.
Example: At n=10 and N=20, the correction can cut error by another ~0.6% on average.

Step 5: Confidence intervals

What happens: Compute uncertainty in alpha and beta from the observed information (the curvature of the likelihood at the fit). Then propagate that uncertainty through the chosen estimator (Plugin or Anchored) to get a confidence interval.
Why it exists: Safety requires not just point estimates but also uncertainty bands. Without them, we can be overconfident.
Example: If the standard error of alpha is SE, then for Anchored we map alpha±z·SE into upper/lower ASR@N bounds.

Step 6: Extensions for realism (if needed)

Unbreakable fraction: Some queries may be truly unbreakable for a given pipeline. Add a spike at zero probability so ASR saturates below 100%.
Online mix: In streaming settings with a mix of benign and harmful prompts, adapt the scaling to account for the fraction of harmful prompts.
Why it exists: Real deployments have quirks (filters, non-determinism, traffic mix). These extensions make estimates more faithful.
Example: If 10% of queries are unbreakable, even infinite attempts max out below 90% ASR.

The Secret Sauce

Careful uncertainty modeling: Using the Beta–Binomial one-stage MLE stabilizes the parameter estimates even when per-query budgets are small.
A single scaling knob (alpha): Alpha is an interpretable control for how fast risk grows with N. This provides intuition and comparability across pipelines.
Anchoring to observed small-N: Anchoring cancels unknown constants and mismodeling quirks, improving robustness and accuracy from minimal data.

Sandwich explanations for key steps and variants:

🍞 Hook: If you already know your score at mile 1, you can estimate where you’ll be at mile 10 using your pace. 🥬 SABER-Anchored: Predicts ASR@N by anchoring at a trustworthy ASR@n and scaling by how fast risk grows (alpha). How it works: (1) Measure ASR@n. (2) Estimate alpha. (3) Scale failure rate by (n/N)^alpha. Why it matters: Reduces sensitivity to other parameters and works great with small data. 🍞 Anchor: With ASR@100 and alpha in hand, you forecast ASR@1000 reliably.

🍞 Hook: Sometimes you don’t have a reliable anchor point, but you still want a forecast. 🥬 SABER-Plugin: Directly plugs alpha and beta into the scaling law. How it works: (1) Fit alpha and beta. (2) Compute predicted ASR@N from the formula. Why it matters: Simple, flexible, and handles uneven per-query budgets. 🍞 Anchor: When user logs provide different n per query, Plugin absorbs this variation through the fit.

🍞 Hook: If points on a graph line up straight, a ruler is enough to guess future points. 🥬 SABER-Fit: Fits the small-N ASR curve in a transformed space (consistent with the theory at large N). How it works: (1) Plot a transformed ASR curve. (2) Fit a straight line. (3) Extrapolate. Why it matters: A practical shortcut when the large-N regime applies. 🍞 Anchor: If your small-N points are already in the high-ASR zone, Fit can work well.

🍞 Hook: A single dimmer switch can make a room go from dim to bright quickly. 🥬 Alpha as the risk amplifier: Alpha controls how fast ASR climbs with more attempts. How it works: (1) Estimate alpha from data. (2) Smaller alpha → faster growth with N. (3) Larger alpha → slower growth. Why it matters: Two attacks can swap who looks worse when N increases, depending on alpha. 🍞 Anchor: An attack with lower ASR@1 but smaller alpha can overtake a rival as N grows.

04Experiments & Results

The Test: The authors evaluated SABER on HarmBench (159 harmful queries) under three different attackers (Text Augmentation, ADV-LLM, Jailbreak-R1), two victim models (Llama-3.1-8B-Instruct and GPT-4.1-mini), and two judges (HarmBench Classifier and an LLM Classifier). They measured how well SABER could predict big-budget ASR (like ASR@1000) using only small budgets per query (like n=100), and compared against a common baseline that estimates per-query success naively and multiplies it out to N attempts.

The Competition: The main baseline treats each query’s small-sample success rate as exact and predicts ASR@N using a simple formula. It’s widely used but ignores the noise from small n. SABER’s estimators (especially Anchored) were put head-to-head against this baseline across many setups.

The Scoreboard (with context):

Big-N prediction with small n: Using only n=100 per query, SABER-Anchored predicted ASR@1000 with a mean absolute error (MAE) of about 1.66 percentage points, versus 12.04 for the baseline. That’s like getting almost perfect instead of a wrong-by-a-letter grade—an 86.2% error reduction.
Mid-range regimes: The biggest wins happened when true ASR@1000 was high but not fully saturated (not near 100%). For example, against GPT-4.1-mini with HarmBench Classifier under ADV-LLM, ground-truth ASR@1000 was ~75.2%. The baseline low-balled it at ~63.4%, but SABER hit ~74.3%.
Ranking reversals: In some cases, an attack with lower ASR@1 overtook another at larger N because its risk scaled faster (smaller alpha). This confirms that single-shot rankings can be misleading.
Stability across N and n: As expected, increasing the small-budget n improves accuracy. Even as target N gets larger, SABER’s error grows slowly and remains low, while the baseline’s error is much larger. SABER-Anchored consistently achieved 4–6× lower MAE than the baseline across settings.

Surprising and practical findings:

Early saturation handled well: In some triplets (strong attacks or weak defenses), ASR grew high by N=20–50. SABER still predicted these small-N targets accurately using even tinier budgets (e.g., n=10), where the baseline could be off by 20+ points.
Uneven budgets: When per-query budgets varied (like real user logs), overall error increased for all methods (less uniform data), but SABER still substantially beat the baseline. In some uneven-budget, smaller-N cases, the Plugin estimator edged out Anchored, showing that SABER offers flexible tools for different regimes.
Partial data: Using only a subset of the dataset (e.g., K=80 out of 159) led to only modest degradation, and SABER still outperformed the baseline. This suggests robustness when full data access isn’t possible.
Inverse queries (Budget@τ): SABER can answer questions like, “How many attempts to reach 95% success?” It did so accurately using small-budget measurements, letting teams plan defenses or estimate attacker resources.

Putting numbers in plain terms:

If the baseline was off by about 12 percentage points on average at ASR@1000 (that’s like missing by a whole letter grade), SABER-Anchored brought the miss down to around 1.7 points (a tiny slip). That shift is the difference between rough guesses and decision-grade predictions.
In very small-budget regimes (n=10), SABER could predict small-N targets (like N=50) within fractions of a point in some settings, while the baseline missed by over 20 points. That’s the difference between a safe green light and a risky false sense of security.

Takeaway: SABER turns a little data into high-confidence forecasts that match what large-scale attackers can achieve. It captures the true scaling behavior (via alpha), explains why rankings can flip at larger N, and gives accurate predictions and uncertainty bounds where prior approaches stumble.

05Discussion & Limitations

Limitations:

Binary judges: The analysis assumes a binary judge (success/fail). Real evaluators sometimes have multi-level severity or continuous scores. Extending SABER to multi-class or continuous judgments (e.g., via a Dirichlet-categorical model) would provide more nuance.
Distributional fit: SABER assumes per-try success probabilities across queries follow a Beta distribution. While this fit passes most tested setups, a few show deviations (e.g., bimodality or heavy boundary mass). Anchoring helps, but extreme mismatches could reduce accuracy.
Saturation and spikes: If some queries are truly unbreakable for a given pipeline, risk plateaus below 100%. SABER includes a spike-and-slab extension, but estimating the unbreakable fraction reliably may require more data.
Scope of evaluation: Experiments focus on HarmBench and a handful of attackers, victims, and judges. Broader coverage (multi-modal, other datasets, more frontier models) would further validate generality.

Required Resources:

Data: A set of harmful queries and the ability to run small numbers of attempts per query (often 5–100 suffices). More queries (K) improve stability faster than greatly increasing n per query.
Compute: Fitting the Beta–Binomial model is lightweight. The heavy lift is generating attempts, but SABER keeps this small.
Software: Any standard optimizer (e.g., L-BFGS-B) and basic statistical tooling; optional plotting or bootstrap for diagnostics.

When NOT to Use:

Non-parallel or heavily dependent attempts: If attempts are not approximately independent (e.g., strong memory effects or adaptive defenses that change per try), the simple scaling may fail.
Noisy or biased judges: If the judge frequently mislabels successes/failures or drifts over time, all methods (including SABER) will inherit those errors.
Extremely small datasets: With very few queries (tiny K) and tiny per-query attempts, estimates of alpha can be too uncertain to trust, though confidence intervals will reflect this.

Open Questions:

Better modeling of deviations: Can we design mixtures (e.g., spike-and-slab Beta mixtures) or nonparametric priors that retain analytic scaling while fitting edge cases better?
Adaptive defenses: How does SABER interact with defenses that change with load, rate-limiting, or detection? Can we co-model attacker and defender scaling dynamics?
Beyond binary: What are the best scaling laws for graded harm or probabilistic judges, and can we deliver the same simplicity and accuracy in those settings?
Bias calibration: The residual bias tends to be underestimation. Can we build principled calibration layers without sacrificing extrapolation guarantees?
Multimodal attacks: Does the same scaling hold for images, audio, or code traces combined with text, and how do cross-modal judges affect parameters like alpha?

06Conclusion & Future Work

Three-sentence summary: SABER is a statistical framework that predicts large-scale adversarial risk (Best-of-N) for LLMs using only small-budget measurements. It models per-try success variability across queries with a Beta distribution and derives a simple scaling law controlled by a single parameter (alpha) that governs how fast risk grows with more attempts. Across diverse attacker–victim–judge setups, SABER achieves dramatically lower prediction error than common baselines, revealing that single-shot evaluations can hide rapid, nonlinear risk amplification at scale.

Main achievement: Turning a small number of measurements into reliable, mathematically grounded forecasts of big-budget adversarial risk—complete with interpretable parameters and uncertainty estimates—so teams can assess realistic vulnerabilities without brute-force testing.

Future directions: Extend beyond binary judges to graded harms, support spike-and-slab and mixture models seamlessly, expand to multimodal and frontier models, integrate with adaptive defense modeling, and develop bias correction for even sharper calibration.

Why remember this: SABER reframes LLM safety from one-shot snapshots to scaling-aware forecasts, matching how real attackers operate. With a single, intuitive knob (alpha) and low-cost data collection, it helps practitioners anticipate where risk is headed as attempts scale, prioritize defenses, and communicate safety with clear, decision-grade metrics like Budget@τ.

Practical Applications

•Estimate large-N adversarial risk from small red-team budgets to inform deployment go/no-go decisions.
•Compute Budget@τ (e.g., 90% or 95%) to plan rate limits, monitoring thresholds, or defense activation points.
•Compare attackers fairly across budgets using alpha to understand who scales risk faster as N grows.
•Detect ranking reversals early: identify attacks that look weak at ASR@1 but will dominate at higher N.
•Prioritize hardening efforts on pipelines with small alpha (fast risk growth) before those with slower scaling.
•Quantify uncertainty with confidence intervals for ASR@N to communicate risk credibly to stakeholders.
•Handle uneven user logs by fitting SABER-Plugin, enabling realistic, passive risk estimation from organic data.
•Assess the impact of making a subset of queries unbreakable (policy filters) using the spike-and-slab extension.
•Track improvements over time: re-fit alpha/beta after defenses change to verify reduced risk amplification.
•Design better benchmarks: report scaling-aware metrics like ASR@N curves and Budget@τ, not just ASR@1.

Version: 1