Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Eddie Landesberg; Manjari Narayan

Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems

Intermediate

Eddie Landesberg, Manjari Narayan12/11/2025

arXiv PDF

Key Summary

•LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
•Causal Judge Evaluation (CJE) calibrates a cheap judge to a small set of expensive oracle labels, then audits whether that calibration still works for each policy.
•With only 5% oracle labels, CJE ranks models correctly 99% of the time on a 4,961‑prompt benchmark, at about 14× lower cost than labeling everything with the oracle.
•Naive confidence intervals around raw judge scores can have 0% coverage; CJE’s bootstrap with bias correction restores about 95% coverage.
•Off‑policy evaluation with importance sampling can fail even when effective sample size looks great; the real blocker is low coverage of target‑typical responses (CLE/TTC diagnostics catch this).
•The mean transport test turns a fragile assumption ('the judge’s bias won’t change') into a checkable pass/fail audit for each policy.
•Direct mode (fresh generations + calibrated rewards) is the safest default for open‑ended text; IPS/DR need good overlap and reliable teacher forcing to help.
•CJE refuses absolute level claims when the audit fails, preventing confident but wrong numbers in dashboards.
•A simple square‑root budgeting rule helps pick how many oracle labels vs. cheap scores to buy to minimize variance at fixed cost.

Why This Research Matters

CJE lets teams use inexpensive LLM judges without getting tricked by their biases, so they can scale evaluations that would otherwise be too costly. It builds an audit step that catches when a calibration stops working, which prevents shipping bad models or bragging about misleading metrics. By preferring Direct when logs lack coverage, it avoids common off-policy traps and saves engineering time. Honest confidence intervals that include calibration uncertainty keep dashboards from being confidently wrong. The cost savings (often >10×) make regular, high-quality evaluation practical for real products. In short, CJE turns “cheap but risky” judge scores into “cheap and reliable” measurements with clear rules about when to trust them.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re picking the best new video game by watching lots of short trailers. The trailers are quick and cheap to watch, but sometimes they exaggerate the action and hide the boring parts. If you only trust the trailers, you might buy the wrong game.

🥬 The Concept: LLM judges are like those trailers. They are cheap, quick scores that say how good a model’s answer looks. Real user satisfaction or expert ratings are like actually playing the game: accurate but expensive. Teams often choose models by these cheap judge scores and hope they match what users truly like.

How it works (before this paper):
1. Generate answers from different model setups (policies).
2. Ask a cheap judge LLM to score each answer.
3. Pick the model with the highest average score and report confidence intervals as if those scores were truth.
Why it matters: Judges can be biased (for example, loving long, flowery answers). Optimizing for the judge can reward fluff over real quality, shipping the worse model.

🍞 Anchor: A team compares two chatbots. The judge gives higher scores to the extra‑chatty one, so they ship it. Users complain it’s unhelpful. They optimized for the trailer, not the game.

🍞 Hook: You know how bathroom scales can be off by a few pounds? If you don’t calibrate the scale, your weight trend and numbers can mislead you.

🥬 The Concept: Calibration means fixing the judge’s readings so that a score really matches expected true quality (the oracle label).

How it works:
1. Take a small sample of responses and label them with the expensive oracle (experts or a gold standard).
2. Learn a mapping from cheap judge scores to expected oracle scores.
3. Apply this map to lots of new judge scores to get calibrated estimates.
Why it matters: Without calibration, ranks can invert: higher judge score can mean lower true quality.

🍞 Anchor: If the judge gives 85 to both short and long answers, but experts say short=90 and long=70, calibration learns that 85 from a long answer should be lowered.

🍞 Hook: Imagine you estimate class grades by the number of words students write. If the teacher later changes the rubric to reward clarity over length, your word‑count trick breaks.

🥬 The Concept: Surrogate validity is checking whether a cheap stand‑in (the judge score) still points to the right thing (true quality) when circumstances change (new policy, new style).

How it works:
1. Use a small oracle‑labeled sample in each new context to test whether the calibration still makes unbiased predictions on average.
2. If the test fails, recalibrate or don’t report absolute levels.
Why it matters: Otherwise, a style shift (like making answers longer) can fool the judge.

🍞 Anchor: A new chatbot becomes very verbose. The cheap judge loves it, but experts don’t. A quick validity check (mean residual test) catches the mismatch.

🍞 Hook: Suppose you want to guess a school’s average height using last year’s students. If this year’s class is mostly basketball players, last year’s data won’t cover the tall region well.

🥬 The Concept: Importance sampling reweights old data to mimic a new policy, but it needs good overlap—your old data must include the kinds of answers the new policy would give.

How it works:
1. Compute a weight for each logged answer: how likely the new policy would have produced it compared with the old.
2. Average outcomes using these weights.
Why it matters: If the old policy rarely produced what the new one likes, estimates become noisy or wrong, even if weights look stable.

🍞 Anchor: If last year had few tall students, reweighting can’t reliably estimate this year’s much taller average.

🍞 Hook: Think of gluing a wobbly table leg. If the glue layer is uneven, the table still wobbles.

🥬 The Concept: Weight stabilization tries to make the reweighting less wobbly by smoothing and bounding weights, often using the judge score as a stabilizing index.

How it works:
1. Create simple, monotone weight candidates tied to judge scores.
2. Stack/average them to minimize variance.
3. Cap variance if needed, then renormalize.
Why it matters: Without stabilization, a few giant weights can dominate and wreck estimates.

🍞 Anchor: After stabilization, no single super‑tall student dominates your average.

🍞 Hook: Picture a speedometer that you re‑zero to 0 when parked, but after a repair you verify it still reads 0 at rest.

🥬 The Concept: A transport audit checks if the earlier calibration still works in a new policy or time window.

How it works:
1. Collect a small oracle slice under the target policy.
2. Test if the average residual (oracle minus calibrated prediction) is zero.
3. Pass → reuse calibration; fail → recalibrate or refuse absolute levels.
Why it matters: It turns an uncheckable hope (“calibration still holds”) into a testable rule.

🍞 Anchor: The audit flags an “unhelpful” policy whose calibrated score is off by −0.31 on average; rankings remain fine but absolute scores are refused.

🍞 Hook: If your camera misses the corners of a photo, you can’t judge what’s there, even if the center looks sharp.

🥬 The Concept: Coverage‑Limited Efficiency (CLE) says you hit a hard accuracy floor when your logged data barely covers the region your target policy likes.

How it works:
1. Define the target‑typical region (answers the new policy would often produce).
2. Measure how much the logger covered that region (TTC).
3. Low TTC means large, unavoidable uncertainty.
Why it matters: Even perfect math can’t beat missing coverage; logs‑only methods will struggle.

🍞 Anchor: In the benchmark, TTC was only 19–49% for non‑clone policies, explaining why weighted logs failed despite high effective sample size.

🍞 Hook: Think of two ways to know if cookies taste good: bake fresh cookies and taste a few (direct), or try to guess using notes from last week’s batches (off‑policy reweighting).

🥬 The Concept: The Direct method generates fresh responses from the target policy and uses calibrated judge predictions, avoiding overlap problems.

How it works:
1. Generate new answers under the target policy.
2. Apply the calibrated mapping to get predicted oracle scores.
3. Use bootstrap with bias correction to get valid uncertainty.
Why it matters: If logs don’t cover your new behavior, Direct remains reliable and efficient.

🍞 Anchor: In experiments, Direct with two‑stage calibration reached 94–99% correct rankings with small oracle budgets.

Putting it together, the world before this paper often trusted uncalibrated judges, leading to flipped rankings, broken confidence intervals, and off‑policy failures that no one noticed. The paper fills the gap with CJE: calibrate once with a small oracle slice, audit per policy, prefer Direct for open‑ended text, diagnose coverage with TTC/CLE, and compute honest intervals that include calibration uncertainty. The stakes are real: picking the wrong chatbot, wasting money, or reporting confident but wrong numbers can hurt users and products.

02Core Idea

🍞 Hook: You know how you tune a musical instrument to a pitch fork and then do a quick sound check in each new room because acoustics change? That keeps your notes true wherever you play.

🥬 The Concept: The key insight of CJE in one sentence is: Calibrate a cheap judge to a small set of oracle labels, then audit that calibration in each policy context, so you can evaluate at scale with valid uncertainty—and refuse claims when the audit fails.

How it works:
1. Learn a mean‑preserving mapping from judge scores to oracle scores using a small labeled slice.
2. For each target policy, collect a tiny audit sample and test whether the mapping remains unbiased on average (mean transport test).
3. If it passes, apply the mapping to many cheap judge scores; use bootstrap with bias correction to get honest confidence intervals.
4. If it fails, recalibrate for that policy or refuse absolute level claims.
Why it matters: This turns a fragile assumption (“judge bias won’t change”) into a routine, checkable procedure.

🍞 Anchor: After tuning, the band does a short sound check (audit). If a mic is off, they fix it before the concert (recalibrate) rather than performing out of tune (reporting biased levels).

Multiple analogies:

Thermometer analogy: Calibrate your thermometer once, then in each new kitchen, dip it in ice water (audit) to confirm 0°C before cooking lots of dishes (scaling evaluation).
Map‑and‑compass analogy: Adjust your compass to true north (calibration), then each hike you compare to a known landmark (audit) so you don’t drift miles off course when trails twist (policy shift).
Photo filter analogy: You learn how a filter changes colors (calibration). On a new camera (policy), you snap a color chart (audit). If colors look right, you can edit thousands of photos confidently (scale with valid uncertainty).

Before vs. after:

Before: Teams used raw judge scores, assumed ranks matched reality, and reported tight but wrong intervals.
After: Teams calibrate once, audit per policy, prefer Direct mode for open‑ended text, and report intervals that include calibration uncertainty. If the audit fails, they refuse levels instead of guessing.

Why it works (intuition without equations):

Mean‑preserving isotonic regression learns a monotone map from judge score to expected oracle score and exactly preserves the average on the calibration slice, which prevents rank inversions.
The mean transport test checks a single number: do residuals average to zero under the target policy? Passing that test is enough to ensure unbiased average values.
Bootstrap with bias correction adds back the uncertainty from learning the calibration, fixing the 0% coverage problem of naive intervals.
Encoding justified restrictions—monotonicity, unit‑mean weights, and convex stacking—shrinks the space of possibilities in a way that only reduces variance, never increasing it.
CLE/TTC diagnostics reveal when logs‑only methods face a structural precision floor because the logger rarely produced target‑typical answers; then Direct is the right tool.

Building blocks (each piece as a mini‑sandwich):

🍞 Hook: You know how taller plants usually got more sun?
🥬 The Concept: Reward calibration maps judge score to expected oracle score and keeps the average right.
How: Fit a smooth index (allowing covariates like response length), convert to ranks, then do isotonic regression that preserves the mean.
Why: Without it, longer answers could look better just for being long.
🍞 Anchor: Two answers both score 0.8 by the judge; experts say the short one is better. The map adjusts the long one down.
🍞 Hook: Mixing two paints can match a tricky color better than either alone.
🥬 The Concept: Influence‑function stacking blends estimators to minimize variance.
How: Compute influence signals for candidate estimators, then pick nonnegative weights that minimize their covariance.
Why: One estimator might be noisy where another is stable.
🍞 Anchor: The blend is steadier than any single estimator.
🍞 Hook: A seatbelt keeps you safe even when roads are bumpy.
🥬 The Concept: Calibration‑aware uncertainty (bootstrap with bias correction) treats the learned mapping as uncertain too.
How: Refit calibration in each bootstrap, add a residual correction term, and build intervals from the full variance.
Why: Ignoring calibration error makes intervals way too narrow.
🍞 Anchor: Naive 95% CIs missed the truth almost always; with bootstrap + correction they hit ~95%.
🍞 Hook: If your old map doesn’t show a new neighborhood, you can’t navigate it well.
🥬 The Concept: CLE/TTC says logs‑only reweighting breaks when the old data barely covers what the new policy does.
How: Define target‑typical responses by surprisal, measure how often logs land there (TTC). Low TTC → big, unavoidable error.
Why: No amount of weight smoothing can invent missing coverage.
🍞 Anchor: TTC below ~0.70 warned that IPS would fail in the benchmark.
🍞 Hook: Quick taste test before serving a big batch.
🥬 The Concept: Mean transport test checks if calibrated predictions are unbiased on average for a specific policy.
How: Collect a small labeled audit slice and test whether residuals average to zero.
Why: If not, refuse level claims or recalibrate.
🍞 Anchor: The adversarial “unhelpful” policy failed with a −0.31 shift and was correctly flagged.

Bottom line: CJE’s single big idea is to make cheap judging point in the right direction by calibrating once, auditing often, and quantifying uncertainty honestly—switching to Direct when logs can’t cover what you need.

03Methodology

High‑level recipe: Input (prompts, policy candidates) → Generate answers → Get cheap judge scores → Calibrate judge to a small oracle slice → Audit mean transport per policy → Evaluate policies at scale with calibrated scores → Build calibration‑aware confidence intervals → Report rankings and (if audits pass) absolute levels.

Step‑by‑step with mini‑sandwiches:

Data and roles

🍞 Hook: Think of a school science fair. You have many projects (prompts) and several judges (policies) giving presentations (answers).
🥬 The Concept: We have prompts X, responses A from each policy, cheap judge scores S for every response, and a small set of expensive oracle labels Y.
How:
1. For each prompt and policy, generate an answer A.
2. Score it with the cheap judge S.
3. On a small subset, also get the oracle Y. Why: Judge scores are fast and cheap for all rows; oracle labels are sparse but trustworthy.
🍞 Anchor: 4,961 prompts × 5 policies → judge scores for all; oracle labels for a small fraction (5–25%).

Reward calibration (two‑stage isotonic)

🍞 Hook: You know how you first roughly straighten a picture frame and then fine‑tune it so it’s perfectly level?
🥬 The Concept: Two‑stage calibration learns a flexible index and then forces a monotone, mean‑preserving map to oracle scores.
How:
1. First stage: learn a smooth index Z=g(S, X) (e.g., splines + response length) to allow non‑linear patterns and remove confounders like verbosity.
2. Convert Z to mid‑ranks U via its empirical CDF.
3. Second stage: fit isotonic regression h↑(U) that is monotone and exactly preserves the slice mean.
4. Predict calibrated rewards R = h↑(U) for all samples. Why: Monotonicity prevents rank inversions; mean preservation keeps averages honest; using X (like length) removes known judge biases.
🍞 Anchor: If the judge over‑rewards long answers, conditioning on length lets the calibrator “un‑favor” them so 0.85 from long ≈ 0.75 oracle, while 0.85 from short ≈ 0.88 oracle.

Transport audit (mean residual test)

🍞 Hook: Before serving soup to everyone, you taste a spoonful from the new pot.
🥬 The Concept: The mean transport test checks whether the calibrated predictions are unbiased on average for each target policy.
How:
1. Collect a small oracle audit slice per policy (e.g., 50–200 labels).
2. Compute residuals ε = Y − f(S, X).
3. Test if E[ε]=0 (with correction for multiple policies if needed).
4. Pass → reuse calibration for levels; fail → recalibrate or refuse absolute levels (still safe to rank if gaps are large). Why: Styles change across policies; this ensures we’re not silently biased.
🍞 Anchor: Clone, premium, and prompt‑engineered policies passed; the adversarial “unhelpful” policy failed with a −0.31 shift and was flagged.

Direct evaluation (recommended default for open‑ended text)

🍞 Hook: When you can retake a photo, you just take a fresh, sharp shot instead of enhancing a blurry old one.
🥬 The Concept: Direct mode generates fresh answers from the target policy, avoids overlap issues, and uses calibrated rewards plus calibration‑aware inference.
How:
1. Generate n fresh responses under the target policy.
2. Apply calibrated mapping R_i = f(S_i, X_i).
3. Use a bias‑corrected, bootstrap estimator that adds back calibration uncertainty.
4. Report value and 95% CIs. Why: It’s robust even when logs don’t cover the target policy’s behavior.
🍞 Anchor: With 5% oracle, Direct reached ~99% pairwise ranking accuracy at n≈5k, with ~14× lower cost than full‑oracle evaluation across 5 policies.

Off‑policy estimators (when fresh generation is not possible)

🍞 Hook: If you can’t bake new cookies, you try to guess taste from last week’s batches by giving more weight to similar recipes.
🥬 The Concept: IPS/DR reuse logged data by reweighting but need coverage; weight stabilization helps with variance but can’t fix missing regions.
How:
1. Compute teacher‑forcing propensities/log‑ratios for the target policy.
2. Stabilize weights by projecting onto S‑monotone, unit‑mean candidates and stacking them to minimize variance; cap variance if needed and renormalize.
3. For DR, combine an outcome model with a weighted residual correction. Why: Raw weights can be heavy‑tailed and unstable; stabilization helps—but only if overlap is decent.
🍞 Anchor: ESS jumped from <1% to >80% after stabilization, but IPS still ranked near random because TTC was low (coverage problem), not just variance.

CLE/TTC diagnostics (decide if logs‑only can work)

🍞 Hook: A map that misses half the neighborhood can’t guide you, no matter how nice the paper is.
🥬 The Concept: Coverage‑Limited Efficiency (CLE) sets a hard floor on precision when Target‑Typicality Coverage (TTC) is low.
How:
1. Define target‑typical answers using surprisal under the target policy (e.g., 90% typical region).
2. Measure what fraction of logs fall in that region (TTC).
3. If TTC < ~0.70, expect IPS/DR to struggle; prefer Direct. Why: No reweighting trick can invent missing coverage.
🍞 Anchor: In the benchmark, TTC=19–49% for non‑clone policies explained why IPS stayed near 47% pairwise accuracy despite weight stabilization.

Calibration‑aware uncertainty (bootstrap + bias correction)

🍞 Hook: When you estimate the height of a forest from a small sample, you must include both sampling variation and how well your measuring stick was made.
🥬 The Concept: Confidence intervals must include the extra variance from learning the calibration.
How:
1. Use cross‑fitting to avoid overfitting in nuisance steps.
2. Use a bias‑corrected one‑step estimator that adds residuals from labeled points with out‑of‑fold predictions.
3. Bootstrap by refitting the calibration to propagate its uncertainty into the final interval. Why: Naive intervals that treat calibration as fixed can have 0% coverage.
🍞 Anchor: With bootstrap and bias correction, coverage rose to ~95% across oracle fractions.

Budgeting (square‑root rule)

🍞 Hook: If apples are expensive and oranges are cheap, you choose a mix that gives the most juice per dollar.
🥬 The Concept: Spend the oracle budget and judge budget to equalize marginal variance reduction per dollar.
How:
1. Track the share of variance coming from calibration vs. evaluation.
2. If calibration uncertainty dominates, buy more oracle labels; if not, buy more judged samples.
3. Use the square‑root law derived in the paper to set the optimal ratio given costs. Why: This minimizes total variance for a fixed budget.
🍞 Anchor: At 5% oracle, calibration variance was ~90% of total, so adding oracle labels helps most; at high oracle fractions, more prompts help more.

Concrete walkthrough (Arena example):

Inputs: 4,961 prompts × 5 policies; cheap judge scores for all; oracle for a small fraction (e.g., 5%).
Calibrate: Learn f(S, length) with two‑stage isotonic on the oracle slice; cross‑fitted.
Audit: Run mean residual test per policy; 3 pass, 1 adversarial fails (flagged).
Evaluate: Use Direct to compute mean calibrated reward per policy; build bootstrap CIs.
Output: Rankings with ~99% accuracy at 5% oracle and ~95% CI coverage; refuse absolute levels for the failing policy.

Secret sauce:

Make assumptions auditable (mean transport test) instead of hoping they hold.
Encode structure as projections (monotonicity, unit‑mean weights, convex stacking) to reduce variance for free.
Prefer Direct when coverage is poor; logs‑only IPS can’t beat CLE’s floor, even with great‑looking ESS.

04Experiments & Results

The test (what they measured and why):

Goal: Can calibrated judge scores pick the right policy and report honest uncertainty at low cost?
Measures:
- Pairwise ranking accuracy (do we order policies correctly?).
- RMSE (how close are estimated values to oracle truth, minus irreducible oracle noise).
- CI coverage (do 95% intervals actually include the truth ~95% of the time?).
- Diagnostics: ESS for weights, TTC for coverage, and transport audit pass/fail.

The competition (what it was compared against):

Naive direct (no calibration; treat judge as truth).
SNIPS/IPS (logs‑only off‑policy reweighting), with and without covariates.
Doubly robust (DR) variants, with and without calibrated rewards and weight stabilization.
CJE Direct and stacked‑DR with calibration‑aware inference.
A separate binary baseline (misclassification correction/Rogan–Gladen) on a preference dataset.

Scoreboard with context:

Accuracy:
- CJE (Direct + two‑stage calibration) averaged ~94% pairwise; at 5% oracle and n≈5k, peaked at ~99% pairwise—like scoring 99/100 when others got ~50–80.
- SNIPS was ~38% pairwise (worse than a coin flip in this setup), showing logs‑only reweighting failed here.
- Calibrated IPS hovered ~47% pairwise even after big ESS gains—near random ranking.
- DR did not dominate; under low coverage it collapsed toward the outcome model, matching Direct but not beating it.
Uncertainty:
- Naive CIs on uncalibrated scores had ~0% coverage (catastrophically over‑confident).
- CJE’s bootstrap with bias correction restored ~95% coverage in Direct and stacked‑DR modes.
Cost:
- With a 16× oracle/judge price ratio, CJE at 5% oracle reached ~99% pairwise accuracy at ~14× lower total cost than full‑oracle across 5 policies.

Surprising findings:

High ESS did not save IPS. Weight stabilization boosted ESS from <1% to >80%, but rankings stayed near random (≈47% accuracy). TTC (19–49%) revealed the true blocker: the logger rarely visited target‑typical regions, so CLE imposed a hard precision floor.
Clone wasn’t 100% ESS. Even the clone policy (same model, new seed) had raw ESS ~26% instead of the ideal ~100%, likely due to teacher‑forcing brittleness (tokenization, API nondeterminism). Stabilization pushed this near 99%, but this highlighted practical fragility.
DR ≈ Direct under low overlap. Because the IPS part added noise without useful signal, DR’s advantage vanished; Direct slightly edged it on average.
Transport audit worked as an alarm. The adversarial “unhelpful” policy failed the mean residual test with a −0.31 shift. CJE flagged it and refused absolute level claims, preventing misleading dashboards.
Binary correction loses lots of signal. On a preference dataset designed for binary methods, CJE’s continuous calibration achieved ~93% lower RMSE and ~9× narrower CIs than Rogan–Gladen, even though both achieved near‑nominal coverage. Binary can be robust to class imbalance shifts, but it throws away useful continuous information.

Concrete example in action:

Suppose base and premium policies both earn judge score ≈0.80 on average. After two‑stage calibration (accounting for response length), premium’s calibrated mean becomes 0.78 while base becomes 0.74. Bootstrap 95% CIs don’t overlap, so we call premium better with high confidence. For the adversarial policy, the audit shows a −0.31 bias; we still rank it lowest by a wide margin but refuse to print an absolute level like “0.48” without recalibration.

Take‑home numbers:

Pairwise ranking: 94% average across settings; 99% at 5% oracle, n≈5k.
Coverage: ~95% for calibrated Direct/stacked‑DR; 0% for naive.
IPS with pretty ESS still ≈47% pairwise when TTC is low (19–49%).
Cost: ~14× cheaper than full‑oracle when amortized across 5 policies at 5% oracle.

What changed because of CJE:

Teams can calibrate once, audit per policy, and scale cheaply with honest uncertainty.
Logs‑only IPS isn’t the default anymore for open‑ended text; Direct is.
Diagnostics (TTC/CLE + transport audit) tell you when to trust which method.

05Discussion & Limitations

Limitations:

🍞 Hook: Even the best map can be wrong if the legend uses the wrong units.
🥬 The Concept: Oracle choice matters. If the oracle labels don’t match stakeholder values, you can be precisely wrong.
Why it matters: CJE faithfully chases the chosen oracle; picking the right one is a governance decision.
🍞 Anchor: If your “oracle” rewards verbosity, you’ll reward verbosity—calibrated or not.
🍞 Hook: Replaying a piano piece from the notes only works if the sheet music matches the performance.
🥬 The Concept: Teacher forcing can be brittle. When log‑probs or tokenization drift, importance ratios are noisy.
Why it matters: IPS/DR can misfire even for near‑clone policies; Direct avoids this.
🍞 Anchor: Clone policy had raw ESS ~26% (not ~100%), revealing TF fragility.
🍞 Hook: A thermometer that’s right on average might still be off at very high or very low temps.
🥬 The Concept: The mean transport test ensures unbiased means but not perfect calibration within every subgroup or tail.
Why it matters: For fairness or tail‑risk use cases, add subgroup audits or richer checks.
🍞 Anchor: A policy may pass the mean test overall but show bias for very long answers; run subgroup diagnostics.

Required resources:

A small oracle audit slice per policy or deployment cell (often ~50–200 labels suffices to check mean transport).
Cheap judge scoring for bulk responses.
For IPS/DR only: reliable teacher forcing/log‑probs.
Compute for bootstrap refits (modest compared to model generation and TF costs).

When not to use (or when to switch modes):

TTC < ~0.70 (poor coverage): prefer Direct over IPS/DR.
No oracle available at all: you can rank heuristically, but don’t claim absolute levels.
Severe temporal drift in judge or domain: refresh calibration and rerun transport audits.
Self‑judging without audit: models judging themselves can be biased; always audit.

Open questions:

Selection‑aware inference when scanning many policies (winner’s‑curse control).
Privacy‑robust or differentially private isotonic calibration.
Active oracle budgeting that automatically targets high‑value labels.
Sequential/agent evaluations with stepwise diagnostics and prefix‑aware weighting.
Fairness: stronger subgroup‑specific transport tests and guarantees.

Overall assessment: CJE is strongest in the common, practical regime—open‑ended generation where you can re‑generate answers. It upgrades cheap judges into reliable, auditable estimators with honest uncertainty. In logs‑only settings, CJE offers the right diagnostics to decide when reweighting is viable—and when it simply cannot beat a coverage floor.

06Conclusion & Future Work

Three‑sentence summary:

This paper introduces Causal Judge Evaluation (CJE): calibrate a cheap judge to a small oracle slice, audit per policy to ensure unbiasedness, and evaluate at scale with calibration‑aware uncertainty.
On 4,961 prompts and five policies, CJE achieves ~94% average ranking accuracy (up to 99% at 5% oracle) with ~95% CI coverage, while reducing cost ~14× versus full‑oracle labeling.
It also explains why logs‑only off‑policy methods can fail despite high ESS (coverage‑limited efficiency) and provides diagnostics and gates to prevent silent failures.

Main achievement:

Turning an uncheckable assumption (“the judge’s bias won’t change”) into a standard, auditable test (mean transport), and combining it with mean‑preserving calibration and bootstrap with bias correction to produce accurate, honest, and affordable evaluations.

Future directions:

Selection‑aware inference over many candidate policies, privacy‑robust calibration, active oracle budgeting, and stepwise evaluation for multi‑turn agents with prefix‑aware diagnostics.

Why remember this:

CJE makes the common practice of “LLM‑as‑judge” safe and scalable by adding three missing ingredients: calibration, audits, and honest uncertainty. It gives teams a practical protocol: calibrate once, audit per policy, prefer Direct when coverage is low, and refuse absolute levels when the audit fails. That’s how you aim cheap measurements at the right target without fooling yourself.

Practical Applications

•Model selection: Pick between prompts/models with calibrated rankings that match user value.
•A/B testing at scale: Use Direct + calibration to compare many variants cheaply with valid CIs.
•Safety and red-teaming triage: Calibrate judges to expert safety labels, then audit per release.
•Regression monitoring: Periodically run transport audits to catch drift in judge behavior over time.
•Cost-optimized evaluation: Apply the square-root budgeting rule to split spend between oracle labels and cheap scores.
•Policy gating: Refuse absolute level claims for any policy that fails the mean transport test; report rankings only.
•Data collection planning: Use the calibration-uncertainty share to decide whether to buy more oracle labels or more judged samples.
•Diagnostic gating for OPE: Use TTC ≥ 0.70 and Bhattacharyya affinity thresholds before trusting IPS/DR.
•Judge selection: Prefer richer, more informative judges (multi-dimensional rubrics) to tighten intervals via better calibration.
•Fairness checks: Run subgroup transport audits and reliability diagrams to detect uneven calibration.

Version: 1