Are We on the Right Way to Assessing LLM-as-a-Judge?

Yuanning Feng; Sinan Wang; Zhengxiang Cheng; Yao Wan; Dongping Chen

Are We on the Right Way to Assessing LLM-as-a-Judge?

Intermediate

Yuanning Feng, Sinan Wang, Zhengxiang Cheng et al.12/17/2025

arXiv PDF

Key Summary

•This paper asks whether we are judging AI answers the right way and introduces Sage, a new way to test AI judges without using human-graded answers.
•Sage checks two things: local self-consistency (does the judge flip when you swap the order?) and global logical consistency (do the rankings make sense overall?).
•It measures these with two metrics: IPI (how often pairwise choices flip) and TOV (how many fixes you need to make the whole ranking logical).
•On a 650-question set, even top models like Gemini-2.5-Pro and GPT-5 became inconsistent on harder cases, showing nearly a quarter of pairwise judgments were unstable.
•Writing down a judging rubric first reduces inconsistency a lot (IPI down 16.1% and TOV down 11.0%), which helps stop 'situational preferences.'
•Fine-tuned judges usually improve, and using a diverse panel of models can beat the best single judge by up to 15%, while debate-style systems often got worse.
•Sage’s scores are extremely stable across runs and temperatures and strongly correlate with human-labeled benchmarks like LLMBar and RewardBench2.
•Humans are inconsistent too: on hard tasks, people’s IPI hit 0.332 and TOV 6.523, so human labels aren’t a perfect gold standard.
•Running Sage costs under $7 and about an hour, versus roughly $82K and 100 days for humans at the same scale.
•Sage helps pick steadier judges for automated arenas and could extend to multimodal judging.

Why This Research Matters

AI systems increasingly judge other AIs, shaping what models learn and which answers users see. If those judges are inconsistent, we can train models in the wrong direction and pick worse answers at runtime. Sage gives a fast, cheap, and label-free way to check whether judges are steady and logical, so teams can trust their evaluation pipelines. It also reveals simple fixes—like writing a one-time rubric or using a small panel—that make judges more reliable today. Because it correlates strongly with human-labeled benchmarks but avoids human bias and cost, Sage is practical at scale. And since even people are inconsistent, shifting toward intrinsic consistency checks helps keep our metrics honest.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) Imagine you and your friends are voting on the best school lunch. Some of you like pizza more than tacos today, but tomorrow you change your mind just because pizza was listed second. That would make your voting a bit wobbly, right?

🥬 Filling (The Actual Concept)

What it is: LLM-as-a-Judge is when a big AI model acts like a referee, deciding which AI answer is better.
How it works (before this paper):
1. People collect questions and several AI answers for each question.
2. Humans label which answer is better (or give a score).
3. An AI judge is trained or checked against those human labels.
Why it matters: If the judge is shaky or biased, it can reward the wrong answers during training and pick the wrong outputs at test time. That breaks the whole process of improving AI with feedback and choosing the best response on the fly.

🍞 Bottom Bread (Anchor) Example: If you ask, "What’s the capital of France?" a good judge should always pick "Paris" over "Lyon" regardless of which one is listed first or if the longer answer sounds fancier.

The World Before You know how: Teachers sometimes grade essays differently if they are tired, rushed, or see the work in a different order. Similarly, most benchmarks for AI judges relied on human-annotated ground truth. That means humans decided which answer was better, and AI judges were trained or tested to match them. This approach helped kickstart the field but had big cracks:

Human bias: People can favor longer answers (verbosity bias), first-listed answers (position bias), or even answers that feel familiar.
Disagreement: Researchers repeatedly found low agreement between different annotators on the same question (around 63–66% on popular datasets), especially for subjective tasks.
Cognitive limits: Very long answers overwhelm attention; tiny errors get missed.
Scalability: Hiring humans is time-consuming and expensive, limiting dataset size and diversity.

The Problem If your so-called "gold labels" are noisy, your judge learns those noises. Worse, when the judge is later used to train other models (as a reward model) or to pick the best answer among many at inference time, the whole pipeline inherits that shakiness.

Failed Attempts

More humans: Adding more annotators reduces but doesn’t remove bias and costs even more.
Stricter instructions: Helps a bit, but people still disagree and miss subtle issues in long texts.
Fancy prompts to judges: Can reduce some biases, but without a solid way to measure judge stability, it’s hard to know what truly improved.

The Gap We’ve been asking, "Does the judge agree with humans?" when we should also be asking, "Is the judge internally consistent and logically coherent?" The field lacked a human-free, scalable way to check whether an AI judge makes stable, rational choices by itself.

Real Stakes

Training: If a reward model is inconsistent, reinforcement learning can optimize for the wrong behaviors.
Test-time selection: If the judge wobbles when comparing close answers, you might discard the best one.
Benchmarks: If judges aren’t steady, leaderboard rankings and research comparisons become unreliable.
Costs: Human labels are pricey; we need fast, affordable checks that scale.

Introducing LLM-as-a-Judge (concept sandwich) 🍞 You know how a referee decides fouls in a game? They call what’s fair or not so the match makes sense. 🥬 The concept: LLM-as-a-Judge is an AI model that decides which answer is better.

How it works:
1. Show the judge a question and two answers.
2. Ask which answer is better (or if they tie).
3. Use these decisions to grade models, train better ones, or choose the best answer at runtime.
Why it matters: Without a reliable judge, we can’t confidently improve or compare AI systems. 🍞 Anchor: When two chatbots answer "Explain photosynthesis," the judge picks the clearer, more accurate one.

This paper proposes a shift: instead of relying on human labels as the only truth, first check if the judge is self-consistent and logically coherent. If it flips decisions when you reverse answer order, or if its overall preferences create contradictions, that’s a red flag, regardless of any human labels.

02Core Idea

🍞 Top Bread (Hook) You know how when you sort your stickers from best to worst, you don’t want to end up in a loop like "A better than B, B better than C, but somehow C better than A"? That means your sorting rule is broken.

🥬 Filling (The Actual Concept)

What it is: The key idea is to evaluate AI judges by how consistent they are locally (pair by pair) and globally (across all choices), without needing human labels.
How it works:
1. For each question, generate several answers.
2. Compare every pair twice: forward order and reversed order (to check for position bias).
3. Measure local flips with IPI (Intra-Pair Instability).
4. Measure global contradictions with TOV (Weak Total Order Violation).
Why it matters: A reliable judge should have a steady "inner rulebook". If it flips when you swap order or creates circular rankings, it’s not trustworthy.

🍞 Bottom Bread (Anchor) Example: If the judge prefers Answer A over B in one prompt, but prefers B over A when you just switch positions, that’s a local inconsistency (counts toward IPI). If it says A > B, B > C, but C > A, that’s a global inconsistency (raises TOV).

Multiple Analogies

Map compass: A good compass points north no matter how you turn it. IPI checks if the judge’s compass changes when you rotate the inputs. TOV checks if the map of all directions makes sense together.
Teacher grading: IPI is seeing if the same two essays, when swapped in order, get opposite grades. TOV is checking that the teacher’s overall ranking doesn’t contain loops.
Tournament ranking: IPI is a referee who calls different winners when teams swap jerseys. TOV is a league table that says Team A beats B, B beats C, but C somehow beats A.

Before vs After

Before: We mostly asked “Does the judge agree with humans?” and hoped that meant it was good.
After: We can now also ask “Is the judge rational on its own?” If it isn’t even self-consistent, agreement with humans might be accidental or fragile.

Why It Works (intuition)

Local check (IPI) targets first-order bias: If reversing order changes the decision, the judge is sensitive to position or surface framing.
Global check (TOV) targets logic over the whole set: If preferences form cycles or contradictions, no single, stable rule explains the choices. Fixes needed to create a tidy ranking directly count the contradictions.
Together, they isolate two failure modes: wobbly pairwise choices and broken overall reasoning.

Building Blocks (each in sandwich form)

Sage 🍞 Imagine a science fair where projects are auto-graded by consistent rules instead of by tired judges. 🥬 What it is: Sage is a human-free evaluation suite that tests an AI judge’s local and global consistency.

How it works:
1. Build a question set (650 questions from structured tasks and real chats).
2. Generate six candidate answers per question.
3. Compare all pairs in both orders.
4. Compute IPI and TOV; average across questions.
Why it matters: It’s scalable, cheap, and avoids injecting human bias into the measurement itself. 🍞 Anchor: Running Sage costs under $7 and finishes in under an hour, versus months of human grading.

Intra-Pair Instability (IPI) 🍞 You know how flipping a coin shouldn’t change your opinion of two drawings if you already decided which is better? 🥬 What it is: IPI measures how often a judge’s choice flips when you reverse the order of the same pair.

How it works:
1. Ask: A vs B.
2. Ask again: B vs A.
3. If the decisions aren’t opposites (or a tie handled symmetrically), that’s an inconsistency.
4. Count inconsistencies across all pairs.
Why it matters: High IPI means the judge is swayed by order or noise rather than content. 🍞 Anchor: If a judge sometimes likes the first-listed answer just because it’s first, IPI will catch it.

Weak Total Order Violation (TOV) 🍞 Imagine ranking your favorite snacks. If your list says chips > cookies > fruit > chips, your list can’t be right. 🥬 What it is: TOV counts the minimum number of pairwise decisions you’d need to change to make all rankings consistent (allowing ties).

How it works:
1. Collect all pairwise outcomes for the six answers.
2. Search for the closest consistent ranking (with possible ties).
3. Count how many pair results must change to reach that ranking.
Why it matters: High TOV means the judge’s global logic contradicts itself. 🍞 Anchor: A judge with A > B, B > C, but C > A must flip at least one decision; TOV measures that.

Local Self-Consistency 🍞 Think of a referee making the same call on identical plays. 🥬 What it is: A judge’s ability to stick to the same decision on the same pair when superficial details (like order) change.

How it works: Repeat the same comparison in both orders and expect mirrored outcomes.
Why it matters: Without it, every pairwise comparison is unreliable. 🍞 Anchor: If swapping answer order flips the winner, local self-consistency is broken.

Global Logical Consistency 🍞 Like making sure your rules for a whole tournament never create impossible loops. 🥬 What it is: The judge’s preferences across all answers should form a sensible, possibly tied, overall ranking.

How it works: Check for cycles and contradictions; TOV quantifies the fixes needed.
Why it matters: Without it, the judge can’t produce a trustworthy leaderboard or final choice. 🍞 Anchor: A coherent ranking lets you say confidently which answer is best overall.

Situational Preference 🍞 You know how a teacher might prefer creativity in one essay but reward neatness in another, shifting the goalposts? 🥬 What it is: When a judge changes its criteria depending on the particular pair, not the question.

How it works: The model lacks a stable internal rubric; with different pairs, it emphasizes different things.
Why it matters: It causes both local and global inconsistency. A fixed rubric per question helps stop this. 🍞 Anchor: Generating a one-time rubric for the question and judging all pairs by it reduced IPI by 16.1% and TOV by 11.0%.

03Methodology

At a high level: Question → Generate 6 answers → Pairwise comparisons (both orders) → Compute IPI and TOV → Average over questions → Report stability

Key Recipe Steps (with concept sandwiches where new ideas appear)

Build the Dataset

What happens: Mix 650 questions from two sources—structured categories (Factuality, Focus, Precise Instruction Following, Math, Safety) and real-world chats (WildChat-1M). For each question, create 6 answers.
Why it exists: This blends clean, test-like problems with messy real queries, making the test realistic and diverse.
Example: A math puzzle, a safety-sensitive instruction, and a casual user question like “How can I plan a weekend trip?”

Sage-Easy vs Sage-Hard (mini sandwich) 🍞 Imagine judging a race where the speed gap is obvious (easy) versus a race where everyone finishes within a second (hard). 🥬 What it is: Two tiers: Easy uses answers from different models with a big quality spread; Hard uses six answers from one strong model, so differences are subtle.

How it works:
1. Easy: Six diverse models produce a wide range of answer quality.
2. Hard: One capable model generates six similar answers.
Why it matters: Judges often look fine on Easy but stumble on Hard, which mirrors fine-grained choices needed in training and test-time selection. 🍞 Anchor: On Hard, even top judges’ IPI rose to around 0.25, revealing difficulty with close calls.

Symmetrized Evaluation Protocol (concept sandwich) 🍞 You know how a fair coin should give the same odds no matter which side you look at first? 🥬 What it is: Always compare A vs B and B vs A to expose order bias.

How it works:
1. For each unordered pair, do two passes: forward and reversed.
2. Record both outcomes.
3. IPI counts when these two outcomes aren’t logical opposites (or consistent ties).
Why it matters: A single-pass setup hides position bias; the symmetrized protocol surfaces it. 🍞 Anchor: Some models had large position effects; the double-pass reveals and quantifies them.

Round-Robin Pairing

What happens: With 6 answers, compare all 15 unique pairs, both orders (total 30 judgments per question).
Why it exists: Full coverage ensures global logic can be checked; partial pairs could miss cycles.
Example: For answers A–F, judge every pair: A-B, A-C, …, E-F, plus each reversed.

Compute IPI (Intra-Pair Instability)

What happens: For every pair, check if reversing order flips the choice incorrectly; average over all pairs to get IPI per question, then average over all questions.
Why it exists: Captures local wobbliness and position bias directly.
Example: If 3 of 6 pairs flip inconsisently for a question with 6 answers (15 pairs), IPI = 3/15 = 0.2.

Compute TOV (Weak Total Order Violation)

What happens: Find the minimum number of pairwise decisions that must change so that all preferences can be arranged into a consistent ranking with ties allowed.
Why it exists: Quantifies global contradictions; the fewer edits needed, the more coherent the judge.
Example: If resolving A > B > C > A requires changing 2 pairwise outcomes, TOV += 2.

Aggregate and Validate Stability

What happens: Average IPI and TOV across all questions. Repeat runs show tiny variance; theoretical analysis bounds the variance extremely low (≈1e-5 for IPI across the whole suite).
Why it exists: Proves that Sage’s scores reflect stable reasoning patterns, not random sampling.
Example: Two different runs of the same judge produce nearly identical metrics.

Rubrics to Reduce Situational Preference (concept sandwich) 🍞 You know how a teacher writes a grading checklist before reading any papers so they grade fairly? 🥬 What it is: The judge first writes its judging rubric once per question, then uses it for every pair.

How it works:
1. Prompt the judge: “State your criteria for this question.”
2. Lock the rubric.
3. Judge all pairs using the same rubric.
Why it matters: Stops the judge from shifting criteria between pairs; reduced IPI by 16.1% and TOV by 11.0%. 🍞 Anchor: On a tricky how-to question, the rubric keeps the judge from favoring chatty answers in one pair and concise answers in another.

Panels vs Debates (concept sandwiches) Panel-based Judge 🍞 Picture a talent show with multiple judges who vote, so one person’s bias doesn’t decide everything. 🥬 What it is: A panel (jury) of different models votes on each comparison.

How it works:
1. Gather a diverse set of strong (or weak) models.
2. Each makes a judgment; aggregate (e.g., majority vote).
3. Compare panel result to the best single model.
Why it matters: Panels often improved robustness (up to 15%) over the best individual judge. 🍞 Anchor: A trio of strong models beat the single best model in both IPI and TOV.

Debate-based Judge 🍞 Imagine two clones arguing loudly; that doesn’t guarantee truth—sometimes it just amplifies confusion. 🥬 What it is: Multiple agents (versions of the same model) debate, then a judge decides.

How it works: Agents exchange arguments across rounds and a judge declares a winner.
Why it matters: In tests, debates often worsened IPI and TOV compared to no debate. 🍞 Anchor: Changing rounds, judge types, and exchange style still didn’t outperform the simpler baseline.

Deep Reasoning Mode (concept sandwich) 🍞 You know how thinking out loud helps you catch mistakes? 🥬 What it is: Prompting models to reason in more depth before deciding.

How it works: Switch the model to higher “reasoning depth” mode (when available) or require chain-of-thought style analysis before verdict.
Why it matters: Generally improved consistency, especially on hard cases. 🍞 Anchor: Moving from low to high reasoning mode reduced both IPI and TOV.

Cross-Checks and Sensitivity

Temperature and prompts: Scores stayed stable across temperatures and across varied prompts.
Model-agnostic hardness: Replacing the generator model for Sage-Hard barely changed scores, showing the difficulty is real, not tied to writing style.
Scoring vs Pairwise: Direct scoring often disagreed with pairwise choices, especially on hard cases, hinting that absolute scales are poorly calibrated.

Secret Sauce

Two simple, label-free lenses (IPI and TOV) catch common, damaging judge failures.
The symmetrized, full round-robin makes biases visible and logic checkable.
Rubric-first judging and panels are practical knobs to reduce inconsistency right now.

04Experiments & Results

The Test

What they measured: Two things—local flips (IPI) and global contradictions (TOV). They also tested stability across multiple runs, temperatures, and prompts, and checked correlation with human-labeled benchmarks (LLMBar, RewardBench2) to show Sage is meaningful.
Why it matters: If IPI/TOV are stable and track external difficulty/accuracy, they’re reliable gauges for judge quality.

The Competition

Baselines: Human-labeled evaluations like LLMBar and RewardBench2.
Models: 13 popular LLM judges (proprietary and open-source), plus six fine-tuned judges, plus multi-model panels and debate frameworks.
Datasets: 650 questions pulled from structured categories and real chat logs, with two difficulty tiers (Easy: varied models; Hard: one strong model produces all six answers).

The Scoreboard (with context)

Strong correlations: Spearman correlations between Sage metrics and external benchmarks were high (≈0.80–0.89), meaning worse IPI/TOV comes with more human-labeled errors too.
Stability: Variance across independent runs was tiny (as low as 2.2e-6 for IPI); theory bounds overall variance to around 1e-5, showing reproducibility.
Big picture: Everyone gets worse on Sage-Hard. IPI and TOV roughly doubled or tripled compared to Sage-Easy. That’s like students who get As on basic quizzes but drop to Cs when questions get subtle.
Top models still wobble on Hard: Even leaders like Gemini-2.5-Pro and GPT-5 had IPI around a quarter on hard tasks, meaning about 1 in 4 local comparisons showed instability.

Surprising Findings

Rubrics help a lot: Having the judge write a one-time rubric per question reduced IPI by 16.1% and TOV by 11.0%—a big gain from a simple prompt trick.
Panels beat solos: A panel of strong models slightly but consistently outperformed the best individual model (up to ~15% better), suggesting diversity helps cancel idiosyncratic biases.
Debates can hurt: Multi-agent debates frequently made metrics worse than doing no debate at all, even after trying different rounds, judge roles, and argument styles.
Deep reasoning helps: Turning up explicit reasoning generally improved consistency, especially on hard sets.
Humans aren’t perfectly consistent either: On Sage-Hard, human IPI hit 0.332 and TOV 6.523. That’s like getting a wobbly B- when we expected a solid A. It shows human labels aren’t a flawless gold standard.

Additional Context

Scoring vs Pairwise mismatch: Direct scoring often disagreed with pairwise winners, especially on Hard, suggesting absolute score calibration is a weak spot.
Cost and speed: Running Sage cost under $7 and about an hour; a human replica would cost roughly$ 81,981 and 100 days for the same volume.
Model-agnostic difficulty: Switching the generator for Sage-Hard barely changed results, confirming hardness is from close-answer similarity, not a model’s writing style.

Bottom Line Sage reveals a consistent pattern: judges look fine when differences are obvious, but many crack under fine-grained pressure. Simple interventions—rubrics, panels, and deeper reasoning—offer immediate improvements.

05Discussion & Limitations

Limitations

Label-free doesn’t mean bias-free: Sage removes human labels, but the answers and judges still come from models that can carry their own biases.
Hardness definition: “Hard” relies on generating similar-quality answers from one capable model; though validated, this is still a design choice.
No absolute correctness: Sage checks internal consistency, not factual truth. A perfectly consistent judge could still be consistently wrong without external facts.
Computational pairing cost: Full round-robin scales quadratically with answer count (though still cheap vs humans), so very large candidate sets need careful budgeting.

Required Resources

Access to several LLMs (for generating answers and serving as judges) and modest compute budget (<$10 per full run at cited rates).
Prompt templates for symmetrized comparisons and optional rubric generation.
Basic tooling to compute IPI/TOV and to orchestrate pairwise runs.

When NOT to Use

If you need domain truth (e.g., medical correctness) rather than rational consistency alone.
If you only have one answer per question (no pairs), since Sage’s power comes from pairwise and global checks.
If your application never encounters close-call comparisons; then simpler, cheaper checks might suffice.

Open Questions

Can we combine Sage with automated fact-checkers to capture both consistency and correctness?
What are the best ways to learn or enforce rubrics automatically during training so judges internalize stable criteria?
Can panel formation be optimized (diversity vs cost) to guarantee improvements over the best single model?
Are there lightweight approximations to TOV that scale to even larger candidate sets while preserving insight?
How do these insights transfer to multimodal judging (images, audio, video) and to safety-critical settings?

06Conclusion & Future Work

3-Sentence Summary Sage evaluates AI judges without human labels by checking whether their choices are consistent for pairs (IPI) and logically coherent across full rankings (TOV). Across 650 questions, even top models became unreliable on close-call comparisons, but simple fixes—like writing one rubric per question, using panels, and encouraging deeper reasoning—made them steadier. Sage’s metrics are cheap, fast, stable, and correlate strongly with human-labeled benchmarks, making them a practical proxy for robustness and even accuracy.

Main Achievement Shifting the focus from agreement-with-humans to internal rationality, Sage introduces two stable, scalable metrics (IPI and TOV) that expose and quantify judge instability without any human annotation.

Future Directions

Fuse Sage with automated factuality tools to capture both consistency and objective correctness.
Train judges with built-in rubric formation to reduce situational preferences by design.
Optimize panels and voting rules for maximum robustness per dollar.
Extend to multimodal judging and safety-critical evaluations with additional safeguards.

Why Remember This Sage reframes what it means to be a “good judge”: not just matching human labels, but being stable and logically coherent on its own. That mindset—and the simple, powerful tools behind it—can make our AI evaluation systems more trustworthy, scalable, and fair.

Practical Applications

•Screen and select the most consistent judges for automated model arenas to stabilize rankings.
•Add a one-time, per-question rubric step to your judge prompt to reduce situational preferences.
•Use a small, diverse panel of models to vote on judgments and outperform a single judge.
•Increase explicit reasoning depth for judges on close-call tasks to boost consistency.
•Run Sage regularly as a regression test to catch new biases or instability after model updates.
•Prefer pairwise comparisons over direct scoring when evaluating close-quality answers.
•Tune inference-time selection (e.g., rejection sampling) by choosing judges with lower IPI/TOV on Sage-Hard.
•Benchmark fine-tuned judge models with Sage before deployment to avoid overfitting to biased labels.
•Estimate evaluation risk: use TOV as a lower bound on error rate when ground truth is scarce.
•Extend Sage-style checks to multimodal judging (images, audio, video) to detect cross-modal inconsistencies.

Version: 1