Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Zhuochun Li; Yong Zhang; Ming Li; Yuelyu Ji; Yiming Zeng; Ning Cheng; Yun Zhu; Yanmeng Wang; Shaojun Wang; Jing Xiao; Daqing He

Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

Intermediate

Zhuochun Li, Yong Zhang, Ming Li et al.1/30/2026

arXiv PDF

Key Summary

•Big models are often used to grade AI answers, but they are expensive, slow, and depend too much on tricky prompts.
•This paper shows that small models secretly store good judging clues inside their hidden layers, even if their writing is weak.
•The authors propose the Semantic Capacity Asymmetry Hypothesis: judging needs far less ‘brain power’ than writing, so small models can judge well using what they already know.
•They introduce Representation-as-a-Judge, which reads the model’s internal representations instead of asking it to write an evaluation.
•Their system, called INSPECTOR, freezes a small model and trains tiny classifiers (probes) on its hidden states to predict scores for five aspects (consistency, logic, informativeness, fluency, factuality).
•Across math and science benchmarks (GSM8K, MATH, GPQA), these probes beat prompt-based small models by large margins and get close to big-model judges.
•Binary judging (high vs low quality) is especially strong (often ~80–90% F1), making it great for fast, cheap data filtering.
•Ablations show simple ingredients work best: mean pooling of layer embeddings plus logistic regression.
•Filtered data chosen by these probes trains better student models than random selection and nearly matches filtering done by a powerful LLM.
•This shifts evaluation from ‘LLM-as-a-Judge’ to ‘Representation-as-a-Judge’—cheaper, steadier, and more interpretable.

Why This Research Matters

Fast, affordable, and reliable evaluation lets teams curate better datasets and improve AI systems without relying on costly proprietary judges. By tapping into small models’ hidden states, organizations can run large-scale, stable assessments that are less sensitive to prompt phrasing. This approach makes evaluation more transparent, since we can inspect which layers and features carry the signal. It also democratizes evaluation by enabling open-source, on-prem setups that respect privacy and budgets. Finally, higher-quality filtering leads to better-trained assistants for math, science, and everyday tasks, benefiting students, developers, and educators.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a good teacher can quickly tell if a math solution makes sense just by scanning the steps, without rewriting the whole solution themselves? They don’t have to be the best writer to be a great grader.

🥬 The Concept: Language models (LMs) are computer programs that learn patterns in text so they can understand and generate language. How it works: (1) They read lots of text, (2) they learn to predict the next word, (3) they build up internal “thoughts” (hidden states) about meaning, and (4) they use those thoughts to answer questions. Why it matters: Without this, models couldn’t understand questions or write any replies.

🍞 Anchor: When you ask an AI, “What’s 7×8?”, it uses what it learned from text to predict “56”.

🍞 Hook: Imagine a notebook where you don’t write your final essay, but you keep your outlines and key points—that’s often enough to judge if an idea is good.

🥬 The Concept: Hidden state representations are the model’s internal notes—vectors that store meaning about words, steps, and relationships while the model thinks. How it works: (1) Each layer turns the input into richer features, (2) attention connects related tokens, (3) the final layers prepare to generate words. Why it matters: If we can read these notes, we might judge quality without asking the model to write a new explanation.

🍞 Anchor: Even if a student’s handwriting is messy, a teacher can still see the correct steps by looking at the outline.

🍞 Hook: Think of a talent show where judges give scores without needing to perform themselves.

🥬 The Concept: LLM-as-a-Judge is when a big model is asked (via a prompt) to grade other models’ outputs without seeing the right answers. How it works: (1) Provide a rubric (e.g., logic, fluency), (2) show the question and response, (3) ask the LLM to reason and give scores, (4) read its generated judgment. Why it matters: It works well, but it’s slow (lots of text generation), expensive (big models), and touchy about prompt wording (results can change).

🍞 Anchor: Changing the wording of the judging prompt can swing scores, like how different instructions can confuse a substitute teacher.

The World Before:

People used big models like GPT-4 to judge answers in tasks like reasoning and summarization. They were accurate but costly and opaque.
Small models were tempting because they’re fast and cheap, but when prompted to judge, they made inconsistent or weak evaluations.

The Problem:

We need reliable, scalable evaluation that doesn’t depend on huge models, heavy decoding, or fragile prompts.

Failed Attempts:

Prompt-engineering small models to judge better usually didn’t fix inconsistency.
Fine-tuning small judges helped a bit but still lagged and remained sensitive to prompts.
Using classic metrics (like ROUGE or BLEU) missed deeper reasoning quality.

The Gap:

Everyone focused on models’ final written judgments (the surface text), not on the rich signals hidden inside their layers.

🍞 Hook: You know how sometimes you know the right answer in your head but can’t explain it perfectly out loud? The knowledge is there, just not expressed well.

🥬 The Concept: The paper’s key observation is that small models often store strong judging clues in their hidden states—even when their generated judgments are poor. How it works: (1) Collect judgments from a strong LLM on many examples, (2) extract hidden layers from a small model for those examples, (3) train tiny probes to map hidden states to scores, (4) use probes to judge new samples. Why it matters: This reveals that judging might be easier than writing, and we can do it with small models efficiently.

🍞 Anchor: A shy student might ace multiple-choice tests (they know the material) even if they struggle to write essays.

Real Stakes:

Cheaper, faster grading lets teams curate huge datasets for training without breaking the bank.
More stable, interpretable evaluation helps researchers compare methods fairly.
Open-source, small-model judging reduces dependence on proprietary systems.
Better data filtering improves downstream models used in math help, coding assistants, and study tools.

02Core Idea

🍞 Hook: Imagine sorting good apples from bad ones by feeling them through a bag—you don’t open the bag (no fancy display), but your hands tell you enough to decide quickly and accurately.

🥬 The Concept: Aha! The key insight is that evaluation needs much less ‘semantic capacity’ than generation, so we can judge quality by reading a small model’s internal representations instead of asking it to write a judgment. How it works: (1) Get strong-LLM scores as training labels, (2) run a small model on the same inputs and grab its hidden layers, (3) train lightweight probes to predict the labels from those hidden layers, (4) use the probes as fast, decoding-free judges. Why it matters: This avoids expensive generation, reduces prompt sensitivity, and makes evaluation interpretable and scalable.

🍞 Anchor: Like checking if a Lego tower is sturdy by gently pressing it (feeling the structure) rather than rebuilding it yourself.

Multiple Analogies:

Airport security scanner: You don’t open every suitcase (no full decoding); the scanner’s internal image (hidden states) shows what matters to decide if it’s safe (the score).
Teacher’s rubric highlights: You scan key steps (representations) instead of rewriting the solution (generation) and still grade reliably.
X-ray of a book: You see the outline (structure of meaning) without reading every word aloud; enough to judge if the plot holds together.

Before vs After:

Before: Evaluation = ask a big LLM to write a judgment; accurate but slow, pricey, and prompt-fragile.
After: Evaluation = decode-free probing of a small LM’s internal signals; accurate enough, fast, cheap, and steadier.

🍞 Hook: Ever notice it’s easier to check if a math step is wrong than to write a perfect explanation from scratch?

🥬 The Concept: Semantic Capacity Asymmetry Hypothesis says judging takes less capacity than writing. How it works: (1) Generation needs planning, long chains, and style—heavy lifting; (2) Evaluation can spot inconsistencies with localized cues in mid/upper layers; (3) Small models already hold these cues in compressed form. Why it matters: We don’t need huge models (or full decoding) to judge well.

🍞 Anchor: It’s easier to spot a wobbly Jenga block than to build the whole tower.

Building Blocks of the Method:

Labels: Use a strong LLM (DeepSeek-V3) to score five aspects: semantic consistency, logicality, informativeness, fluency, factuality.
Representations: Feed the same inputs to a small LM (e.g., Qwen3-1.7B) and extract hidden states and attention.
Features: Pool per-layer embeddings (mean/last/min/max/concat), add compact stats, and optionally reduce with PCA.
Probes: Train tiny classifiers (often just logistic regression) to predict aspect scores (binary and multiclass).
Selection: Rank layers/pooled features, pick top layers, concatenate features, and choose the simplest, most stable probe.

Why It Works (intuition, no equations):

Hidden states capture structure and facts before words are generated.
Mid-to-upper layers tend to bundle the exact cues needed to spot mistakes.
Simple linear probes are enough because the signal is already organized by the LM.

🍞 Anchor: A magnifying glass doesn’t rewrite the story—it just helps you see the important letters clearly enough to check for typos.

03Methodology

At a high level: Input (question + model’s response) → Get gold aspect scores from a strong LLM → Run the same input through a small LM and collect hidden representations → Build features from selected layers → Train tiny probes to predict scores → Output fast, decoding-free judgments.

Step 1: Build a labeled judging set using a strong LLM

What happens: Use a medium LM (e.g., Llama-3-8B-Instruct) to generate many responses so there’s a mix of good and bad solutions. Then ask a strong judge (DeepSeek-V3) to score each response on five aspects (semantic consistency, logicality, informativeness, fluency, factuality) using clear rubrics.
Why this step exists: We need reliable labels to teach our probes what ‘good’ and ‘bad’ look like. Without it, probes don’t know what to predict.
Example: A math problem’s solution gets scores like Consistency=5, Logic=4, Info=5, Fluency=5, Factuality=4.

🍞 Hook: Like filling a sticker chart where a trusted teacher gives stars for different skills.

🥬 The Concept: Evaluation aspects are the rubric categories that define quality. How it works: Each aspect is scored from 1–5; we can also make a simple pass/fail (≥4 vs <4). Why it matters: Clear rubrics turn fuzzy ‘goodness’ into learnable targets for our probes.

🍞 Anchor: A spelling test has separate points for accuracy, neatness, and completeness.

Step 2: Extract hidden representations from a small LM

What happens: Feed the evaluation prompt (question + response within the aspect-specific instruction) into a small LM (e.g., Qwen3-0.6B/1.7B, Llama-3.2-1B, Llama-3.1-8B) and collect per-layer hidden states and attention maps while keeping the LM frozen.
Why this step exists: We want the model’s ‘internal notes’ without forcing it to write a new judgment. Without these, there’s nothing for probes to read.
Example: For a 24-layer model, we get a sequence of 24 embedding grids—one per layer—that encode how the model is interpreting the input.

🍞 Hook: Think of photographing each page of a student’s scratch work as they solve a problem.

🥬 The Concept: Hidden state pooling is how we turn a whole sequence of token embeddings into a single compact vector per layer. How it works: Use mean, last, min, max, or concatenations; compute small stats (norm, variance, entropy); optionally include attention entropy summaries. Why it matters: Compact features make simple probes work well and prevent overfitting.

🍞 Anchor: Summarizing a chapter with a short paragraph that still keeps the main idea.

Step 3: Probe each layer and rank configurations

What happens: For each layer and pooling type, train a tiny classifier (often logistic regression) to predict aspect scores using cross-validation; record performance for both multiclass (1–5) and binary (high vs low) tasks.
Why this step exists: Different layers capture different signals; we need to find where the evaluative cues are strongest and most stable. Without ranking, we might pick weak or noisy layers.
Example: Mean-pooled features from upper-mid layers might best predict logicality, while last-token features from a higher layer might suit fluency.

🍞 Hook: Trying different pairs of glasses to see which makes the page clearest.

🥬 The Concept: Probing classifiers are tiny models placed on top of frozen representations to test what information is linearly recoverable. How it works: Keep the LM frozen; fit simple classifiers (logistic regression, linear SVM, small MLP, random forest) on pooled features; use cross-validation to ensure generalization. Why it matters: If simple probes do well, it means the LM already organized the needed signals.

🍞 Anchor: A thermometer (simple tool) can read your body’s temperature because your body already carries that signal.

Step 4: Assemble a final multi-layer probe

What happens: Start from the best single layer, add the next best only if it helps; concatenate features across a few chosen layers; tune a small set of classifiers and pick the simplest, most stable winner for each aspect.
Why this step exists: Combining a few strong layers usually beats any single layer while staying compact. Without careful selection, we either miss signal or overfit.
Example: For factuality, concatenating mean-pooled features from layers 15 and 18 with logistic regression might yield the best binary F1.

🍞 Hook: You don’t stack every lens in the world on a camera—just two or three that together give a crisp picture.

🥬 The Concept: Representation-as-a-Judge is the overall strategy: judge by reading inside the model rather than asking it to talk. How it works: Freeze the small LM, extract features, train tiny probes, and output scores—no decoding. Why it matters: It’s efficient, reliable, and interpretable compared to prompt-based judging.

🍞 Anchor: Checking a car’s dashboard sensors instead of asking the car to write you a paragraph about its health.

The Secret Sauce:

Capacity asymmetry: Detecting errors needs fewer resources than writing explanations, so small LMs suffice for judging.
Layer sweet spots: Mid-to-upper layers consistently concentrate evaluative signals.
Simple beats complex: Mean pooling + logistic regression often wins—showing signals are already well-structured.
Balanced training: Downsampling scores (1–5) avoids bias toward common labels.
Decoding-free pipeline: By skipping generation, evaluation becomes fast, cheap, and reproducible.

04Experiments & Results

The Test: The authors measured how well probes predict aspect scores for reasoning benchmarks (GSM8K, MATH, GPQA), using both multiclass (1–5) and binary (high vs low) classification, reporting weighted F1. They also tested whether these probes can filter data for supervised fine-tuning (SFT) and how well they generalize across datasets (OOD tests).

The Competition:

Prompt-based small LMs acting as judges (directly generating scores and justifications).
Fine-tuned small LMs (e.g., Qwen3-0.6B) trained to generate judgments.
A strong text encoder baseline (RoBERTa) trained on the same scoring data.

Scoreboard with Context:

Probing vs prompting: Probes beat prompt-based small LMs by large margins across all aspects and datasets. Think of it like moving from a B- average to consistent A-range on many tasks.
Binary tasks shine: High/low-quality prediction often reaches around 80–90% F1 for several setups—like getting an A–A+ when baselines hover around C+/B–.
Multiclass is modest but useful: Predicting exact 1–5 scores typically lands around 50–60% F1—reasonable given the strong teacher LLM is far larger.
Model size surprises: Bigger small models don’t always win; Qwen3-1.7B sometimes outperforms Llama-3.1-8B and vice versa, depending on aspect and dataset. Scaling isn’t a silver bullet for judging.
Secret simplicity: Mean pooling + logistic regression repeatedly tops more complex setups in ablations—simple tools read strong signals.

Surprising Findings:

Hidden strength: Small models encode reliable evaluative cues even when their generated judgments are poor. The signal is inside; generation can hide it.
Binary robustness: Under distribution shift (train on GSM8K, test on MATH, or vice versa), binary probes keep reasonable F1 (often ~35–62%), suggesting coarse quality cues transfer; fine-grained (1–5) scores transfer poorly.
Data filtering works: Using probe scores to filter training data produces SFT gains comparable to filtering with a strong LLM judge and clearly better than random filtering—especially at smaller data scales.

Concrete Numbers (summarized):

Binary F1 often around 80–90% for aspects like fluency, logicality, or factuality with the best small-model/probe combos, versus much lower for prompt-based small LMs.
Multiclass F1 often around 50–60% with probes, substantially above prompt-based small LMs and RoBERTa on these tasks.
Ablations: Mean pooling dominates; logistic regression commonly wins; more complex classifiers add little.

Why these results matter:

Probes provide near-LLM-judge fidelity for yes/no-style gatekeeping and filtering at a tiny fraction of cost.
They reduce prompt fragility and improve reproducibility since evaluation no longer depends on generated text.
They open a path to interpretable diagnostics (which layers and features carry evaluative signals).

Example in Action:

A GSM8K solution judged as perfect (5s across aspects) by a strong LLM is also scored highly by the Qwen3-1.7B probe, while Qwen3-1.7B’s prompt-based judgment oddly penalizes irrelevant style issues—showing probes align better with gold labels than the same model’s generated text.

05Discussion & Limitations

Limitations:

Rubric dependence: The five aspects (semantic consistency, logicality, informativeness, fluency, factuality) are well-motivated but not universal. Some domains may need other aspects (e.g., safety, fairness, code correctness nuances).
Teacher bias: Labels come from a single strong LLM (DeepSeek-V3). Different judges might score differently; mixing judges could reduce bias.
Domain focus: Experiments emphasize reasoning datasets; broader coverage (commonsense, code, dialogue safety) needs exploration.
Fine-grained transfer: Multiclass (1–5) scores transfer poorly across datasets; probes are best for coarse filtering (binary) in new domains.
Representation access: You need to run the small model with hooks to get hidden states; not all deployments expose this easily.

Required Resources:

A small open-source LM (0.6B–8B) with API or code access to hidden states and attention.
A labeled evaluation set: questions, responses, and aspect scores from a strong LLM (or human raters if available).
Modest compute for feature caching and training simple classifiers; far less than decoding with large LLMs.

When Not to Use:

If you cannot access internal representations (e.g., closed APIs with no hidden states), probing isn’t feasible.
If you need very fine-grained ordinal scores across new domains without task-specific tuning, expect weaker transfer than binary.
If the evaluation aspects are poorly defined for your task, probes may learn inconsistent targets.

Open Questions:

Multi-judge fusion: How best to combine labels from multiple strong LLMs or humans to reduce bias and improve robustness?
Aspect design: What’s the right universal set of aspects for different domains (coding, safety, multimodal tasks)?
Better features: Can learned subspace methods or sparse autoencoders reveal even crisper evaluative signals without overfitting?
Causality: Are these evaluative features causal for good performance, or correlational fingerprints of internal processing?
Interpretability: Can we map probe weights back to tokens or heads to produce human-friendly rationales for the scores?

06Conclusion & Future Work

Three-Sentence Summary:

The paper discovers that small language models hide strong judging signals in their internal representations, even when they write weak judgments. 2) It proposes the Semantic Capacity Asymmetry Hypothesis and a practical framework, INSPECTOR, that reads those hidden signals with tiny probes to score quality without generation. 3) Experiments on GSM8K, MATH, GPQA (and an open-ended set) show big gains over prompt-based small models and approach large-LLM judges, especially for binary filtering.

Main Achievement:

A paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge: decoding-free, cheap, stable evaluation using small models’ internal states, with simple probes (often mean pooling + logistic regression) delivering near-LLM fidelity for coarse judgments.

Future Directions:

Expand aspect sets and domains (commonsense, coding, safety) and use multiple teacher judges to reduce bias.
Improve feature learning (sparse or interpretable subspaces) and produce token-level attributions for transparent rationales.
Explore active learning: use probe uncertainty to choose which samples need expensive big-LLM judgments.

Why Remember This:

It challenges the assumption that only huge models can judge well and reveals a cost-effective, interpretable alternative.
It shows that ‘understanding’ can live in hidden layers, even when generation struggles—an important clue for building practical, trustworthy AI evaluators.
It offers immediate utility: fast, scalable data filtering that boosts downstream training quality.

Practical Applications

•Filter large pools of model-generated solutions to keep only high-quality reasoning traces for training.
•Build lightweight, on-prem evaluation services that score responses without sending data to external APIs.
•Continuously monitor production models by probing internal signals for drops in logic, consistency, or factuality.
•Pre-screen data before expensive human or LLM review, reducing labeling costs.
•Score intermediate chain-of-thought steps to prune weak rationales during self-training.
•Select cleaner instruction-tuning datasets to boost downstream performance of student models.
•Benchmark multiple small models consistently without prompt-engineering headaches.
•Diagnose which layers capture specific quality aspects to guide model editing or distillation.
•Create fast binary gates (pass/fail) in data pipelines for scalable quality control.
•Combine probe uncertainty with active learning to route only hard cases to big LLM judges.

Version: 1