Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Pedro Memoli Buffa; Luciano Del Corro

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

Beginner

Pedro Memoli Buffa, Luciano Del Corro1/13/2026

arXiv PDF

Key Summary

•The paper introduces Entropy Sentinel, a simple way to watch how accurate an AI is by reading its “uncertainty heartbeat” during generation.
•It only needs the top-20 next-token probabilities that many APIs already return, so it works for both open and closed models without extra costs.
•Each answer’s uncertainty trace is turned into 11 summary features (like mean, max, and quantiles of entropy), then a tiny classifier predicts how likely that answer is correct.
•Averaging those per-answer probabilities over a topic slice gives a direct estimate of slice-level accuracy (in accuracy units, not just a vague score).
•Across 10 STEM benchmarks and 9 models (3B–20B), the method often tracks true accuracy well and orders domains almost correctly (median Spearman around 0.94).
•Training on a difficulty-mixed set (easy + hard) works best, often beating training on medium-only tasks and improving as more diverse benchmarks are added.
•Some models (like PHI-3.5-MINI) show near-perfect agreement; others (like Qwen3-8B) have weaker entropy–correctness coupling and need extra validation.
•Compared to classic uncertainty scores (like entropy sum or NLL), Entropy Sentinel is similarly strong for ranking but uniquely outputs calibrated accuracy estimates.
•The approach is cheap, API-friendly, and practical for continuous monitoring and for prioritizing data collection toward weak domains.
•Limits include STEM-only evaluation, dependence on decoding settings, top-k entropy approximation, and model-specific reliability, so deployments should validate calibration first.

Why This Research Matters

Many organizations deploy LLMs across changing topics and users, and need a reliable way to see when performance dips without constantly paying for labels. Entropy Sentinel turns what the model already produces (top-k probabilities) into live, understandable accuracy estimates per slice. This makes it practical to spot weak areas quickly and focus data collection where it helps the most. Because it’s cheap, it scales to production traffic and works with both open and closed models. Even when absolute accuracy isn’t perfect, the strong rankings guide teams to prioritize effectively. In short, it helps keep AI systems trustworthy, efficient, and responsive as real-world usage evolves.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a teacher doesn’t grade the whole class every single day, but still wants to know who needs help in fractions or grammar? They’d love a quick daily checkup, not a giant test every month.

🥬 The Situation Before: Big language models (LLMs) are used by lots of people asking many kinds of questions that change over time. Companies try to track how well models are doing with hand-built tests and human labeling. This is slow, expensive, and can miss new problem types that show up suddenly. If you only check once in a while, you discover mistakes late and don’t know which topic slices (like “word problems” or “physics”) are weakest today.

🥬 The Problem: Teams need two things at once: (1) Monitoring—“Where is the model performing poorly right now as usage shifts?” and (2) Improvement—“Which data should we collect next to fix the biggest gaps?” Current workflows aren’t continuous and aren’t fine-grained enough (for example, per customer segment, or per topic cluster). Worse, raw uncertainty numbers from models aren’t in accuracy units, so you can’t easily say, “Accuracy is about 64% on this slice.”

🍞 Anchor: Imagine running a homework help bot. You want a dashboard that says, “Algebra word problems: 62% today, Geometry: 78%,” so you know where to focus practice problems and tutoring.

🍞 Hook: Think of a heart monitor—doctors don’t need to open your chest to check health; they read the heartbeat pattern and decide if things look okay.

🥬 Failed Attempts: People tried using raw uncertainty scores like entropy, NLL, or perplexity to tell if answers are correct. Those scores do help rank which answers are riskier, but they aren’t directly accuracy. Their scale shifts across models and domains, so a score like “1.5” doesn’t translate to “75% accuracy.” Other methods probe hidden layers or run multiple samples, but these can be expensive, hard to access from closed models, or slow.

🍞 Anchor: It’s like having a thermometer that says “units: mystery.” You can tell hotter vs. colder, but not the real temperature.

🍞 Hook: Imagine if we could turn the model’s built-in uncertainty trace into a simple, trustworthy “percent correct” estimate, with no special access.

🥬 The Gap: We needed an inference-time signal that is cheap (no extra runs), API-compatible (works with closed models that only return top-k logprobs), and robust under domain shift (learn on a few benchmarks, transfer to new ones). Most of all, we needed accuracy numbers, not just relative risk scores.

🍞 Anchor: Like turning a fuzzy weather vibe (“seems cloudy”) into a clear forecast (“60% chance of rain”).

🍞 Hook: Picture listening to how confident the model sounds at every word it writes, like hearing a singer’s pitch wobble.

🥬 Why This Paper: The authors show that the model’s uncertainty over time—its “entropy trace”—can be summarized into 11 simple features. A tiny classifier turns those into a probability the answer is correct. Average those probabilities over a topic slice to get slice-level accuracy. This gives a continuous, low-cost monitor you can run from logs.

🍞 Anchor: Your dashboard now reads, “Algebra word problems: estimated 63% ± a bit,” updated all the time, so you know exactly which slices to improve first.

02Core Idea

🍞 Hook: Imagine you’re guessing answers on a quiz. When you’re unsure, your voice hesitates; when you’re sure, it’s steady. If we tracked that hesitation, we could predict how often you’re right.

🥬 The Aha in One Sentence: The pattern of uncertainty while an LLM writes (its entropy trace) can be mapped to the chance its final answer is correct; averaging these chances across a slice estimates slice accuracy in accuracy units.

🥬 Multiple Analogies:

Heartbeat/EKG: Doctors read the rhythm to infer health. We read the entropy rhythm to infer correctness.
Detective’s Confidence: A detective’s tone tightens when stuck and relaxes when certain. High entropy means “I’m not sure”; low entropy means “I’ve got it.”
Seismograph: Small shakes vs. big quakes. Entropy spikes signal risky moments; steady low entropy signals confidence.

🥬 Before vs. After:

Before: Raw uncertainty scores told you relative risk but not calibrated accuracy. Monitoring was slow and required labels.
After: A small classifier converts entropy summaries into correctness probabilities per answer. Averaging them yields slice accuracy, letting you monitor continuously and target data collection.

🥬 Why It Works (Intuition):

Correct chains often show sustained lower entropy once the solution path clicks, while incorrect ones waver or spike.
Summaries like max, mean, quantiles, and accumulation capture both steady confidence and shaky moments.
A lightweight classifier learns which mix of these features signals correctness for your model and transfers to new domains when trained on diverse difficulties.

🥬 Building Blocks (with the Sandwich pattern for each concept):

🍞 Hook: You know how when you type, your phone suggests the next word and shows a few likely options? 🥬 Next-Token Probabilities:

What it is: The chances the model assigns to each possible next token it might write.
How it works: At every step, the model computes a probability for each token; many APIs return only the top-k highest ones.
Why it matters: These probabilities are the raw material for measuring uncertainty without peeking inside the model. 🍞 Anchor: If the top suggestion has 70% and others are tiny, the model is confident; if lots of tokens are around 5–10%, it’s uncertain.

🍞 Hook: Imagine listening to a singer: steady pitch means confidence; shaky pitch means doubt. 🥬 Output Entropy Profile:

What it is: A timeline showing how uncertain the model is at each generated token, summarized into simple stats.
How it works: Compute entropy at every step from top-k probabilities; then take features like max, mean, quantiles, std, skew, kurtosis, and the sum (SEA).
Why it matters: This captures the “uncertainty fingerprint” of the whole answer, not just one moment. 🍞 Anchor: A smooth, low-entropy trace often matches correct solutions; frequent spikes often match mistakes.

🍞 Hook: Think of a quick referee who can judge a play in seconds. 🥬 Lightweight Classifier:

What it is: A small model (like a random forest or logistic regression) that predicts if an answer is likely correct.
How it works: It learns from labeled examples to map the 11 entropy features to a correctness probability.
Why it matters: It’s fast, cheap, and avoids complex modeling. 🍞 Anchor: Like a coach who glances at form (features) and estimates the chance of a successful shot.

🍞 Hook: Picture a weather forecast saying, “70% chance of rain”—not just “feels rainy.” 🥬 Probabilistic Correctness Prediction:

What it is: Turning features into a number between 0 and 1 for how likely the answer is correct.
How it works: The classifier outputs a probability; optional calibration helps align it with real frequencies.
Why it matters: Probabilities are easy to average and act on. 🍞 Anchor: If ten answers average 0.62, expect about 62% correct.

🍞 Hook: Think of grades per subject: math 85%, science 72%. 🥬 Domain-Level Accuracy Estimation:

What it is: Averaging per-answer probabilities over a slice (domain) to estimate accuracy for that slice.
How it works: Collect many answers from the same slice, average their correctness probabilities.
Why it matters: Gives you slice accuracy in plain accuracy units for monitoring. 🍞 Anchor: “Olympiad-style problems: 41%,” so you know where to get more data.

🍞 Hook: A dashboard that blinks when something drifts off. 🥬 Entropy-Based Monitoring:

What it is: Using the entropy-to-accuracy pipeline to watch performance over time and alert on drops.
How it works: Log top-k probabilities during normal inference, compute features, predict probabilities, average by slice, trend over time.
Why it matters: It’s continuous, cheap, and works with many APIs. 🍞 Anchor: Your ops board updates hourly: “Algebra down 6 points; prioritize new labels there.”

03Methodology

At a high level: Prompt → Generate with top-k probabilities → Compute entropy per token → Summarize into 11 features → Predict per-answer correctness probability → Average by slice → Slice accuracy estimate.

Step-by-step (with Sandwich where needed):

Input and Logging

What happens: For each question, the LLM generates an answer. During decoding, we log the top-k next-token probabilities at each step (k=20 to match common APIs).
Why this step exists: We need an API-friendly, cheap signal. Top-k logprobs are widely available without internal states.
Example: Suppose the model writes 30 tokens. At each step, you get the top 20 probabilities.

Token-Level Entropy from Top-k

What happens: For each step, compute truncated Shannon entropy using only those top-20 probabilities.
Why this step exists: Entropy measures uncertainty; even truncated entropy tracks “how unsure” the model is as it writes.
Example: If probabilities are very peaked (like 0.9, then tiny others), entropy is low; if they’re spread out (like many around 0.05), entropy is higher.

🍞 Hook: Think of “how scattered are your guesses?” 🥬 Top-k Approximation (new concept):

What it is: Calculating entropy from only the top-20 tokens, not the whole vocabulary.
How it works: Sum p log p over the top-20, ignore the tail.
Why it matters: It’s fast and API-compatible, but slightly underestimates true entropy when the tail is big. 🍞 Anchor: Like checking the top 20 most popular ice creams instead of the whole menu—you get a good sense quickly.

Summarize the Entropy Trajectory into 11 Features

What happens: From the full sequence of entropies H(t), compute: max, mean, std, quantiles (Q10, Q25, Q50, Q75, Q90), skewness, kurtosis, and sum (SEA). This 11D vector is the “entropy profile.”
Why this step exists: Different parts of the profile capture different error clues: peaks (max), overall level (mean), spread (std), tails (quantiles), shape (skew/kurt), and total uncertainty mass (SEA).
Example with numbers: Suppose H(t) over 6 tokens is [1.2, 0.9, 0.8, 1.5, 0.7, 0.6]. Then max=1.5, mean≈0.95, std≈0.31, Q50≈0.85, SEA≈5.7.

Train a Lightweight Classifier

What happens: Using labeled data (was the final answer correct?), train a small model (random forest, logistic regression, or MLP) to output a probability of correctness from the 11 features. Optionally apply class balancing.
Why this step exists: A learned map turns raw features into probabilities that better transfer across domains than a single threshold on one metric.
Example: After training, input [max=1.5, mean=0.95, …, SEA=5.7] → predict 0.23 (likely incorrect).

🍞 Hook: Imagine a bathroom scale that reads 2 pounds heavy and needs adjustment. 🥬 Calibration (new concept):

What it is: A post-processing step (like isotonic calibration) that makes predicted probabilities match real frequencies.
How it works: Fit a monotonic mapping on a validation set so that, e.g., all 0.7 predictions are correct about 70% of the time.
Why it matters: Since we average probabilities to estimate accuracy, calibration helps those averages be meaningful. 🍞 Anchor: After calibration, “0.62” really behaves like 62%.

🍞 Hook: Think of the world changing—last month’s easy homework might be harder this month. 🥬 Domain Shift (new concept):

What it is: When the kinds of questions you face at test time differ from those you trained on.
How it works: Train on a few benchmarks, then test on unseen benchmarks to stress generalization.
Why it matters: Real traffic drifts; we need the estimator to work on new slices. 🍞 Anchor: Training on easy algebra and hard olympiad tasks helps you handle middling geometry later.

From Per-Answer Probabilities to Slice Accuracy

What happens: For a domain slice D (e.g., “Olympiad-style math”), average the predicted correctness probabilities across its answers to get an accuracy estimate.
Why this step exists: Monitoring needs accuracy in accuracy units, not just relative scores.
Example with numbers: 100 answers in the slice, average predicted probability = 0.41 → estimate 41% accuracy.

Implementation Details that Keep It Practical

Benchmarks: 10 STEM datasets (GSM8K, SVAMP, GSM-Symbolic, MATH, TheoremQA, SciBench, MatSciBench, OlympiadBench, LiveMathBench, GPQA).
Models: 9 LLMs from 6 families, 3B–20B.
Generation: vLLM serving, temp=0.5, max length 2048, zero-shot chain-of-thought prompting.
Labels: An external validator LLM checks the final answer against the gold; spot audits showed about 97% agreement with humans.

The Secret Sauce

Use only what APIs commonly return (top-k logprobs); no hidden states needed.
Capture the whole answer’s uncertainty story (entropy profile), not just a single point.
Train on mixed-difficulty data; this strongly boosts transfer to new domains.

Practical Example End-to-End:

A slice has 50 geometry questions. For each answer, compute the 11 features from its entropy trace.
The classifier outputs probabilities like [0.88, 0.52, 0.31, …].
Average = 0.64 → Geometry slice accuracy ≈ 64%.
If yesterday was 72%, you flag drift and prioritize getting more geometry labels/data.

04Experiments & Results

🍞 Hook: Think of testing a new thermometer: you compare its readings with real temperatures across different rooms to see if it’s trustworthy.

🥬 The Test: The authors asked, “Can entropy profiles predict slice accuracy under domain shift?” They trained on k benchmarks (k in {1,2,3,4}) and estimated accuracy on the remaining 10−k unseen benchmarks. They repeated this across 9 models (3B–20B) and multiple estimator variants (classifier type, with/without calibration, with/without class balancing), totaling over 41,000 configurations. They measured two things: (1) AEE (Accuracy Estimation Error)—how far the estimate is from the real accuracy, and (2) Spearman ρ—how well domains are ranked by difficulty.

🍞 Anchor: It’s like checking if your forecast says “Room A is cooler than Room B” and also how close it is to the real temperatures.

Competition (Baselines): They compared Entropy Sentinel against classic uncertainty metrics built from logprobs (like entropy sum/SEA, entropy max, average NLL, perplexity, etc.). Those can rank domains but don’t produce calibrated accuracy directly.

Scoreboard with Context:

Cross-domain estimates often tracked real accuracies well. With a “difficulty-extremes” training set (GSM8K + OlympiadBench), many models hit very high rank agreement (ρ ≥ ~0.90) with low AEE (~0.03–0.12). That’s like getting an A for ranking and a small off-by-a-few-points error bar for accuracy.
PHI-3.5-MINI stood out: near-perfect ordering (ρ=1.00) and tiny error (AEE≈0.03) in the extremes setup—like guessing almost every class’s score correctly.
Using “intermediate-only” supervision (MATH + SciBench) hurt performance: AEE got bigger (~0.06–0.17) and ranking weakened. That’s like practicing only medium puzzles and then struggling with both very easy and very hard ones.
Compared to baselines, Entropy Sentinel was as good or slightly better for ranking in most cases, and uniquely returned accuracy estimates (not just scores). For example, SEA and entropy max were strong rankers but couldn’t say “about 63% correct.”

Surprising/Key Findings:

Training Composition Dominates: The most important factor wasn’t the classifier family but which benchmarks you train on. Mixed-difficulty training (easy + hard) consistently beat difficulty-homogeneous training. A “U-shaped” trend appeared: training groups that are all-easy or all-hard did worse than groups in the middle (mixing difficulties).
More Supervision Helps: As k grows from 1 to 4, median AEE drops and results become more stable (smaller IQR). With up to 4 benchmarks, errors fell into a tighter band across all 9 models.
Model Dependence: Some models (like Qwen3-8B) showed weaker entropy–correctness coupling, giving lower ρ (~0.75) and higher AEE. Others (like PHI-3.5-MINI) were excellent. Even within a family, bigger wasn’t always better.
Reduced Feature Sets Often Sufficed: Sometimes just SEA (sum of entropy) or a tiny subset (like max + SEA) nearly matched or beat the full 11D profile, especially for ranking. This suggests redundancy among features and reinforces that data composition is king.
Leave-One-Out Near-Max Supervision: Training on 9/10 and testing on the held-out 10th showed strong rankings (ρ up to ~0.98) for several models, but some models still had systematic absolute error, indicating that calibration quality can be model-specific.

Bottom Line Numbers (Plain-speak):

Best-case models got “A-level” rankings (ρ near 0.95–1.00) and “small off-by-a-little” absolute errors (AEE ~0.03–0.10) using mixed-difficulty supervision.
Harder cases still ranked well (ρ around 0.9) but could be off by a larger margin in accuracy points (AEE ~0.12–0.17), so you should validate and, if needed, recalibrate for your specific model.

🍞 Anchor: It’s like a coach who can usually tell which teams are stronger and estimate their scores pretty well—but for a few teams, the coach needs more practice tape to calibrate better.

05Discussion & Limitations

🍞 Hook: Even great thermometers have limits—you still keep an eye on how they were built and where you use them.

🥬 Limitations:

STEM-Only Evaluation: The study focuses on math and science with clear right/wrong answers. Open-ended tasks (like creative writing) don’t have a single gold answer, so the same trick won’t directly apply.
Top-k Approximation: Using only the top-20 tokens to compute entropy is fast and API-friendly but can miss tail probability mass, slightly warping uncertainty scales—especially on very uncertain steps.
Sensitivity to Decoding/Formatting: Temperature, max length, prompting style, and how verbose the model is can all shift entropy traces without reflecting true ability changes.
Model Dependence: Some models’ entropies align tightly with correctness; others, less so. Post-training (like RLHF) can change how confidence relates to being right, so validate on your target model.
Absolute vs. Relative: Ranking (ρ) can be strong while absolute accuracy estimates (AEE) remain off by a few points or show bias. Use the method to prioritize, and double-check with labels in high-stakes cases.

Resources Needed:

Access to top-k next-token probabilities (k≈20) from your serving stack.
A small labeled set from a few representative benchmarks (ideally mixed difficulty) to train and calibrate the classifier.
Light compute: training a tiny classifier on 11D features is cheap.

When NOT to Use:

Domains without verifiable correctness or where you can’t define a consistent success label.
Settings where APIs don’t expose logprobs, or where decoding settings change wildly and frequently without recalibration.
High-stakes decisions requiring exact accuracy numbers without room for calibration error.

Open Questions:

Transfer to Open-Ended Tasks: How can we adapt to tasks with fuzzy correctness (e.g., summaries, dialogue)?
Better Tail Handling: Can we recover more tail mass (beyond top-20) without losing API compatibility?
Adaptive Decoding: Can we stabilize entropy traces across decoding changes automatically?
Robust Calibration: Can we build calibration that travels well across models and time, or self-corrects online?
Multi-Signal Fusion: What gains come from combining entropy profiles with other cheap signals (e.g., length, self-consistency from a few low-cost samples) while staying budget-friendly?

🍞 Anchor: Treat this like a reliable early-warning light: great for spotting where to look next, but you still pop the hood (label a sample) before big repairs.

06Conclusion & Future Work

🍞 Hook: Think of turning a model’s “confidence heartbeat” into a simple, readable speedometer for accuracy.

🥬 Three-Sentence Summary:

The paper shows that the entropy pattern while an LLM writes can be summarized into 11 features and fed to a tiny classifier to predict per-answer correctness probabilities.
Averaging those probabilities over a slice yields slice-level accuracy in accuracy units, enabling continuous, API-friendly monitoring—even under domain shift.
Across 10 STEM benchmarks and 9 models, this approach often tracks true accuracy and ranks domains well, especially when trained on mixed-difficulty data; reliability varies by model and should be validated.

Main Achievement:

Turning cheap, widely available entropy traces into calibrated, slice-level accuracy estimates that support continuous monitoring and targeted data acquisition.

Future Directions:

Extend beyond STEM to open-ended tasks with softer correctness signals.
Improve calibration robustness across decoding settings and model families.
Explore fusing entropy with other low-cost signals for even better transfer.

Why Remember This:

It’s a practical, lightweight recipe you can run from logs today, turning uncertainty into action. You get a living map of where your model struggles, so you can collect the right data at the right time—and keep your model sharp as the world shifts.

🍞 Anchor: Like a smart dashboard for your AI tutor that says, “Fractions dropped this week—let’s practice those now,” helping you fix the biggest problems first.

Practical Applications

•Set up a live dashboard that shows estimated accuracy by topic slice (e.g., algebra, physics) using only logged top-k probabilities.
•Trigger alerts when slice accuracy drops by a chosen threshold (e.g., 5 points week-over-week) to catch regressions early.
•Prioritize data labeling and augmentation for the lowest-accuracy slices first to maximize improvement ROI.
•Compare model candidates or new fine-tunes by their estimated slice accuracies before running costly human evaluations.
•Continuously monitor customer segments (e.g., enterprise vs. education) to discover where the model underperforms in real traffic.
•Run A/B tests on prompts or decoding settings and track how entropy-based accuracy estimates change per slice.
•Use reduced feature sets (like SEA-only) for ultra-cheap monitoring when compute or storage is constrained.
•Calibrate the classifier for each new model version to maintain accuracy units as training or RLHF changes confidence behavior.
•Combine estimates with sampling-based audits (label a small batch) to re-check calibration on high-stakes slices.
•Schedule targeted data acquisition sprints (e.g., more Olympiad-style problems) when those slices consistently estimate low accuracy.

Version: 1