Spilled Energy in Large Language Models

Adrian Robert Minut; Hazem Dewidar; Iacopo Masi

Spilled Energy in Large Language Models

Intermediate

Adrian Robert Minut, Hazem Dewidar, Iacopo Masi2/21/2026

arXiv

Key Summary

•The paper treats the last layer of a Large Language Model (the softmax over tokens) as an Energy-Based Model, which lets us measure a new signal called spilled energy.
•Spilled energy is the gap between two energy numbers that should match from one step to the next while the model is writing; a big gap warns that the token might be wrong.
•A second signal, marginal energy, is a single-step energy that also helps flag suspicious tokens.
•Both signals are training-free: you don’t need to train extra probes or build new classifiers; you just read values the model already computes.
•The method focuses on the exact answer tokens (like 'Rome' in 'The capital of Italy is Rome'), where truth signals are concentrated.
•Across nine real-world benchmarks and several LLM families (LLaMA, Mistral, Gemma, Qwen), spilled energy gives competitive or better hallucination detection than logits and trained probes.
•On synthetic math tasks with easy-to-hard wrong answers, spilled energy separates correct from incorrect answers even when the mistake is tiny.
•Instruction-tuned models make spilled energy even more helpful, while traditional confidence scores can become overconfident.
•A simple threshold on spilled energy often splits correct vs. incorrect answers cleanly, enabling practical deployment.
•This provides a principled, fast, and general way to catch LLM errors without fine-tuning or heavy tooling.

Why This Research Matters

This work gives a simple, fast way to catch likely LLM mistakes without training extra detectors. That means safer chatbots that can warn users before stating a wrong fact, and tutoring systems that can double-check their own math steps. It helps teams ship reliable features faster because it works across many tasks and model families out of the box. It can reduce bias-related harms by flagging suspicious outputs for review. In coding and reasoning assistants, it can filter weak steps and keep only the strong ones. Over time, it could even guide new training methods to build models that keep their internal “books” balanced and their confidence honest.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how when you tell a story step by step, each part should fit with the last one—like puzzle pieces that click together? If a piece doesn’t fit, you can feel something is off.

🥬 Filling (The Actual Concept)

What it is: This paper introduces a new way to spot when language models “make things up” by checking whether their internal puzzle pieces (energies) fit together from one word to the next.
How it works: The authors look at the model’s final decision layer (softmax) through the lens of Energy-Based Models (EBMs) and measure a mismatch—called spilled energy—between two energy values that should match across steps.
Why it matters: If the pieces don’t fit, the model might be making an error, like stating a wrong fact or messing up a calculation.

🍞 Bottom Bread (Anchor) Imagine asking, “What is the capital of Italy?” If the model writes “Rome,” the puzzle pieces usually fit well. If it writes “Sydney,” the pieces don’t fit nicely, and spilled energy grows.

The World Before:

LLMs could produce amazing text but sometimes confidently say things that aren’t true (hallucinations). People tried to measure confidence using logits (how strong the model votes for a token) or probabilities like p(true), but these often didn’t reflect truthfulness reliably.
Another approach trained separate “probe classifiers” that read the model’s internal activations to guess if the output is right. These probes worked on the dataset they were trained on but often failed to generalize to different tasks.

🍞 Top Bread (Hook) You know how a thermometer tells you temperature directly, without teaching it a new trick every day? A good detector shouldn’t need retraining for every single test.

🥬 Filling (The Actual Concept)

What it is: A training-free detector is a tool that works out of the box—no extra training, no custom probes.
How it works: Read values the LLM already computes (its logits and softmax denominator), reinterpret them as energies, and compare them from one step to the next.
Why it matters: It generalizes to many tasks and models because it doesn’t depend on task-specific training.

🍞 Bottom Bread (Anchor) Instead of building a new detector for movie reviews, math, and trivia, you use the same simple measurement—the energy gap—everywhere.

The Problem:

LLMs sometimes produce factual errors, biased outputs, or broken reasoning. Traditional metrics can be misleading, and trained probes don’t travel well across tasks. We need a principled, general, and fast way to flag likely mistakes right when they happen.

Failed Attempts:

Logit confidence: can be overconfident, especially after instruction tuning.
Trained probe classifiers: often don’t transfer to new datasets; require compute, labels, and careful selection of layers/tokens.
Activation ablations or steering: can help reduce errors but need careful design and don’t measure truth directly.

The Gap:

What’s missing is a universal, training-free signal rooted in probability theory that reflects when the model’s own internal bookkeeping doesn’t add up from token to token.

🍞 Top Bread (Hook) Imagine balancing a checkbook: yesterday’s balance plus today’s deposits minus withdrawals should equal today’s balance. If it doesn’t, something’s off.

🥬 Filling (The Actual Concept)

What it is: Spilled energy is like a balance mismatch between two numbers that should match as the model writes the next token.
How it works: At token i, read the chosen token’s logit as one energy; at token i+1, read the softmax denominator as a marginal energy. In a perfect world, they’d match; in practice, the gap signals trouble.
Why it matters: This “mismatch meter” correlates with hallucinations across many tasks without any special training.

🍞 Bottom Bread (Anchor) When the model writes “120 eggs” after a math problem, the energies line up. If it writes “470 eggs,” the energy gap grows, flagging a likely error.

Real Stakes:

Safer assistants: catch wrong facts before users trust them.
Better tools: filter low-quality steps in chain-of-thought, math, or coding.
Fairness: flag biased or implausible outputs.
Efficiency: deployable across many models and tasks without extra training.

02Core Idea

🍞 Top Bread (Hook) Imagine two friends passing a baton in a relay race. If the handoff is smooth, the team keeps its speed. If the pass is fumbled, you see the stumble right away.

🥬 Filling (The Actual Concept)

What it is: The aha! moment is that the last layer of an LLM can be viewed as an Energy-Based Model, and two energies that should match across steps often don’t; the size of this mismatch—spilled energy—predicts errors.
How it works: Treat the picked token’s logit as an energy at step i, and treat the softmax’s denominator (the log-sum over all tokens) as a marginal energy read at step i+1. Compare them. A big gap means a likely mistake.
Why it matters: This gap is training-free, portable across tasks and models, and focuses on the exact answer tokens where truth lives.

🍞 Bottom Bread (Anchor) Q: “What’s the capital of Italy?” If the model writes “Rome,” the baton pass (energies) align. If it writes “Sydney,” the pass stumbles—spilled energy spikes.

Multiple Analogies:

Cash register: Yesterday’s total plus today’s sales minus returns should match today’s total. A mismatch flags an accounting error (spilled energy).
Jigsaw puzzle: The new piece should fit the growing picture. If it doesn’t click (energy mismatch), the piece may be wrong.
Heartbeat monitor: A steady rhythm means healthy. A sudden jump (energy spike) can signal an issue with the just-written token.

Before vs After:

Before: We relied on raw confidence (logits) or trained probes tied to specific datasets. Results varied a lot and didn’t travel well between tasks.
After: We have a math-grounded, training-free signal (spilled energy) that generalizes across QA, reasoning, sentiment, bias tests, and more.

🍞 Top Bread (Hook) You know how a recipe works better when each step prepares for the next? The sauce from step one should pair smoothly with the pasta in step two.

🥬 Filling (The Actual Concept)

What it is: Autoregressive decoding means the model writes one token at a time, each step leaning on the past. The chain rule says certain quantities should match between steps.
How it works: Reinterpret the softmax classifier as two energies: one for the chosen token and one that compresses all choices. Across steps, those should agree—if they don’t, that’s spilled energy.
Why it matters: It turns the model’s own math into a detector for wrong turns, without any external training.

🍞 Bottom Bread (Anchor) As the model solves “12 chickens, 2 eggs per day, 5 days,” the step-by-step energies stay in sync if it writes “120.” They fall out of sync if it writes “470.”

Why It Works (Intuition):

The chain rule of probability ties together the model’s per-token choices into a consistent story. If the next step’s “total mass” (marginal energy) doesn’t align with the last step’s chosen token energy, it hints the model’s internal story got shaky.
Errors, biases, or brittle reasoning make the model’s distribution over next tokens more spread out or oddly peaked, which shows up as a larger mismatch.

Building Blocks (with Sandwich mini-explanations):

🍞 You know how you stand on tiptoe to reach a shelf? That last stretch is the softmax deciding which word wins. 🥬 Softmax as a classifier: It turns raw scores (logits) into probabilities over all tokens. If softmax sees one score much higher, it picks that token confidently. Without it, the model can’t choose words. 🍞 Example: Between “Rome” and “Sydney,” softmax turns higher “Rome” score into a higher probability.
🍞 Picture a scoreboard where bigger numbers mean more likely to win. 🥬 Logit: The raw score before softmax that says how strongly the model prefers a token. Without logits, there’s no ranking to pick from. 🍞 Example: A higher logit for “Rome” than “Sydney” leads the model to choose “Rome.”
🍞 Think of a crowd vote vs. one judge. 🥬 Marginal energy: A single-step measurement that summarizes how spread-out the model’s probabilities are over all tokens. Without it, you can’t tell if the model is confused or confident at that step. 🍞 Example: If many words seem plausible, marginal energy reflects that broader spread.
🍞 Imagine comparing the recipe’s plan with the actual step you just cooked. 🥬 Spilled energy: The difference between the chosen token’s energy at step i and the all-tokens energy at step i+1. Without comparing them, you miss the telltale sign of a stumble. 🍞 Example: A large gap when the model writes “Sydney” flags a likely hallucination.

03Methodology

At a high level: Prompt → Model generates an answer → Find the exact answer tokens → Read two energies from the model → Compute spilled energy and/or marginal energy → Pool across the answer span → Threshold to flag likely errors.

Step-by-step (like a recipe):

Input and Generation

What happens: Give the model a prompt (e.g., “What is the capital of Italy?”). Let it generate an answer.
Why it exists: We need a real output to analyze; hallucinations show up in the produced tokens.
Example data: Prompt: “What is the capital of Italy?” Output: “The capital of Italy is Rome.”

Localize Exact Answer Tokens

🍞 Hook: You know how a quiz grades just the final answer, not your whole paragraph?
🥬 Concept: Exact answer tokens are the minimal span that carries the answer (e.g., the token(s) for “Rome”). We find them with light heuristics or a helper instruction-tuned model.
Why it exists: Truth signals concentrate on the exact answer span. Without focusing here, punctuation or filler words can create false alarms.
🍞 Anchor: From “The capital of Italy is Rome,” extract just “Rome” (maybe split into multiple tokens like “Ro”, “me”).

Read Energies from the LLM

🍞 Hook: Imagine peeking at the oven thermometer to see if your cake is baking right.
🥬 Concept: Two values are read directly from the model’s last layer:
- Chosen-token energy (from the picked token’s logit at step i).
- Marginal energy (a single-step summary of the whole distribution, read at step i or i+1 depending on the comparison you need).
Why it exists: These are the ingredients of spilled energy. Without them, you can’t measure the mismatch.
🍞 Anchor: For the token “Rome,” grab its logit (chosen-token energy). Also read the softmax’s denominator summary (marginal energy).

Compute Spilled Energy and Marginal Energy

🍞 Hook: Think of balancing your notebook totals; you compare yesterday’s and today’s numbers for consistency.
🥬 Concept: Spilled energy = chosen-token energy (step i) compared to marginal energy (step i+1). Marginal energy alone (single-step) can also flag uncertainty.
Why it exists: If these don’t align, it signals the model’s internal story is wobbly at the answer token.
🍞 Anchor: If the gap is small for “Rome,” that’s good. If big for “Sydney,” that’s suspicious.

Pool Across Multi-Token Answers

🍞 Hook: Like checking the weakest link in a chain to judge overall strength.
🥬 Concept: If the answer has multiple tokens (e.g., “San Francisco”), compute energy for each token and pool them. Min-pooling (take the smallest/most suspicious value) works best on average in experiments.
Why it exists: Some answer pieces are more informative than others. Without pooling, you might miss the single token that reveals the problem.
🍞 Anchor: For “San Fran cis co,” compute per-token spilled energy and take the minimum as the final score.

Threshold and Decision

What happens: Use a simple threshold on the pooled score to decide if the answer is likely correct or a hallucination.
Why it exists: Practical systems need a yes/no or risk score; a threshold turns the continuous signal into an action.
Example data: If spilled energy > T, flag the answer; else accept it (or pass it along with a green light).

Optional: Combine Signals

What happens: You can combine spilled energy with the magnitude of marginal energy (scaled spilled energy) for added sensitivity in some settings.
Why it exists: Sometimes the scale of uncertainty matters; combining both can help.
Example data: Final score = |marginal energy| × spilled energy.

Secret Sauce (What makes it clever):

It’s training-free and principled: You don’t train new probes, and you don’t poke the model’s internals. You just read the values it already computes and apply probability logic.
It localizes the answer: By zooming in on exact answer tokens, it avoids false alarms from punctuation or filler words.
It generalizes: It works across very different datasets (math, trivia, reading comprehension, bias tests) and model families (LLaMA, Mistral, Gemma, Qwen).

Concrete Examples:

Factual QA: “Capital of Italy → Rome.” Small spilled energy → accept. “Sydney.” Large spilled energy → flag.
Math: “12 $chickens × 2$ $eggs × 5$ days = 120.” Small spill → accept. “470.” Big spill → flag.
Sentiment (IMDB): If the answer token that signals ‘positive’ or ‘negative’ sentiment has a large spill, warn that the classification might be off.

What breaks without each step:

Skip exact answer localization → higher false positives from commas or opening words.
Skip pooling → multi-token answers hide the suspicious token.
Skip marginal energy → you lose how spread-out/uncertain the step is.
Skip spilled energy → you miss the step-to-step consistency check that catches subtle mistakes.

04Experiments & Results

The Test (What they measured and why):

They measured how well spilled energy and marginal energy can separate correct from incorrect answers (hallucination detection), summarized by AuROC.
They tested two worlds: (1) synthetic math with known right/wrong answers and controlled difficulty (tiny to large numeric offsets), and (2) nine real benchmarks spanning QA, reasoning, bias, and sentiment.

The Competition (Baselines):

Logit confidence (classic next-token confidence).
Probing classifiers (Orgad et al.), trained on internal activations but known to generalize poorly across datasets.
p(true)-style measures.

Scoreboard with Context:

Synthetic math (Qwen, LLaMA, Mistral): Spilled energy shows clear separation between correct and incorrect even when the wrong answer differs by only 1–10 (the hard case). Think of it like getting an A when others barely pass on the toughest questions.
Real benchmarks (HotpotQA, TriviaQA, Math, Winogrande, Winobias, Movies, MNLI, IMDB, HotpotQA-WC):
- Spilled energy consistently beats logits and substantially beats trained probes when evaluating across datasets (transfer), with average AuROC around the low-to-mid 70s on several model/dataset combos.
- A simple threshold on spilled energy often splits correct vs. incorrect distributions cleanly.
- Min-pooling over the answer span works best overall.

Surprising/Notable Findings:

Instruction-tuned models: Traditional confidence scores (logits) often become overconfident after instruction tuning, but spilled energy gets stronger. For example, on LLaMA-3 and Mistral families, spilled energy’s average detection improves after instruction tuning.
Exact answer localization: Just like prior work, focusing on exact answer tokens is crucial. For spilled and marginal energy, using the exact span boosted average AuROC by roughly a quarter.
Cross-dataset transfer: Probing classifiers shine in-distribution but drop near chance off-distribution. Spilled energy, with no training at all, often outperforms these probes in the transfer setting.

Concrete Comparison Highlights:

In many settings, spilled energy outperforms Orgad et al.’s probes by notable margins even when the probes are evaluated in their own trained setting; in cross-dataset evaluations, spilled energy’s advantage is clearer.
On Gemma (1B and 4B), the method maintains strong performance without any special tuning, showing model-family generality.

Takeaway:

Spilled energy is an easy-to-read, general-purpose signal that detects errors across tasks and model families. It’s especially valuable where training new detectors is impractical or where tasks change frequently.

05Discussion & Limitations

Limitations (be specific):

False positives on low-information tokens (punctuation, first words) if you don’t localize the exact answer span.
Sensitivity varies by domain: tasks with very flat or peaky token distributions can shift the absolute scale of energies, affecting thresholds.
It flags likely errors but doesn’t explain them (e.g., whether the issue is bias, missing knowledge, or reasoning failure), so it’s a detector, not a doctor.

Required Resources:

White-box access to the model’s output logits (standard for many LLM deployments).
Light compute to read energies and pool across a small span—no training or fine-tuning needed.
Optional helper LLM or heuristics to extract exact answer tokens.

When NOT to Use:

If you cannot access logits/softmax internals (strict black-box APIs only).
If outputs have no identifiable answer span (e.g., unconstrained creative writing without a checkable nugget), unless you devise a good proxy span.
If your application needs root-cause explanations rather than a risk flag.

Open Questions:

Can we turn spilled energy into a “fixer” by adjusting decoding when the gap spikes (e.g., dynamic temperature, re-ranking, or lookahead)?
How stable are thresholds across domains and model sizes, and can we auto-calibrate them?
Can we combine spilled energy with other uncertainty signals (e.g., semantic entropy, self-consistency) for even stronger detection with minimal overhead?
What architectural or training changes would explicitly reduce spilled energy, improving calibration by design?

06Conclusion & Future Work

3-Sentence Summary:

The paper reinterprets an LLM’s softmax layer as an Energy-Based Model and defines spilled energy: a gap between two energies that should match across steps.
This training-free signal, read directly from logits, strongly correlates with errors and generalizes across many tasks and model families.
Across synthetic math and nine real benchmarks, spilled energy reliably flags hallucinations and often outperforms both logits and trained probes, especially in cross-dataset transfer.

Main Achievement:

A simple, principled, and training-free hallucination detector—spilled energy—that leverages the model’s own probability bookkeeping to spot likely mistakes at the exact answer tokens.

Future Directions:

Use spilled energy during decoding to steer away from risky tokens in real time.
Combine with semantic entropy or self-consistency checks for even stronger, low-overhead detection.
Explore training objectives that directly reduce spilled energy, improving calibration by construction.

Why Remember This:

It turns a universal, theory-backed property (step-to-step consistency) into a practical, portable tool. Instead of building new detectors for every task, you can read the model’s own signals to catch errors, making LLMs safer and more trustworthy in everyday use.

Practical Applications

•Add a risk score to answers: compute spilled energy on exact answer tokens and warn users when the score is high.
•Filter chain-of-thought: drop intermediate steps with large spilled energy and regenerate them for better reasoning quality.
•Guardrails for factual QA: if spilled energy is high, trigger retrieval or ask the user to clarify before finalizing an answer.
•Math and code checking: flag lines with high spilled energy and request a re-check or run unit tests.
•Human-in-the-loop review: route high-spill outputs to moderators or domain experts for approval.
•Adaptive decoding: when spill spikes, lower temperature or switch to a safer decoding policy.
•Monitoring dashboards: track average spilled energy over time to detect model drift or domain shifts.
•Bias auditing: inspect high-spill tokens in sensitive contexts (e.g., gendered pronouns) for fairness checks.
•API integration: expose spilled energy alongside token probabilities so downstream apps can make informed decisions.

Version: 1

Key Summary

Why This Research Matters

Turn this paper into a decision

Detailed Explanation

01Background & Problem Definition

02Core Idea

03Methodology

04Experiments & Results

05Discussion & Limitations

06Conclusion & Future Work

Practical Applications

Notes