Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Amirhosein Ghasemabadi; Di Niu

Can LLMs Predict Their Own Failures? Self-Awareness via Internal Circuits

Intermediate

Amirhosein Ghasemabadi, Di Niu12/23/2025

arXiv PDF

Key Summary

•Large language models often sound confident even when they are wrong, and existing ways to catch mistakes are slow or not very accurate.
•This paper introduces Gnosis, a tiny add-on (about 5 million parameters) that watches a model’s own hidden activity to predict if its answer is correct.
•Gnosis reads two kinds of inside signals—hidden states and attention maps—then compresses them into small summaries and outputs a correctness score.
•It runs almost instantly and its cost doesn’t grow with the length of the answer, staying around 25 milliseconds even for very long outputs.
•Across math reasoning, open-domain trivia, and academic tests, Gnosis beats big external judges like 8B-parameter reward models and even a proprietary judge.
•A single Gnosis head trained on a small 1.7B model can judge larger sibling models without retraining, saving time and money.
•Gnosis can judge partial answers, flagging likely failures early so the system can stop, retry, or escalate to a stronger model.
•Results show that reliable correctness clues are already inside the model’s own activity and can be decoded efficiently.
•This makes LLMs more trustworthy and cheaper to deploy, especially for long, step-by-step reasoning tasks.

Why This Research Matters

If AI can sense when it’s likely wrong, it can pause, ask for help, or switch to a safer strategy before giving a bad answer. That means fewer hallucinations in tools we rely on for studying, coding, or advice. Because Gnosis is tiny and fast, companies can improve reliability without paying the cost of big external judges or many repeated runs. Early warnings save compute and time, especially on long, step-by-step problems. As this approach spreads, everyday AI experiences can feel more trustworthy, with confidence scores that actually mean something. It also opens doors to smarter control: retry when uncertain, escalate tough cases, or stop early when failure is detected. Over time, this could make AI assistants safer and cheaper at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you take a test, sometimes you can tell mid-problem whether you’re on the right track or not? Computers that write answers—like large language models (LLMs)—often don’t have that sense. They can write long, smart-sounding explanations and still be wrong, without realizing it.

🍞 Hook: Imagine building a LEGO castle while blindfolded. You can place bricks, but you don’t notice when a wall is crooked until the end. 🥬 The Concept (Hidden States): Hidden states are a model’s private notes at each step while it thinks. How it works: (1) As the model reads and writes, it updates a hidden vector for each token; (2) These vectors carry what it thinks is important; (3) Across many steps, they trace its reasoning path. Why it matters: Without seeing hidden states, we only judge by the final words, missing where thinking went wrong. 🍞 Anchor: A student’s scratch work shows whether the steps make sense even if the final answer is wrong.

The world before this work looked like this:

LLMs were great at sounding fluent but not at noticing their own mistakes. They could hallucinate facts or flip signs in math and keep going confidently.
To check answers, people used three main tricks:
1. External judges: Another big model reads the answer and scores it. Accurate but expensive and slow.
2. Multi-sample consistency: Ask the same model the same question many times and see if it agrees with itself. More reliable but cost grows with the number of samples.
3. Text-based self-critique or token probabilities: Use the words or their probabilities to guess confidence. Fast but often tracks “how nice it sounds” instead of “is it correct,” and it gets brittle for long, multi-step problems.

🍞 Hook: You know how a detective doesn’t just read the final confession; they also study the clues collected along the way? 🥬 The Concept (Attention Mechanism): Attention is the model’s way of deciding which earlier words matter for the current step. How it works: (1) For each new word, the model looks back at prior words; (2) It assigns weights (focus) to the most relevant tokens; (3) It uses those focused pieces to write the next token. Why it matters: If attention bounces around chaotically or ignores key steps, the reasoning may be shaky even if the final sentence sounds fluent. 🍞 Anchor: When you ask, “What’s the capital of France?”, attention focuses on “capital” and “France,” helping the model say “Paris.”

The problem researchers faced was: Can we tell, cheaply and reliably, whether an answer will be right just by peeking at what the model is already doing internally while it writes? Not another model judging the output after the fact, not many re-tries—just read the “brainwaves” of the model itself.

Previous attempts at using internal signals often looked at simple, fragile clues: single-token probes (only the last token’s hidden state), or hand-crafted stats that didn’t scale across tasks. They missed the big picture—the full path of how the model’s representations evolve over time and how its attention routes information.

🍞 Hook: Picture a car dashboard while driving: speed, fuel, temperature—all live signals tell you if the trip is going smoothly. 🥬 The Concept (Internal Signals): Internal signals are the model’s own hidden states and attention maps, the live telemetry of its thinking. How it works: (1) While the model generates text, it produces hidden vectors and attention weights; (2) These signals change in patterns when reasoning is stable vs. when it’s drifting; (3) Reading them can reveal trouble early. Why it matters: If we only read the final text, we miss early warning signs that were visible all along. 🍞 Anchor: A car might still be moving, but the overheating light warns you to pull over before the engine fails.

The gap this paper fills is a simple, tiny, and fast way to decode those internal signals into a correctness score—one that doesn’t depend on the length of the answer and doesn’t need a huge external judge. That matters in real life because:

Safety: Catching hallucinations reduces harmful advice and misinformation.
Cost: Avoiding multiple re-tries or big judges cuts latency and cloud bills.
Control: If we can spot a failing path early, we can stop, recover, or escalate to a stronger model.
Scale: The method should work across model sizes and tasks without constant retraining.

In short, the story is: People used to bolt on big judges or ask for lots of samples. This paper shows you can instead listen carefully to the model’s own inner signals and tell, with surprising accuracy and speed, whether it’s right—often before it even finishes its answer.

02Core Idea

The key insight in one sentence: The internal “fingerprints” of correctness already exist inside an LLM’s hidden states and attention patterns, and a tiny, length-invariant decoder can reliably read them to predict when the model is right or wrong.

Three analogies to feel it:

Baking thermometer: You can tell if the cake is baked by measuring inside temperature (internal states), not by guessing from the frosting (final text). A small thermometer (Gnosis) is all you need.
Coach’s eye: A coach can tell from an athlete’s posture and rhythm (trajectory over time) whether a jump will end well, even before landing.
Car telemetry: Mechanics read engine logs (attention + hidden signals) to predict failures without driving the car for miles.

Before vs. after:

Before: Confidence came from extra judges, repeated sampling, or surface text measures—accurate but slow or misleading for complex reasoning.
After: Confidence comes from the model’s own inner activity—fast, cheap, and better calibrated, even for long chains of thought.

🍞 Hook: You know how you can sometimes feel you’re solving a math problem correctly because each step clicks into place? 🥬 The Concept (Gnosis): Gnosis is a tiny add-on that watches an LLM’s hidden states and attention maps during generation and outputs a probability that the answer is correct. How it works: (1) Read final-layer hidden states and attention maps as the model generates; (2) Compress them to fixed-size summaries so cost doesn’t grow with length; (3) Feed them through two small encoders (one for hidden states, one for attention); (4) Fuse both signals with a gated head to predict correctness. Why it matters: Without Gnosis, you need slow external judges or shaky text-based guesses; with it, you get fast, intrinsic, and reliable self-checks. 🍞 Anchor: Like a smartwatch tracking your heartbeat and steps to decide if your workout is on target right now, not just after it’s over.

Why it works (the intuition):

Correct and incorrect generations leave different “shapes” in hidden states over the sequence—stable progress vs. meandering detours.
Attention routes information differently when reasoning is coherent vs. noisy—e.g., appropriate locality vs. scattered focus.
By combining both streams, Gnosis captures complementary signals: hidden states carry broad semantic/reasoning cues; attention patterns reveal routing stability, which especially helps on long, multi-step problems.
Length-invariant compression (adaptive pooling and fixed grids) preserves essential patterns while keeping compute tiny and constant.

Building blocks (small pieces, big picture):

Hidden-state encoder: Treats the hidden vectors like a time-series, using lightweight multi-scale convolutions plus set-style attention pooling to summarize the whole trajectory.
Attention-map encoder: Summarizes each head’s map (like a small image), extracts a compact feature, then mixes across layers and heads with axial convolutions; finally, attention-based pooling creates a single descriptor.
Gated fusion head: Learns when to trust hidden vs. attention cues more, outputting a single correctness probability.

🍞 Hook: Imagine you can tell you’re off-track halfway through a maze, not just when you hit a dead end. 🥬 The Concept (Trajectory-Level Self-Awareness): This is the ability to sense correctness from partial generations, not just final answers. How it works: (1) Gnosis can read any prefix of the internal signals; (2) Its fixed-size summaries still make sense mid-way; (3) It predicts correctness early, enabling early stop or escalation. Why it matters: Without it, you waste compute on bad paths or only discover failure at the end. 🍞 Anchor: A GPS reroutes you as soon as it sees you drifting off course, saving time and fuel.

🍞 Hook: Like watching two camera feeds to direct a live show. 🥬 The Concept (Dual-Stream Introspection): Gnosis analyzes two parallel views—hidden states and attention maps—to form a stronger prediction. How it works: (1) Build a descriptor from hidden states; (2) Build another from attention maps; (3) Fuse them with a gated head; (4) Output a score. Why it matters: Either stream alone helps, but together they’re more reliable across tasks. 🍞 Anchor: A doctor looks at both your heartbeat and blood pressure to judge your health, not just one number.

03Methodology

At a high level: Prompt + Partial/Full Generation → Read internal traces (hidden states + attention maps) → Compress to fixed-size summaries → Encode each stream (hidden vs. attention) → Fuse with a tiny gated head → Output a correctness probability.

Step-by-step recipe with the “why” and an example:

Inputs: hidden states and attention maps from a frozen LLM
- What happens: As the LLM answers, we capture the final-layer hidden state for every token (shape S×D) and all attention maps across layers/heads (shape L×H×S×S).
- Why it exists: These are the live “brain signals” of the model. Without them, we’d be guessing from text only.
- Example: For a 2,000-token math solution, we get 2,000 hidden vectors and many attention maps across layers/heads.
Length-invariant compression (fixed-budget summaries)
- What happens: We adaptively pool the hidden sequence down to K_hid positions (e.g., 192), and downsample each attention map to a fixed k×k grid (e.g., 256×256).
- Why it exists: To keep compute tiny and constant, no matter how long the answer is. Without this, judging long outputs would be slow and memory-heavy.
- Example: A 24k-token response compresses to the same small summary size as a 2k-token response.

🍞 Hook: Like skimming a long movie by selecting key frames. 🥬 The Concept (Dual-Stream Introspection): Two parallel encoders distill different types of evidence—hidden (content/trajectory) and attention (routing/stability). How it works: (1) Hidden stream turns the time-series into a compact vector; (2) Attention stream summarizes each map, then mixes across heads/layers; (3) Both outputs are fused. Why it matters: One stream alone can miss important clues the other catches. 🍞 Anchor: A sports analyst uses both player motion tracks (hidden) and pass maps (attention) to judge team performance.

Hidden-state circuit encoder
- What happens: Treat hidden states like a time-series: (a) local temporal mixing with multi-scale dilated 1D convolutions and channel gating (to denoise and catch multi-step rhythms); (b) global set-style attention (SAB) to let all positions interact; (c) Pooling by Multihead Attention (PMA) to learn a few “reliability prototypes” and produce a vector z_hid.
- Why it exists: Local mixing cleans and compresses patterns; global pooling captures whole-trajectory structure. Without both, signals are either too noisy or too averaged out.
- Example: A steady, staircase-like evolution of hidden states (coherent reasoning) vs. jittery, back-and-forth shifts (drifting) lead to different z_hid.
Attention circuit encoder
- What happens: For each head’s downsampled map, extract features in two ways: (i) a tiny CNN (learned visual patterns), and (ii) interpretable stats (entropy, locality to the diagonal, spectral texture, center/spread). Concatenate them into a per-head vector. Arrange all heads/layers into a grid tensor, mix across it with lightweight axial convolutions (row/column), then PMA to produce z_attn.
- Why it exists: Attention routing stability is informative, especially for long reasoning. Without cross-head/layer mixing, we’d ignore how patterns interact across the network depth.
- Example: For step-by-step math, healthy attention keeps a strong near-diagonal focus (locality); chaotic maps hint at confusion.

🍞 Hook: Like combining two reports into one decision. 🥬 The Concept (Correctness Prediction): A small gated MLP fuses z_hid and z_attn and outputs a correctness probability. How it works: (1) Concatenate z_hid and z_attn; (2) A gated MLP learns when to trust each more; (3) Apply a sigmoid to get a score in [0,1]. Why it matters: Without adaptive fusion, the model can’t prefer the best signal for each case. 🍞 Anchor: A meteorologist merges temperature and pressure readings, then gives a single rain probability.

Training with auto-labels, backbone frozen
- What happens: Generate answers on training sets; match to ground truth to label correct (1) or wrong (0). Train only the tiny Gnosis encoders and head with binary cross-entropy. The LLM stays frozen.
- Why it exists: No costly human labels or full-model fine-tuning. Without freezing, you’d risk changing the model you’re trying to judge.
- Example: For math and trivia data, the pipeline labels each attempt automatically, creating a balanced dataset cheaply.
Early and partial judgments
- What happens: Because summaries are length-invariant, Gnosis can read prefixes and still produce meaningful scores mid-generation.
- Why it exists: Early stopping or escalation saves compute and time. Without it, you discover failure only at the end.
- Example: After 40% of a solution is written, Gnosis often reaches near-final accuracy on whether the answer will be correct.

The secret sauce:

Two complementary streams (content trajectory + routing stability) distilled into tiny, fixed-size descriptors.
Learned pooling (PMA) that finds reliability prototypes instead of naïvely averaging everything.
A gated fusion that adapts per example, shifting trust between hidden and attention cues.
All of this with a ~5M-parameter head whose speed doesn’t depend on output length—making it practically free to run.

🍞 Hook: You know how you can guess your quiz score pretty well if you remember how each question felt as you solved it? 🥬 The Concept (Self-Awareness): Here, self-awareness means the model can estimate whether its own current answer is likely correct, by reflecting on its internal activity. How it works: (1) The model produces internal signals as it thinks; (2) Gnosis reads those signals; (3) The score reflects the model’s own trajectory, not external opinions. Why it matters: Without this, we rely on slow, expensive judges or unreliable surface cues. 🍞 Anchor: A musician knows mid-performance whether the piece is going well by the feel of their timing and fingerwork, not by the audience’s applause later.

04Experiments & Results

The test: Can Gnosis predict correctness better and faster than baselines across very different tasks? The authors evaluated on:

Math-Reasoning: AMC12, AIME 2024/2025, HMMT Feb 2025 (long, multi-step problems).
Open-Domain QA: TriviaQA (short factual answers; catching hallucinations).
Academic Knowledge: MMLU-Pro (diverse topics; out-of-distribution challenge).

They measured both ranking and probability quality using: 🍞 Hook: Imagine a gradebook that checks both how well you sort right vs. wrong and how honest your confidence is. 🥬 The Concept (Calibration Metrics): Calibration metrics check whether predicted probabilities match reality (e.g., a 70% score should be right about 70% of the time). How it works: (1) AUROC/AUPR measure how well correct answers get higher scores than incorrect ones; (2) Brier Skill Score (BSS) and Expected Calibration Error (ECE) check how truthful the probabilities are. Why it matters: A system that’s “right for the wrong reasons” or overconfident is risky, even if it ranks well. 🍞 Anchor: If you say you’re 90% sure every time but are right only half the time, your calibration is bad—even if you sometimes rank answers well.

Competitors:

Training-free internal stats: logit entropy, mean token probability, attention eigenvalue score.
Trajectory/spectral internal indicators (e.g., Chain-of-Embedding variants).
Big external judges: two strong 8B reward models (SkyworkRM variants) and a proprietary judge via API.
A learnable final-token probe (MLP-Prob) that only reads the last hidden state.

Scoreboard with context:

Math-Reasoning: Gnosis reached ~0.95–0.96 AUROC, lifting far above training-free baselines (often mid-0.7s) and beating large judges (including an 8B reward model and the proprietary judge). That’s like getting an A+ when others are getting B’s.
TriviaQA: Gnosis achieved ~0.86–0.89 AUROC across backbones, competitive with or better than large judges and clearly ahead of internal heuristics. That’s like reliably spotting real facts vs. confident guesses.
MMLU-Pro: Gnosis scored ~0.74–0.82 AUROC depending on backbone—stronger or on par with big judges and notably better calibrated (lower ECE, higher BSS). That’s like keeping cool and honest on a pop quiz in unfamiliar subjects.
Latency: About 25 ms and nearly constant even for 12k–24k token answers, yielding 37× to 99× speedups over an 8B reward model as outputs get longer.

Surprising findings:

Early prediction: After seeing only ~40% of a completion, Gnosis already matches or exceeds the full-solution performance of external judges and single-token probes. This enables substantial compute savings.
Cross-scale transfer: Train a Gnosis head on a small 1.7B model and use it to judge 4B or 8B sibling models—still beats big external judges. This hints that error “fingerprints” are structurally similar across sizes.
Bimodal, decisive scores: Gnosis tends to give clear low or high probabilities with good calibration, while a big reward model often clusters around middling scores. Clearer separation is safer and more actionable.
Hidden vs. attention: Hidden-state features were broadly strong across domains; attention features especially shined on long, reasoning-heavy tasks. Fusing both was consistently best.

Why these numbers matter:

High AUROC/AUPR shows Gnosis is excellent at ranking correct above incorrect answers, even when incorrect ones are common (safety-critical).
Positive BSS and low ECE mean the probabilities are trustworthy. A trustworthy confidence is essential for downstream decisions like early stopping or escalation.
The tiny size and near-zero added latency make the method practical for real deployments, not just lab demos.

In short, across tasks, models, and lengths, Gnosis was both sharper (better accuracy) and steadier (better calibration) than much bigger or costlier alternatives, while being dramatically faster.

05Discussion & Limitations

Limitations (be specific):

Family-bound transfer: A Gnosis head trained on one model family (e.g., Qwen3 variants) transfers well to siblings but may not generalize to models with very different architectures or generation styles.
Not a universal judge: It’s a self-awareness mechanism, not a world-knowledge verifier; it reads internal patterns rather than checking facts against external databases.
Access required: You need glass-box access to hidden states and attention maps; pure black-box APIs won’t expose these signals.
Final-layer focus: It reads final-layer states and downsampled attention; subtle cues from earlier layers could be missed (though ablations show strong signals remain).
Domain shifts: While robust across tested domains, extreme distribution shifts could change error fingerprints and require light retraining.

Required resources:

Minimal model size (~5M params) and tiny inference cost (≈25 ms) independent of sequence length.
Training data built automatically by generating answers and labeling with ground truth; no human annotation is needed.
Modest compute to train each head (e.g., up to 12 hours on 2×A100 for the largest backbone in the paper).

When NOT to use:

Black-box models where internal traces are unavailable.
Tasks where correctness demands external retrieval or databases and you specifically want fact-checking beyond internal confidence.
Cross-family judging (e.g., very different tokenization or prompting styles) without validation, as transfer may degrade.
Extremely terse, trivial tasks where simple heuristics already work and the overhead of hooking internals isn’t justified.

Open questions:

Interpretability: Which specific heads/layers carry the strongest correctness cues, and can we map them to human-understandable operations?
Robustness: How stable are the learned fingerprints under adversarial prompting or stylistic perturbations?
Steering vs. judging: Can the same signals guide the model mid-generation to correct itself, not just predict failure?
Layer selection: Would combining select mid-layer signals with final-layer states further improve early detection?
Beyond text: How does this approach extend to multimodal models where attention spans vision, audio, and text?

Bottom line: Gnosis convincingly shows that reliable correctness signals are already inside the model, but broader universality, interpretability, and robustness under strong shifts remain important frontiers.

06Conclusion & Future Work

Three-sentence summary: This paper presents Gnosis, a tiny, length-invariant mechanism that reads an LLM’s own hidden states and attention maps to predict whether its answer is correct. Across math reasoning, trivia, and academic tests, Gnosis outperforms much larger external judges in both accuracy and calibration while adding only ~5M parameters and ≈25 ms latency. It even generalizes to partial answers and larger sibling models, enabling early stopping and compute-aware control.

Main achievement: Proving that correctness signals are intrinsic to the generation process and can be decoded efficiently and reliably without external supervision or large auxiliary judges.

Future directions: Expand transfer across unrelated model families, deepen interpretability to identify the most informative circuits, integrate earlier-layer signals, and turn passive judgment into active guidance that nudges reasoning back on track in real time. Exploring multimodal extensions and coupling with retrieval could further improve factual grounding.

Why remember this: It changes the default from “hire a big external judge” to “listen to the model’s own heartbeat,” delivering faster, cheaper, and more trustworthy AI systems—especially for long, step-by-step reasoning where early warnings save both time and errors.

Practical Applications

•Early-stop long chain-of-thought generations when Gnosis flags a likely failure, saving compute and latency.
•Auto-escalate hard questions from a small model to a larger one only when Gnosis predicts low correctness.
•Filter or down-rank hallucination-prone answers in search, chatbots, or tutoring systems before showing them to users.
•Gate deployment decisions (e.g., require human review) when the correctness score is below a threshold in high-stakes domains.
•Guide sampling budgets: allocate more attempts only when the initial trajectory looks unreliable, reducing average cost.
•Enable safer agents by checking step-wise confidence and pausing plans that look unstable mid-execution.
•Train-time curation: select high-confidence self-generated rationales for distillation or reinforcement learning.
•On-device use: add a small, fast verifier to edge models where large judges are infeasible due to compute limits.
•Mixed-expertise routing: use Gnosis scores to choose which specialized tool or model to call next.
•Monitoring dashboards: track correctness calibration over time to catch regressions after model or prompt updates.

Version: 1