EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

Jewon Yeom; Jaewon Sok; Seonghyeon Park; Jeongjae Park; Taesup Kim

EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs

Intermediate

Jewon Yeom, Jaewon Sok, Seonghyeon Park et al.1/11/2026

arXiv PDF

Key Summary

•This paper teaches AI models not just how to solve problems but also how to tell when their own answers might be wrong.
•The method, called EPICAR, trains models to both reason and self-evaluate by learning from correct and incorrect attempts.
•Traditional self-training (like STaR) boosts accuracy but often makes models overconfident; EPICAR fixes that calibration problem.
•EPICAR uses a simple trick: after each answer, the model also says how sure it is (yes/no confidence) and learns from that.
•Across Llama-3 and Qwen-3 models, EPICAR improved both accuracy and trustworthiness (better AUROC, lower ECE and Brier Score).
•It works beyond math; it also helps with code generation, showing the idea generalizes to other reasoning tasks.
•With better built-in confidence, EPICAR lets models reach the performance of 30 samples using only 10, cutting inference compute by about 3×.
•A special decoding helper (AID) keeps formats clean so the model isn’t punished for tiny formatting mistakes.
•EPICAR shines most in models with enough reasoning capacity (about 3B parameters or more).
•The big takeaway: teaching models to know what they don’t know makes them both smarter and safer.

Why This Research Matters

AI that can say, “I might be wrong,” is far safer and more useful in everyday life. EPICAR shows how to teach that skill directly during training, not just patch it later. With more honest confidence, students, coders, and scientists can trust AI assistance more and know when to double-check. Better calibration also reduces wasted compute at inference time, since fewer samples are needed to get reliable results. This makes powerful reasoning more affordable and greener. By learning from both correct and incorrect attempts, models become not only smarter but also better at self-awareness. That’s a big step toward trustworthy AI partners.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how during a math test you don’t just write answers—you also have a feeling about which ones you’re sure about and which ones you’re not? That feeling helps you double-check the shaky ones.

🥬 The Concept: Large Language Models (LLMs)

What it is: LLMs are computer programs that read, write, and solve problems using patterns learned from lots of text.
How it works: 1) They read your question. 2) They predict the next word over and over to form an answer. 3) They can also show steps if we ask.
Why it matters: Without a sense of when they’re likely wrong, they can sound confident but be incorrect. 🍞 Anchor: When you ask an AI to solve “27 × 14,” it can write steps and an answer, but without calibration, it might say a wrong number very confidently.

🍞 Hook: Imagine you’re solving a puzzle by breaking it into smaller clues first.

🥬 The Concept: Chain-of-Thought (CoT) prompting

What it is: A way to ask AI to show its steps, not just the final answer.
How it works: 1) The prompt says “think step by step.” 2) The model writes reasoning. 3) It outputs a final answer at the end.
Why it matters: Without steps, errors hide; with steps, we can check where things went wrong. 🍞 Anchor: To find “How many apples after 3 days if 2 are eaten daily from 15?”, the model writes: Day 1: 13, Day 2: 11, Day 3: 9 → Answer: 9.

🍞 Hook: Picture a teacher who only praises your correct answers and ignores your mistakes.

🥬 The Concept: Self-training like STaR

What it is: A method where the model practices on its own answers, keeping only the ones that match the ground truth.
How it works: 1) The model tries problems. 2) Keep only successful paths. 3) Fine-tune on those paths. 4) Repeat.
Why it matters: Without learning from mistakes, the model becomes narrow and overconfident, missing where it tends to fail. 🍞 Anchor: If the AI tries five solutions and only the two correct ones are kept, it never learns the “signs” of its three wrong tries.

🍞 Hook: Think about your inner voice that says, “I’m 60% sure,” or “I’m 95% sure.”

🥬 The Concept: Calibration

What it is: Matching how sure the model sounds to how often it’s actually right.
How it works: 1) The model gives an answer. 2) It also gives a confidence score. 3) We measure how close those confidences are to reality.
Why it matters: Without calibration, the model can be very sure but wrong, which is risky. 🍞 Anchor: If the model says “90% sure” 100 times, it should be right about 90 of those times.

🍞 Hook: Imagine tying super-tight shoelaces that make you run fast but cut off blood flow.

🥬 The Concept: Calibration cost

What it is: The hidden penalty where making models more accurate with self-training often makes them overly confident.
How it works: 1) Training focuses on correct answers only. 2) The model learns to compress uncertainty. 3) It reports high confidence too often.
Why it matters: Without addressing this cost, models become less trustworthy even as they get more accurate. 🍞 Anchor: A model that improved its math accuracy might start answering tricky questions with 99% confidence—even when it’s wrong.

🍞 Hook: Imagine only practicing the moves you already got right and forgetting that tricky corner you always miss.

🥬 The Concept: Model collapse (in this context)

What it is: When training on only “winning” examples pushes the model to ignore uncertainty and alternative possibilities.
How it works: 1) Positive-only feedback. 2) Predictions become narrow. 3) The model loses signals about being unsure.
Why it matters: Without those signals, the model can’t tell when it’s likely making a mistake. 🍞 Anchor: If an AI always studies perfect solutions, it may stop recognizing the telltale signs of a wrong turn in its own reasoning.

The world before: LLMs were getting better at showing steps (CoT) and self-improving (STaR/ReST). But as they boosted accuracy, they often grew more sure of themselves—even when wrong. People tried temperature scaling (re-tuning the “spice” of logits), prompting the model to verbalize confidence, or using multiple samples to vote (self-consistency). These helped sometimes, but either cost lots of compute or didn’t fix the root problem: the base model’s sense of certainty was off.

The problem: The model didn’t learn when to trust its own reasoning. It learned how to get to correct answers—but not how to recognize likely mistakes.

Failed attempts:

Positive-only training (STaR) raised accuracy but worsened overconfidence.
External verifiers or separate calibrators helped but added complexity and compute.
Prompting for confidence was fragile—change the wording, get different certainty.

The gap: We needed training that teaches two skills at once: solve the problem and judge the solution’s trustworthiness.

Real stakes: In daily life, a model that “knows what it doesn’t know” is safer for homework help, coding hints, study guides, and more—and it saves time and cost because it needs fewer tries to reach a reliable answer.

02Core Idea

🍞 Hook: Imagine you’re doing homework with a smart buddy who not only shows their steps but also says, “I’m sure,” or “I’m not sure—let’s check.”

🥬 The Concept: EPICAR (Epistemically-Calibrated Reasoning)

What it is: A way to train models to reason and to judge their own answers’ reliability at the same time.
How it works: 1) The model generates solutions and marks them as correct/incorrect. 2) It also answers a yes/no question: “Is this answer correct?” to produce a confidence. 3) Training mixes two tasks: reinforce correct reasoning and learn to say “yes” for correct cases and “no” for incorrect ones. 4) Repeat over iterations.
Why it matters: Without EPICAR, models become overconfident. With EPICAR, they learn to solve well and to know when to slow down or double-check. 🍞 Anchor: For a math problem, the model shows steps, gives an answer, then answers “Is this correct? yes/no.” It trains on both.

The “Aha!” in one sentence: Don’t just teach the model to get answers—teach it to grade its own answers, too, and learn from both the wins and the misses.

Three analogies:

Basketball coach: You practice shots (reasoning) and also learn to feel which shots are likely to miss (self-evaluation).
Chef tasting: You cook (reasoning) and taste-test (confidence), learning when a dish needs more salt.
Spell-checker in your brain: You write (reasoning) and your inner voice flags likely typos (self-evaluation) before submitting.

Before vs. After:

Before: Models chased correct paths and ignored how to recognize likely mistakes; accuracy up, trust down.
After: Models practice both success and failure signals; accuracy up and trust up, reaching Pareto-superior trade-offs (better in both at once).

🍞 Hook: Think of a student who learns from wrong answers as carefully as from right ones.

🥬 The Concept: Dual-objective training (reason + self-evaluation)

What it is: Training on two tasks in one go: solve the problem and judge its own correctness.
How it works: 1) Correct solutions go into the reasoning dataset. 2) All attempts (right or wrong) become yes/no self-evaluation data. 3) The model learns both simultaneously.
Why it matters: Without learning from incorrect attempts, the model can’t recognize its own pitfalls. 🍞 Anchor: Even when the model’s final number is wrong, it still learns to say “no” to “Is this correct?”

🍞 Hook: You know how saying your confidence out loud can make you catch shaky answers?

🥬 The Concept: Verbalized confidence (yes/no)

What it is: Instead of reading tiny probability scores, the model says “yes” or “no” to how likely it’s correct, and we use that probability.
How it works: 1) After solving, the model gets a mini-prompt: “Is this correct? yes/no.” 2) We read the probability of “yes.” 3) That number becomes its confidence.
Why it matters: Words capture the model’s high-level certainty better for reasoning tasks than raw token scores. 🍞 Anchor: For “Is my final answer 42 correct?”, the model might assign 0.82 to “yes,” so confidence is 0.82.

🍞 Hook: Picture tidying your desk so the teacher can grade your work fairly.

🥬 The Concept: Adaptive Injection Decoding (AID)

What it is: A decoding helper that enforces clean answer formatting so the model isn’t punished for tiny syntax slips.
How it works: 1) If the model is about to end early, we gently insert the expected answer format (like “So, the answer is {…}”). 2) We make sure brackets close. 3) We stop when the format is valid.
Why it matters: Without AID, a correct solution could be mislabeled “wrong” just for a missing brace, teaching the wrong lesson. 🍞 Anchor: If the model forgets the closing }, AID adds it so the grader can read the answer.

Why it works (intuition):

The model sees both sides of the coin: what correctness looks like and what failure smells like. That builds a sense of epistemic uncertainty—knowing when it doesn’t know.
This creates a natural curriculum: early on, there are more “no” labels, so the model learns caution; later, as it improves, it sees more “yes,” and it grows confident—but appropriately so.
Because the model internalizes a trustworthy confidence signal, test-time ensembles can use fewer samples (weighted by confidence) to reach the same accuracy as big ensembles.

Building blocks:

Reasoning traces (the steps).
Verbalized self-evaluation (yes/no with a probability).
Dual-objective mixing (train on correct reasoning + all self-evals).
AID for clean formatting and less label noise.
Optional: Confidence-weighted voting at inference to save compute.

🍞 Anchor: With EPICAR, answering “What is 17 × 23?” includes steps, the final answer, and a confidence judgment—so when it’s unsure, it signals that clearly and invites a double-check.

03Methodology

High-level pipeline: Input question → Generate multiple reasoning paths → Check correctness and ask the model to self-evaluate (yes/no) → Mix data (reasoning + self-evaluation) → Supervised fine-tuning → Repeat over iterations.

🍞 Hook: Imagine solving many riddles, and after each, you also say how sure you are.

🥬 The Concept: Generation phase (K samples per question)

What it is: For each problem, the model writes K different step-by-step solutions and final answers.
How it works: 1) Sample K reasoning paths. 2) For each path, get the final answer. 3) Compare to ground truth (if available). 4) Ask, “Is this correct? yes/no,” and record the probability.
Why it matters: Without several tries, we can’t learn what confident-correct vs. confident-incorrect looks like. 🍞 Anchor: For a MATH problem, the model tries 10 solution paths, gets 3 right, 7 wrong, and reports confidence for each.

🍞 Hook: Think of two baskets on your desk: one for solutions to copy from, one for solutions to judge.

🥬 The Concept: Mixing phase (dual datasets)

What it is: Build two datasets from those K attempts: (1) reasoning reinforcement (only correct paths), (2) self-evaluation (all paths labeled yes/no).
How it works: 1) Correct paths go to reasoning set. 2) All paths go to self-eval set (correct→ yes; incorrect→ no). 3) Shuffle and combine into a single training stream.
Why it matters: Without the “no” examples, the model can’t learn what “wrongness” looks like. 🍞 Anchor: A wrong answer still teaches the model to say “no” to “Is this correct?”

🍞 Hook: Picture taking notes and grading them in the same notebook so you learn faster.

🥬 The Concept: Dual-objective SFT (single loss)

What it is: Train with one standard language-model loss across both reasoning tokens and the yes/no token.
How it works: 1) Concatenate reasoning text and the yes/no response into one target sequence. 2) Optimize the usual next-token loss. 3) Repeat across iterations.
Why it matters: A simple, stable setup—no fragile weighting tricks—lets the model jointly learn reasoning and self-judgment. 🍞 Anchor: After an answer, the training target includes “… Is this correct? yes” or “… Is this correct? no.”

🍞 Hook: Imagine your notebook reminds you to neatly box your final answer so the teacher can grade it easily.

🥬 The Concept: Adaptive Injection Decoding (AID)

What it is: A guardrail that keeps answers parseable (e.g., ensures an answer box is opened and closed) so we don’t confuse format errors with logic errors.
How it works: 1) If the model forgets the closing brace, gently force it. 2) If it’s ending too soon, nudge it to finish the format. 3) Prevent endless output inside the answer.
Why it matters: Without AID, the model might get punished for typos, which would corrupt the self-evaluation learning signal. 🍞 Anchor: AID adds the missing } so an otherwise correct 42 isn’t marked wrong.

🍞 Hook: Think about report cards that show not only grades but also how “honest” the student’s confidence was.

🥬 The Concept: Calibration metrics (ECE, AUROC, Brier Score)

What it is: Ways to measure whether the model’s confidence matches reality and whether it ranks correct answers higher than wrong ones.
How it works: 1) ECE: groups by confidence and compares average confidence vs. actual accuracy per group. 2) AUROC: checks if correct answers tend to get higher confidence than incorrect ones. 3) Brier: measures squared error between confidence and actual correctness.
Why it matters: Without these metrics, we can’t tell if the model is just loud—or truly trustworthy. 🍞 Anchor: If the model says “I’m 80% sure” many times, ECE checks if it’s right about 8 out of 10; AUROC checks if rights usually score higher than wrongs; Brier sums how far off its confidence is.

Example walk-through:

Input: “What is 29 × 17?”
Generation: The model samples 10 solutions with steps and answers. Suppose 4 match the truth.
Self-eval: For each path, it answers “Is this correct? yes/no” and gives a probability for “yes.”
Mixing: The 4 correct paths teach reasoning; all 10 paths teach self-evaluation (the 6 incorrect → “no”).
Training: One standard language-model loss trains on both sequences.
Iterate: Repeat for more data, improving both solving and confidence honesty.

Secret sauce:

The model learns from its failures, not just successes, so it captures the “smell” of likely mistakes.
AID removes noisy labels from formatting slips, keeping the self-evaluation signal clean.
A “natural curriculum” happens automatically: early training is caution-heavy (more “no”), later training grows confident responsibly (more “yes”).

04Experiments & Results

🍞 Hook: Imagine testing two things about a student: how often they get the right answer, and how truthful their self-confidence is.

🥬 The Concept: What was tested

What it is: The researchers measured both reasoning accuracy and reliability (calibration/discrimination).
How it works: 1) Datasets: MATH for training/eval, GSM8K for out-of-distribution (OOD) math, MBPP for code. 2) Models: Llama-3 and Qwen-3 families, multiple sizes. 3) Metrics: Accuracy, ECE, AUROC, Brier; sometimes temperature scaling (TS) to test intrinsic vs. fixable calibration.
Why it matters: Without balanced tests, you might boost accuracy but accidentally teach overconfidence. 🍞 Anchor: Think of a scoreboard that shows “points scored” (accuracy) and “honesty score” (calibration).

The competition: Baselines included the base model, STaR (iterative self-training that keeps only correct paths), Slow Thinking (ICL prompts that encourage careful checking), and Model Merging (weight interpolation to balance capability and calibration).

Scoreboard highlights (with context):

Llama-3-3B: Accuracy 7.56% → 8.58% (EPICAR), AUROC 0.555 → 0.568; ECE 0.376 → 0.108; Brier 0.216 → 0.097. That’s like improving the grade while also getting way better at predicting when you’re right.
Llama-3-8B: Accuracy 13.30% → 14.42% (EPICAR), AUROC 0.544 → 0.595; ECE 0.496 → 0.415; Brier 0.368 → 0.298. Better answers and more honest confidence.
Qwen-3-8B: Accuracy 49.52% (STaR) → 49.76% (EPICAR); AUROC 0.727 → 0.797; ECE 0.196 → 0.131; Brier 0.259 → 0.206. Stronger discrimination and calibration.
MBPP (code): Llama-3-8B accuracy 37.74% (STaR) → 39.30% (EPICAR); ECE(+TS) 0.390 → 0.113; Brier 0.387 → 0.246. That’s like moving from a shaky coder with overly bold claims to a steadier one who admits uncertainty.

🍞 Hook: You know how a careful student double-checks their work before handing it in?

🥬 The Concept: Slow Thinking (ICL)

What it is: Prompting that nudges models to self-verify and backtrack within longer reasoning.
How it works: 1) Provide few-shot examples that demonstrate checking. 2) The model writes a more reflective chain of thought. 3) It stabilizes confidence on capable models.
Why it matters: Without a stable base, slow thinking can wobble; with EPICAR’s calibration, it shines, especially in 8B models. 🍞 Anchor: On Qwen-3-8B, pairing EPICAR with slow thinking reached 55.56% in one reported setup.

🍞 Hook: Picture a classroom vote where votes from more confident students count a bit more.

🥬 The Concept: Confidence-Informed Self-Consistency (CISC)

What it is: A way to ensemble multiple sampled answers by giving more weight to paths the model feels confident about.
How it works: 1) Sample K solutions. 2) Each path has a “yes” probability. 3) Weight votes by confidence (softmax with temperature). 4) Pick the answer with the highest total weighted confidence.
Why it matters: Without CISC, frequent wrong answers can win the vote; with CISC, honest confidence tips the balance toward the right path. 🍞 Anchor: On MATH-500, EPICAR + CISC matched STaR’s K=30 performance using only K=10, cutting compute by about 3×.

🍞 Hook: Think of tuning a blend between two instruments until the music sounds just right.

🥬 The Concept: Model Merging

What it is: Interpolating between base weights and fine-tuned weights to trade off capability and calibration.
How it works: 1) Pick λ between 0 and 1. 2) Merge: (1−λ)base + λfine-tuned. 3) Evaluate across λ values to find sweet spots.
Why it matters: Without merging, you might be stuck with overconfidence; with merging, you can often improve calibration further. 🍞 Anchor: Llama-3-8B with EPICAR + merging reached 15.02% accuracy—its highest in that study.

Surprises:

Smaller models (≈1B) struggled to gain accuracy while improving calibration—suggesting a capacity threshold for the full benefits.
EPICAR often boosted AUROC (ranking correctness well) even when raw ECE needed extra tuning (e.g., with temperature scaling) in OOD settings like GSM8K.

OOD generalization (GSM8K): EPICAR improved accuracy and discriminative reliability for many settings (e.g., Llama-3-3B accuracy 14.94% → 21.46%; AUROC 0.527 → 0.606), showing the self-evaluation skill travels beyond the training domain.

05Discussion & Limitations

Limitations:

Domain scope: EPICAR was tested where correctness can be checked automatically (math, code). It’s less clear how to apply the same clear yes/no supervision to fuzzy domains like legal advice or creative writing.
Capacity threshold: Very small models (around 1B parameters) sometimes couldn’t benefit fully; there may be a “critical mass” of reasoning skill required.
Absolute calibration shift OOD: In new domains, the model’s relative confidence ordering (AUROC) often stays strong, but the exact numeric probabilities (ECE) can drift.
Prompt sensitivity: Verbalized confidence (“Is this correct? yes/no”) can be somewhat sensitive to phrasing; EPICAR improves robustness but doesn’t eliminate this challenge.
Training cost: Iterative training adds compute during training time, though it later saves inference compute by needing fewer samples.

Required resources:

A base reasoning-capable LLM (ideally 3B+ parameters).
Datasets with verifiable answers (for correctness labels).
Compute for iterative sampling (K per problem) and fine-tuning.

When not to use:

Extremely small models with very low baseline accuracy (little signal to learn honest self-evaluation).
Domains with no clear ground truth or verifiable checker.

Open questions:

Can we make verbalized probabilities numerically stable across domains (absolute calibration) without hurting accuracy?
How does EPICAR combine with RL methods (e.g., using self-eval as intrinsic rewards) to further improve reasoning?
Can we design prompt-invariant confidence queries that are robust to wording changes?
How low can the capacity threshold go with smarter data augmentation or curriculum design?

06Conclusion & Future Work

Three-sentence summary:

EPICAR trains models to both solve problems and honestly judge their own answers by learning from correct and incorrect attempts.
This dual-objective approach improves accuracy and trustworthiness together, fixing the usual overconfidence seen in positive-only self-training.
With better internal confidence, models can match high-ensemble performance using far fewer samples, cutting inference compute by about 3×.

Main achievement:

Turning reasoning training into an epistemic learning problem—making “know what you don’t know” a first-class training goal—so models become both smarter and safer.

Future directions:

Improve absolute calibration in new domains, integrate EPICAR-style self-eval into RL frameworks, and harden verbalized confidence against prompt sensitivity.

Why remember this:

Because knowing when you might be wrong is a superpower for both people and AI. EPICAR shows that teaching models this superpower boosts performance, trust, and efficiency all at once.

Practical Applications

•Homework helpers that flag uncertain answers so students know where to ask a teacher or re-check.
•Coding assistants that warn when a suggested function is likely buggy before you run it.
•Math tutoring tools that show steps and confidence per step, guiding targeted practice.
•Customer support bots that escalate tricky cases when confidence is low, reducing bad answers.
•Scientific assistants that highlight low-confidence claims in summaries for expert review.
•Medical triage chatbots that defer or seek human confirmation when uncertainty is high.
•Legal research tools that mark low-confidence citations, prompting additional verification.
•Enterprise analytics agents that report uncertainty bands with conclusions for safer decisions.
•Education platforms that grade not only answers but also students’ confidence calibration.
•Research LLMs that use confidence-weighted ensembling to reach strong answers with fewer samples.

Version: 1