VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

Vikash Singh; Darion Cassel; Nathaniel Weir; Nick Feng; Sam Bayless

VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

Intermediate

Vikash Singh, Darion Cassel, Nathaniel Weir et al.1/27/2026

arXiv PDF

Key Summary

•VERGE is a teamwork system where an AI writer (an LLM) works with a strict math checker (an SMT solver) to make answers both smart and logically sound.
•It breaks a big answer into tiny, checkable facts called atomic claims, then tries to prove each one with logic when possible.
•If the logic checker finds a mistake, VERGE pinpoints the exact few claims to fix using Minimal Correction Subsets (MCS) so the AI knows what to change.
•Not every sentence can be turned into hard math, so VERGE routes clear math/logic facts to the solver and sends fuzzy or commonsense parts to a group of AI judges.
•To avoid string-matching tricks, VERGE uses semantic equivalence: different formalizations must mean the same thing logically to count as a match.
•It keeps score across all claims and penalizes mixes of mutually inconsistent statements, then iterates with targeted feedback until the answer is stable and verified.
•Across multiple reasoning benchmarks, VERGE boosts performance at convergence by about 18.7% compared to single-pass answers.
•It shines in messy, real-world language by giving formal guarantees where possible and consensus confidence elsewhere.
•There is a 'Formalization Barrier': smaller models struggle to write valid logic code, so bigger models benefit more from VERGE.
•The result is a more trustworthy AI that knows when to prove, when to ask for consensus, and exactly what to fix when it’s wrong.

Why This Research Matters

VERGE makes AI answers not just sound good but hold together logically, which is crucial in areas like medicine, finance, and law. It proves what can be proved and admits uncertainty where language is fuzzy, giving people a clearer picture of what to trust. By pinpointing the smallest fixes when answers break, it helps AI improve faster and safer. This hybrid design lets AI handle both strict math and everyday language without pretending everything fits the same mold. As models grow stronger, VERGE scales up their reliability, turning fluency into trustworthy reasoning. In short, it’s a big step toward AI you can check, correct, and confidently use.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how a smart friend can talk really smoothly, but sometimes gets facts mixed up? Talking nicely isn’t the same as being right.

🥬 The Concept: Large Language Models (LLMs) are computer programs that are great at writing and explaining things in human language.

How it works: They learn patterns from lots of text, predict likely next words, and generate answers that sound fluent.
Why it matters: Without a way to check if their words are logically correct, they can sound confident but still be wrong. 🍞 Anchor: Imagine asking, “If all cats are mammals and Felix is a cat, is Felix a mammal?” An LLM might say yes (good!), but it might also add, “Felix barks,” which isn’t supported. We need a checker that catches that.

🍞 Hook: Think of building a LEGO bridge. It can look perfect, but will it hold? You test it.

🥬 The Concept: Formal verification is a way to prove whether statements follow strict logical rules.

How it works: Turn sentences into precise math-like rules, then use a solver to test if they fit together without breaking.
Why it matters: Without formal checks, an answer can include hidden contradictions that humans might miss. 🍞 Anchor: “All cats are mammals; Felix is a cat” formally proves “Felix is a mammal,” but doesn’t prove “Felix barks.”

The World Before: LLMs got very good at step-by-step solutions (Chain-of-Thought), self-consistency (sampling multiple answers and voting), self-refinement (critiquing themselves), and even multi-agent debates. These improved reliability a bit, but none gave hard, mathematical guarantees. They measured what sounded right, not what must be right.

The Problem: In high-stakes places—like health, finance, or law—“probably correct” is not enough. We need proofs when we can have them, and honest uncertainty when we can’t. LLMs still hallucinate (make things up), mix in unsupported claims, or contradict their own steps.

🍞 Hook: Imagine translating a poem into numbers. Some ideas are crisp (like dates), but others are fuzzy (like beauty). Forcing all of it into numbers breaks the meaning.

🥬 The Concept: The semantic gap is the mismatch between messy, human language and rigid, formal logic.

How it works: Clear claims (like 2 < 5) can be formalized exactly; vague claims (like “likely” or “small”) don’t fit cleanly.
Why it matters: If we force every sentence into strict logic, we either fail often or invent shaky rules, causing brittle systems. 🍞 Anchor: “Alice is taller than Bob” is formal-ish; “Alice looks tall for her age” is fuzzy.

Failed Attempts: Pure logical pipelines require every claim to be formalized, and they crumble on open-world language. Pure LLM pipelines rely on probabilities and can’t give proofs. Even tool-using LLMs often check results (like with a calculator) but not the logical structure that ties claims together.

The Gap: We need a system that proves what can be proved, and sensibly checks the rest—while giving exact, useful feedback on what to fix when things clash.

Real Stakes: Think medical advice that must not contradict known risks, contracts that must meet regulations, or scheduling decisions that must satisfy constraints. People need answers that are not just eloquent but also verifiably sound.

Enter VERGE: a framework that teams up LLMs (for language and creativity) with SMT solvers (for strict logic), routes claims to the right checker, and uses pinpoint feedback to refine answers until they are consistent and, when possible, formally proven.

02Core Idea

🍞 Hook: Imagine writing an essay with three helpers: a friendly writer, a tough math teacher, and a coach who tells you exactly which sentence to fix.

🥬 The Concept: VERGE is a neurosymbolic system that mixes an LLM (the writer) with an SMT solver (the math teacher) and a guidance engine (the coach) to refine answers until they are logically sound.

How it works: It breaks your answer into small claims, proves the provable ones with logic, checks fuzzy ones by consensus, and when there’s a conflict, it finds the smallest set of claims to change.
Why it matters: Without this team effort, AI answers can look good but hide logical cracks—or waste time fixing the wrong parts. 🍞 Anchor: “Felix is a mammal. Felix barks.” The solver says the second claim isn’t supported and points to it. The AI revises: “Felix is a mammal. Unknown if he barks.”

The Aha! Moment (one sentence): Don’t force all language into math; instead, formally verify what you can, and use trusted consensus for the rest—then use precise, minimal feedback to repair contradictions.

Three Analogies:

School Project: The writer drafts, the checker proves, the coach marks the exact lines to change—repeat until the paper earns an A.
Building Inspection: The blueprint (logic) must pass code (SMT). If something fails, the inspector circles the exact beams to replace (MCS), not the whole house.
Sports Team: The playbook (logic) covers strict moves; player instincts (commonsense) handle fuzzier calls; the replay booth (routing) decides which reviewer judges which play; penalties (feedback) name the exact foul.

Before vs. After:

Before: LLMs voted or debated but couldn’t prove; logic systems broke on fuzzy language.
After: VERGE proves math/logic claims, gets consensus for fuzzy ones, and fixes only what’s needed when things collide.

Why It Works (intuition, no equations):

Logic is a truth engine for crisp facts; use it to lock down what’s provable.
Language is rich and sometimes vague; use judged consensus when logic can’t apply cleanly.
Instead of a red X that says “wrong,” give a map of the smallest fixes (MCS) so the model improves fast and stays consistent.
Check that different formalizations really mean the same thing (semantic equivalence), so agreement is about meaning, not matching words.

Building Blocks (in simple pieces):

🍞 Hook: You know how big ideas are easier to check when split into bullet points? 🥬 The Concept: Atomic claims are tiny statements that can each be checked.

How it works: Split the answer into simple, stand-alone facts (no pronouns like “he”).
Why it matters: Small facts are easier to verify and repair. 🍞 Anchor: “All cats are mammals” + “Felix is a cat” + “Felix is a mammal” are clean, checkable pieces.

🍞 Hook: Think of assigning chores to who does them best. 🥬 The Concept: Semantic routing sends each claim to the best checker based on type.

How it works: Math/logic/temporal go to the solver; vague/commonsense/probabilistic go to consensus judges; hybrid when needed.
Why it matters: Forcing fuzzy claims into hard math causes brittle failures. 🍞 Anchor: “3 is less than 5” → solver. “It’s likely to rain” → consensus.

🍞 Hook: Picture three students solving the same equation differently but getting the same truth. 🥬 The Concept: Multi-model semantic equivalence means different formalizations are accepted only if they are truly the same logically.

How it works: Instead of word-matching, the solver checks if two formulas always agree in meaning.
Why it matters: Prevents fake agreement caused by similar wording. 🍞 Anchor: A ∧ B equals B ∧ A in logic, even if written differently.

🍞 Hook: When a bike squeaks, you want to oil the exact bolt, not replace the whole wheel. 🥬 The Concept: Minimal Correction Subsets (MCS) point to the smallest set of claims to change to fix a contradiction.

How it works: Keep as many consistent claims as possible and mark only the minimal troublemakers.
Why it matters: Saves time, keeps good reasoning intact, and speeds convergence. 🍞 Anchor: “Green is left of Blue” + “Blue is at 2” + “Green is at 3” can’t all be true; MCS singles out “Green is at 3.”

Put together, VERGE is a friendly but strict editor: prove what you can, sensibly judge what you can’t, and when there’s a clash, fix the fewest lines necessary.

03Methodology

High-level flow: Input → Entity extraction → Draft answer → Decompose into atomic claims → Classify & route → Formalize + consensus (when formal) → Verify (hard/soft/hybrid) → Score + joint consistency → Targeted feedback (MCS) → Refine → Output.

Step 1. Entity Extraction 🍞 Hook: Before solving a puzzle, you label the pieces. 🥬 The Concept: Entity extraction finds the people, things, times, and numbers in the problem to use as named pieces in logic.

What happens: From the context and question, VERGE lists entities like Felix, Monday, or Process A.
Why it matters: The solver needs exact names to talk about.
Example: From “All cats are mammals. Felix is a cat,” we create entities Cat, Mammal, Felix. 🍞 Anchor: Declare Felix as an object; use predicates like Cat(Felix).

Step 2. Generation (First Draft)

What happens: The LLM writes an initial answer using the context.
Why it matters: We need a starting point to refine.
Example: “Felix is a mammal. Felix also barks.”

Step 3. Atomic Decomposition 🍞 Hook: Checking a checklist is easier than checking a paragraph. 🥬 The Concept: Split the answer into atomic claims—tiny, self-contained facts.

What happens: Turn the draft into a list like [C1: Felix is a mammal; C2: Felix barks].
Why it matters: Small facts can be proven or fixed precisely.
Example: Resolve pronouns and make each claim stand alone. 🍞 Anchor: “He barks” becomes “Felix barks.”

Step 4. Claim Classification and Semantic Routing 🍞 Hook: Send math problems to the math teacher, and art questions to the art teacher. 🥬 The Concept: Semantic routing picks the right verifier based on meaning: math/logic/temporal → solver; vague/commonsense/probabilistic → consensus judges; hybrid if formalization fails.

Why it matters: Prevents brittle failures from forcing fuzzy language into strict logic.
Example: “3 < 5” → solver; “It’s likely to rain” → consensus; syntax error? → hybrid fallback. 🍞 Anchor: “Blue is right of Red” goes to logic; “The painting is beautiful” goes to soft judges.

Step 5. Autoformalization with Semantic Equivalence 🍞 Hook: Translating a sentence into math is like writing music notes from a melody. 🥬 The Concept: Autoformalization turns clear claims into logic formulas; multiple candidates must be semantically equivalent to count.

What happens: The LLM proposes 3 formal versions; the solver checks if they mean the same thing logically.
Why it matters: Avoids accepting formulas that look similar but differ in meaning.
Example: “All cats are mammals” may be encoded in different but equivalent forms; solver verifies equivalence, not wording. 🍞 Anchor: Accept only if a majority forms an equivalence clique; otherwise, refine or route to soft.

Step 6. Verification Cascade (Hard/Soft/Hybrid) 🍞 Hook: Different tools for different jobs. 🥬 The Concept: VERGE verifies each claim using the strongest suitable method.

Hard (SMT): statuses—Entailed (proven), Possible (consistent but not forced), Contradictory (clashes), Unknown (timeout/error).
Soft (Consensus): LLM judges vote Supported/Plausible/Unsupported/Uncertain with confidence weights.
Hybrid: If formalization chokes (syntax/timeout), reroute to soft to keep moving.
Why it matters: Guarantees when possible; graceful fallback otherwise. 🍞 Anchor: “Alan drinks beer” gets proved by contradiction; “It’s likely to rain” gets a supported consensus.

Step 7. Joint Consistency and Scoring 🍞 Hook: Even if each puzzle piece looks fine alone, the full picture can still have gaps. 🥬 The Concept: Joint consistency checks whether all accepted claims fit together; scoring rewards rigor and penalizes contradictions.

What happens: Combine claims and context; run a joint SAT check; compute a score: Entailed=1.0, Supported=0.9, Possible=0.7, Contradictory=0.0 plus a variance penalty if claims disagree.
Why it matters: Stops answers that mix confident truths with hidden conflicts.
Example: “Green < Blue,” “Blue = 2,” “Green = 3” fail jointly; score drops. 🍞 Anchor: A uniform A report card beats a mix of A’s and F’s.

Step 8. Targeted Feedback via MCS 🍞 Hook: When a circuit shorts, you want the tiny bad wire, not a full house rewire. 🥬 The Concept: Minimal Correction Subsets mark the smallest set of claims to change to restore consistency.

What happens: Keep the highest-confidence consistent subset; label only the minimal offenders for revision.
Why it matters: Speeds convergence and preserves good reasoning.
Example: In ZebraLogic, MCS pinpoints “Green = 3” as the fix. 🍞 Anchor: The AI revises just those claims, not the whole answer.

Step 9. Iteration until Convergence

What happens: The LLM rewrites using feedback (unsat cores, MCS, joint conflicts, low-confidence formalizations). Repeat until the score passes a threshold and the joint check is satisfied—or scores stop improving.
Why it matters: Each loop tightens logic, reduces contradictions, and improves clarity.
Example: From “Felix barks” to “Unknown if Felix barks,” with “Felix is a mammal” proven.

Secret Sauce (why this recipe is clever):

Semantic routing respects the semantic gap, applying the right level of rigor to each claim.
Semantic equivalence ensures agreement on meaning, not just wording.
MCS turns vague failure into precise, actionable edits, making refinement reliable and fast.
Variance-penalized scoring incentivizes harmony across all claims, not just isolated wins.

04Experiments & Results

The Test: VERGE was tried on six types of reasoning tasks—pure logic (FOLIO), synthetic proofs (ProofWriter), constraint puzzles (ZebraLogic), exam-style scheduling and logic (AR-LSAT), very hard mixed reasoning (BBEH), and a broad, real-world exam (HLE). These cover both crisp logic and fuzzy human-language questions.

The Competition: It was compared to popular prompting tricks—Chain-of-Thought (CoT), Self-Consistency, Self-Refinement—and to neurosymbolic/logical baselines like DSB, LogicLM, LINC, and Proof of Thought (PoT). To be fair, all methods got similar iteration budgets.

The Scoreboard (with context):

Overall, VERGE improved final accuracy at convergence by about 18.7% over single-pass methods. Think of moving from a low B to a solid A in tough classes.
On HLE (broad human questions), a jump from 14.2% (CoT) to 30.5% with VERGE is like doubling the right answers when the test mixes fuzzy and crisp items.
On AR-LSAT scheduling, VERGE reached over 90% with a strong model—like acing the logic games by proving constraints instead of guessing.
On ZebraLogic puzzles, it reached around 91%, showing the solver-plus-feedback excels at constraint satisfaction.
Exception: ProofWriter. PoT still wins here (≈98%) because that dataset is perfectly rigid and fits a single, monolithic program execution style. VERGE’s modular checks add overhead that isn’t needed on such clean, synthetic logic.

Surprising and Notable Findings:

Monotonic Improvement: Across up to 10 iterations, VERGE kept getting better each step (monotonic), like climbing stairs steadily. In contrast, plain self-refinement often got worse over time—the “correlation cliff”—fixing true steps into false ones without a solid ground truth checker.
Ablations (what matters most): • Remove MCS? Big drops on strict constraint tasks (e.g., AR-LSAT). Without precise “change just this” pointers, the AI struggles to repair conflicts. • Remove Routing? Scores fall sharply on fuzzy, open-ended tasks (e.g., HLE). Forcing all language into strict logic backfires. • Soft-Only? Largest overall drop. Consensus alone lacks the ironclad precision needed for tight logic puzzles.
Router Reliability: The semantic router classified claims to the right verifiers with about 94% accuracy in stress tests, including tricky idioms like “gave 110%.” It mostly erred on the safe side—avoiding sending fuzzy claims to the solver.
Formalization Barrier: Smaller models (≈20B) write invalid logic code too often, so the solver ends up catching typos, not proving logic. Larger models (≈120B+) cross the barrier: they produce valid logic, enabling true verification and high-quality feedback. In plain terms: the better your writer, the more helpful the strict checker becomes.

Takeaway: VERGE’s hybrid design—prove where possible, judge where needed, fix minimally and precisely—beats both pure logic and pure language approaches on most real-world reasoning mixes. Where problems are purely synthetic and perfectly formal, a direct program-synthesis method can be faster; but in the messy real world, VERGE’s flexibility pays off.

05Discussion & Limitations

Limitations (honest talk):

Latency: VERGE does more work each turn—splitting claims, generating multiple formalizations, checking equivalence, calling the solver, running judges, and crafting feedback. Answers can take many seconds per iteration, so it’s not ideal for instant chat.
Formalization Barrier: Smaller LLMs struggle to write valid logic code; the full benefits appear with large models. That means the best performance currently favors teams with access to big models.
Expressiveness Trade-offs: To keep proofs decidable and fast, VERGE restricts logic to manageable fragments (like linear arithmetic). Highly complex math (non-linear, heavy quantifiers) often must fall back to consensus instead of formal proof.
Verified Hallucinations Risk: If a claim is mistranslated into logic (autoformalization error), the solver can “prove” the wrong thing. Clear UI labels that distinguish “formally proved” from “consensus-supported” are essential.

Required Resources:

A capable LLM (ideally ≥70B parameters for reliable formalization), an SMT solver like Z3, and enough CPU time for iterative checks.

When NOT to Use:

Real-time conversations needing sub-second replies.
Domains requiring very expressive, non-linear math proofs that exceed solver fragments used here.
Tasks where you can already run a monolithic, exact program that fully models the domain (e.g., certain synthetic theorem datasets).

Open Questions:

Can we make autoformalization both faster and more faithful, reducing the risk of “proving the wrong thing”?
How can routing become even smarter—perhaps mixing partial formalization with soft evidence dynamically within one claim?
Can we learn better feedback than greedy MCS while staying fast, maybe with learned repair policies?
How do we align logical verification with factuality and ethics, adding trustworthy sources and value-aware checks on top of logic?
Can we compress the pipeline to deliver near-real-time performance without losing rigor?

06Conclusion & Future Work

Three-Sentence Summary: VERGE teams a language model with a logic solver to build answers that are not just fluent, but also provably consistent where possible. It routes clear, formal claims to the solver, fuzzy claims to consensus judges, and uses Minimal Correction Subsets to point out exactly which bits to fix when conflicts pop up. The result is a steady, reliable refinement loop that delivers formal guarantees when the math allows and sensible confidence elsewhere.

Main Achievement: Turning a simple “right/wrong” buzz from a solver into precise, minimal, actionable feedback (MCS) and combining that with semantic routing and equivalence-based consensus to guide an LLM toward verifiable reasoning.

Future Directions: Improve autoformalization faithfulness, expand logic coverage safely, learn smarter repair strategies, integrate factuality and ethical checks, and reduce latency for broader deployment. Better interfaces should clearly label which claims are formally proved versus consensus-supported so users trust the right parts the right amount.

Why Remember This: VERGE shows a practical path to trustworthy AI by proving what can be proved and honestly handling the rest—a blueprint for combining human-like language with machine-grade logic. It’s not just about getting higher scores; it’s about knowing which parts of an answer are rock-solid, which are well-supported, and how to fix the few that aren’t.

Practical Applications

•Contract and policy checking: Prove clauses don’t conflict and highlight the exact sentences to revise.
•Scheduling and planning: Guarantee constraints (like availability rules) are satisfied and fix minimal conflicts.
•Math and logic tutors: Provide step-by-step solutions with formal proofs where possible and honest uncertainty labels.
•Compliance audits: Verify that procedures meet regulations and flag precise corrections.
•Code review assistants: Check logical conditions, invariants, and configuration constraints for contradictions.
•Data pipeline validation: Ensure schema and business rules remain consistent after changes, pinpointing minimal fixes.
•Scientific reasoning support: Separate proven derivations from plausible hypotheses with clear labels.
•Legal exam prep and LSAT-style training: Generate verifiable reasoning paths and corrections.
•Risk assessments: Combine provable constraints (budgets, capacities) with consensus on qualitative risks.
•Question-answering systems for enterprises: Deliver responses with proof status and targeted corrections when needed.

Version: 1