Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification

Chuxue Cao; Jinluan Yang; Haoran Li; Kunhao Pan; Zijian Zhao; Zhengyu Chen; Yuchen Tian; Lijun Wu; Conghui He; Sirui Han; Yike Guo

Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification

Intermediate

Chuxue Cao, Jinluan Yang, Haoran Li et al.1/30/2026

arXiv PDF

Key Summary

•Large language models sometimes reach the right answer for the wrong reasons, which is risky and confusing.
•This paper adds a strict logic checker into the model’s thinking loop so mistakes are caught and fixed while the model is still thinking.
•Training happens in two stages: first the model learns to write and check its own logic steps, then it is polished with reinforcement learning that rewards clean structure and correct logic.
•A special reward system first blocks broken formats, then encourages tidy tool use, and finally rewards correct answers with concise wording.
•Instead of checking only at the end, the logic checker gives real-time feedback like 'this step is inconsistent' or 'no counterexample exists,' guiding the next step.
•Across six tough benchmarks, the 7B and 14B models beat strong baselines by notable margins and set new state-of-the-art results in several tasks.
•Making verification flexible (compute normally, verify when it helps) works better than forcing a formal proof after every single step.
•The trained models use more symbolic/logic tools (like Z3 and SymPy) and rely less on brute-force search, which improves generalization.
•There is extra compute cost and occasional trouble auto-translating messy language into strict logic, but the gains outweigh the pain.
•This approach points toward AI that reasons more like a careful scientist: explain, check, correct, and only then answer.

Why This Research Matters

When AI reasons about schoolwork, business, or science, we want it to be both confident and correct. This approach teaches AI to check its own logic while thinking, reducing contradictions and catching errors early. That means more reliable tutoring, clearer explanations, and safer assistance in planning or analysis. In real tasks, the model learns to compute normally and verify only when it helps, saving time while keeping rigor. Over time, such systems can become dependable partners that show their work, prove key steps, and earn trust. This is a foundation for AI that doesn’t just talk well but thinks carefully, like a good scientist or teacher.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how sometimes you solve a math problem and guess a number that just happens to be right, even if your steps were a bit wobbly? That feels okay in a hurry, but it’s not reliable when the problems get harder or when your answer affects real people. That is where today’s story begins: big AI models can sound confident and even land on the right final answer, but their steps can hide logical slips.

Before this work, large language models were already good at talking through problems step by step using something called chain-of-thought. They could handle puzzles, math, and logic better than older systems. But because these models pick the next word by probability, they don’t automatically check whether each step follows the rules of logic. This means they can be inconsistent: two steps might contradict each other, or a conclusion might not really follow from the earlier steps. As a result, models sometimes “reward hack”—they find shortcuts or patterns that get the right final label without learning the true reasoning.

🍞 Hook: Imagine taking a multiple-choice test where you learn which letter gets points, but you never learn the subject. You might memorize patterns that work on practice tests but fail on real ones.

🥬 The Concept (Reward Hacking): It is when a system chases the reward (like a good score) in a way that doesn’t match the true goal (sound reasoning). How it works: 1) The model notices shortcuts that often yield correct answers. 2) It repeats those shortcuts. 3) It never builds solid logic. Why it matters: Without preventing this, the model can get the right answers for the wrong reasons and collapse when the pattern changes.

🍞 Anchor: A model learns that answers to certain word problems often end with a simple fraction, so it guesses that fraction even when the steps don’t support it. It sometimes gets points, but the logic is broken.

People tried to fix this. Some teams trained a second model to judge the reasoning process step by step. This helped, but the judge itself was still a language model, so it could be biased or miss subtle contradictions. Other teams used formal tools (like theorem provers or code interpreters) to check logic, but often only after the full answer was already written (post-hoc), or only in a narrow area like math. That means errors slipped through during the actual thinking and got baked into later steps.

🍞 Hook: You know how it’s easier to erase a pencil mistake right after you write it than to redo the whole page later?

🥬 The Concept (Post-hoc vs. In-process Checking): Post-hoc checking happens after you finish; in-process checking happens while you write. How it works: 1) In-process checks each new step. 2) If a step fails, you fix it before moving on. 3) This stops mistakes from spreading. Why it matters: Without in-process checks, a tiny early error can snowball into a wrong final answer.

🍞 Anchor: Writing a proof in math class and asking your teacher to glance at each step is better than finishing the whole proof and hoping it’s fine.

The missing piece was a way to blend the model’s natural language thinking with a strict logic checker that could jump in during the reasoning, not only at the end. And it needed to work beyond just math problems—on logic, science, and general reasoning.

🍞 Hook: Picture a climbing partner who checks each knot as you go up the wall, not just at the top.

🥬 The Concept (Formal Logic Verification): It is a precise, machine-checkable way to test whether a step follows rules of logic. How it works: 1) Translate a claim into a formal language. 2) Ask a solver if it’s consistent or find a counterexample. 3) Use the result to accept, fix, or rethink the step. Why it matters: Without this, the model can sound sure but be wrong; with it, each step is grounded in rules.

🍞 Anchor: If the model says “Diana scored higher than Bob,” the checker tries to satisfy all given constraints. If it finds a conflict, it flags the claim and pushes a correction.

This paper fills the gap by interleaving formal verification and natural language generation. Instead of letting errors pile up, the model gets real-time feedback to catch and correct mistakes. The training has two stages: first teach the model to write and check steps (so it learns the rhythm), then use reinforcement learning to reward clean structure, smart tool use, and correct answers. The stakes are big: better math help for students, safer reasoning in planning tasks, clearer scientific explanations, and more trustworthy AI that doesn’t just guess well—it reasons well.

02Core Idea

🍞 Hook: Imagine writing a mystery story with a friend who is a logic referee. Every time you write a clue, your friend instantly tells you, “That fits the facts,” or “Nope, that contradicts page 2.”

🥬 The Concept (Aha!): Interleave a strict logic checker inside the model’s step-by-step thinking so each step can be verified, corrected, and improved before moving on. How it works: 1) The model writes a small natural-language step. 2) It generates a matching formal version of that step. 3) A solver checks the formal step and returns feedback (satisfiable, unsatisfiable, or errors/counterexamples). 4) The model uses this feedback to revise or continue. 5) Repeat until a final, verified answer is produced. Why it matters: Without this interleaving, wrong steps sneak in and mislead later steps. With it, the model learns to build sound arguments and avoids reward hacking.

🍞 Anchor: In a ranking puzzle (who scored higher?), the model proposes “Diana > Bob.” The solver tries to fit that with all constraints. If impossible, it flags it, and the model corrects to “Undetermined” or another consistent relation.

Three analogies to lock it in:

Referee on the field: Each play (reasoning step) happens only if it follows the rules; fouls are called immediately.
Spell-check for logic: As you type, red squiggles appear under invalid logic, and you fix it before continuing.
GPS with live rerouting: If your chosen step hits a dead end, you get instant reroute instructions before you waste time.

Before vs. After:

Before: The model wrote a full chain, and only then might someone try to check it. Early mistakes contaminated later reasoning, and shortcuts were rewarded if they matched the final answer.
After: The model writes in smaller bites, each one checked. Errors are caught early. Rewards encourage both correct answers and well-formed, verifiable steps.

Why it works (intuition, not equations):

It narrows the search to logically consistent paths, so the model explores fewer wrong turns.
The feedback is dense: instead of a single pass/fail at the end, each step gives a signal (“good,” “fix,” “contradiction,” “counterexample”).
It creates healthy habits: the model learns to write steps that are easy to formalize and verify, improving clarity and structure.
It generalizes: logical consistency is a universal rule that helps across math, logic, and many knowledge tasks.

🍞 Hook: Think of the building blocks as LEGO bricks with snap-checks that ensure each new brick truly connects.

🥬 Building Blocks:

Interleaved Steps: Natural text step → formal step → verification feedback.
Execution-based Validation: Every formal piece gets actually run/checked to avoid fake formalism.
Two-Stage Training: First learn the format (SFT), then learn the policy (RL) of when and how to reason and verify.
Hierarchical Rewards: 1) Block broken formats. 2) Encourage tidy tool use. 3) Reward correct, concise answers.
Flexible Verification: Use logic tools when they add value; don’t force them for trivial arithmetic.

🍞 Anchor: In an economics elasticity question, the model formalizes “moving northwest means higher price and lower quantity” and proves with a solver that elasticity must increase along a linear demand curve in that direction. The solver’s UNSAT on the opposite claim acts like a stamp saying, “No counterexample exists,” and the model answers confidently and correctly.

03Methodology

At a high level: Problem → Stage 1 (SFT with verified examples) → Stage 2 (RL with structured rewards and in-loop verification) → Final answer.

Stage 1: Formal-Logic-Verification-Guided SFT (learn the rhythm)

What happens: A strong teacher model generates correct chain-of-thought solutions. These chains are split into small modules. For each module, the system builds a matching formal snippet (like constraints or small programs) and predicts what its execution should output. Then the formal snippet is actually executed in a sandbox. If the real output matches the expected one (or is semantically equivalent), the pair is kept. If not, the natural-language step is rewritten to align with the true execution. Only tightly aligned triples (text step, formal code, execution output) are kept for training.
Why this step exists: If the model learns from sloppy or fake formal proofs, it will copy bad habits. Execution-based validation filters out noisy pairs and teaches the model to write steps that are both clear and checkable.
Example: For a logic puzzle, a module might say, “Because Alice > Bob and Bob > Carol, Alice > Carol.” The formal snippet encodes these relations; the solver confirms transitivity. If the snippet returns an unexpected result, the text step gets rewritten to match reality before it enters the training set.

🍞 Hook: Like practicing piano with a metronome that beeps if you slip off-beat.

🥬 The Concept (Execution-based Validation): It checks each formal step by actually running it. How it works: 1) Create the formal step. 2) Execute it in a sandbox. 3) Compare results to expectations; rewrite if needed. Why it matters: Without execution, formal steps could look right but be wrong; execution keeps everything honest.

🍞 Anchor: If a code block claims “the solver proves Diana > Bob,” but when executed it returns “unknown” or “unsat,” the pipeline rejects or fixes it before training.

Stage 2: Reinforcement Learning with Interleaved Verification (learn the strategy)

What happens: Now the model generates its own reasoning with optional calls to tools (like a symbolic solver). After each formal call, it reads the feedback and decides what to do next: accept, fix, or try another path. A composite reward guides learning.
Why this step exists: SFT teaches format and habits, but RL teaches decision-making: when to verify, how much to write, when to revise, and when to stop.
Example with data: The model answers a TheoremQA item. It proposes a claim, encodes it, gets “unsat” from the solver (meaning the claim cannot hold with the constraints), then revises to a consistent claim and concludes with a correct proof sketch and answer.

🍞 Hook: Think of a teacher who grades not only the final essay, but also the outline, structure, and use of sources.

🥬 The Concept (Hierarchical Reward): A layered scoring system that first blocks broken outputs, then scores structure, and finally correctness. How it works: 1) Level 1: Fatal errors (timeouts, loops, way too many tool calls) get a strong penalty. 2) Level 2: Format issues (missing tags, no final answer) get a smaller penalty. 3) Level 3: Valid responses earn structural points (neat tags, reasonable tool use) plus correctness points (right answer, not too wordy). Why it matters: Without this, the model might game the reward or get stuck in loops; with it, the model learns safe, tidy, and accurate reasoning.

🍞 Anchor: A response with six solver calls (limit is three) gets clipped early with a penalty; a tidy, correct answer with one well-placed verification gets a high reward.

Optimization trick (Group Relative Policy Optimization, GRPO)

What happens: For a given question, the model tries several answers. Each gets a reward. Instead of caring about the raw number, the learning focuses on which tries are better than the others, which stabilizes training.
Why this step exists: Relative scoring reduces noise and encourages steady improvement.
Example: On a hard math item, eight attempts are sampled. The two that are structured, concise, and correct get higher adjusted scores; the policy shifts toward those patterns.

🍞 Hook: You know how you improve faster by comparing attempts and copying the best one?

🥬 The Concept (GRPO): A reinforcement learning method that improves the policy by favoring relatively better samples within a group. How it works: 1) Sample multiple answers. 2) Score them. 3) Push the model toward the top performers, with safety via clipping and a KL term. Why it matters: Without relative focus, training can wobble; with it, learning is steadier.

🍞 Anchor: If three versions differ only in when they verify, and the mid-verify one consistently succeeds, GRPO nudges the model to prefer that timing.

Flexible vs. Enforced Verification

What happens: The team tested forcing verification after every tiny step vs. letting the model compute normally and verify when helpful. Forcing added overhead and sometimes hurt simple arithmetic tasks (over-formalizing). Flexible usage kept math performance high while preserving logic wins.
Why this step exists: Not every step needs a formal proof; use the right tool at the right time.
Example: For 2 + 2, don’t call a theorem prover. For “this property holds for all values,” do call it.

🍞 Hook: Don’t use a microscope to read a billboard.

🥬 The Concept (Interleaved, Flexible Verification): Verification is available in the loop and used when it adds value. How it works: 1) The model decides to verify or not. 2) If verified, use feedback to refine. 3) Keep tool calls within limits. Why it matters: Without flexibility, you slow down and may confuse simple tasks; with it, you get the best of both worlds.

🍞 Anchor: On a geometry problem, the model computes lengths directly, but when checking a general angle relationship, it uses the solver to confirm no counterexample exists.

The Secret Sauce

Real-time, execution-backed feedback makes every training example trustworthy.
The reward system teaches the model to be both careful and efficient.
Interleaving verification changes behavior: more symbolic reasoning, less brute force, better generalization.

Pipeline summary (recipe):

Gather problems and teacher solutions; slice into steps. 2) Auto-formalize steps; predict outputs. 3) Execute formal steps; keep or rewrite to match. 4) Train SFT on aligned triples (text, code, output). 5) Do RL where the model thinks, optionally verifies, reads tool feedback, and continues. 6) Apply hierarchical rewards and GRPO to improve policy. 7) Cap tool calls; penalize loops and bad formats. 8) Prefer flexible verification for speed and accuracy.

04Experiments & Results

The Test: The authors evaluated reasoning across logic, math, and general domains. They measured not only final accuracy but also whether models could maintain consistent, checkable reasoning. Benchmarks included KOR and BBH (logic), MATH-500 and AIME 2024 (math), plus GPQA-Diamond and TheoremQA (general and theorem application).

The Competition: They compared against strong baselines: base models, RL-trained models using language-only rewards, tool-integrated reasoners that compute with Python, and systems that audit after the fact. The key difference: those baselines typically either checked at the end or focused on narrow domains, while the new method checked during the process across diverse tasks.

The Scoreboard (with context):

7B scale: FLV-SFT (just the supervised stage with formal verification) already beat several RL baselines trained without formal tools, averaging about 49.8. That’s like getting a solid A when others are at B or B+. Adding FLV-RL nudged this further to about 51.9, setting new highs among peers.
14B scale: FLV-SFT reached roughly 55.7 average; FLV-RL jumped to about 58.6, a clear lead over comparable models. On AIME 2024, performance rose notably (for example, up to about 30%), almost doubling some baselines—like turning a difficult test from a D into a solid C+ or B- in a very competitive class. On TheoremQA, the system set a new bar (~63.5%), showing strong formal reasoning.
Across all six benchmarks, the approach pushed new state-of-the-art results for its size range, with average margins reported at about +10.4% (7B) and +14.2% (14B) over previous strong methods.

Surprising Findings:

Flexible > Enforced verification: Forcing verification every tiny step sometimes backfired, especially on simple arithmetic, adding cognitive overhead and slowing the model. Letting the model compute normally and verify only when it mattered preserved math strength and kept logic gains.
Symbolic shift: The trained model used symbolic/logic tools much more often than brute-force or generic utilities. This suggests it learned to think abstractly and verify claims, not just grind through numbers.
Token economy: Yes, the approach used more tokens (longer responses) than some baselines, but the jump in accuracy and reliability made the extra cost worthwhile for tough reasoning tasks.

What does a win look like?

Logic tasks (KOR, BBH): The interleaved checks caught contradictions early, improving final answers. Think: fewer logical knots, more consistent chains.
Math (MATH-500, AIME): When properties were general (like inequalities or identities), verification shined. For straight computation, the model skipped heavy formalism and stayed efficient—an important balance.
TheoremQA and GPQA-D: The system proved especially strong at applying formal theorems and reasoning with structure; GPQA showed mixed signals due to benchmark issues, but the trend still favored careful logic.

Takeaway: Real-time verification changed how the model reasons. It spent more brainpower on symbolic logic and less on trial-and-error or formatting fluff. The result was higher, more stable scores across different kinds of challenges.

05Discussion & Limitations

Limitations (what this can’t do yet):

Extra compute: Checking logic in the loop makes training slower (about 2× over some baselines). In high-volume settings, that cost matters.
Auto-formalization struggles: Turning messy, commonsense-heavy language into clean formal logic can misfire. If the translation is off, the checker may give the wrong feedback.
Over-formalizing simple steps: If the model verifies every tiny arithmetic fact, it can waste time and even confuse itself. This is why the paper favors flexible verification.

Required Resources:

A capable base model (7B–14B), a logic solver (like Z3), a safe sandbox to run code, and multi-GPU training (the authors used 16 H800s for RL). Good teacher and judge models help during data building.

When NOT to Use:

Very simple arithmetic or single-step facts that don’t benefit from formal logic—just compute and move on.
Super-ambiguous language questions with no stable formalization; a human-in-the-loop or better parsing may be needed first.
Ultra-low-latency use cases where every millisecond counts; interleaving checks may be too expensive unless carefully tuned.

Open Questions:

Smarter auto-formalization: How can we translate everyday language into reliable formal constraints with fewer mistakes?
Verification policies: Can the model learn exactly when to verify for best speed/accuracy trade-offs?
Broader domains: How far can this go beyond math and logic—law, medicine, engineering—without huge domain-specific libraries?
Safety and oversight: Can formal verification help detect deceptive reasoning or dangerous plans more reliably?
Human collaboration: What’s the best way for people to guide or audit verification steps without needing to be logic experts?

Honest assessment: The method clearly improves reasoning quality and consistency, especially on multi-step logic. It costs more compute and needs careful design to avoid over-formalizing. But the gains in clarity, correctness, and generalization make it a strong step toward trustworthy AI reasoning.

06Conclusion & Future Work

Three-sentence summary: This paper teaches language models to think with a logic referee sitting beside them, checking steps as they go. By first learning from tightly validated examples and then optimizing with a reward that values safety, structure, and correctness, the model avoids sneaky shortcuts and builds true reasoning skill. Across multiple hard benchmarks, this pushes performance to new levels with more reliable, explainable steps.

Main Achievement: The first broadly effective, interleaved verification framework that blends natural-language thought with formal logic checking in real time, supported by an execution-validated SFT pipeline and a hierarchical RL reward design.

Future Directions: Improve auto-formalization for messy text, learn policies for when to verify to cut costs, expand beyond math and logic to highly specialized domains, and connect verification to safety auditing for high-stakes decisions. Better human-AI interfaces could let non-experts see and trust the verified steps.

Why Remember This: It shows a practical path from sounding smart to being logically sound. Instead of hoping the final answer is right, the model now builds, checks, and fixes its path—like a careful scientist—making AI more trustworthy for classrooms, labs, and real-world problem solving.

Practical Applications

•Math tutoring that shows each step and verifies key claims, reducing hidden mistakes.
•Logic puzzle assistants that can guarantee no contradictions in the final solution.
•Coding helpers that verify preconditions, postconditions, or invariants before suggesting fixes.
•Spreadsheet and financial model auditors that check constraints (like balance rules) while you build formulas.
•Science lab note checkers that flag inconsistent assumptions in hypotheses and results.
•Legal or policy drafting aids that detect logical conflicts across clauses and suggest consistent rewrites.
•Technical interview practice tools that guide candidates to correct reasoning paths with verified steps.
•Research assistants that validate theorem applications or counterexample searches before concluding.
•Operations planning systems that verify schedule and resource constraints during scenario building.
•Education platforms that grade not only final answers but also verified reasoning processes.

Version: 1