Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

Junqi Liu; Zihao Zhou; Zekai Zhu; Marco Dos Santos; Weikun He; Jiawei Liu; Ran Wang; Yunzhou Xie; Junqiao Zhao; Qiufeng Wang; Lihong Zhi; Jia Li; Wenda Li

Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics

Intermediate

Junqi Liu, Zihao Zhou, Zekai Zhu et al.1/20/2026

arXiv PDF

Key Summary

•Numina-Lean-Agent is a new open system that uses a general coding agent to write and check exact math proofs in Lean without special training.
•It plugs into Lean through MCP tools, retrieves the right theorems with LeanDex, and even drafts human-style solution ideas using an Informal Prover.
•On the Putnam 2025 benchmark, it solved all 12/12 problems, matching the best closed system and beating others on several tasks.
•A clever 'subagent' trick splits very long proofs into smaller goals so the model can focus better and avoid getting lost.
•A 'blueprint' plan helps the agent map out big theorems into bite-sized lemmas and revise the plan when Lean’s feedback shows gaps.
•When stuck, a 'Discussion Partner' lets the agent ask other models for hints, like a study buddy brainstorming new proof paths.
•The system can collaborate with mathematicians; together they formalized parts of the Effective Brascamp–Lieb theorem (over 8,000 lines).
•Proofs are correct but can be verbose, and type conversions (like Real to NNReal) can slow things down.
•Because it’s a general coding agent, you can swap in a stronger base model to get better results without retraining.
•Everything is released openly so others can reproduce, inspect, and extend the system.

Why This Research Matters

Math underlies software, science, and engineering, so having proofs that are both smart and machine-checked boosts trust in real-world systems. A general coding agent that can plan, retrieve facts, and verify every step helps catch subtle mistakes that humans might miss. Because the tools are modular and open, the community can extend the system to new areas, from algebra to analysis to geometry, and reproduce results. The approach also supports human–AI collaboration, speeding up research by letting experts guide the blueprint while the agent fills in details. As base models improve, you can swap them in for instant gains without retraining, so progress compounds. In education, students can see precise, checked solutions and learn how informal ideas map to formal proofs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re building a Lego castle. If every block snaps together perfectly, you know the castle is sturdy. In math, we want the same kind of snap-together certainty.

🥬 The Concept (Lean Theorem Prover): What it is: Lean is a computer system that checks whether each step of a math proof snaps together correctly, so the whole proof is rock-solid. How it works:

You write a proof in a special math language.
Lean’s checker inspects each step using strict logic rules.
If every step fits, Lean says “OK”; if not, it shows exactly where things don’t match. Why it matters: Without a checker like Lean, mistakes can hide inside long proofs; with Lean, errors are caught early and precisely. 🍞 Anchor: Just like a Lego block that won’t fit if it’s the wrong shape, Lean refuses a proof step that doesn’t match the rules.

🍞 Hook: You know how you plan a trip—pick a goal, choose tools (car, bus, map), and make decisions along the way?

🥬 The Concept (Agentic Reasoning): What it is: Agentic reasoning means an AI acts like a planner: it has goals, chooses tools, tries ideas, learns from feedback, and adjusts. How it works:

Set a goal (prove a theorem).
Pick helpful tools (search, draft, check).
Try a plan; read feedback.
Improve the plan and repeat until it works. Why it matters: Without agency, the AI would just guess; with agency, it learns from each attempt and gets smarter about what to try next. 🍞 Anchor: Like a traveler who reroutes after hitting traffic, an agentic prover changes tactics when a proof path gets blocked.

The world before: Early AI provers usually did one thing at a time: predict the next little tactic or generate an entire proof in one go. They could be fast or clever, but they often got lost on long, twisty proofs. People tried search trees (like exploring many paths in a maze) and whole-proof generators (like writing the whole essay at once). Each had limits: searches were slow; one-shot essays missed tiny details Lean cares about. Hybrid systems then mixed informal reasoning (human-like explanations) with formal tactics, but still struggled with really long, structured arguments and often needed special training and custom pipelines.

🍞 Hook: Think of USB ports on a computer: you plug in new devices without rebuilding the whole machine.

🥬 The Concept (Model Context Protocol, MCP): What it is: MCP is a standard way for an AI to plug into outside tools (like Lean or search) and call them when needed. How it works:

Tools describe what they can do (like a menu).
The agent requests a tool, gives inputs.
The tool runs and returns results.
The agent uses the results to decide next steps. Why it matters: Without MCP, adding or swapping tools is messy; with it, tools are plug-and-play. 🍞 Anchor: Like plugging a printer into a USB port, the agent can “plug” into Lean or a theorem searcher without rewiring itself.

The problem: Powerful agentic provers existed, but many were closed-source, hard to reproduce, tightly tied to special training, or used hand-crafted, task-specific pipelines. That made them hard to extend, inspect, or adapt.

🍞 Hook: Imagine upgrading your game console by just switching the game cartridge to a newer, stronger one—no need to rebuild the console.

🥬 The Concept (Numina-Lean-Agent): What it is: Numina-Lean-Agent is a general coding agent that proves math in Lean by flexibly choosing tools via MCP—no retraining required when you swap in a stronger base model. How it works:

Read the math goal.
Choose tools (Lean checker, theorem searcher, informal solver, discussion partner).
Try proof steps; read Lean’s feedback.
Revise the plan, retry, and finalize a Lean-verified proof. Why it matters: Without a flexible, general agent, provers are brittle and hard to extend; with it, you can adapt quickly to new problems and tools. 🍞 Anchor: It’s like a smart coder who can compile, search docs, ask a friend, and refactor code until the program runs—except here the “program” is a formal proof.

Real stakes: Getting robust, checkable math matters for safety-critical software, science, and education. A system that can both think like a coder and verify like a logician helps catch hidden errors, speed up research, and make advanced math more accessible. And because Numina-Lean-Agent is open and general, the community can reproduce results, plug in new tools, and steadily improve the ecosystem.

Failed attempts and the gap: Prior work either trained specialized provers (powerful but rigid) or hand-built pipelines (clever but narrow). The missing piece was a general coding agent, connected through MCP, that can call the right tool at the right time, learn from Lean’s live feedback, and scale to long, subtle arguments without bespoke training.

🍞 Hook: Think of a team with roles: a code editor, a search engine, a writing buddy. You get more done when they cooperate.

🥬 The Concept (Claude Code): What it is: Claude Code is the base coding model that pilots the agent: it writes Lean code, reads errors, calls tools, and revises. How it works:

Interpret the goal and the current Lean state.
Decide which MCP tool to call.
Generate or fix Lean code based on results.
Iterate until Lean accepts the proof. Why it matters: Without a strong coding pilot, the tools don’t coordinate well; with it, the agent stays on course. 🍞 Anchor: Like a skilled programmer who reads compiler errors and fixes code piece by piece, Claude Code steers proofs to completion.

02Core Idea

🍞 Hook: You know how a Swiss Army knife can switch tools on the fly—scissors, knife, screwdriver—so you solve many problems with one handle?

🥬 The Concept (Core Idea of the Paper): What it is: Use a single, general coding agent as the “one handle,” and plug in many math tools (via MCP) so the agent can prove theorems, search libraries, draft human-style ideas, and ask for help—all without retraining. How it works:

Treat formal proving as coding with feedback: write Lean, compile, fix.
Add plug-in tools: Lean interaction, semantic retrieval, informal solution drafting, multi-model discussion.
Let the agent autonomously pick and chain tools as needed.
Swap in a stronger base model to boost performance with zero training. Why it matters: This unifies many tasks (proving, planning, searching, explaining) into one flexible workflow, making the system powerful, extensible, and reproducible. 🍞 Anchor: Like a coder using IDE features (errors, search, refactor) plus StackOverflow, the agent mixes tools to finish the job.

Three analogies:

Workshop analogy: One master craftsperson (the agent) handles a complex build by picking the right tool (saw, drill, sander) at each step, checking measurements (Lean), and adjusting.
Orchestra analogy: The conductor (agent) cues different sections (retrieval, informal reasoning, formal tactics), listens to feedback (errors), and refines the performance.
Hiking analogy: The hiker (agent) uses a map (blueprint), asks rangers (discussion partner), checks signposts (Lean goals), and chooses different trails (tools) when blocked.

Before vs. After:

Before: Many systems were rigid, tuned for one pipeline or trained model, hard to reproduce or extend.
After: A general coding agent, plus MCP, means you can add tools like Lego bricks, replace the base model for instant boosts, and keep a single, clear workflow.

Why it works (intuition, not equations):

Tight feedback loop: Lean’s exact error messages and goal states are like a compass; the agent never flies blind.
Division of labor: Retrieval brings facts, the Informal Prover suggests a game plan, the discussion partner adds fresh angles, and Lean ensures correctness.
Modularity: MCP tools are swappable; improvements don’t ripple into complex rewrites.
Scaling with model strength: Better base model in, better reasoning out—no retraining necessary.

🍞 Hook: Imagine a research lab where planners, librarians, writers, and reviewers collaborate in a loop to produce a flawless paper.

🥬 The Concept (Building Blocks):

Lean-LSP-MCP What it is: A tool that lets the agent see Lean’s goals, run code, and try multiple strategies in parallel attempts. How it works: 1) Query goals and diagnostics; 2) run snippets; 3) compare outcomes; 4) pick the best next move. Why it matters: Without real-time Lean insight, the agent would guess; with it, each step is grounded in truth. Anchor: Like a coder reading compile errors and rerunning tests instantly.
LeanDex What it is: A semantic theorem/definition retriever across Lean packages (e.g., mathlib). How it works: 1) Read a natural-language query; 2) interpret intent; 3) fetch matching Lean facts; 4) feed them into the proof. Why it matters: Without good retrieval, the agent invents facts or misses key lemmas; with it, proofs cite real, relevant theorems. Anchor: Like searching a library by meaning, not just exact titles.
Informal Prover What it is: A generator–verifier loop that drafts human-style solutions and checks them repeatedly. How it works: 1) Draft a solution; 2) verify thrice independently; 3) if wrong, provide feedback; 4) refine and repeat. Why it matters: Without high-level ideas, the agent can over-focus on low-level tactics; with them, it follows a smarter plan. Anchor: Like a math student writing an outline, getting teacher notes, and revising.
Discussion Partner What it is: A way for the agent to ask other models for hints when stuck. How it works: 1) Share the current state; 2) get multiple perspectives; 3) pick promising trails; 4) continue formalization. Why it matters: Without outside perspectives, the agent can get tunnel vision; with them, it escapes bottlenecks. Anchor: Like phoning a friend during a tough puzzle.

What changes because of this idea: You can attack contest problems (Putnam 2025), sustain long research proofs (Brascamp–Lieb), and transparently share the whole system. It’s not just another trained prover; it’s a general, inspectable, extendable proving lab in one agent.

03Methodology

At a high level: Input (a math statement) → Plan and choose tools → Try Lean steps and retrieve facts → Draft/refine ideas → Ask for help if stuck → Final Lean-verified proof.

🍞 Hook: Picture a chef tackling a complex recipe: read the dish, gather ingredients, try steps, taste, adjust seasoning, and plate perfectly.

🥬 The Concept (Workflow as a Recipe): What it is: A repeatable loop where the agent writes Lean code, checks it, uses retrieval and informal reasoning, and improves until Lean accepts the proof. How it works:

Understand the Goal: The agent inspects the Lean goal and context using Lean-LSP-MCP (goals, diagnostics, file structure).
Make a Plan (Blueprint): For big theorems, the agent drafts a blueprint—a list of lemmas and dependencies.
Gather Facts (LeanDex): It searches across mathlib and other packages for relevant theorems and definitions.
Try Proof Steps: The agent writes Lean tactics or structured proofs and runs them with Lean; it can also spin multiple small attempts at a node (multiattempt) and keep the winner.
Add High-Level Ideas (Informal Prover): If the path is unclear, it drafts a human-style solution and verifies it, repeating up to 20 iterations with triple-checks each time.
Ask for Hints (Discussion Partner): When blocked, it invites alternative reasoning from other models and then adapts.
Refine and Repeat: Lean’s error messages and goal states guide revisions until the proof compiles. Why it matters: Without this loop, the agent either guesses blindly or drowns in details; the loop keeps it guided, informed, and flexible. 🍞 Anchor: Like a chef tasting and adjusting salt, the agent tastes (compiles) and corrects (edits) until the dish (proof) is right.

Each step in detail with examples:

Step 1 (Lean insight): The agent calls leangoal and leandiagnosticmessages. If Lean says “type mismatch between ℝ and ℝ≥0,” the agent plans a type conversion lemma first. What breaks without it: You’d fix the wrong thing and waste time. Example: A failed ‘ring’ tactic error leads the agent to try field_simp first, then ring—just like adjusting order of operations.
Step 2 (Blueprint planning): The agent lists lemmas L1→L2→…→Theorem, with dependencies. If proving Effective Brascamp–Lieb, it identifies needed linear-algebra bounds and positivity conditions before the final inequality. What breaks without it: The agent commits to a bad path and gets stuck. Example: After Lean rejects an underspecified lemma, the agent strengthens assumptions and inserts a missing helper lemma.
Step 3 (LeanDex retrieval): For a goal involving sums of squares, it queries by meaning (e.g., “sum i^2 closed form”) and surfaces mathlib lemmas that match. What breaks without it: The agent may invent or misremember facts. Example: It retrieves a lemma about Finset.range sums to rewrite the goal correctly.
Step 4 (Lean-LSP-MCP execution): It tries leanruncode on a small snippet and, with leanmultiattempt, explores two tactic strategies in parallel attempts at the same node. What breaks without it: The agent can waste many steps on a single bad idea. Example: It compares a simp-first script vs. a rewrite-first script and keeps the one that advances the goal.
Step 5 (Informal Prover): The generator drafts a solution; the verifier checks it three times independently. If any check fails, it gives feedback and the generator revises. Up to 20 cycles. What breaks without it: The agent might miss the right high-level structure. Example: On B4, iterative refinement solved the problem in 5 rounds, while independent sampling failed within 10 rounds given the same budget.
Step 6 (Discussion Partner): When a proof bottleneck appears, the agent asks another model for a different line of attack (e.g., switch from combinatorial counting to a recurrence argument), evaluates the suggestion, and proceeds. What breaks without it: The agent can loop on the same bad idea. Example: For a symmetry lemma, an outside hint suggests an invariant that shortens the proof.
Step 7 (Refine): Lean’s messages drive targeted edits: add missing coercions, restructure a proof by cases, or call field_simp before ring. What breaks without it: Random edits won’t converge to a solution. Example: A ‘norm’ inequality fails; the agent imports the correct normed group instance and re-tries.

🍞 Hook: For very long puzzles, you don’t carry the whole maze in your head—you split it into rooms and solve them one by one.

🥬 The Concept (Subagent Mechanism): What it is: A way to break huge proofs into subgoals, each handled by a focused mini-agent to avoid context overload. How it works:

Identify a stubborn subgoal (e.g., a key lemma in A5).
Spin up a subagent dedicated to that lemma.
Have it get an informal hint (e.g., via GPT-5.2), then formalize it.
Merge the finished lemma back into the main proof. Why it matters: Without subagents, long contexts muddle instruction-following and stall progress. 🍞 Anchor: Like solving one room of a maze, marking it done, then moving to the next.

The secret sauce:

Tool autonomy via MCP: The agent chooses when to retrieve, when to compile, when to brainstorm.
Feedback-first proving: Lean’s precise signals keep the loop efficient.
Hybrid reasoning: Informal ideas guide formal steps; formal checks keep ideas honest.
Modularity and swap-ability: Better base models mean instant upgrades without training.
Planning with repair: Blueprints adapt mid-flight based on real compiler evidence.

Resource/budgeting note: For Putnam 2025, each problem had about a $50 budget by default, with tougher ones (A5 ~$ 1000, B6 ~$300) getting more room to explore. Runs were strictly sequential and internet search was disabled to ensure fair, reproducible conditions.

04Experiments & Results

🍞 Hook: Think of a math contest where different teams try the same 12 problems. Who solves more, who finishes faster, and who writes the cleanest solutions?

🥬 The Concept (Benchmarking on Putnam 2025): What it is: The team tested Numina-Lean-Agent on the 12 problems from Putnam 2025, a tough college-level contest, and compared it to leading provers. How it works:

Use standardized formal statements (from Seed-Prover 1.5) for fairness.
Run the agent sequentially with no internet search, within set budgets.
Compare how many problems are solved, how long they take, and how long the Lean proofs are. Why it matters: Without head-to-head tests, you can’t tell if a new method really helps. 🍞 Anchor: Like grading each team’s test with the same answer key and time limits.

The test: The main metrics were (a) problems solved, (b) time per problem, and (c) proof length (code lines without comments/blank lines). Competitors included AxiomProver (closed-source), Aristotle, and Seed-Prover 1.5.

The scoreboard:

Problems solved: Numina-Lean-Agent got 12/12, matching AxiomProver’s perfect score and surpassing Aristotle and Seed-Prover 1.5 on some problems.
Time: Despite running without any parallelization, it was faster on a subset of problems compared to others; for example, it beat Axiom on certain instances like A1.
Proof length: It often produced shorter Lean proofs than AxiomProver and Seed-Prover 1.5 on many problems, with clear wins on A3, B1, and B5. Step-based Aristotle sometimes yields very short scripts by design, so comparisons there are nuanced; still, Numina-Lean-Agent was notably concise among agentic peers.

Surprising findings:

Iterative refinement in the Informal Prover outperformed independent sampling given identical call budgets: on B4, independent sampling failed within 10 rounds, while refinement finished in 5.
A subagent strategy on A5 (splitting off a key alternating-permutation lemma) overcame long-context instruction drift. The subagent obtained an informal hint (via GPT-5.2) and then successfully formalized the isolated lemma.
Even with strict settings (no internet retrieval, sequential execution), the system maintained strong efficiency, suggesting the tool mix and feedback loops are pulling real weight.

Context for the numbers: “12/12” is like getting an A+ on a famously tough exam, where capable peers got A or B. Faster times on some tasks show the loop is efficient, not just powerful. Shorter proofs indicate the agent often found cleaner lines of reasoning once facts were retrieved effectively and the blueprint guided structure.

Ablation-style evidence:

Without the Informal Prover, several problems remained out of reach; adding it closed more cases.
With subagents, very long problems became manageable by reducing context bloat.
The tool autonomy (choosing when to retrieve vs. when to compile vs. when to discuss) appears key to both success rates and proof compactness.

Takeaway: The combination of a general coding agent, MCP-based tooling, and feedback-driven planning can match the best closed systems on hard benchmarks while staying open and extensible.

05Discussion & Limitations

🍞 Hook: Even a great backpack has limits—you still need to pack wisely, know the trail, and sometimes ask for help.

🥬 The Concept (Honest Assessment): What it is: A clear look at what the system can’t do yet, what it needs, when not to use it, and open questions for the community. How it works:

List limitations.
State resource needs.
Point to bad-fit scenarios.
Ask questions that guide future work. Why it matters: Without honest boundaries, users get false expectations; with them, progress is faster and safer. 🍞 Anchor: Like a trail map that marks cliffs and closed paths so hikers choose smart routes.

Limitations:

Proof verbosity: While correct, some generated Lean proofs are long or less idiomatic, especially for entire-lemma ‘sorry’ gaps.
Type-level friction: Conversions (e.g., Real to NNReal) can derail progress and cost time, even when the mathematical idea is fine.
Style vs. substance: The code often passes checks but lacks the elegant abstractions beloved by mathlib contributors.

Required resources:

Access to a strong base coding model (e.g., Claude Code with Opus 4.5) and auxiliary LLMs (e.g., Gemini for informal proving or hints).
API budgets that scale with problem difficulty (e.g., higher for A5/B6).
A configured Lean + MCP environment (Lean-LSP-MCP, LeanDex) and compute to run iterative loops.

When not to use:

Very small, routine lemmas where a human or a lightweight tactic suffices more quickly.
Environments where external tool calls or LLMs are restricted and the MCP ecosystem can’t be deployed.
Tasks where code elegance and community style are top priority without human refactoring time.

Open questions:

Can we distill the agent’s long scripts into human-quality abstractions automatically?
How far can type-level guidance be automated (e.g., proactive coercion planning) to prevent stalls?
What’s the best balance between informal brainstorming and formal step-by-step tactics for different domains?
How do we measure and improve blueprint quality—granularity, dependency health, and repair speed—in a principled way?
Can subagent orchestration be generalized to dynamic, learnable decomposition policies?

06Conclusion & Future Work

Three-sentence summary: Numina-Lean-Agent treats formal proving like coding with a toolbox: it writes Lean, reads precise feedback, retrieves facts, drafts high-level ideas, and even asks for help—all through MCP. With this general agentic approach, it solved all 12 Putnam 2025 problems, matched the best closed system, and supported a real research formalization (Effective Brascamp–Lieb). Because the base model is swappable and the tools are modular, the system is open, reproducible, and ready to grow.

Main achievement: Showing that a general coding agent, not a task-specific trained prover, can deliver state-of-the-art formal math performance while remaining extensible and open.

Future directions:

Auto-refactoring to make proofs shorter and more idiomatic after they pass.
Smarter type-handling and coercion planning to reduce friction.
Learned subagent orchestration and adaptive blueprinting for even longer, trickier theorems.
Broader tool ecosystem via MCP (e.g., domain-specific solvers) and richer retrieval.

Why remember this: It’s a template for how AI can do careful, verifiable reasoning by combining flexible planning with strict checking. The approach turns proving into an engineerable process—plan, try, test, revise—so progress compounds as tools and base models improve. In short, it’s a sturdy bridge between human-style ideas and machine-level certainty.

Practical Applications

•Automating contest-level problem solving with verifiable Lean proofs for math competitions and practice.
•Assisting researchers in formalizing advanced theorems by generating blueprints and filling in lemma proofs.
•Serving as a teaching assistant that converts human-style solutions into formal proofs students can study.
•Auditing mathematical steps in scientific papers by attempting formal reconstructions in Lean.
•Building safety-critical proofs for algorithms (e.g., optimization, control) with strict machine checks.
•Refactoring long proofs into shorter, cleaner variants using retrieval and planning to find tighter arguments.
•Prototyping new math libraries by introducing definitions/lemmas that compile and integrate with mathlib.
•Exploring alternative proof strategies via the Discussion Partner to escape local dead ends.
•Benchmarking and comparing theorem-proving tools under reproducible, open conditions.
•Rapidly upgrading performance by swapping in a stronger base coding model without retraining.

Version: 1