Learning to Repair Lean Proofs from Compiler Feedback

Evan Wang; Simon Chess; Daniel Lee; Siyuan Ge; Ajit Mallavarapu; Vasily Ilin

Learning to Repair Lean Proofs from Compiler Feedback

Intermediate

Evan Wang, Simon Chess, Daniel Lee et al.2/3/2026

arXiv PDF

Key Summary

•This paper teaches AI how to fix broken Lean math proofs by learning from the compiler’s feedback, not just from finished, perfect proofs.
•The authors build APRIL, a 260,000-example dataset that pairs an incorrect proof with the exact error message, the local proof state, a correct repair, and a short human-friendly diagnosis.
•They create errors in four realistic ways (theorem swap, tactic swap, wrong line, wrong multi-line) so models see many natural failure types.
•Training a 4B-parameter model on APRIL boosts single-shot repair accuracy from 1.1% to 27.4%, slightly beating a much larger 32B baseline in the same setting.
•Tactic mistakes are the easiest to fix, theorem swaps are medium, and hallucinated line completions are the hardest.
•Training to both fix the proof and explain the fix makes the model’s reasoning clearer, with only a small drop in raw repair accuracy.
•Explanations from the trained model help other models repair better, showing that good feedback can be shared across agents.
•The work shows that diagnostic-conditioned supervision (learning from the red pen) is a powerful, underused training signal for theorem proving agents.

Why This Research Matters

When AI can read precise feedback and apply minimal fixes, it becomes far more efficient and reliable at formal math. This helps researchers and students iterate proofs faster, just like learning from a teacher’s notes. Smaller models that use feedback well can rival or beat larger ones, reducing compute and cost. Clear, short explanations make the AI’s reasoning transparent and reusable by humans and other agents. The approach generalizes beyond math: compilers and analyzers in programming can provide similar guidance for automatic bug fixing. Ultimately, learning from the red pen enables collaborative, explainable AI assistants that improve rapidly during real work.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine doing a big math worksheet. You don’t get an A the first time—you try, see the teacher’s red marks, fix mistakes, and try again.

🥬 The concept: "Supervised learning" in AI is like guided practice—show the computer examples and the right answers so it learns patterns. How it works:

Give lots of input–output pairs (like problems and solutions),
Let the model guess,
Show it the correct answer,
Nudge it to guess better next time. Why it matters: Without the right examples, the model practices the wrong thing and doesn’t improve where it struggles most. 🍞 Anchor: If we only show perfect math proofs, an AI never learns how to recover when it goes off track.

🍞 Hook: You know how a spelling checker underlines a mistake and says, “I expected this letter here”? That’s feedback.

🥬 The concept: "Compiler feedback" is the proof assistant’s red pen—it tells you exactly where and why a Lean proof fails. How it works:

You run the proof in Lean,
Lean pinpoints the failing spot,
It shows the current goal and error message,
You use this to decide the next edit. Why it matters: Without feedback, you’d be guessing blindly which line broke and why. 🍞 Anchor: If Lean says "type mismatch at line 7" and shows the goal, you know where to fix the proof, just like fixing a misspelled word.

🍞 Hook: Think of a chess coach who watches your wrong move and explains exactly what went wrong so you can fix it.

🥬 The concept: "Lean proof repair" means fixing a broken formal proof using the proof assistant’s feedback. How it works:

Start from a failing proof attempt,
Read Lean’s error and the local goal,
Make a targeted edit (tactic/theorem/line),
Recheck until the proof compiles. Why it matters: End-to-end proof generation can fail often; repair turns a near-miss into a success. 🍞 Anchor: If a rewrite used the wrong theorem, swapping to the right one may let the rest of the proof work unchanged.

The world before: AI theorem provers got better at producing whole proofs from scratch, often with search and reinforcement learning. But training data mostly contained only correct, final proofs. That means models didn’t get much practice reading error messages or making small, smart fixes.

The problem: When a proof breaks, today’s systems rarely treat that failure as a learning example. They toss it and try again, or use the failure only as a yes/no signal. Models aren’t directly taught to interpret Lean’s feedback, explain the error, and propose a precise repair.

Failed attempts elsewhere: In programming, researchers learned that training on error–fix pairs works great. But in math proving, public datasets like mathlib-based corpora, Lean Workbook, and Herald almost never include the failed states and corresponding corrections.

The gap: We need a large dataset where each example pairs a broken proof, Lean’s diagnostics (error message and goal state), and the exact repaired proof—plus a short, human-friendly explanation grounded in the same feedback.

The real stakes: Iterative repair can be 32–128x more efficient than repeatedly guessing. For students, it’s like learning to use the teacher’s notes to get better fast. For developers building AI provers, it means smaller models can act smarter by reading and acting on precise feedback.

🍞 Hook: Think of practicing piano—playing only perfect pieces won’t teach you how to correct a slip mid-song.

🥬 The concept: "Diagnostic-conditioned supervision" is training that uses the feedback itself (Lean’s red pen) as an input to guide the model’s fix. How it works:

Pair each error with Lean’s message and goal,
Train the model to output both an explanation and the corrected proof,
Make the explanation refer to the same feedback,
Repeat across many error types. Why it matters: Without conditioning on the diagnostic, the model can’t learn to read and leverage the most informative signal. 🍞 Anchor: If the diagnostic says “expected A → B but got A × B,” the model learns to pick a theorem that outputs an arrow, not a product.

02Core Idea

🍞 Hook: You know how a coach’s notes on your practice sheet tell you exactly what to fix on the next try?

🥬 The concept: The key insight is to teach AI to read Lean’s feedback and fix just what’s wrong—by training on many paired examples of (broken proof, compiler message) → (explanation, corrected proof). How it works:

Start with correct proofs,
Create realistic mistakes (four kinds),
Ask Lean for the exact error message and goal state,
Provide a matching correct repair and a short explanation,
Fine-tune models to take the feedback and output both the diagnosis and the fix. Why it matters: Without this, models guess broadly; with it, they make small, smart, feedback-aware edits. 🍞 Anchor: If a proof fails because a tactic ‘simp’ can’t close the goal, try ‘linarith’ in arithmetic contexts; the dataset teaches the model when and why.

Three analogies for the same idea:

GPS reroute: When a road is closed (error), the GPS reads the sign (diagnostic) and finds the closest detour (targeted repair) instead of restarting the trip.
Spellcheck: A red underline (error) plus a suggestion (feedback) helps you correct one word, not rewrite the paragraph.
Taste-and-tweak cooking: If the soup is too salty (feedback), you add a potato (repair) rather than cooking a new pot from scratch.

Before vs After:

Before: Models mostly learned to output full proofs; when they failed, they tried again without deep training on reading diagnostics.
After: Models learn to interpret the exact error and propose a minimal, correct edit—making small models competitive in single-shot repair.

Why it works (intuition): Lean’s diagnostics carry precise, local information—where the mismatch is, what the goal expects, and which hypotheses are available. By aligning each failure with a ground-truth fix and a short natural-language diagnosis, the model learns a map from specific feedback patterns to specific edits.

🍞 Hook: Imagine a well-organized toolbox where each drawer has the exact tool for a known problem.

🥬 The concept: "APRIL dataset" is that toolbox: 260,000 examples that pair broken proofs with Lean’s messages, corrected proofs, and explanations. How it works:

Collect many correct proofs,
Make controlled mistakes (swap theorem/tactic, wrong line/multi-line),
Run Lean to capture error and goal state,
Store the correct repair and a concise explanation. Why it matters: Without many labeled failures, the model can’t practice feedback-driven fixing. 🍞 Anchor: If you replace theorem T with related theorem T′ that needs a different hypothesis, the example shows the exact complaint and the exact swap back to T that fixes it.

🍞 Hook: Think of swapping a tool with a similar-looking one that doesn’t quite fit the bolt.

🥬 The concept: "Mutation strategies" generate realistic errors that look plausible but fail for precise reasons. How it works:

Theorem mutation: swap in a semantically similar but type-incompatible theorem.
Tactic mutation: replace a tactic with a cousin (e.g., linarith vs nlinarith).
Line mutation: LLM completes one redacted line incorrectly.
Multi-line mutation: LLM completes a redacted tail of the proof incorrectly. Why it matters: Without realistic mistakes, the model won’t learn real-world repair patterns. 🍞 Anchor: Using ring instead of nlinarith in a non-linear arithmetic goal often stalls—seeing that pattern teaches when each tactic fits.

🍞 Hook: Imagine a friend who can explain a tough puzzle in simple words so you see the fix.

🥬 The concept: "Natural-language explanations" summarize why a proof fails and hint at the precise edit. How it works:

Read Lean’s message and goal state,
Point to the mismatch (types, hypotheses, target),
Propose a minimal change,
Keep the wording short and grounded in the feedback. Why it matters: Without explanations, humans and downstream agents can’t easily trust or reuse the model’s reasoning. 🍞 Anchor: “The goal expects α → β, but the theorem returns α × β; swap to the version with an arrow conclusion.”

03Methodology

🍞 Hook: Picture an assembly line that takes a sturdy toy, makes a safe small break, measures exactly how it broke, and then shows how to fix it.

🥬 The concept: The pipeline turns correct Lean proofs into training examples for repair. How it works at a high level: Correct proof → (Mutation) → Erroneous proof → (Lean) → Error + goal state → (Annotator LLM) → Explanation + Fix suggestion → (SFT) → Model that repairs and explains. Why it matters: Without this pipeline, we’d lack large, clean pairs of (failure, fix) to teach the model. 🍞 Anchor: Swap a theorem, watch Lean complain, record the complaint and the correct swap-back, then teach the model that pattern.

Step-by-step details and why each matters:

Collect correct proofs

What: Gather 39,492 Lean 4.22.0-rc4-compiling proofs from Herald, Lean Workbook, and NuminaMath-Lean.
Why: We need trustworthy starting points so each induced failure has a known, minimal fix.
Example: A human-annotated Numina proof using have and structured tactics gives richer contexts.

Create controlled errors (four mutation types)

What: Induce plausible mistakes that often appear in real work.
Why: Models must practice realistic missteps, not random noise.
Examples: • Theorem mutation: Replace eq_comm with mul_comm in a spot needing equality symmetry rather than multiplication commutativity. • Tactic mutation: Swap nlinarith with ring in a non-linear arithmetic goal—Lean stalls. • Line mutation: Redact one line and let an LLM fill it; keep only failures that compile to an error. • Multi-line mutation: Redact the tail and let an LLM complete multiple lines; keep only realistic failures.

Ask Lean for precise feedback

What: Run the mutated proof; capture the error message, failing location, and goal state (context/hypotheses/target).
Why: This diagnostic is the most informative teaching signal—localized, structured, and actionable.
Example: “Type mismatch: expected Nat, got Int at line 12” plus the goal and available hypotheses.

Generate concise explanations and fix hints

What: Use a separate LLM (DeepSeek-V3-0324) to write a short error explanation and a fix suggestion grounded in that same feedback; for theorem/tactic swaps, include contrastive metadata (intended vs substituted item).
Why: These short texts train the model to articulate what went wrong and how to repair it; they also help humans and tool-using agents.
Example: “Replace simp with linarith because the goal is linear arithmetic and simp cannot close it.”

Build training prompts

What: Pack the failing proof, error message, goal state, and request format into a chat-style input, with targets containing (a) an explanation first, then (b) the repaired proof.
Why: This ordering encourages reasoning-before-acting and ties the fix to the diagnostic.
Example: The assistant first writes a 1–3 sentence diagnosis, then outputs the corrected Lean block.

Fine-tune models with LoRA

What: Apply Low-Rank Adaptation to Qwen3-4B-Instruct, Kimina-Prover-8B, and Goedel-Prover-8B under identical SFT hyperparameters (AdamW, lr 1e-4, cosine, bf16, seq len 2048).
Why: Parameter-efficient tuning adapts versatile backbones to the repair task at modest compute cost.
Example: LoRA rank 32 on attention and MLP projections; early stopping by validation loss.

🍞 Hook: Think of two keys that open the same door faster together.

🥬 The concept: "Joint supervision of explanation + repair" means training the model to say why and then to fix. How it works:

The input includes the diagnostic,
The model outputs a grounded explanation,
Then outputs the corrected proof,
Both are scored during training. Why it matters: Without the explanation head, the model may fix but cannot communicate; with it, we get transparent, reusable reasoning for agents and humans. 🍞 Anchor: Providing the explanation to another model (DeepSeek) raises its repair success from 4% to 29% on the same failures.

The secret sauce:

Controlled, semantically plausible mutations create authentic failure patterns.
Aligning Lean’s exact diagnostics with minimal correct repairs teaches targeted editing.
Adding short, grounded explanations turns black-box fixes into understandable, shareable guidance.

04Experiments & Results

🍞 Hook: Imagine a pop quiz where you must fix one broken line in a single try—no do-overs.

🥬 The concept: "Single-shot repair evaluation" checks if the model can produce a compiling fix from one failing attempt plus Lean’s feedback. How it works:

Give the model a broken proof, error message, and goal state,
Ask for one repaired output,
Count it as correct only if Lean accepts it,
Report accuracy across many failures. Why it matters: This isolates true feedback-using repair skill from search or multiple retries. 🍞 Anchor: Like getting one chance to correct a math step after the teacher’s hint—either the fix works, or it doesn’t.

The test: 1,835 held-out erroneous proofs covering theorem, tactic, line, and multi-line mutations. Metric: pass@1 (compiles or not). All models use the same interface and no iterative search.

The competition: Baselines include the untuned Qwen3-4B-Instruct (a small general LLM) and Goedel-Prover-V2 (8B and 32B), plus Kimina-Prover-8B. The main comparison is fair because everyone does single-shot repair with the same inputs.

The scoreboard (with context):

Qwen3-4B-Instruct jumps from 1.1% (base) to 27.4% after training on APRIL—like going from almost never fixing to getting more than one in four right on the first try.
Goedel-Prover-V2-32B scores 26.8% in the same setting, so the fine-tuned 4B slightly edges a model 8× larger.
Fine-tuned 8B models (Goedel/Kimina) reach roughly 32–35% overall, pulling clearly ahead.

By error type:

Tactic mutations: easiest—top repairs around low-40% after finetuning (local, well-cued fixes).
Theorem mutations: medium difficulty—mid-to-high 20s.
Line mutations: hardest—tops around the low-to-mid teens because the wrong line can be semantically far from the goal.
Multi-line: between line and theorem cases.

Surprising findings:

Small can beat big when trained right: a 4B model, taught to read Lean’s red pen, slightly beats a 32B baseline in this strict one-shot setting.
Joint training on all error types works well; specialized training per type is only slightly better—patterns transfer.
Training to produce explanations slightly lowers raw pass@1 (e.g., Qwen from 31.2% repair-only to 27.4% joint), but the explanations greatly help downstream agents: DeepSeek’s success rises from 4% (with base explanations) to 29% (with trained explanations).

🍞 Hook: Like learning both to solve and to show your work—your friend can now use your notes to solve similar problems.

🥬 The concept: "Explanations as a resource" means the diagnosis text is valuable beyond the original model. How it works:

The trained model writes a short, grounded explanation,
Another model reads it as a hint,
That model repairs more often. Why it matters: Without shareable explanations, repair skills stay trapped in one model. 🍞 Anchor: With trained explanations attached, DeepSeek improves more than 7× on the same failures.

05Discussion & Limitations

🍞 Hook: Think of a practice book—it’s great, but it can’t include every kind of problem you’ll ever see.

🥬 The concept: "Limitations" are the edges of what this method covers well. How it works:

Synthetic bias: Errors are induced by controlled mutations and LLM completions; real human errors may differ.
Coverage gaps: Four mutation types cover many failures, but not all (e.g., very deep library misuse or long-range dependencies).
Version lock: Trained on Lean 4.22.0-rc4; messages or behaviors could shift across versions.
Length/complexity: Extremely long or intricate proofs may exceed the 2048-token input window. Why it matters: Knowing the edges helps decide when to complement with search, RL, or retrieval. 🍞 Anchor: If the failure depends on a subtle library change in a different Lean version, a repair-trained model might misread the diagnostic.

Required resources:

A Lean toolchain for compiling candidate fixes,
GPUs for fine-tuning (LoRA helps keep costs reasonable),
Access to curated correct proofs and a way to create controlled mutations.

When not to use:

Full end-to-end proving from scratch with no failing attempt—use search or RL-guided provers.
Domains where compiler diagnostics are minimal or misleading.
Tasks needing multi-try exploration (this work isolates single-shot repair).

Open questions:

How best to combine repair training with iterative search or RL so that each failure during search becomes new supervision?
Can we harvest real human error trajectories from IDE histories to reduce synthetic bias?
How well does the approach transfer to other proof assistants (Coq/Rocq, Isabelle, HOL Light)?
Can explanations be made even more structured (e.g., references to goal/hypothesis IDs) to further boost downstream agent performance?

06Conclusion & Future Work

Three-sentence summary: The paper reframes formal proving as learn-to-repair: given a failing Lean proof and its diagnostic feedback, predict both a clear explanation and a minimal, correct fix. To support this, the authors build APRIL, a 260K-example dataset of paired (error, diagnostic) with aligned (explanation, repair), and fine-tune models to read and act on Lean’s red pen. In single-shot repair, a fine-tuned 4B model reaches 27.4%, slightly surpassing a strong 32B baseline in the same setup.

Main achievement: Showing that diagnostic-conditioned supervision—explicitly training on error/feedback/fix triplets—unlocks strong, scalable repair skill and clear explanations, enabling smaller models to act smarter.

Future directions:

Blend repair training with iterative search/RL so every failed branch becomes labeled supervision.
Collect real human error trajectories from development tools to complement synthetic mutations.
Port the approach to other proof assistants and programming languages.
Enrich explanations with structured references for tool-using agents.

Why remember this: It proves that learning from the red pen—not just from gold solutions—can change the game: models become better debuggers, clearer communicators, and more efficient collaborators in formal math.

Practical Applications

•Build IDE assistants that suggest targeted Lean proof fixes grounded in the current error and goal.
•Teach students formal proving by showing concise, feedback-grounded explanations next to suggested repairs.
•Speed up continuous integration for math libraries by auto-repairing common breaking changes with diagnostics.
•Pretrain small, cost-efficient repair models to triage failures before invoking heavy search-based provers.
•Create curriculum generators that produce realistic failing proofs for classroom exercises and auto-grading.
•Use explanations as hints for a second agent (or human) to boost joint repair success in multi-agent systems.
•Port the repair pipeline to other proof assistants (Coq/Rocq, Isabelle) to accelerate cross-system development.
•Mine failing attempts from real projects and fine-tune repair models to each team’s code/proof style.
•Use repair-first passes to prune search spaces in theorem provers, improving overall wall-clock efficiency.

Version: 1