MMFormalizer: Multimodal Autoformalization in the Wild

Jing Xiong; Qi Han; Yunta Hsieh; Hui Shen; Huajian Xin; Chaofan Tao; Chenyang Zhao; Hengyuan Zhang; Taiqiang Wu; Zhen Zhang; Haochen Wang; Zhongwei Wan; Lingpeng Kong; Ngai Wong

MMFormalizer: Multimodal Autoformalization in the Wild

Intermediate

Jing Xiong, Qi Han, Yunta Hsieh et al.1/6/2026

arXiv PDF

Key Summary

•MMFormalizer is a new system that turns problems with pictures and words (like physics scenes or geometry diagrams) into strict, checkable math statements and proofs.
•Its secret is a step-by-step “recursive grounding” process that builds big ideas from small, picture-linked facts, and it only stops when it reaches a solid anchor: a basic unit (like meters or seconds) or a trusted axiom (like a law of physics).
•It uses Lean, a proof assistant, plus mathlib and PhysLean, to make sure every step compiles and is logically correct.
•A new test set called PHYX-AF checks many skills across math and physics, making sure the picture actually matters, not just the words.
•Frontier models (like GPT-5 and Gemini-3-Pro) do best overall; GPT-5 shines on modern physics, but geometry remains the hardest area for everyone.
•The system includes a semantic checker that verifies each formal statement is supported by both the image and the text.
•Ablation studies show that clear stopping rules, using the image during grounding, and trying multiple candidates (pass@k) all improve results.
•The framework is the first to autoformalize across classical mechanics, relativity, quantum mechanics, and thermodynamics from multimodal inputs.
•It bridges perception and formal logic, making scientific reasoning more reliable, reusable, and verifiable.
•This work highlights how dimensional analysis (units) keeps formal math tied to real-world measurements so the results stay meaningful.

Why This Research Matters

Many real problems mix pictures and words—like engineering blueprints, physics labs, and textbook diagrams—and we need AI that respects both. MMFormalizer ties every step of reasoning to what it sees and to real units, so results stay physically meaningful and verifiable. This makes scientific reasoning more trustworthy for classrooms, research, and industry. It can help catch subtle mistakes early, saving time and improving safety. By compiling inside a proof assistant, it creates reusable, shareable proofs that others can audit. Over time, this could raise the standard of how we write, teach, and check science. It’s a step toward AI that not only answers but also proves why the answer is right.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how when you solve a puzzle, you look at the picture on the box and also the shapes of the pieces? Old AI “puzzle solvers” were great at reading the box (the text) but not so great at feeling the pieces (the image). That made it hard to trust their final picture.

🍞 Hook: You know how a math word problem often includes a diagram, and if you ignore the diagram, you’ll probably miss something important? 🥬 The Concept: Autoformalization is turning everyday math and science language into strict, computer-checkable math statements.

How it works:
1. Read the text and find the key ideas.
2. Translate those ideas into a formal language (like writing precise rules).
3. Use a proof checker to make sure everything is 100% correct.
Why it matters: Without autoformalization, computers can’t reliably check long, tricky solutions—they might miss mistakes. 🍞 Anchor: If a question says “triangle ABC is isosceles,” autoformalization writes a precise statement like “AB = AC” in a proof system so it can be verified.

The world before: Researchers built strong text-only autoformalizers. In geometry, many systems could parse symbolic descriptions and prove theorems. Proof assistants like Lean and big libraries like mathlib helped ensure strict correctness. But there was a gap: math and physics in the real world are multimodal—pictures, graphs, and scenes carry crucial details that words alone don’t capture.

🍞 Hook: Imagine trying to measure the height of a tree only from a story, without looking at the picture that shows the shadow. 🥬 The Concept: Multimodal reasoning is thinking with words and pictures together.

How it works:
1. Look at the image for shapes, distances, or interactions.
2. Read the text for constraints and goals.
3. Combine both to form the full problem.
Why it matters: Without it, an AI might solve the wrong problem by ignoring important clues hidden in the picture. 🍞 Anchor: A function graph plus a question about slope at a point—multimodal reasoning uses the curve shape (image) and the description (text) to get the right derivative.

The problem: Physics isn’t just about symbols; it’s about units and measurements. “3” is not the same as “3 meters” or “3 seconds.” If a system ignores units, it may write pretty-looking equations that mean nothing in reality.

🍞 Hook: Imagine baking cookies with “3 of sugar.” Three what—cups, teaspoons, or bags? 🥬 The Concept: Dimensional grounding ties every quantity to real-world units (like meters, seconds, kilograms) so equations stay physically meaningful.

How it works:
1. Detect physical quantities (mass, length, time, charge, temperature).
2. Track their dimensions through each step.
3. Reject any equation where units don’t match.
Why it matters: Without dimensional grounding, you could “add meters to seconds,” which makes no sense. 🍞 Anchor: F = ma only works if mass is in kilograms and acceleration in meters per second squared; the result must be in newtons.

Failed attempts: Prior systems often (1) treated images as optional, so the model over-relied on text; (2) skipped dimensional checks, letting nonsense equations pass; (3) used retrieval so heavily that they got stuck when the exact code didn’t exist; and (4) lacked smart stopping rules, so they kept decomposing forever, generating huge, unusable proof trees.

The gap: We needed a framework that (a) grounds visual pieces into formal statements step by step, (b) stops at the right time using either a fundamental unit (dimensions) or a trusted law (axiom), and (c) compiles inside a real proof assistant using reusable libraries from math and physics.

Real stakes: This matters for safe engineering (diagrams + specs → checked constraints), science education (diagrams in textbooks → verified proofs), research reproducibility (paper figures → formal derivations), and robotics (seeing a scene → making physics-consistent plans). If the AI can see, read, and prove—while respecting units—we can trust it more in the real world.

02Core Idea

Aha! Moment in one sentence: Start from the picture and text, build a ladder of statements that stay tied to what you see, and stop only when you hit solid ground: a basic unit or a basic law.

Explain it three ways:

Detective story: Gather clues from the scene (image) and the witness (text), line them up into facts, and end the case when you reach a known law that the court (Lean) accepts.
LEGO build: Snap small bricks (visual facts) into medium assemblies (lemmas), then into a sturdy model (axioms), stopping when you reach pieces labeled “fundamental.”
Cooking: Prep ingredients (points, lines, masses), follow the recipe (grounding rules), and plate the dish when the taste test (units or axioms) passes.

🍞 Hook: You know how you stack blocks higher only if each layer sits firmly on the one below? 🥬 The Concept: Recursive grounding builds complex formal claims from simpler, visually supported pieces.

How it works:
1. Parse the image into primitives (points, lines, regions; or objects, forces, trajectories).
2. Propose small claims linked to those visuals (lemmas).
3. Retrieve helpful known facts from libraries.
4. Keep composing until you reach a reliable anchor (dimension or axiom).
Why it matters: Without this, the AI might jump to big claims that aren’t actually supported by the picture or physics. 🍞 Anchor: From a pulley photo: identify masses and strings → relate their directions and tensions → stop at Newton’s law with proper units.

🍞 Hook: Think of the simplest truths in math and physics as the base rules everyone agrees on. 🥬 The Concept: Axiom composition assembles verified small facts into bigger truths using accepted base rules.

How it works:
1. Take grounded lemmas.
2. Combine them with axioms (e.g., conservation laws, geometric postulates).
3. Produce a final formal statement the prover can check.
Why it matters: Without composition, you’d have many tiny facts but no finished proof. 🍞 Anchor: In geometry, “AB = AC” and “angles at the base are equal” assemble into “triangle ABC is isosceles,” which Lean can verify.

🍞 Hook: Imagine a super-strict math teacher who never lets a sloppy step slide. 🥬 The Concept: The Lean system is a proof checker that accepts only fully precise, well-typed statements.

How it works:
1. You write definitions and theorems.
2. Lean checks each step follows the rules.
3. If everything passes, it compiles; if not, it points to the error.
Why it matters: Without Lean, you might trust a proof that looks right but contains hidden mistakes. 🍞 Anchor: “Force equals mass times acceleration” is encoded and checked so the system won’t accept mismatched units or undefined terms.

Before vs After:

Before: Systems often ignored images, forgot units, or overfit to known code snippets. Proof trees could spiral without ending.
After: MMFormalizer builds statements that are visually and dimensionally grounded, composes them into axioms, and compiles in Lean. It also defines explicit stopping rules so reasoning halts safely.

Why it works (intuition, no equations): Images anchor symbols to reality; dimensions act like a ruler that rejects nonsense; axioms give you trusted glue to assemble parts; and the prover enforces strict logic. Together, these prevent drift and keep the reasoning chain short, sensible, and checkable.

Building blocks introduced in this work:

🍞 Hook: When you read a map, you first notice cities and roads. 🥬 The Concept: SceneGraph is a structured list of visual primitives (like points, lines, or objects) plus their relations.
- How it works: Parse the image → detect primitives → record relations such as parallel, touching, inside.
- Why it matters: Without a clean map of the scene, you can’t link visuals to formal statements. 🍞 Anchor: In a triangle picture, the SceneGraph lists A, B, C as points and AB, BC, CA as lines, including angle relations.
🍞 Hook: Think of a diary recording every solid mini-fact you’ve confirmed. 🥬 The Concept: PropChain is the chain of small lemmas, each connected to something you saw.
- How it works: For each sub-part of the scene, propose a lemma, retrieve relevant theorems, and confirm it.
- Why it matters: Without PropChain, you jump too far and risk errors. 🍞 Anchor: “Point D is the midpoint of AB” enters the chain because the image shows equal ticks on the segment.
🍞 Hook: Picture locking your build by snapping in a few special key pieces. 🥬 The Concept: AxiomChain is the final list where each claim is either an axiom or tied to dimensions that close the loop.
- How it works: Elevate PropChain items into axiom-backed or dimension-anchored statements.
- Why it matters: Without AxiomChain, your proof can’t finalize. 🍞 Anchor: In physics, once momentum and force are tied through Newton’s law with correct units, the chain can terminate.
🍞 Hook: Like searching a library by meaning, not just by exact words. 🥬 The Concept: LeanSearch retrieves helpful definitions and lemmas from mathlib and PhysLean.
- How it works: Index the library, embed statements, and fetch top matches to guide formalization.
- Why it matters: Without retrieval, you reinvent the wheel or get stuck. 🍞 Anchor: Need “midpoint” properties? LeanSearch brings the exact lemmas to use.
🍞 Hook: Think of a referee who checks both the instant replay (image) and the rulebook (text). 🥬 The Concept: Semantic checking verifies that each formal statement is supported by both modalities.
- How it works: Compare the image, the text, and the code; approve only if they match.
- Why it matters: Without this, the system might pass a good-looking but wrong proof. 🍞 Anchor: If the diagram shows parallel lines, the checker rejects any proof that treats them as intersecting.

Put together, MMFormalizer is a disciplined builder: see → ground → compose → check → stop at dimensions/axioms, all inside a prover.

03Methodology

At a high level: Input (image + text) → Parse scene → Propose grounded lemmas (PropChain) → Retrieve supporting facts (LeanSearch) → Check for termination (dimension or axiom) → Compose into AxiomChain → Compile in Lean → Semantic check → Output verified formal code.

Step-by-step, with purpose and examples:

Parse the image into a SceneGraph.

What happens: The system detects primitives (points, lines, regions for geometry; objects, vectors, forces for physics) and relations (parallel, touching, contained_in).
Why this exists: You can’t reason about what you can’t name. SceneGraph is the vocabulary of the picture.
Example: In a triangle diagram, identify points A, B, C; lines AB, BC, CA; and a marked right angle at B.

Create initial hypotheses tied to visuals.

What happens: For each primitive or relation, the system writes a short, informal note like “AB ⟂ BC” or “Block on incline with angle θ.”
Why: These are the seeds for formal lemmas; they keep reasoning anchored to what’s visible.
Example: Seeing equal tick marks on two sides suggests “AB = AC.”

Build the PropChain by turning notes into lemmas.

What happens: The model queries LeanSearch with each note to find relevant definitions and lemmas, then proposes a formal lemma with a proof sketch.
Why: Small, verified steps prevent big errors.
Example: Use a midpoint lemma from mathlib to formalize “D is the midpoint of AB” when the diagram shows equal halves.

Retrieve from libraries (mathlib + PhysLean) via LeanSearch.

What happens: The system gets the top-k relevant items (names, types, comments) to guide formalization and proof.
Why: Reuse trusted math and physics facts instead of starting from scratch.
Example: For “momentum is mass times velocity,” PhysLean provides precise definitions and links to Hamilton’s equations.

Check for recursive termination with dimensional or axiomatic grounding.

What happens: After each lemma, the system asks, “Can I end here?” Two ways to stop:
- Dimensional grounding: The statement reduces to base units like [M], [L], [T], [Q], [Θ] or approved derived units (force, energy).
- Axiomatic grounding: The statement rests directly on a fundamental axiom or law (e.g., Newton’s laws, conservation laws, Maxwell’s equations).
Why: This prevents infinite decomposition into unnecessary sub-lemmas and keeps proofs meaningful.
Example: In a pulley system, once tensions and accelerations are tied to Newton’s second law with correct units (N = kg·m/s²), stop.

Compose lemmas into higher-level results (AxiomChain).

What happens: Combine child lemmas to form non-leaf theorems, promoting them to the axiom/dimension-closed level.
Why: Composition turns many small truths into a final, reusable theorem.
Example: “midpoint D,” “medians intersect at centroid,” and angle facts build into a triangle center theorem.

Syntax and type checking (compilation in Lean).

What happens: Lean verifies that the definitions and proofs compile—no missing variables, wrong types, or illegal steps.
Why: Guarantees machine-level correctness.
Example: A proof that compiles in Lean is like a bridge that passes all safety inspections.

Semantic checking across modalities.

What happens: A separate checker rates each image–text–formula triplet as 1 (accept) or 0 (reject) based on cross-modal consistency.
Why: Prevents accepting a formal proof that doesn’t match the picture or statement.
Example: If the picture shows 60°, but the proof uses 90°, the checker rejects it.

Output verified formal code.

What happens: The system returns Lean code (theorem + proof) plus the dependency graph (what depended on what).
Why: This makes the reasoning transparent, reusable, and checkable by others.
Example: A theorem like “force equals rate of change of momentum” compiled from a specific scene.

The secret sauce:

Explicit termination: Reasoning stops when grounded in units or axioms, avoiding runaway trees (the paper shows a failure case without this rule).
Dimension tracking: Units act like guardrails to block physically meaningless equations.
Library retrieval: LeanSearch injects just-in-time knowledge from mathlib/PhysLean so the model composes the right building blocks.
Proof assistant backbone: Lean enforces rigor at every step, turning ideas into certifiable artifacts.

Mini “sandwich” intros for supporting notions:

🍞 Hook: Mixing cups with teaspoons ruins recipes. 🥬 The Concept: Dimensional analysis ensures units align in every step.
- How it works: Tag quantities with units, propagate them, reject mismatches.
- Why it matters: Stops nonsense like adding meters to seconds. 🍞 Anchor: Validates that energy has units of joules and momentum of kg·m/s.
🍞 Hook: A city map needs a legend. 🥬 The Concept: SceneGraph is the map legend of the image.
- How it works: List primitives and relations.
- Why it matters: It’s the foundation for grounding. 🍞 Anchor: Points, lines, “parallel,” and “perpendicular” all live here.
🍞 Hook: A diary of facts builds trust. 🥬 The Concept: PropChain records grounded mini-theorems.
- How it works: One lemma at a time, linked to the picture.
- Why it matters: Prevents leaps. 🍞 Anchor: “D is midpoint of AB” before proving centroid facts.
🍞 Hook: Final locks make furniture sturdy. 🥬 The Concept: AxiomChain locks proofs with axioms or dimensions.
- How it works: Promote lemmas to closed statements.
- Why it matters: Finishes the proof. 🍞 Anchor: Newton’s laws cap a mechanics proof.
🍞 Hook: Ask a librarian who knows meanings, not just titles. 🥬 The Concept: LeanSearch finds relevant theorems.
- How it works: Embed, index, retrieve top matches.
- Why it matters: Avoids reinventing the wheel. 🍞 Anchor: Fetch “centroid” lemmas when needed.
🍞 Hook: A referee watches the play and the replay. 🥬 The Concept: Semantic checking compares image, text, and code.
- How it works: Accept only when all agree.
- Why it matters: Blocks pretty-but-wrong proofs. 🍞 Anchor: Rejects a 90° claim when the diagram shows 60°.

04Experiments & Results

The test: The authors built PHYX-AF, 115 carefully chosen multimodal problems spanning MathVerse (plane/solid geometry, function graphs), PhyX (mechanics, electromagnetism, thermodynamics, modern physics: relativity and quantum), Synthetic Geometry (hard, out-of-distribution constructions), and Analytic Geometry (coordinate-heavy). Crucially, they filtered out any problem solvable by text alone—so the image really matters.

What they measured and why:

Compile accuracy: Does the generated Lean code type-check and run? This is the first gate to ensure formal correctness.
Semantic correctness: Is the solution meaningfully right, even if written differently than a reference? This captures true understanding.
Human verification: Experts check logic and multimodal consistency. This guards against corner cases.

The competitors:

Frontier closed models: GPT-5, Gemini-3-Pro, Gemini-2.5-Pro.
Open models: Qwen3-VL-235B, Qwen2.5-VL-72B.
All were used inside the same MMFormalizer pipeline to keep comparisons fair.

The scoreboard (with context):

GPT-5 excelled on PHYX, especially modern physics (relativity and quantum). Think of it as getting an A in the toughest physics class.
Gemini-3-Pro led on many math categories (MathVerse and Analytic Geometry), showing strong general multimodal reasoning. That’s like topping the class in geometry and graphs.
Geometry remains hardest overall. Synthetic and Analytic Geometry showed big gaps: models struggled to connect precise visual relations (angles/lengths) to formal proofs, especially out of distribution. That’s like students doing well on homework but stumbling on tricky Olympiad-style puzzles.
Advanced open-source models lagged notably in physical domains and out-of-distribution geometry. They often failed to produce compilable code or correct semantics compared to frontier systems.

Surprising and useful findings (ablations):

Removing retrieved code snippets sometimes helped in Synthetic Geometry. Why? It freed the model to invent new dependent types instead of forcing it into mismatched templates—like letting a student derive a fresh formula instead of copying the wrong one.
Explicit termination rules were crucial. Without them, the reasoning tree grew too deep and wide, wasting compute and failing to finish—like writing an essay that never ends.
Using the image during grounding improved performance in tougher areas, including modern physics and synthetic geometry. Seeing matters!
Trying multiple candidates (pass@k) improved performance on hard problems, showing that test-time scaling (more tries) can pay off.

Semantic checking analysis:

Multiple LLMs were tested as semantic referees over code produced by strong and weaker models. Some “weaker” models still performed surprisingly well at flagging mismatches when verifying code from open-source systems. This hints at a practical workflow: cross-check models can catch each other’s mistakes.

Takeaway in plain terms: The framework works across many domains, but geometry with complex visuals is still the Achilles’ heel. Frontier models handle physics better (especially with dimensional grounding), while images are truly needed for top performance. Smart stopping rules, image grounding, and multiple attempts make a real difference.

05Discussion & Limitations

Limitations:

Visual parsing brittleness: If the diagram is cluttered, low-resolution, or ambiguous, the SceneGraph can be wrong, and the whole chain inherits that error.
Geometry perception remains hard: Precise angle/length relations and auxiliary constructions are frequent failure points, especially for unfamiliar setups.
Library coverage: Even with mathlib and PhysLean, some needed definitions or modern physics constructs may be missing, forcing the model to synthesize types under pressure.
Compute and latency: Recursive grounding with retrieval and compilation can be slow, especially if pass@k is large or trees grow deep.
Semantic checking is not perfect: Automated cross-modal checks still sometimes accept or reject borderline cases; human verification remains valuable.

Required resources:

A working Lean 4 toolchain with mathlib and PhysLean, plus the LeanSearch index built over them.
A capable multimodal LLM (frontier models do best) with enough context window and tool-use ability.
GPUs/CPUs for retrieval and compilation at scale; storage for indices and logs; a sandboxed environment for code execution.

When not to use:

Extremely noisy or stylized images where key geometry/physics cues are unreadable.
Tasks that are purely text-based (no need for multimodal overhead).
Domains with missing axioms or unit systems (no reliable dimensional/axiomatic anchors yet).
Real-time control loops that can’t tolerate compilation latency.

Open questions:

Can we learn better visual parsers specialized for formal geometry and physics cues to reduce SceneGraph errors?
How can the system automatically expand libraries (safely) when it discovers missing types or lemmas, without hallucination?
Can we unify stronger unit-aware type systems across more sciences (chemistry, biology) for robust dimensional grounding?
What’s the best strategy to allocate compute across grounding, retrieval, and composition (e.g., dynamic pass@k) for reliability vs. speed?
How can multiple LLMs serve as mutual referees to raise semantic checking reliability without human review?

06Conclusion & Future Work

In three sentences: MMFormalizer turns problems that mix pictures and words into strict, checkable math and physics proofs by building them up from what it can see and read. It stops only when every step is anchored by real units or trusted axioms, then compiles everything inside Lean so mistakes can’t slip through. A new benchmark, PHYX-AF, shows frontier models excel in physics while geometry remains the toughest frontier.

Main achievement: A unified, multimodal autoformalization pipeline—recursive grounding + dimensional/axiomatic termination + axiom composition—capable of formalizing across classical mechanics, relativity, quantum mechanics, thermodynamics, and multiple geometry settings.

Future directions: Improve visual parsing for geometry, expand physics and math libraries, strengthen semantic checking with multi-model referees, and develop smarter compute allocation (dynamic pass@k, adaptive termination). Extend dimensional/type-theoretic grounding into other sciences with richer unit systems.

Why remember this: It’s a blueprint for connecting what we see to what we can prove—bridging perception and logic. By baking units and axioms into the core, it keeps formal reasoning tied to the real world, making automated science more trustworthy, reusable, and verifiable.

Practical Applications

•Auto-checking geometry proofs in digital textbooks using both the diagram and the text.
•Verifying physics lab reports by turning graphs and photos into formal, unit-checked derivations.
•Assisting engineers by flagging unit mismatches or geometric inconsistencies in CAD-like diagrams.
•Helping robots plan safe motions by grounding forces and contacts in unit-aware formal logic.
•Building searchable libraries of formalized physics and geometry facts from classroom problems.
•Auditing research figures by linking plots and schematics to machine-checkable statements.
•Generating hint-by-hint tutoring that explains each grounded lemma before the final theorem.
•Creating synthetic, formally verified practice problems for competitions and curricula.
•Cross-checking AI models by using one model to semantically verify another model’s formal code.
•Providing regulatory compliance checks where standards require unit-consistent calculations.

Version: 1