MMFormalizer: Multimodal Autoformalization in the Wild
Key Summary
- ā¢MMFormalizer is a new system that turns problems with pictures and words (like physics scenes or geometry diagrams) into strict, checkable math statements and proofs.
- ā¢Its secret is a step-by-step ārecursive groundingā process that builds big ideas from small, picture-linked facts, and it only stops when it reaches a solid anchor: a basic unit (like meters or seconds) or a trusted axiom (like a law of physics).
- ā¢It uses Lean, a proof assistant, plus mathlib and PhysLean, to make sure every step compiles and is logically correct.
- ā¢A new test set called PHYX-AF checks many skills across math and physics, making sure the picture actually matters, not just the words.
- ā¢Frontier models (like GPT-5 and Gemini-3-Pro) do best overall; GPT-5 shines on modern physics, but geometry remains the hardest area for everyone.
- ā¢The system includes a semantic checker that verifies each formal statement is supported by both the image and the text.
- ā¢Ablation studies show that clear stopping rules, using the image during grounding, and trying multiple candidates (pass@k) all improve results.
- ā¢The framework is the first to autoformalize across classical mechanics, relativity, quantum mechanics, and thermodynamics from multimodal inputs.
- ā¢It bridges perception and formal logic, making scientific reasoning more reliable, reusable, and verifiable.
- ā¢This work highlights how dimensional analysis (units) keeps formal math tied to real-world measurements so the results stay meaningful.
Why This Research Matters
Many real problems mix pictures and wordsālike engineering blueprints, physics labs, and textbook diagramsāand we need AI that respects both. MMFormalizer ties every step of reasoning to what it sees and to real units, so results stay physically meaningful and verifiable. This makes scientific reasoning more trustworthy for classrooms, research, and industry. It can help catch subtle mistakes early, saving time and improving safety. By compiling inside a proof assistant, it creates reusable, shareable proofs that others can audit. Over time, this could raise the standard of how we write, teach, and check science. Itās a step toward AI that not only answers but also proves why the answer is right.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
You know how when you solve a puzzle, you look at the picture on the box and also the shapes of the pieces? Old AI āpuzzle solversā were great at reading the box (the text) but not so great at feeling the pieces (the image). That made it hard to trust their final picture.
š Hook: You know how a math word problem often includes a diagram, and if you ignore the diagram, youāll probably miss something important? š„¬ The Concept: Autoformalization is turning everyday math and science language into strict, computer-checkable math statements.
- How it works:
- Read the text and find the key ideas.
- Translate those ideas into a formal language (like writing precise rules).
- Use a proof checker to make sure everything is 100% correct.
- Why it matters: Without autoformalization, computers canāt reliably check long, tricky solutionsāthey might miss mistakes. š Anchor: If a question says ātriangle ABC is isosceles,ā autoformalization writes a precise statement like āAB = ACā in a proof system so it can be verified.
The world before: Researchers built strong text-only autoformalizers. In geometry, many systems could parse symbolic descriptions and prove theorems. Proof assistants like Lean and big libraries like mathlib helped ensure strict correctness. But there was a gap: math and physics in the real world are multimodalāpictures, graphs, and scenes carry crucial details that words alone donāt capture.
š Hook: Imagine trying to measure the height of a tree only from a story, without looking at the picture that shows the shadow. š„¬ The Concept: Multimodal reasoning is thinking with words and pictures together.
- How it works:
- Look at the image for shapes, distances, or interactions.
- Read the text for constraints and goals.
- Combine both to form the full problem.
- Why it matters: Without it, an AI might solve the wrong problem by ignoring important clues hidden in the picture. š Anchor: A function graph plus a question about slope at a pointāmultimodal reasoning uses the curve shape (image) and the description (text) to get the right derivative.
The problem: Physics isnāt just about symbols; itās about units and measurements. ā3ā is not the same as ā3 metersā or ā3 seconds.ā If a system ignores units, it may write pretty-looking equations that mean nothing in reality.
š Hook: Imagine baking cookies with ā3 of sugar.ā Three whatācups, teaspoons, or bags? š„¬ The Concept: Dimensional grounding ties every quantity to real-world units (like meters, seconds, kilograms) so equations stay physically meaningful.
- How it works:
- Detect physical quantities (mass, length, time, charge, temperature).
- Track their dimensions through each step.
- Reject any equation where units donāt match.
- Why it matters: Without dimensional grounding, you could āadd meters to seconds,ā which makes no sense. š Anchor: F = ma only works if mass is in kilograms and acceleration in meters per second squared; the result must be in newtons.
Failed attempts: Prior systems often (1) treated images as optional, so the model over-relied on text; (2) skipped dimensional checks, letting nonsense equations pass; (3) used retrieval so heavily that they got stuck when the exact code didnāt exist; and (4) lacked smart stopping rules, so they kept decomposing forever, generating huge, unusable proof trees.
The gap: We needed a framework that (a) grounds visual pieces into formal statements step by step, (b) stops at the right time using either a fundamental unit (dimensions) or a trusted law (axiom), and (c) compiles inside a real proof assistant using reusable libraries from math and physics.
Real stakes: This matters for safe engineering (diagrams + specs ā checked constraints), science education (diagrams in textbooks ā verified proofs), research reproducibility (paper figures ā formal derivations), and robotics (seeing a scene ā making physics-consistent plans). If the AI can see, read, and proveāwhile respecting unitsāwe can trust it more in the real world.
02Core Idea
Aha! Moment in one sentence: Start from the picture and text, build a ladder of statements that stay tied to what you see, and stop only when you hit solid ground: a basic unit or a basic law.
Explain it three ways:
- Detective story: Gather clues from the scene (image) and the witness (text), line them up into facts, and end the case when you reach a known law that the court (Lean) accepts.
- LEGO build: Snap small bricks (visual facts) into medium assemblies (lemmas), then into a sturdy model (axioms), stopping when you reach pieces labeled āfundamental.ā
- Cooking: Prep ingredients (points, lines, masses), follow the recipe (grounding rules), and plate the dish when the taste test (units or axioms) passes.
š Hook: You know how you stack blocks higher only if each layer sits firmly on the one below? š„¬ The Concept: Recursive grounding builds complex formal claims from simpler, visually supported pieces.
- How it works:
- Parse the image into primitives (points, lines, regions; or objects, forces, trajectories).
- Propose small claims linked to those visuals (lemmas).
- Retrieve helpful known facts from libraries.
- Keep composing until you reach a reliable anchor (dimension or axiom).
- Why it matters: Without this, the AI might jump to big claims that arenāt actually supported by the picture or physics. š Anchor: From a pulley photo: identify masses and strings ā relate their directions and tensions ā stop at Newtonās law with proper units.
š Hook: Think of the simplest truths in math and physics as the base rules everyone agrees on. š„¬ The Concept: Axiom composition assembles verified small facts into bigger truths using accepted base rules.
- How it works:
- Take grounded lemmas.
- Combine them with axioms (e.g., conservation laws, geometric postulates).
- Produce a final formal statement the prover can check.
- Why it matters: Without composition, youād have many tiny facts but no finished proof. š Anchor: In geometry, āAB = ACā and āangles at the base are equalā assemble into ātriangle ABC is isosceles,ā which Lean can verify.
š Hook: Imagine a super-strict math teacher who never lets a sloppy step slide. š„¬ The Concept: The Lean system is a proof checker that accepts only fully precise, well-typed statements.
- How it works:
- You write definitions and theorems.
- Lean checks each step follows the rules.
- If everything passes, it compiles; if not, it points to the error.
- Why it matters: Without Lean, you might trust a proof that looks right but contains hidden mistakes. š Anchor: āForce equals mass times accelerationā is encoded and checked so the system wonāt accept mismatched units or undefined terms.
Before vs After:
- Before: Systems often ignored images, forgot units, or overfit to known code snippets. Proof trees could spiral without ending.
- After: MMFormalizer builds statements that are visually and dimensionally grounded, composes them into axioms, and compiles in Lean. It also defines explicit stopping rules so reasoning halts safely.
Why it works (intuition, no equations): Images anchor symbols to reality; dimensions act like a ruler that rejects nonsense; axioms give you trusted glue to assemble parts; and the prover enforces strict logic. Together, these prevent drift and keep the reasoning chain short, sensible, and checkable.
Building blocks introduced in this work:
- š Hook: When you read a map, you first notice cities and roads.
š„¬ The Concept: SceneGraph is a structured list of visual primitives (like points, lines, or objects) plus their relations.
- How it works: Parse the image ā detect primitives ā record relations such as parallel, touching, inside.
- Why it matters: Without a clean map of the scene, you canāt link visuals to formal statements. š Anchor: In a triangle picture, the SceneGraph lists A, B, C as points and AB, BC, CA as lines, including angle relations.
- š Hook: Think of a diary recording every solid mini-fact youāve confirmed.
š„¬ The Concept: PropChain is the chain of small lemmas, each connected to something you saw.
- How it works: For each sub-part of the scene, propose a lemma, retrieve relevant theorems, and confirm it.
- Why it matters: Without PropChain, you jump too far and risk errors. š Anchor: āPoint D is the midpoint of ABā enters the chain because the image shows equal ticks on the segment.
- š Hook: Picture locking your build by snapping in a few special key pieces.
š„¬ The Concept: AxiomChain is the final list where each claim is either an axiom or tied to dimensions that close the loop.
- How it works: Elevate PropChain items into axiom-backed or dimension-anchored statements.
- Why it matters: Without AxiomChain, your proof canāt finalize. š Anchor: In physics, once momentum and force are tied through Newtonās law with correct units, the chain can terminate.
- š Hook: Like searching a library by meaning, not just by exact words.
š„¬ The Concept: LeanSearch retrieves helpful definitions and lemmas from mathlib and PhysLean.
- How it works: Index the library, embed statements, and fetch top matches to guide formalization.
- Why it matters: Without retrieval, you reinvent the wheel or get stuck. š Anchor: Need āmidpointā properties? LeanSearch brings the exact lemmas to use.
- š Hook: Think of a referee who checks both the instant replay (image) and the rulebook (text).
š„¬ The Concept: Semantic checking verifies that each formal statement is supported by both modalities.
- How it works: Compare the image, the text, and the code; approve only if they match.
- Why it matters: Without this, the system might pass a good-looking but wrong proof. š Anchor: If the diagram shows parallel lines, the checker rejects any proof that treats them as intersecting.
Put together, MMFormalizer is a disciplined builder: see ā ground ā compose ā check ā stop at dimensions/axioms, all inside a prover.
03Methodology
At a high level: Input (image + text) ā Parse scene ā Propose grounded lemmas (PropChain) ā Retrieve supporting facts (LeanSearch) ā Check for termination (dimension or axiom) ā Compose into AxiomChain ā Compile in Lean ā Semantic check ā Output verified formal code.
Step-by-step, with purpose and examples:
- Parse the image into a SceneGraph.
- What happens: The system detects primitives (points, lines, regions for geometry; objects, vectors, forces for physics) and relations (parallel, touching, contained_in).
- Why this exists: You canāt reason about what you canāt name. SceneGraph is the vocabulary of the picture.
- Example: In a triangle diagram, identify points A, B, C; lines AB, BC, CA; and a marked right angle at B.
- Create initial hypotheses tied to visuals.
- What happens: For each primitive or relation, the system writes a short, informal note like āAB ā BCā or āBlock on incline with angle Īø.ā
- Why: These are the seeds for formal lemmas; they keep reasoning anchored to whatās visible.
- Example: Seeing equal tick marks on two sides suggests āAB = AC.ā
- Build the PropChain by turning notes into lemmas.
- What happens: The model queries LeanSearch with each note to find relevant definitions and lemmas, then proposes a formal lemma with a proof sketch.
- Why: Small, verified steps prevent big errors.
- Example: Use a midpoint lemma from mathlib to formalize āD is the midpoint of ABā when the diagram shows equal halves.
- Retrieve from libraries (mathlib + PhysLean) via LeanSearch.
- What happens: The system gets the top-k relevant items (names, types, comments) to guide formalization and proof.
- Why: Reuse trusted math and physics facts instead of starting from scratch.
- Example: For āmomentum is mass times velocity,ā PhysLean provides precise definitions and links to Hamiltonās equations.
- Check for recursive termination with dimensional or axiomatic grounding.
- What happens: After each lemma, the system asks, āCan I end here?ā Two ways to stop:
- Dimensional grounding: The statement reduces to base units like [M], [L], [T], [Q], [Ī] or approved derived units (force, energy).
- Axiomatic grounding: The statement rests directly on a fundamental axiom or law (e.g., Newtonās laws, conservation laws, Maxwellās equations).
- Why: This prevents infinite decomposition into unnecessary sub-lemmas and keeps proofs meaningful.
- Example: In a pulley system, once tensions and accelerations are tied to Newtonās second law with correct units (N = kgĀ·m/s²), stop.
- Compose lemmas into higher-level results (AxiomChain).
- What happens: Combine child lemmas to form non-leaf theorems, promoting them to the axiom/dimension-closed level.
- Why: Composition turns many small truths into a final, reusable theorem.
- Example: āmidpoint D,ā āmedians intersect at centroid,ā and angle facts build into a triangle center theorem.
- Syntax and type checking (compilation in Lean).
- What happens: Lean verifies that the definitions and proofs compileāno missing variables, wrong types, or illegal steps.
- Why: Guarantees machine-level correctness.
- Example: A proof that compiles in Lean is like a bridge that passes all safety inspections.
- Semantic checking across modalities.
- What happens: A separate checker rates each imageātextāformula triplet as 1 (accept) or 0 (reject) based on cross-modal consistency.
- Why: Prevents accepting a formal proof that doesnāt match the picture or statement.
- Example: If the picture shows 60°, but the proof uses 90°, the checker rejects it.
- Output verified formal code.
- What happens: The system returns Lean code (theorem + proof) plus the dependency graph (what depended on what).
- Why: This makes the reasoning transparent, reusable, and checkable by others.
- Example: A theorem like āforce equals rate of change of momentumā compiled from a specific scene.
The secret sauce:
- Explicit termination: Reasoning stops when grounded in units or axioms, avoiding runaway trees (the paper shows a failure case without this rule).
- Dimension tracking: Units act like guardrails to block physically meaningless equations.
- Library retrieval: LeanSearch injects just-in-time knowledge from mathlib/PhysLean so the model composes the right building blocks.
- Proof assistant backbone: Lean enforces rigor at every step, turning ideas into certifiable artifacts.
Mini āsandwichā intros for supporting notions:
- š Hook: Mixing cups with teaspoons ruins recipes.
š„¬ The Concept: Dimensional analysis ensures units align in every step.
- How it works: Tag quantities with units, propagate them, reject mismatches.
- Why it matters: Stops nonsense like adding meters to seconds. š Anchor: Validates that energy has units of joules and momentum of kgĀ·m/s.
- š Hook: A city map needs a legend.
š„¬ The Concept: SceneGraph is the map legend of the image.
- How it works: List primitives and relations.
- Why it matters: Itās the foundation for grounding. š Anchor: Points, lines, āparallel,ā and āperpendicularā all live here.
- š Hook: A diary of facts builds trust.
š„¬ The Concept: PropChain records grounded mini-theorems.
- How it works: One lemma at a time, linked to the picture.
- Why it matters: Prevents leaps. š Anchor: āD is midpoint of ABā before proving centroid facts.
- š Hook: Final locks make furniture sturdy.
š„¬ The Concept: AxiomChain locks proofs with axioms or dimensions.
- How it works: Promote lemmas to closed statements.
- Why it matters: Finishes the proof. š Anchor: Newtonās laws cap a mechanics proof.
- š Hook: Ask a librarian who knows meanings, not just titles.
š„¬ The Concept: LeanSearch finds relevant theorems.
- How it works: Embed, index, retrieve top matches.
- Why it matters: Avoids reinventing the wheel. š Anchor: Fetch ācentroidā lemmas when needed.
- š Hook: A referee watches the play and the replay.
š„¬ The Concept: Semantic checking compares image, text, and code.
- How it works: Accept only when all agree.
- Why it matters: Blocks pretty-but-wrong proofs. š Anchor: Rejects a 90° claim when the diagram shows 60°.
04Experiments & Results
The test: The authors built PHYX-AF, 115 carefully chosen multimodal problems spanning MathVerse (plane/solid geometry, function graphs), PhyX (mechanics, electromagnetism, thermodynamics, modern physics: relativity and quantum), Synthetic Geometry (hard, out-of-distribution constructions), and Analytic Geometry (coordinate-heavy). Crucially, they filtered out any problem solvable by text aloneāso the image really matters.
What they measured and why:
- Compile accuracy: Does the generated Lean code type-check and run? This is the first gate to ensure formal correctness.
- Semantic correctness: Is the solution meaningfully right, even if written differently than a reference? This captures true understanding.
- Human verification: Experts check logic and multimodal consistency. This guards against corner cases.
The competitors:
- Frontier closed models: GPT-5, Gemini-3-Pro, Gemini-2.5-Pro.
- Open models: Qwen3-VL-235B, Qwen2.5-VL-72B.
- All were used inside the same MMFormalizer pipeline to keep comparisons fair.
The scoreboard (with context):
- GPT-5 excelled on PHYX, especially modern physics (relativity and quantum). Think of it as getting an A in the toughest physics class.
- Gemini-3-Pro led on many math categories (MathVerse and Analytic Geometry), showing strong general multimodal reasoning. Thatās like topping the class in geometry and graphs.
- Geometry remains hardest overall. Synthetic and Analytic Geometry showed big gaps: models struggled to connect precise visual relations (angles/lengths) to formal proofs, especially out of distribution. Thatās like students doing well on homework but stumbling on tricky Olympiad-style puzzles.
- Advanced open-source models lagged notably in physical domains and out-of-distribution geometry. They often failed to produce compilable code or correct semantics compared to frontier systems.
Surprising and useful findings (ablations):
- Removing retrieved code snippets sometimes helped in Synthetic Geometry. Why? It freed the model to invent new dependent types instead of forcing it into mismatched templatesālike letting a student derive a fresh formula instead of copying the wrong one.
- Explicit termination rules were crucial. Without them, the reasoning tree grew too deep and wide, wasting compute and failing to finishālike writing an essay that never ends.
- Using the image during grounding improved performance in tougher areas, including modern physics and synthetic geometry. Seeing matters!
- Trying multiple candidates (pass@k) improved performance on hard problems, showing that test-time scaling (more tries) can pay off.
Semantic checking analysis:
- Multiple LLMs were tested as semantic referees over code produced by strong and weaker models. Some āweakerā models still performed surprisingly well at flagging mismatches when verifying code from open-source systems. This hints at a practical workflow: cross-check models can catch each otherās mistakes.
Takeaway in plain terms: The framework works across many domains, but geometry with complex visuals is still the Achillesā heel. Frontier models handle physics better (especially with dimensional grounding), while images are truly needed for top performance. Smart stopping rules, image grounding, and multiple attempts make a real difference.
05Discussion & Limitations
Limitations:
- Visual parsing brittleness: If the diagram is cluttered, low-resolution, or ambiguous, the SceneGraph can be wrong, and the whole chain inherits that error.
- Geometry perception remains hard: Precise angle/length relations and auxiliary constructions are frequent failure points, especially for unfamiliar setups.
- Library coverage: Even with mathlib and PhysLean, some needed definitions or modern physics constructs may be missing, forcing the model to synthesize types under pressure.
- Compute and latency: Recursive grounding with retrieval and compilation can be slow, especially if pass@k is large or trees grow deep.
- Semantic checking is not perfect: Automated cross-modal checks still sometimes accept or reject borderline cases; human verification remains valuable.
Required resources:
- A working Lean 4 toolchain with mathlib and PhysLean, plus the LeanSearch index built over them.
- A capable multimodal LLM (frontier models do best) with enough context window and tool-use ability.
- GPUs/CPUs for retrieval and compilation at scale; storage for indices and logs; a sandboxed environment for code execution.
When not to use:
- Extremely noisy or stylized images where key geometry/physics cues are unreadable.
- Tasks that are purely text-based (no need for multimodal overhead).
- Domains with missing axioms or unit systems (no reliable dimensional/axiomatic anchors yet).
- Real-time control loops that canāt tolerate compilation latency.
Open questions:
- Can we learn better visual parsers specialized for formal geometry and physics cues to reduce SceneGraph errors?
- How can the system automatically expand libraries (safely) when it discovers missing types or lemmas, without hallucination?
- Can we unify stronger unit-aware type systems across more sciences (chemistry, biology) for robust dimensional grounding?
- Whatās the best strategy to allocate compute across grounding, retrieval, and composition (e.g., dynamic pass@k) for reliability vs. speed?
- How can multiple LLMs serve as mutual referees to raise semantic checking reliability without human review?
06Conclusion & Future Work
In three sentences: MMFormalizer turns problems that mix pictures and words into strict, checkable math and physics proofs by building them up from what it can see and read. It stops only when every step is anchored by real units or trusted axioms, then compiles everything inside Lean so mistakes canāt slip through. A new benchmark, PHYX-AF, shows frontier models excel in physics while geometry remains the toughest frontier.
Main achievement: A unified, multimodal autoformalization pipelineārecursive grounding + dimensional/axiomatic termination + axiom compositionācapable of formalizing across classical mechanics, relativity, quantum mechanics, thermodynamics, and multiple geometry settings.
Future directions: Improve visual parsing for geometry, expand physics and math libraries, strengthen semantic checking with multi-model referees, and develop smarter compute allocation (dynamic pass@k, adaptive termination). Extend dimensional/type-theoretic grounding into other sciences with richer unit systems.
Why remember this: Itās a blueprint for connecting what we see to what we can proveābridging perception and logic. By baking units and axioms into the core, it keeps formal reasoning tied to the real world, making automated science more trustworthy, reusable, and verifiable.
Practical Applications
- ā¢Auto-checking geometry proofs in digital textbooks using both the diagram and the text.
- ā¢Verifying physics lab reports by turning graphs and photos into formal, unit-checked derivations.
- ā¢Assisting engineers by flagging unit mismatches or geometric inconsistencies in CAD-like diagrams.
- ā¢Helping robots plan safe motions by grounding forces and contacts in unit-aware formal logic.
- ā¢Building searchable libraries of formalized physics and geometry facts from classroom problems.
- ā¢Auditing research figures by linking plots and schematics to machine-checkable statements.
- ā¢Generating hint-by-hint tutoring that explains each grounded lemma before the final theorem.
- ā¢Creating synthetic, formally verified practice problems for competitions and curricula.
- ā¢Cross-checking AI models by using one model to semantically verify another modelās formal code.
- ā¢Providing regulatory compliance checks where standards require unit-consistent calculations.