Semantic Search over 9 Million Mathematical Theorems

Luke Alexander; Eric Leonen; Sophie Szeto; Artemii Remizov; Ignacio Tejeda; Giovanni Inchiostro; Vasily Ilin

Semantic Search over 9 Million Mathematical Theorems

Intermediate

Luke Alexander, Eric Leonen, Sophie Szeto et al.2/5/2026

arXiv PDF

Key Summary

•This paper builds a Google-for-theorems: a semantic search engine that finds exact theorems, lemmas, and propositions instead of just entire papers.
•They gathered 9.2 million human-written, research-level theorem statements from arXiv and seven other open sources.
•Each theorem is rewritten as a short plain-English “slogan” by a language model so that natural-language queries can match theorems written in LaTeX.
•Embedding these slogans into vectors and using fast nearest-neighbor search lets the system retrieve relevant theorems at web scale.
•On 111 real queries authored by professional mathematicians, the system achieves 45.0% Hit@20 at the theorem level, beating ChatGPT 5.2 with search (19.8%) and Gemini 3 Pro (27.0%).
•At the paper level, it reaches 56.8% Hit@20, outperforming Google Search (37.8%).
•More context when writing slogans (using the paper’s introduction) improves retrieval significantly over using the theorem body alone.
•A reranker further improves top-1 accuracy, showing that late, fine-grained matching helps distinguish very similar results.
•The tool especially shines at surfacing auxiliary lemmas buried deep in papers, which paper-level search often misses.
•A public demo and dataset are released so researchers and AI systems can use and build on this capability.

Why This Research Matters

This work lets mathematicians and students find the exact theorem they need within seconds, rather than hunting through entire papers. It reduces duplicated research and helps authors confirm novelty before submitting, avoiding painful retractions or withdrawals. For AI systems, precise theorem-level retrieval provides the right building blocks for proof search and reasoning, improving reliability. Teachers and learners can locate crisp statements and standard lemmas for lessons, homework, and self-study. The public dataset and demo invite the community to build better tools and extend coverage, supporting open science. As this approach spreads, we can expect faster literature discovery and more accurate, grounded AI in mathematics.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how finding a specific movie scene is much harder than just finding the movie title? If you want the moment when the hero says one line, scrolling the whole film is slow and frustrating. Mathematicians face the same problem: the exact thing they need is a theorem, lemma, proposition, or corollary tucked inside a long paper—and most search engines only find the whole paper.

🍞 Hook: Imagine you’re trying to find the one LEGO piece that makes your spaceship work, but every store only sells full mixed boxes. You don’t want the box; you want that exact piece. 🥬 The Concept (Semantic Search): What it is: A way to find information by meaning, not just matching exact words. How it works:

Turn text into meaning-rich numbers (embeddings).
Put those numbers in a big map where “close” means “similar meaning.”
For a query, find the closest spots on the map. Why it matters: If we only match words, we miss results that say the same idea with different wording, and we waste time opening entire papers. 🍞 Anchor: If you search “a rational variety is simply connected,” good semantic search finds the exact statement—even if the paper phrases it as “smooth projective rational varieties have trivial fundamental group.”

The world before: arXiv has millions of papers, and math alone has hundreds of thousands. Tools like Google Scholar and arXiv search help you find papers, but not the specific theorem you remember from a lecture or need for your proof. Both human mathematicians and AI proof agents often need one precise statement, not a whole PDF. This mismatch has real consequences. For humans, people sometimes rewrite results that already exist, leading to painful paper withdrawals. For AI systems, tools may “solve” open problems that were actually closed long ago, because they couldn’t find the prior theorem.

🍞 Hook: You know how kids can understand stories better than code? Computers are the opposite—they love code but struggle with stories. 🥬 The Concept (Natural Language Processing, NLP): What it is: Technology that helps computers understand and use human language. How it works:

Read text and break it into pieces (tokens).
Learn patterns of meaning from huge amounts of text.
Turn sentences into vectors that capture meaning. Why it matters: Without NLP, a computer can’t “get” what a math query in plain English is asking, so it can’t match it to the right theorem. 🍞 Anchor: When someone types, “over an algebraically closed field, abelian p-Lie algebras split in a special way,” NLP helps the system understand and compare that idea to millions of theorem statements.

The problem: Mathematical theorems are written in LaTeX with symbols and specialized notation, while search queries are usually in informal English. If we directly embed raw LaTeX, many general-purpose models struggle to understand the math symbols. If we stick to paper-level results, we miss the deeply buried lemmas.

Failed attempts: Past MathIR efforts did great work on formula search (finding matching equations) and on paper-level search, but not on general meaning-level theorem retrieval across millions of statements. Classic sparse keyword methods (like BM25) stumble when different authors use different words for the same concept. Even dense retrieval over raw LaTeX is brittle because symbol-heavy text confuses models trained mostly on normal language.

The gap: We needed a bridge that lets informal, plain-English queries find formal theorem statements accurately, at massive scale.

Real stakes:

Researchers can avoid re-proving known results.
Students and teachers can find the exact facts they need for learning or homework.
Reviewers can quickly check references.
AI agents can pick the right lemmas during automated theorem proving instead of wandering blindly.

🍞 Hook: Think of writing a book report: it’s easier to find a book if you know its short, clear summary. 🥬 The Concept (Slogan Generation): What it is: Making a short, plain-English summary of a theorem’s main idea. How it works:

Read the theorem (and some context like the abstract or introduction).
Write a few clear sentences in everyday English, avoiding symbols.
Store that summary as the searchable representation. Why it matters: Now informal questions match formal theorems because both live in the same language space. 🍞 Anchor: From “If k is separably closed, k-points are dense in A^n,” a slogan might be “Over a separably closed field, the rational points are dense in affine n-space,” which a natural-language query can hit directly.

Bottom line: The paper shows that turning 9.2 million theorem statements into concise English slogans, then embedding and indexing them for semantic search, lets users find the exact theorem they had in mind far more often than with general-purpose tools.

02Core Idea

The “Aha!” in one sentence: Represent every theorem with a short, symbol-free English slogan and search over those—because people ask in English while theorems live in LaTeX.

Three analogies for the same idea:

Recipe cards: Instead of searching whole cookbooks (papers), write one clear recipe card (slogan) per dish (theorem). Then, when you ask “quick vegetarian pasta,” you get the exact card.
Song summaries: If each song had a one-line mood/tag summary, a request like “calm piano for studying” would match the right song—even if the song title is unrelated.
Map pins: Every theorem gets a GPS pin (a vector) based on its meaning. Your question becomes a pin too. Search is just “find nearby pins.”

🍞 Hook: You know how computers can’t really compare stories unless we give them numbers? 🥬 The Concept (LLM Embedding): What it is: Turning sentences into number lists (vectors) that capture meaning. How it works:

A language model reads the text.
It outputs a fixed-length vector (like 4096 numbers) that encodes the text’s meaning.
Similar meanings end up as nearby vectors. Why it matters: Now computers can compare ideas by distance: close means similar. 🍞 Anchor: The query “every smooth projective rational variety has trivial fundamental group” and a slogan that says the same idea will have vectors pointing in nearly the same direction.

Before vs. after:

Before: Search engines mostly returned whole papers, and symbol-heavy LaTeX confused general models. You often had to sift manually.
After: The system treats each theorem as a first-class item with a clear English summary, so informal queries snap to exact statements—even deep inside a paper.

Why it works (intuition):

People phrase the same math idea in many ways. Dense embeddings capture meaning beyond exact words.
Stripping symbols and using natural language reduces ambiguity and puts queries and theorems in the same space.
Large-scale vector indexes can scan millions of items quickly, so this is fast enough for real use.

Building blocks (small pieces that make it go):

🍞 Hook: Imagine a librarian who first gives you a handful of likely shelves, then helps you pick the single best book. 🥬 The Concept (Dense Passage Retrieval): What it is: A two-step search that quickly surfaces likely matches using embeddings. How it works:

Encode everything as vectors.
Use a fast index to fetch the nearest pieces.
Optionally rerank those few using a more precise method. Why it matters: It makes web-scale search both fast (coarse step) and accurate (fine step). 🍞 Anchor: Ask “splitting of abelian p-Lie algebras”; DPR brings you top 100 matches instantly, then a reranker sorts them so the exact theorem lands near the top.

🍞 Hook: Think of a clever city map with highways and local streets to get anywhere fast. 🥬 The Concept (HNSW Indexing): What it is: A data structure that finds nearest vectors quickly by hopping through layers of shortcuts. How it works:

Build a layered graph where nearby vectors are linked.
Start at the top (coarse map), move to promising areas.
Dive down layers to precise neighbors. Why it matters: Without it, you’d compare your query to millions of items one by one—too slow. 🍞 Anchor: For 9.2 million theorems, HNSW lets you jump to the right neighborhood in milliseconds.

🍞 Hook: When we use simplified versions of data, we can compute differences super fast. 🥬 The Concept (Hamming Distance): What it is: Counting how many bits differ between two binary codes. How it works:

Convert vectors into compact binary codes.
Compare two codes by counting bit mismatches.
Use it to quickly shortlist candidates. Why it matters: It’s extremely fast, so you get a small pool of likely matches before detailed scoring. 🍞 Anchor: 4096-bit codes let the system fetch a tight candidate set for your query almost instantly.

🍞 Hook: If two arrows point in almost the same direction, they’re saying the same thing. 🥬 The Concept (Cosine Similarity): What it is: A way to measure how much two vectors point the same way. How it works:

Normalize both vectors.
Compute their angle’s cosine.
Bigger cosine = more similar meaning. Why it matters: It’s a strong final check to rank the most semantically aligned theorems. 🍞 Anchor: Among candidates for “density of k-points in A^n,” the one with the highest cosine similarity is likely your theorem.

Put together, the core idea is simple but powerful: say the theorem clearly in English, turn it into a vector, and use smart maps and distances to find it again from a human question.

03Methodology

At a high level: User query → embed into a vector → fast approximate search to get candidate theorems → precise reranking → ranked list of theorems with links and LaTeX bodies.

Step 1: Collect and parse theorems

What happens: The authors gather 9.2 million theorems from arXiv and seven other open sources. For arXiv LaTeX, they use three tactics: (1) plasTeX node parsing, (2) a custom TeX package that logs theorem environments during compilation, and (3) regex as a fallback. For other sources (Stacks Project, ProofWiki, etc.), they adapt to each site’s structure and extract theorem names, bodies, labels, and links.
Why it exists: Theorems are the atoms of math arguments; you can’t search them if you can’t reliably pull them out of papers.
Example: From “Theorem 3.9 (Shokurov reduction)” they capture the name and the full LaTeX body, expanding simple macros so symbols are consistent.

🍞 Hook: Think of taking apart a complex LEGO build to list every special piece. 🥬 The Concept (Theorem Parsing): What it is: Extracting the type, number, note, and body of each theorem-like environment from source files. How it works:

Scan LaTeX (or wikitext) to find theorem blocks.
Record their names, labels, and bodies.
Clean/normalize content (e.g., expand simple macros, strip noise). Why it matters: Without clean, structured theorem data, we can’t index or retrieve precise statements. 🍞 Anchor: From a Stacks Project .tex file, the system pulls “Lemma 056U” plus its statement and a link so you can verify the source.

Step 2: Write natural-language slogans

What happens: A large language model (e.g., DeepSeek V3, Gemini 3 Pro, Claude Opus 4.5) reads the theorem (and sometimes the abstract or introduction) and writes a short, clear English summary—no symbols, no proof steps.
Why it exists: People search in English; models embed English better than raw LaTeX; slogans bridge the gap.
Example: The dense-k-points fact becomes a 1–4 sentence summary in plain English.

Step 3: Embed everything

What happens: They embed every slogan and user query with Qwen3-Embedding-8B into 4096-dimensional vectors. These are stored in PostgreSQL with pgvector.
Why it exists: Vectors let meaning be compared by distance.
Example: The query “splitting of abelian p-Lie algebras over algebraically closed fields” and a matching slogan land close in vector space.

🍞 Hook: When you want to find nearby houses, you need a map and a way to measure closeness. 🥬 The Concept (Dense Passage Retrieval): What it is: Searching with embeddings so semantically close texts end up as near neighbors. How it works:

Make vectors for all items and the query.
Use an index to find near neighbors quickly.
Optionally rerank them with a finer model. Why it matters: It scales to millions of items and still finds good matches fast. 🍞 Anchor: For “rational varieties are simply connected,” DPR fetches a compact set of likely theorems from millions.

Step 4: Index for fast search

What happens: They build an HNSW index over binary-quantized vectors. First, they shortlist candidates by Hamming distance; then they rerank by cosine similarity on the original vectors.
Why it exists: Two-stage speeds (binary/Hamming) plus accuracy (cosine) balances performance.
Example: For top-k results, they retrieve clamp(max(200, 12×k), 200, 800) candidates, filter by metadata, then compute cosine similarity to sort.

🍞 Hook: Use highways to get near your destination, then local streets to pick the exact house. 🥬 The Concept (HNSW Indexing): What it is: A layered graph that makes nearest-neighbor search fast and accurate. How it works:

Build links among vectors at multiple layers.
Start high, jump near the right zone.
Descend to fine layers to find precise neighbors. Why it matters: Massive speed-up over checking every item. 🍞 Anchor: With 9.2 million theorems, HNSW returns strong candidates in milliseconds.

🍞 Hook: Counting how many light switches differ tells you if two rooms’ lighting patterns match. 🥬 The Concept (Hamming Distance): What it is: The count of bit differences between two binary codes. How it works:

Turn high-d vectors into binary codes.
Compare codes by counting mismatched bits.
Use as a lightning-fast first filter. Why it matters: It narrows millions to hundreds almost instantly. 🍞 Anchor: A 4096-bit code lets the engine shortlist theorems that roughly match your question very quickly.

🍞 Hook: Two arrows pointing in almost the same direction describe nearly the same path. 🥬 The Concept (Cosine Similarity): What it is: A normalized dot product indicating how aligned two vectors are. How it works:

Normalize vectors to length 1.
Compute their dot product.
Higher is more similar. Why it matters: It’s a robust fine-grained score for final ranking. 🍞 Anchor: Among candidates, the theorem whose slogan points most like your query is ranked first.

Step 5: Optional reranking

What happens: A cross-encoder reranker (Qwen3-Reranker-0.6B) reads each query–slogan pair and gives a precise relevance score, improving P@1 and MRR.
Why it exists: Bi-encoders are fast but sometimes coarse; rerankers inspect details that help pick the true best match.
Example: Two lemmas may look alike; the reranker notices subtle scope differences that matter.

Step 6: Evaluate fairly

What happens: Three mathematicians wrote 111 natural-language queries from memory, blind to the slogans. Each target pair was double-checked by an LLM and a second mathematician. Baselines included Google/arXiv (paper-level), ChatGPT 5.2 with web search, and Gemini 3 Pro.
Why it exists: Realistic queries and careful verification reveal actual utility.
Example: Queries centered on Algebraic Geometry, Analysis, PDEs; the system had to find exact theorems or at least the right paper.

Secret sauce:

Slogans make formal math live in the same space as human questions.
Extra context (like introductions) lets the LLM write better slogans, which boosts retrieval.
Two-stage search (Hamming then cosine) balances speed and accuracy, and reranking sharpens top hits.
Metadata filters and optional citation-weighting refine results for practical use.

04Experiments & Results

The test: The team measured how often the system finds the exact theorem (or at least the right paper) for 111 expert-written queries, using standard information-retrieval metrics.

🍞 Hook: When teachers grade a quiz, they don’t just look at the final score—they check how fast and how early you got correct answers. 🥬 The Concept (Precision@k): What it is: Of the top k results shown, what fraction are correct? How it works:

Look at the first k items.
Count how many are right.
Divide by k. Why it matters: It measures early accuracy—how “clean” the top results are. 🍞 Anchor: If 4 of the top 10 results are the exact theorem, Precision@10 = 0.4.

🍞 Hook: If you throw 10 darts, did at least one hit the bullseye? 🥬 The Concept (Hit@k): What it is: Did any of the top k results contain the correct item (1 = yes, 0 = no)? How it works:

Check the first k.
If the right one appears anywhere, score 1.
Average this over all queries. Why it matters: It shows whether the system surfaces the right theorem somewhere near the top. 🍞 Anchor: If the exact theorem is 7th, Hit@10 = 1 even if Precision@10 isn’t perfect.

🍞 Hook: Imagine scoring a treasure hunt: finding the treasure first gets you more points than finding it later. 🥬 The Concept (MRR@k): What it is: The average of 1/rank of the first correct result within the top k. How it works:

Note the rank (1 for first, 2 for second, ...).
Take 1/rank.
Average across queries (truncate at k). Why it matters: Rewards earlier discovery; higher MRR means correct items are found sooner. 🍞 Anchor: If the match is at rank 2, you get 1/2 = 0.5 for that query; earlier is better.

Competition and setup: The authors compared multiple embedders (Gemma 0.3B, Qwen3 0.6B, Qwen3 8B) and against baseline tools: Google Search and arXiv (paper-level only), plus ChatGPT 5.2 with web search and Gemini 3 Pro. LLMs used to generate slogans varied (DeepSeek V3, Gemini 3 Pro, Claude Opus 4.5). They also ablated context (body only vs. body+abstract vs. body+first section) and embedded with/without task instructions.

Scoreboard with context:

Theorem-level: Qwen3-Embedding-8B achieved Hit@20 = 45.0%, clearly beating ChatGPT 5.2 (19.8%) and Gemini 3 Pro (27.0%). With a reranker, P@1 improved from 17.1% to 18.9% and MRR@20 from 24.3% to 27.0%.
Paper-level: Qwen3-Embedding-8B reached Hit@20 = 56.8%, outperforming Google Search at 37.8%. Interpretation: 45% Hit@20 is like consistently getting the right answer on the first page of results nearly half the time—while general-purpose systems often struggle to name the specific theorem even when they find the paper.

Surprising findings:

Slogans beat raw LaTeX: Embedding LaTeX statements directly performed far worse than embedding LLM-written English slogans.
Context helps: Using the paper’s introduction to write the slogan improved retrieval over body-only or abstract-only settings. Intros often summarize goals and clarify terminology.
Better slogan LLMs help: Proprietary LLMs (e.g., Claude Opus 4.5, Gemini 3 Pro) produced slogans that led to stronger retrieval than some open models; choice of slogan generator matters.
Reranking matters: A cross-encoder reranker improved top-1 and MRR, indicating that late, fine-grained reading of query–slogan pairs resolves tough ties.
Diversity effect: General LLMs with web search tended to cluster multiple results from one paper in a row, reducing diversity. The proposed system surfaced a broader range of papers per query, improving chances of a hit.
Deep lemmas advantage: Because each theorem gets its own slogan, the system finds auxiliary lemmas buried in later sections—something paper-level search often misses.

User anecdotes make it real: One user needed a density-of-k-points fact; the first result was exactly the relevant Stacks Project tag. Another needed a structure theorem for abelian p-Lie algebras; the first result was directly on target (Usefi, 2009), saving time compared to chasing indirect references.

Overall: The results show that theorem-level semantic search at web scale not only works but can outperform widely used tools, especially when the goal is to name the exact statement rather than just the containing paper.

05Discussion & Limitations

Limitations:

Parsing fragility: LaTeX is diverse; some theorem bodies may be truncated or malformed. Heuristics help, but errors remain.
Slogan quality varies: If the LLM misunderstands a statement or the provided context is thin, the slogan may drift, hurting retrieval.
Coverage constraints: The public release excludes arXiv items with non-permissive licenses; some important theorems may be missing.
Small evaluation set: 111 high-quality queries is standard in MathIR but still small; broader coverage will further validate results.
Not a proof tool: Retrieval finds statements; it doesn’t verify correctness, resolve version numbering differences, or generate proofs.

Required resources:

Preprocessing: Running parsers and compilers at arXiv scale; storage for millions of items plus metadata.
Slogan generation: LLM API usage (~$4,000 for 9.2M theorems in this build), plus time and queue management.
Indexing: Vector DB (PostgreSQL + pgvector), HNSW index building, and binary quantization.
Inference: An embedding model endpoint (Qwen3-Embedding-8B) and optional reranker.

When not to use:

If you need exact formal proof objects in Lean/Isabelle/Coq, use formal-library search; this system targets informal theorems.
If queries require non-English or highly specialized symbol reasoning, current slogans and embeddings may underperform.
If you need guaranteed completeness for legal/compliance decisions, you must cross-check manually.

Open questions:

Math-aware embeddings: Can specialized math embedding models that understand symbols and structure close the remaining gap?
Auto-context selection: What’s the best way to feed context to the slogan generator—adaptive snippets, citations, or section graphs?
Multilingual math: How well can the system support queries in other languages or mixed-language math?
Linking to formal math: Can we automatically connect informal theorems to their formal counterparts for proof assistants?
Feedback loops: How best to use user ratings and clicks to continually refine slogans and rankings?

In short, the approach is strong and practical, but there’s room to improve parsing robustness, slogan fidelity, evaluation breadth, and deeper math understanding.

06Conclusion & Future Work

Three-sentence summary: The paper turns 9.2 million theorems into short English slogans and uses embedding-based search to find exact statements from natural-language queries. This theorem-level semantic search achieves 45% Hit@20 (theorem) and 56.8% Hit@20 (paper), beating popular baselines like Google Search, ChatGPT with web search, and Gemini for this task. Context-rich slogan generation and a two-stage retrieval-plus-rerank pipeline make web-scale math retrieval both fast and accurate.

Main achievement: Demonstrating, at web scale, that natural-language “sloganization” of formal math paired with dense retrieval can reliably surface precise theorems—not just papers—and outperform widely used general tools.

Future directions: Improve slogan quality with better math-aware LLMs; expand sources (e.g., nLab, more textbooks) and languages; link informal theorems to formal proof libraries for agentic reasoning; and incorporate user feedback for continuous reranking and slogan refinement.

Why remember this: It reframes how we access mathematical knowledge—treating theorems as first-class searchable objects. That shift saves researchers time, prevents duplicated work, boosts education, and gives AI proof agents the right building blocks. As semantic maps of math get richer, discovery gets faster and more reliable for everyone.

Practical Applications

•Quickly verify whether a statement you plan to publish is already known, reducing duplicate work.
•Support AI theorem provers by retrieving highly relevant lemmas as premises during proof search.
•Accelerate literature reviews by jumping directly to the exact theorems behind claims in related work.
•Help peer reviewers check citations and pinpoint the specific theorem a manuscript relies on.
•Assist students preparing for quals by finding standard lemmas and corollaries on key topics.
•Enable instructors to assemble curated reading lists of precise results for a course module.
•Power chat-based math assistants (RAG) that can cite exact theorems rather than hand-waving references.
•Monitor new arXiv submissions for results closely related to your current project or conjecture.
•Integrate into math writing tools to suggest precise references as you draft a proof.
•Extend to neighboring domains (e.g., algorithms, statistics) to retrieve exact propositions from technical texts.

Version: 1