ModelTables: A Corpus of Tables about Models

Zhengyuan Dong; Victor Zhong; Renée J. Miller

ModelTables: A Corpus of Tables about Models

Intermediate

Zhengyuan Dong, Victor Zhong, Renée J. Miller12/18/2025

arXiv PDF

Key Summary

•ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.
•It links each table to the model and papers it came from, so we can understand how models are related beyond just text keywords.
•The benchmark includes more than 60,000 models and about 90,000 tables and shows that model tables are smaller but more closely connected than typical open-data tables.
•The paper defines clear, verifiable ways to say two models (and their tables) are related: paper citations, model-card lineage, and shared datasets.
•They test many table search methods and find that table-embedding (dense) retrieval works best overall with 66.5% Precision@1.
•Union-based semantic table search (finding tables you can stack together) scores 54.8% P@1 overall and does especially well for citation-based relatedness.
•Hybrid metadata retrieval (mixing text search with embeddings) also performs well (54.1% P@1) and excels on lineage and dataset signals.
•Augmentations like adding header text into cells and transposing tables make retrieval more accurate and robust.
•The results show big room for improvement and open the door to better semantic retrieval, structured comparison, and organization of model knowledge.
•They release data, code, and a repeatable pipeline so others can build bigger or private versions of ModelTables.

Why This Research Matters

Choosing the right AI model often depends on nuanced, structured facts that live inside tables, not just in text blurbs. By linking tables to models and papers—and using citations, lineage, and dataset co-usage as the answer key—ModelTables enables fair tests and real improvements in finding related model information. This helps engineers quickly assemble trustworthy comparisons, avoid mistakes, and pick the best model for their task. Research teams can better organize and navigate ever-growing model lakes, both public and private. Over time, this leads to more transparent, reproducible science and faster innovation because the precise evidence is easier to discover. It also lays a foundation for automated model governance, auditing, and documentation at scale.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine a giant library not of storybooks but of AI models. Each model has its own info sheet with tables of scores, settings, and notes—like a baseball card showing stats.

🥬 The Concept (Model Lake): What it is: A Model Lake is a huge place that stores lots of AI models and their descriptions. How it works: 1) People upload models, 2) They add model cards (pages that explain the model), 3) Others can search and compare. Why it matters: Without an organized lake, finding the right model feels like looking for a needle in a haystack. 🍞 Anchor: Hugging Face is a real example of a Model Lake where you can browse models and read their cards.

🍞 Hook: You know how a recipe card has a small table that lists ingredients and amounts? Model pages also have little tables, but with scores and settings.

🥬 The Concept (Scientific Tables about Models): What it is: These are structured tables summarizing model performance (like accuracy) and configuration (like batch size). How it works: Authors put results into rows and columns so others can quickly compare models across tasks. Why it matters: Text alone misses details like exact numbers; tables keep precise facts in clear boxes. 🍞 Anchor: A GLUE benchmark table showing BERT vs. RoBERTa scores is a classic example.

🍞 Hook: Think of a fair race where everyone agrees on the finish line. You need that finish line to judge who won.

🥬 The Concept (Benchmark): What it is: A benchmark is an agreed-upon test set to fairly compare methods. How it works: 1) Gather data, 2) Define what counts as a correct match, 3) Measure how well methods do. Why it matters: Without a benchmark, claims like “our search is better” are just guesses. 🍞 Anchor: ModelTables is a benchmark for finding related tables about AI models.

🍞 Hook: In math class, you grade your answers against the answer key. That key is the truth you measure against.

🥬 The Concept (Ground Truth): What it is: Ground truth is the trusted answer set used to score methods. How it works: 1) Collect signals (like citations, links, shared datasets), 2) Turn them into “related/not-related” labels, 3) Use these to judge search quality. Why it matters: Without ground truth, you can’t tell if search results are actually right. 🍞 Anchor: If two model cards say one inherited from the other, that’s ground-truth relatedness.

🍞 Hook: When you look for a recipe online, you can search by keywords, or you can ask for “something that’s like spaghetti but gluten-free.”

🥬 The Concept (Table Search): What it is: Finding the most related tables to a given table. How it works: 1) Keyword search looks for matching words, 2) Join/union search checks how tables fit together, 3) Semantic retrieval looks for meaning, not just words. Why it matters: Different searches catch different kinds of related tables; relying on one misses important matches. 🍞 Anchor: Searching a BERT results table should find RoBERTa and other GLUE tables, even if column names differ.

The world before: We had big table corpora from the web and open data portals. They were great for general tasks like column typing and join discovery, but not focused on AI models. People often searched by keywords or by structure (can these tables be joined or stacked?). That missed important meaning—like two model tables discussing the same task with different labels.

The problem: Text-only or purely structural search often overlooks the rich, precise semantics in performance/configuration tables. Also, many corpora lacked built-in, verifiable ground truth for “relatedness” beyond manual curation.

Failed attempts: Keyword search can’t handle synonyms or schema changes; join/union search falls short when structure is varied; dense retrieval on arbitrary web tables lacks model-aware signals; black-box semantic search (like summaries) can’t be externally evaluated.

The gap: We needed a model-centric table benchmark with trustworthy relatedness signals (citations, lineage, and shared datasets), plus a reproducible way to build it—so everyone can test and improve table search for models.

Real stakes: Better table search means faster, more accurate model selection, trustworthy comparisons, and improved governance in both public and private model lakes—impacting researchers, engineers, and users who depend on picking the right model for the job.

02Core Idea

🍞 Hook: Imagine sorting baseball cards. If you could instantly find all cards related to one player—teammates, rivals, or shared tournaments—you’d learn much faster.

🥬 The Concept (ModelTables): What it is: ModelTables is a large, linked collection of model-related tables with built-in, model-aware ground truth for relatedness. How it works: 1) Crawl Hugging Face model cards, their GitHub links, and cited papers; 2) Extract tables; 3) Link each table to its model and papers; 4) Build ground-truth edges from citations, model lineage, and shared datasets. Why it matters: It lets us fairly test table search methods that care about meaning, not just matching words or identical shapes. 🍞 Anchor: Given a BERT table, ModelTables tells you which RoBERTa or MUPPET tables are truly related (because of citations, inheritance, or shared datasets).

The “Aha!” in one sentence: If you connect tables to their models and papers, then use real-world signals (citations, lineage, shared datasets) as the answer key, you can fairly test and greatly improve how we find related model tables.

Three analogies:

Family tree: Models are like family members; base_model tags show parents/children; tables are their report cards.
Neighborhood map: Citations are roads connecting houses (papers); shared references mean houses are near each other.
Cookbook series: Dishes (models) often reuse the same ingredients (datasets), so their recipe tables are related.

Before vs After:

Before: Search often matched exact words or identical schemas, missing important but differently shaped information.
After: We can measure and improve methods that find thematically related tables—even when headers, layouts, or wording differ.

Why it works (intuition):

Paper citations reflect intellectual connections; lineage shows model families; shared datasets tie models to the same tasks. These signals are authored by developers and researchers themselves, so they’re reliable.
When tables inherit these links, we get a truth-backed way to judge if table search really understands model relationships.

Building blocks:

Source linking: Every table is tied to at least one model; models to papers and datasets.
Multi-signal ground truth: Paper-level (direct citations and reference overlap), model-level (explicit links and inheritance), dataset-level (shared training/eval datasets).
Robust corpus design: Deduped, quality-checked tables, plus augmentations (transpose, header-to-cell) to handle style differences.
Evaluation-ready: Supports multiple search methods (keyword, join/union, dense, sparse, hybrid) with consistent scoring.

🍞 Anchor: When a lab uploads a new fine-tuned model that cites RoBERTa and reuses GLUE, ModelTables links its performance table to the RoBERTa family and GLUE peers, so search can find them even if headers differ.

03Methodology

At a high level: Input (model cards) → Extract tables and links (GitHub, papers) → Clean and filter → Augment for robustness → Build relatedness graphs (paper, lineage, dataset) → Evaluate search methods.

🍞 Hook: Imagine scavenger hunting with a map that marks not just treasures, but also paths, families, and clubs.

🥬 The Concept (Source Extraction Pipeline): What it is: A step-by-step crawler that collects tables and metadata from model cards, GitHub, and papers. How it works: 1) Parse Hugging Face model cards (tables, URLs, BibTeX), 2) Fetch GitHub READMEs and extract tables, 3) Fetch arXiv HTML pages to extract tables, 4) Use Semantic Scholar’s S2ORC to get raw table text from PDFs and reconstruct structure with an LLM, 5) Link all tables back to their model. Why it matters: Without linking every table to its model and papers, we can’t build reliable relatedness or evaluate search. 🍞 Anchor: A BERT model card links to a GitHub README and its arXiv paper; the pipeline pulls all three tables and ties them to BERT.

Step-by-step “recipe” with reasons and examples:

Model cards first: They’re clean Markdown and often contain performance and config tables. Without this, we’d miss developer-curated facts. Example: A table listing “batch size,” “learning rate,” and GLUE scores.
GitHub second: READMEs often expand details or report extra benchmarks. Skipping this loses important, structured context. Example: A README’s table of model variants and parameter counts.
arXiv HTML third: When available, HTML tables are precise; PDF-only extraction is noisy, so they limit PDF-based tables to controlled ablations. Example: A paper’s Table 5 of SQuAD results pulled from its HTML page.
S2ORC + LLM: S2ORC gives raw text of tables; an LLM reconstructs rows/columns. Without this, many paper tables would remain unusable. Example: Turning “MNLI: 84.6; QNLI: 92.7…” into a tidy 2D table.

🍞 Hook: Think of tidying your room: you fix crooked posters, sort drawers, and toss junk so everything is easy to find.

🥬 The Concept (Quality Control & Filtering): What it is: A set of fixes and filters to make tables consistently usable. How it works: 1) Repair merged/empty cells, 2) Handle special characters and pipes in Markdown, 3) Merge footnotes into cells, 4) Stitch multi-page tables, 5) Remove visual artifacts, 6) Filter out low-quality or unstructured tables. Why it matters: Messy tables break retrieval and produce false matches. 🍞 Anchor: If a cell reads “0.88†” with a footnote, they merge it so the number and note stay together.

🍞 Hook: Sometimes people write ingredients across columns; other times they list them down rows. It’s still the same recipe.

🥬 The Concept (Transpose Augmentation): What it is: Create a flipped version of each table (rows become columns) to handle style differences. How it works: 1) For every table, generate its transpose, 2) Include both in search, 3) Accept either as a match. Why it matters: Two related tables might be rotated; without transposes, retrieval misses them. 🍞 Anchor: One GLUE table lists tasks as columns; another lists them as rows—transposition aligns them.

🍞 Hook: Imagine if a score said “93.5” but didn’t say “accuracy.” You might not know what it means.

🥬 The Concept (Header-to-Cell Augmentation): What it is: Copy column header meaning into each cell value (e.g., turn “93.5” under “Acc” into “Acc: 93.5”). How it works: 1) For each value, prepend its header name, 2) Keep originals too, 3) Search sees both semantic context and number. Why it matters: Numbers alone are ambiguous; adding header text boosts semantic matching. 🍞 Anchor: “F1: 88.0” is easier to match across tables than a bare “88.0.”

🍞 Hook: Think of friendship links: classmates (model cards), pen pals (citations), and clubmates (shared datasets) all show relationships.

🥬 The Concept (Relatedness Graphs): What it is: Three signals to say models (and their tables) are related—papers, lineage, datasets. How it works: 1) Paper-level: Direct citations and reference overlap (with intent and influence filters), 2) Model-level: Explicit links and base_model inheritance, 3) Dataset-level: Shared datasets from tags/URLs. Tables inherit their model’s edges. Why it matters: Each signal catches a different type of true relation; using all gives balanced coverage and precision. 🍞 Anchor: Two models that both cite BERT and both use GLUE are strongly related, even if they don’t cite each other directly.

Putting it together: After extraction, cleaning, filtering, and augmentation, the benchmark has over 60K models and about 90K tables. Its relatedness graphs are realistic and skewed (some hub models connect widely), making evaluation meaningful and challenging.

04Experiments & Results

🍞 Hook: Picture a quiz game: given one table, can you pick the most related table from a huge pile? The score is how often you get the top pick right.

🥬 The Concept (Evaluation Setup): What it is: A test of different search methods to find the most related table (Precision@1). How it works: 1) Treat every table as a query, 2) Search the corpus, 3) Count success if top-1 is ground-truth related by paper, model, or dataset signals. Why it matters: Precision@1 is strict—if your first guess is wrong, you score zero—so it’s a clear signal of practical usefulness. 🍞 Anchor: If your query is a BERT GLUE table, top-1 should be a true peer like RoBERTa GLUE, not a random vision table.

The competition (methods tested):

Keyword search (headers/cells)
Joinable search (can tables be joined on a key?)
Unionable search (can tables be stacked by columns?), using Starmie for semantic unions
Dense retrieval (Sentence-BERT embeddings over whole tables), FAISS for fast neighbors
Sparse retrieval (metadata text with Pyserini)
Hybrid retrieval (sparse metadata to shortlist, then dense rerank)

Scoreboard (context):

Table-based dense retrieval: 66.5% P@1 overall—like getting an A while many others score B.
Union-based semantic search: 54.8% P@1 overall; 54.6% on citation links, 31.3% on model-card inheritance, 30.6% on shared datasets—solid B, especially strong for citation-relatedness.
Metadata-hybrid retrieval: 54.1% P@1—competitive and particularly good on lineage/dataset signals.
Keyword and simple joinable search trail behind; they miss semantic and schema variations.

Surprising findings and insights:

Dense vs. Union strength: Dense table embeddings excel overall (66.5%), but union search shines for direct citation-style relatedness—meaning structure-aware matching still matters a lot.
Source quality matters: GitHub and model-card tables (clean Markdown) yield much higher precision than heterogeneous paper tables (especially S2ORC+LLM reconstructions). Homogeneous sources (M+G) score best; adding noisier sources lowers precision.
Augmentations help: Adding header text into cells consistently boosts accuracy. Transposing helps a little; the biggest benefit comes from making semantics explicit.
Structural robustness: For union search, shuffling columns and rows at inference helps the encoder ignore layout quirks and focus on meaning—column shuffling yields the largest gains.
Ground-truth choice changes numbers: Overlap-based citation GT is denser and yields higher measured precision than strict direct-citation GT. Methods behave consistently as GT becomes stricter (numbers drop but relative ordering remains stable).

Concrete examples:

Union search example: A BERT GLUE table retrieves MUPPET or RoBERTa GLUE tables with very similar schemas, enabling side-by-side comparisons.
Dense retrieval example: It pulls in semantically related tables with different layouts (extra metrics, different column orders), connecting relevant content that union search might miss.

Big picture: No single method dominates every case. Dense is best overall, union finds schema-aligned peers, and hybrid helps when context lives outside the table. Together they suggest combining structural and semantic signals is the path forward.

05Discussion & Limitations

🍞 Hook: Even the best treasure map can have smudges. Knowing the smudges helps you read it better.

🥬 The Concept (Limitations): What it is: Realistic constraints in the data and methods. How it works: 1) Paper PDFs are hard—S2ORC+LLM reconstructions are useful but noisy, 2) Sources vary (Markdown vs PDF), creating uneven quality, 3) Ground truth depends on available links/tags; missing metadata weakens edges, 4) ModelTables is AI-model-centric, so results may not transfer to other domains. Why it matters: Understanding limits prevents overclaiming and guides future fixes. 🍞 Anchor: A blurry photo still shows the scene, but you won’t read tiny labels—same for low-quality table extractions.

Required resources:

Access to Hugging Face model cards and dataset metadata
GitHub README crawling
arXiv HTML fetching and Semantic Scholar S2ORC dumps
An LLM for table reconstruction from raw text (optional but helpful)
Compute for indexing (FAISS, Pyserini) and training/validation

When not to use:

Domains without reliable citations, lineage tags, or dataset metadata
Tasks needing PDF-accurate structure recovery without HTML support
Settings where strict privacy prevents even metadata linking (unless you adapt the pipeline for in-house signals)

Open questions:

How best to fuse structural union signals with dense semantics for top-1 and beyond?
Can we improve PDF table reconstruction to near-HTML quality at scale?
How to auto-detect table types (performance vs config) and weigh them differently in retrieval?
Can we adapt the relatedness signals for private model lakes (e.g., team/project links) without losing reliability?
What are the best representations for numeric-heavy tables (units, ranges, uncertainty)?

06Conclusion & Future Work

Three-sentence summary: ModelTables is a large, model-centric benchmark of tables that links each table to its model and papers, and defines relatedness using citations, lineage, and shared datasets. Across many methods, dense table embeddings perform best overall, union-based search is strong for citation-style relatedness, and hybrid metadata methods compete closely—showing each captures different, valuable signals. The released data and reproducible pipeline enable community progress in semantic table retrieval, structured comparison, and principled organization of model knowledge.

Main achievement: Turning real, author-supplied signals (citations, model-card links, dataset tags) into a trustworthy, large-scale ground truth for thematically related model tables—and using it to reveal clear gaps and opportunities in current search methods.

Future directions:

Fuse structural union and dense semantic signals into a unified retriever-reranker stack
Improve PDF table reconstruction and unit/metric normalization
Add automatic table-type detection and type-aware scoring
Extend to private model lakes with internal lineage and project signals

Why remember this: It shows that tables carry the precise facts we need to truly compare models, and that linking them via real scholarly and lineage signals unlocks better search, better decisions, and better science.

Practical Applications

•Build a semantic table search tool that finds peer models’ performance tables for side-by-side comparisons.
•Create dashboards that automatically collect and align configuration tables for a given base model family.
•Power a ‘model chooser’ assistant that summarizes top models for a dataset across recent papers.
•Audit model cards by triangulating table facts from model cards, GitHub, and papers to catch inconsistencies.
•Enrich internal model registries with citation- and dataset-based relatedness for faster discovery.
•Enable table-aware question answering (e.g., ‘Which model has the best MNLI score under 150M params?’).
•Automate report generation that aggregates and normalizes metrics/units across related tables.
•Support private model lakes by swapping paper-citation signals with project/team lineage graphs.
•Boost retrieval robustness by deploying header-to-cell augmentation and transpose variants.
•Train new retrieval models that fuse structural (unionable) and semantic (dense) signals.

Version: 1