ModelTables: A Corpus of Tables about Models
Key Summary
- ā¢ModelTables is a giant, organized collection of tables that describe AI models, gathered from Hugging Face model cards, GitHub READMEs, and research papers.
- ā¢It links each table to the model and papers it came from, so we can understand how models are related beyond just text keywords.
- ā¢The benchmark includes more than 60,000 models and about 90,000 tables and shows that model tables are smaller but more closely connected than typical open-data tables.
- ā¢The paper defines clear, verifiable ways to say two models (and their tables) are related: paper citations, model-card lineage, and shared datasets.
- ā¢They test many table search methods and find that table-embedding (dense) retrieval works best overall with 66.5% Precision@1.
- ā¢Union-based semantic table search (finding tables you can stack together) scores 54.8% P@1 overall and does especially well for citation-based relatedness.
- ā¢Hybrid metadata retrieval (mixing text search with embeddings) also performs well (54.1% P@1) and excels on lineage and dataset signals.
- ā¢Augmentations like adding header text into cells and transposing tables make retrieval more accurate and robust.
- ā¢The results show big room for improvement and open the door to better semantic retrieval, structured comparison, and organization of model knowledge.
- ā¢They release data, code, and a repeatable pipeline so others can build bigger or private versions of ModelTables.
Why This Research Matters
Choosing the right AI model often depends on nuanced, structured facts that live inside tables, not just in text blurbs. By linking tables to models and papersāand using citations, lineage, and dataset co-usage as the answer keyāModelTables enables fair tests and real improvements in finding related model information. This helps engineers quickly assemble trustworthy comparisons, avoid mistakes, and pick the best model for their task. Research teams can better organize and navigate ever-growing model lakes, both public and private. Over time, this leads to more transparent, reproducible science and faster innovation because the precise evidence is easier to discover. It also lays a foundation for automated model governance, auditing, and documentation at scale.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine a giant library not of storybooks but of AI models. Each model has its own info sheet with tables of scores, settings, and notesālike a baseball card showing stats.
š„¬ The Concept (Model Lake): What it is: A Model Lake is a huge place that stores lots of AI models and their descriptions. How it works: 1) People upload models, 2) They add model cards (pages that explain the model), 3) Others can search and compare. Why it matters: Without an organized lake, finding the right model feels like looking for a needle in a haystack. š Anchor: Hugging Face is a real example of a Model Lake where you can browse models and read their cards.
š Hook: You know how a recipe card has a small table that lists ingredients and amounts? Model pages also have little tables, but with scores and settings.
š„¬ The Concept (Scientific Tables about Models): What it is: These are structured tables summarizing model performance (like accuracy) and configuration (like batch size). How it works: Authors put results into rows and columns so others can quickly compare models across tasks. Why it matters: Text alone misses details like exact numbers; tables keep precise facts in clear boxes. š Anchor: A GLUE benchmark table showing BERT vs. RoBERTa scores is a classic example.
š Hook: Think of a fair race where everyone agrees on the finish line. You need that finish line to judge who won.
š„¬ The Concept (Benchmark): What it is: A benchmark is an agreed-upon test set to fairly compare methods. How it works: 1) Gather data, 2) Define what counts as a correct match, 3) Measure how well methods do. Why it matters: Without a benchmark, claims like āour search is betterā are just guesses. š Anchor: ModelTables is a benchmark for finding related tables about AI models.
š Hook: In math class, you grade your answers against the answer key. That key is the truth you measure against.
š„¬ The Concept (Ground Truth): What it is: Ground truth is the trusted answer set used to score methods. How it works: 1) Collect signals (like citations, links, shared datasets), 2) Turn them into ārelated/not-relatedā labels, 3) Use these to judge search quality. Why it matters: Without ground truth, you canāt tell if search results are actually right. š Anchor: If two model cards say one inherited from the other, thatās ground-truth relatedness.
š Hook: When you look for a recipe online, you can search by keywords, or you can ask for āsomething thatās like spaghetti but gluten-free.ā
š„¬ The Concept (Table Search): What it is: Finding the most related tables to a given table. How it works: 1) Keyword search looks for matching words, 2) Join/union search checks how tables fit together, 3) Semantic retrieval looks for meaning, not just words. Why it matters: Different searches catch different kinds of related tables; relying on one misses important matches. š Anchor: Searching a BERT results table should find RoBERTa and other GLUE tables, even if column names differ.
The world before: We had big table corpora from the web and open data portals. They were great for general tasks like column typing and join discovery, but not focused on AI models. People often searched by keywords or by structure (can these tables be joined or stacked?). That missed important meaningālike two model tables discussing the same task with different labels.
The problem: Text-only or purely structural search often overlooks the rich, precise semantics in performance/configuration tables. Also, many corpora lacked built-in, verifiable ground truth for ārelatednessā beyond manual curation.
Failed attempts: Keyword search canāt handle synonyms or schema changes; join/union search falls short when structure is varied; dense retrieval on arbitrary web tables lacks model-aware signals; black-box semantic search (like summaries) canāt be externally evaluated.
The gap: We needed a model-centric table benchmark with trustworthy relatedness signals (citations, lineage, and shared datasets), plus a reproducible way to build itāso everyone can test and improve table search for models.
Real stakes: Better table search means faster, more accurate model selection, trustworthy comparisons, and improved governance in both public and private model lakesāimpacting researchers, engineers, and users who depend on picking the right model for the job.
02Core Idea
š Hook: Imagine sorting baseball cards. If you could instantly find all cards related to one playerāteammates, rivals, or shared tournamentsāyouād learn much faster.
š„¬ The Concept (ModelTables): What it is: ModelTables is a large, linked collection of model-related tables with built-in, model-aware ground truth for relatedness. How it works: 1) Crawl Hugging Face model cards, their GitHub links, and cited papers; 2) Extract tables; 3) Link each table to its model and papers; 4) Build ground-truth edges from citations, model lineage, and shared datasets. Why it matters: It lets us fairly test table search methods that care about meaning, not just matching words or identical shapes. š Anchor: Given a BERT table, ModelTables tells you which RoBERTa or MUPPET tables are truly related (because of citations, inheritance, or shared datasets).
The āAha!ā in one sentence: If you connect tables to their models and papers, then use real-world signals (citations, lineage, shared datasets) as the answer key, you can fairly test and greatly improve how we find related model tables.
Three analogies:
- Family tree: Models are like family members; base_model tags show parents/children; tables are their report cards.
- Neighborhood map: Citations are roads connecting houses (papers); shared references mean houses are near each other.
- Cookbook series: Dishes (models) often reuse the same ingredients (datasets), so their recipe tables are related.
Before vs After:
- Before: Search often matched exact words or identical schemas, missing important but differently shaped information.
- After: We can measure and improve methods that find thematically related tablesāeven when headers, layouts, or wording differ.
Why it works (intuition):
- Paper citations reflect intellectual connections; lineage shows model families; shared datasets tie models to the same tasks. These signals are authored by developers and researchers themselves, so theyāre reliable.
- When tables inherit these links, we get a truth-backed way to judge if table search really understands model relationships.
Building blocks:
- Source linking: Every table is tied to at least one model; models to papers and datasets.
- Multi-signal ground truth: Paper-level (direct citations and reference overlap), model-level (explicit links and inheritance), dataset-level (shared training/eval datasets).
- Robust corpus design: Deduped, quality-checked tables, plus augmentations (transpose, header-to-cell) to handle style differences.
- Evaluation-ready: Supports multiple search methods (keyword, join/union, dense, sparse, hybrid) with consistent scoring.
š Anchor: When a lab uploads a new fine-tuned model that cites RoBERTa and reuses GLUE, ModelTables links its performance table to the RoBERTa family and GLUE peers, so search can find them even if headers differ.
03Methodology
At a high level: Input (model cards) ā Extract tables and links (GitHub, papers) ā Clean and filter ā Augment for robustness ā Build relatedness graphs (paper, lineage, dataset) ā Evaluate search methods.
š Hook: Imagine scavenger hunting with a map that marks not just treasures, but also paths, families, and clubs.
š„¬ The Concept (Source Extraction Pipeline): What it is: A step-by-step crawler that collects tables and metadata from model cards, GitHub, and papers. How it works: 1) Parse Hugging Face model cards (tables, URLs, BibTeX), 2) Fetch GitHub READMEs and extract tables, 3) Fetch arXiv HTML pages to extract tables, 4) Use Semantic Scholarās S2ORC to get raw table text from PDFs and reconstruct structure with an LLM, 5) Link all tables back to their model. Why it matters: Without linking every table to its model and papers, we canāt build reliable relatedness or evaluate search. š Anchor: A BERT model card links to a GitHub README and its arXiv paper; the pipeline pulls all three tables and ties them to BERT.
Step-by-step ārecipeā with reasons and examples:
- Model cards first: Theyāre clean Markdown and often contain performance and config tables. Without this, weād miss developer-curated facts. Example: A table listing ābatch size,ā ālearning rate,ā and GLUE scores.
- GitHub second: READMEs often expand details or report extra benchmarks. Skipping this loses important, structured context. Example: A READMEās table of model variants and parameter counts.
- arXiv HTML third: When available, HTML tables are precise; PDF-only extraction is noisy, so they limit PDF-based tables to controlled ablations. Example: A paperās Table 5 of SQuAD results pulled from its HTML page.
- S2ORC + LLM: S2ORC gives raw text of tables; an LLM reconstructs rows/columns. Without this, many paper tables would remain unusable. Example: Turning āMNLI: 84.6; QNLI: 92.7ā¦ā into a tidy 2D table.
š Hook: Think of tidying your room: you fix crooked posters, sort drawers, and toss junk so everything is easy to find.
š„¬ The Concept (Quality Control & Filtering): What it is: A set of fixes and filters to make tables consistently usable. How it works: 1) Repair merged/empty cells, 2) Handle special characters and pipes in Markdown, 3) Merge footnotes into cells, 4) Stitch multi-page tables, 5) Remove visual artifacts, 6) Filter out low-quality or unstructured tables. Why it matters: Messy tables break retrieval and produce false matches. š Anchor: If a cell reads ā0.88ā ā with a footnote, they merge it so the number and note stay together.
š Hook: Sometimes people write ingredients across columns; other times they list them down rows. Itās still the same recipe.
š„¬ The Concept (Transpose Augmentation): What it is: Create a flipped version of each table (rows become columns) to handle style differences. How it works: 1) For every table, generate its transpose, 2) Include both in search, 3) Accept either as a match. Why it matters: Two related tables might be rotated; without transposes, retrieval misses them. š Anchor: One GLUE table lists tasks as columns; another lists them as rowsātransposition aligns them.
š Hook: Imagine if a score said ā93.5ā but didnāt say āaccuracy.ā You might not know what it means.
š„¬ The Concept (Header-to-Cell Augmentation): What it is: Copy column header meaning into each cell value (e.g., turn ā93.5ā under āAccā into āAcc: 93.5ā). How it works: 1) For each value, prepend its header name, 2) Keep originals too, 3) Search sees both semantic context and number. Why it matters: Numbers alone are ambiguous; adding header text boosts semantic matching. š Anchor: āF1: 88.0ā is easier to match across tables than a bare ā88.0.ā
š Hook: Think of friendship links: classmates (model cards), pen pals (citations), and clubmates (shared datasets) all show relationships.
š„¬ The Concept (Relatedness Graphs): What it is: Three signals to say models (and their tables) are relatedāpapers, lineage, datasets. How it works: 1) Paper-level: Direct citations and reference overlap (with intent and influence filters), 2) Model-level: Explicit links and base_model inheritance, 3) Dataset-level: Shared datasets from tags/URLs. Tables inherit their modelās edges. Why it matters: Each signal catches a different type of true relation; using all gives balanced coverage and precision. š Anchor: Two models that both cite BERT and both use GLUE are strongly related, even if they donāt cite each other directly.
Putting it together: After extraction, cleaning, filtering, and augmentation, the benchmark has over 60K models and about 90K tables. Its relatedness graphs are realistic and skewed (some hub models connect widely), making evaluation meaningful and challenging.
04Experiments & Results
š Hook: Picture a quiz game: given one table, can you pick the most related table from a huge pile? The score is how often you get the top pick right.
š„¬ The Concept (Evaluation Setup): What it is: A test of different search methods to find the most related table (Precision@1). How it works: 1) Treat every table as a query, 2) Search the corpus, 3) Count success if top-1 is ground-truth related by paper, model, or dataset signals. Why it matters: Precision@1 is strictāif your first guess is wrong, you score zeroāso itās a clear signal of practical usefulness. š Anchor: If your query is a BERT GLUE table, top-1 should be a true peer like RoBERTa GLUE, not a random vision table.
The competition (methods tested):
- Keyword search (headers/cells)
- Joinable search (can tables be joined on a key?)
- Unionable search (can tables be stacked by columns?), using Starmie for semantic unions
- Dense retrieval (Sentence-BERT embeddings over whole tables), FAISS for fast neighbors
- Sparse retrieval (metadata text with Pyserini)
- Hybrid retrieval (sparse metadata to shortlist, then dense rerank)
Scoreboard (context):
- Table-based dense retrieval: 66.5% P@1 overallālike getting an A while many others score B.
- Union-based semantic search: 54.8% P@1 overall; 54.6% on citation links, 31.3% on model-card inheritance, 30.6% on shared datasetsāsolid B, especially strong for citation-relatedness.
- Metadata-hybrid retrieval: 54.1% P@1ācompetitive and particularly good on lineage/dataset signals.
- Keyword and simple joinable search trail behind; they miss semantic and schema variations.
Surprising findings and insights:
- Dense vs. Union strength: Dense table embeddings excel overall (66.5%), but union search shines for direct citation-style relatednessāmeaning structure-aware matching still matters a lot.
- Source quality matters: GitHub and model-card tables (clean Markdown) yield much higher precision than heterogeneous paper tables (especially S2ORC+LLM reconstructions). Homogeneous sources (M+G) score best; adding noisier sources lowers precision.
- Augmentations help: Adding header text into cells consistently boosts accuracy. Transposing helps a little; the biggest benefit comes from making semantics explicit.
- Structural robustness: For union search, shuffling columns and rows at inference helps the encoder ignore layout quirks and focus on meaningācolumn shuffling yields the largest gains.
- Ground-truth choice changes numbers: Overlap-based citation GT is denser and yields higher measured precision than strict direct-citation GT. Methods behave consistently as GT becomes stricter (numbers drop but relative ordering remains stable).
Concrete examples:
- Union search example: A BERT GLUE table retrieves MUPPET or RoBERTa GLUE tables with very similar schemas, enabling side-by-side comparisons.
- Dense retrieval example: It pulls in semantically related tables with different layouts (extra metrics, different column orders), connecting relevant content that union search might miss.
Big picture: No single method dominates every case. Dense is best overall, union finds schema-aligned peers, and hybrid helps when context lives outside the table. Together they suggest combining structural and semantic signals is the path forward.
05Discussion & Limitations
š Hook: Even the best treasure map can have smudges. Knowing the smudges helps you read it better.
š„¬ The Concept (Limitations): What it is: Realistic constraints in the data and methods. How it works: 1) Paper PDFs are hardāS2ORC+LLM reconstructions are useful but noisy, 2) Sources vary (Markdown vs PDF), creating uneven quality, 3) Ground truth depends on available links/tags; missing metadata weakens edges, 4) ModelTables is AI-model-centric, so results may not transfer to other domains. Why it matters: Understanding limits prevents overclaiming and guides future fixes. š Anchor: A blurry photo still shows the scene, but you wonāt read tiny labelsāsame for low-quality table extractions.
Required resources:
- Access to Hugging Face model cards and dataset metadata
- GitHub README crawling
- arXiv HTML fetching and Semantic Scholar S2ORC dumps
- An LLM for table reconstruction from raw text (optional but helpful)
- Compute for indexing (FAISS, Pyserini) and training/validation
When not to use:
- Domains without reliable citations, lineage tags, or dataset metadata
- Tasks needing PDF-accurate structure recovery without HTML support
- Settings where strict privacy prevents even metadata linking (unless you adapt the pipeline for in-house signals)
Open questions:
- How best to fuse structural union signals with dense semantics for top-1 and beyond?
- Can we improve PDF table reconstruction to near-HTML quality at scale?
- How to auto-detect table types (performance vs config) and weigh them differently in retrieval?
- Can we adapt the relatedness signals for private model lakes (e.g., team/project links) without losing reliability?
- What are the best representations for numeric-heavy tables (units, ranges, uncertainty)?
06Conclusion & Future Work
Three-sentence summary: ModelTables is a large, model-centric benchmark of tables that links each table to its model and papers, and defines relatedness using citations, lineage, and shared datasets. Across many methods, dense table embeddings perform best overall, union-based search is strong for citation-style relatedness, and hybrid metadata methods compete closelyāshowing each captures different, valuable signals. The released data and reproducible pipeline enable community progress in semantic table retrieval, structured comparison, and principled organization of model knowledge.
Main achievement: Turning real, author-supplied signals (citations, model-card links, dataset tags) into a trustworthy, large-scale ground truth for thematically related model tablesāand using it to reveal clear gaps and opportunities in current search methods.
Future directions:
- Fuse structural union and dense semantic signals into a unified retriever-reranker stack
- Improve PDF table reconstruction and unit/metric normalization
- Add automatic table-type detection and type-aware scoring
- Extend to private model lakes with internal lineage and project signals
Why remember this: It shows that tables carry the precise facts we need to truly compare models, and that linking them via real scholarly and lineage signals unlocks better search, better decisions, and better science.
Practical Applications
- ā¢Build a semantic table search tool that finds peer modelsā performance tables for side-by-side comparisons.
- ā¢Create dashboards that automatically collect and align configuration tables for a given base model family.
- ā¢Power a āmodel chooserā assistant that summarizes top models for a dataset across recent papers.
- ā¢Audit model cards by triangulating table facts from model cards, GitHub, and papers to catch inconsistencies.
- ā¢Enrich internal model registries with citation- and dataset-based relatedness for faster discovery.
- ā¢Enable table-aware question answering (e.g., āWhich model has the best MNLI score under 150M params?ā).
- ā¢Automate report generation that aggregates and normalizes metrics/units across related tables.
- ā¢Support private model lakes by swapping paper-citation signals with project/team lineage graphs.
- ā¢Boost retrieval robustness by deploying header-to-cell augmentation and transpose variants.
- ā¢Train new retrieval models that fuse structural (unionable) and semantic (dense) signals.