CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

Tsung-Hsiang Chou; Chen-Jui Yu; Shui-Hsiang Hsu; Yao-Chung Fan

CGPT: Cluster-Guided Partial Tables with LLM-Generated Supervision for Table Retrieval

Intermediate

Tsung-Hsiang Chou, Chen-Jui Yu, Shui-Hsiang Hsu et al.1/22/2026

arXiv PDF

Key Summary

•This paper introduces CGPT, a way to help computers find the right tables by building smarter mini-versions of tables and training with tough practice questions.
•Instead of just taking the first few rows of a table, CGPT groups similar rows using K-means clustering and samples from each group to cover more of the table’s meaning.
•A language model writes synthetic (practice) questions about each mini-table, which are then used to train the retriever to be more precise.
•CGPT uses hard negatives—very similar but wrong tables—to teach the model to tell close look-alikes apart.
•Across four benchmarks (MimoTable, OTTQA, FetaQA, E2E-WTQ), CGPT boosts top-1 accuracy (R@1) by an average of 16.54% over strong baselines like QGpT.
•It generalizes well across different domains and languages in a single mixed corpus.
•Smaller, cheaper LLMs for generating questions work almost as well, making the method cost-efficient.
•The core idea is semantically guided partial tables plus contrastive training using LLM-generated supervision.
•Even without fine-tuning, the clustering-based partial tables alone improve retrieval.
•The approach is scalable and practical for large table collections used in real-world systems.

Why This Research Matters

Many real-world facts live inside tables, from budgets to sports stats to lab results. CGPT helps systems find the right table on the first try, which speeds up answers and reduces frustration for users. Because it works well across domains and languages, one system can serve many use cases reliably. It is also cost-effective: smaller LLMs for question generation perform almost as well, making it accessible to teams without huge compute. By training with realistic, challenging examples, CGPT boosts precision where it matters most. This leads to better chat assistants, smarter search, and more trustworthy data tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school library has millions of fact cards organized into tables—countries with their capitals, players with their scores, or plants with their heights. When you ask a question like “Which country has Paris as its capital?”, the computer needs to find the exact table fast.

🥬 The Concept (Table Retrieval): Table retrieval is how a computer picks the right table from a giant collection when given a question. How it works:

Turn the question and every table into special number lists (embeddings).
Compare the numbers to see which table is most similar to the question.
Return the top matches as answers. Why it matters: Without good table retrieval, later steps like answering the question or summarizing data will fail because the system starts with the wrong table. 🍞 Anchor: If you ask “Who won most gold medals in 2016?”, table retrieval needs to fetch the Olympics medal table before any answer can be found.

The World Before: Computers got pretty good at finding text paragraphs (like news articles), thanks to general-purpose embedding models trained on tons of text. But tables are different: they’re very structured, with rows and columns, and may be very large. Early systems often squashed a whole table into one vector. That’s like trying to squeeze a whole class’s report cards into one emoji—you lose important details.

🥬 The Concept (Embedding Model): An embedding model turns text or table bits into number vectors so computers can compare meaning. How it works:

Read words, headers, and sometimes positions (row/column info).
Map them into numerical vectors capturing meaning.
Use similarity (like cosine similarity) to compare. Why it matters: If embeddings miss important details, the model can’t match questions to the right tables. 🍞 Anchor: Two sentences like “France’s capital is Paris” and “Paris is the capital of France” get nearly the same vectors, so the system knows they mean the same thing.

The Problem: When a system squeezes a whole big table into one vector, important specifics get lost.

🥬 The Concept (Semantic Compression): Semantic compression happens when too much information is stuffed into too small a summary, losing details. How it works:

Encode an entire large table into a single vector.
Only the most average, common patterns survive.
Rare but crucial facts (like a key row far down) get blurred out. Why it matters: The question might ask about a detail that got lost, causing a mismatch. 🍞 Anchor: If a table has 1,000 rows and only row 923 answers your question, cramming everything into one number often forgets row 923.

Failed Attempts: A recent fix called QGpT asked an LLM to write helpful practice questions based on the first k rows of a table.

🥬 The Concept (Partial Table): A partial table is a smaller slice of the original table used to guide retrieval or question generation. How it works:

Pick a subset of rows from a big table.
Use this smaller piece to create focused queries or representations.
Retrieve based on these smaller, clearer summaries. Why it matters: It shortens inputs and highlights details, but if you pick the wrong rows, you miss important info. 🍞 Anchor: If you only keep the first 10 rows but the answer is in row 50, your practice questions won’t cover the right content.

What Didn’t Work: Always picking the first rows assumes they’re representative. Often, the juicy details live elsewhere. Also, the practice questions generated by LLMs were not used to actually train the embedding model to get better, leaving performance gains on the table.

The Gap: We need a smarter way to choose which rows form the partial table—so we cover more of the table’s meaning—and we should use those LLM-written questions to truly teach (fine-tune) the retriever.

Real Stakes: Good table retrieval powers many things you use:

Fast answers in search or chat assistants when data lives in spreadsheets.
Accurate financial, science, or sports lookups.
Less time wasted digging through messy data. If we don’t fix it, you get slower tools, wrong answers, or missed insights.

🥬 The Concept (K-means Clustering): K-means clustering groups similar items into k clusters. How it works:

Start with k guess-centers.
Assign each row’s embedding to the nearest center.
Recompute centers by averaging assigned points.
Repeat until groups stop changing much. Why it matters: It helps pick diverse rows from different groups, so our partial tables cover more of the table’s topics. 🍞 Anchor: Like sorting your sticker collection into groups by theme (animals, sports, space) so you remember to include a few from each theme when making a mini album.

That’s where CGPT comes in: it builds cluster-guided partial tables and uses LLM-made questions to fine-tune the retriever, especially with hard negatives (tricky, similar but wrong tables) to sharpen precision.

02Core Idea

🍞 Hook: You know how studying with only the first page of your notes is risky because the test might ask about page 5 or 7? A smarter plan is to sample from each chapter and practice with tough questions.

🥬 The Concept (CGPT): CGPT is a training framework that builds smarter partial tables using clustering and trains the retriever with LLM-made questions and hard negatives so it finds the right table more often. How it works:

Cluster table rows by meaning and sample from each cluster to build diverse partial tables (KPTs).
Ask an LLM to write synthetic questions about each KPT.
For each question, find look-alike but wrong tables (hard negatives).
Fine-tune the embedding model so it prefers the true KPT over the hard negatives. Why it matters: Without cluster-guided sampling and hard-negative training, the model misses scattered facts and can’t tell near-duplicates apart. 🍞 Anchor: Like making a study guide with a few examples from every chapter and quizzing yourself against sneaky wrong answers until you reliably pick the right one.
The “Aha!” Moment in one sentence: Don’t guess which rows matter—let clustering pick diverse slices, then use LLM-written questions plus hard negatives to directly teach the retriever what “right” looks like.
Multiple Analogies:

Puzzle Boxes: Instead of only checking the top puzzle pieces, take a few from every color group, then practice with tricky lookalike puzzles so you learn which pieces truly fit.
Grocery Basket: Build a balanced basket by grabbing from produce, dairy, grains, and snacks, then practice comparing with baskets that look similar but miss key items.
Sports Training: Train by playing against teams that are almost as good as your own—close matches expose weaknesses and make you sharper.

Before vs After:

Before: Partial tables were made by taking the first k rows; LLM questions helped but didn’t improve the retriever itself; easy negatives led to shallow learning.
After: Partial tables are cluster-guided; LLM questions become supervision; hard negatives force fine-grained discrimination; top-1 accuracy jumps substantially.

Why It Works (intuition):

Clustering spreads coverage across the table’s semantic neighborhoods, rescuing details that would be lost in a single vector.
LLM questions translate table facts into natural queries, aligning the retriever with how users really ask.
Hard negatives are almost-right decoys, so the model learns precise boundaries rather than vague similarities.
Contrastive learning is like pulling the right pair closer while pushing lookalikes away, organizing the space of meanings.

Building Blocks:

🍞 Hook: Think of numbers as map pins.
🥬 The Concept (Cosine Similarity): Cosine similarity measures how aligned two vectors are, like checking if two arrows point the same way. How it works: Compute the angle between vectors; smaller angle = higher similarity. Why it matters: It ranks which tables are closest to a query in meaning. 🍞 Anchor: Two arrows both pointing northeast have high cosine similarity.
🥬 The Concept (Synthetic Query Generation): An LLM writes realistic practice questions from a partial table to simulate user queries. How it works: Feed the KPT to the LLM with instructions to create diverse, content-grounded questions. Why it matters: It teaches the retriever on examples that mirror real searches. 🍞 Anchor: Like a teacher writing practice questions directly from your textbook page.
🥬 The Concept (Hard-Negative Contrastive Fine-tuning): Train the model so queries stick to their true tables and repel near-miss tables. How it works: For each query, compare the positive table to several very similar wrong ones and adjust embeddings to favor the positive. Why it matters: Without hard negatives, the model can’t tell close cases apart. 🍞 Anchor: Like learning to tell twin puppies apart by focusing on tiny differences (a spot on the ear).
🥬 The Concept (Recall@k): Recall@k asks, “Is the correct table in the top k results?” How it works: For each query, retrieve k tables; count how often the right one shows up. Why it matters: High R@1 means users get the right table first, saving time. 🍞 Anchor: If R@1 is 60%, 6 out of 10 times the right table is at the top.

Together, these pieces turn scattered table facts into findable answers.

03Methodology

At a high level: Input (full tables) → K-means Partial Tables (KPTs) → LLM-generated synthetic queries → Hard-negative sampling → Contrastive fine-tuning → Output (a sharper embedding retriever).

Step A: K-means Partial Table Generation (KPT)

What happens: Each row (instance) in a table gets an embedding. K-means clustering groups similar rows. From each cluster, we randomly sample s rows and combine them with the header to make one partial table. We create k such KPTs per original table.
Why this step exists: Big tables hide details when compressed. Clustering ensures we grab diverse slices so rare but important rows aren’t missed. Random intra-cluster sampling preserves variation, which the ablation shows is crucial.
Example: Suppose a table lists 1,000 animals with columns [Name, Class, Habitat, Lifespan, Diet]. Clustering groups rows like ocean mammals, desert reptiles, rainforest birds, etc. Sampling 5 from each group gives mini-tables that cover many habitats and diets.
Secret sauce here: Adaptive k (bounded by k_max) balances coverage and efficiency; random sampling inside clusters keeps semantic variety (outperforming centroid-only picks in the paper).

🍞 Hook: Think of making a lunchbox with items from every food group so your meal is balanced. 🥬 The Concept (KPT - Cluster-Guided Partial Table): A KPT is a small table built by sampling rows from each semantic cluster. How it works: Embed rows → cluster → sample per cluster → attach header. Why it matters: Without KPTs, the model either sees too little (first-k-rows only) or too much (full table compressed), losing crucial details. 🍞 Anchor: Your lunchbox (KPT) has fruit, veggies, grains, and protein—nothing important is forgotten.

Step B: Synthetic Query Generation with an LLM

What happens: For each KPT, prompt an LLM (e.g., Llama-3.1-8B-Instruct) to write n_q diverse, content-grounded questions (entity, time, comparison, aggregation, complex reasoning).
Why this step exists: It converts table facts into user-like questions, giving direct supervision about how real queries might look.
Example: For a KPT about Olympic medals, the LLM might ask, “Which country ranked first in gold medals in 2016?” or “How many total medals did Japan win?”
Secret sauce here: The prompt requires referencing real values and multiple question types, preventing fluffy, generic queries.

🍞 Hook: Like a teacher creating a quiz from one page of your textbook. 🥬 The Concept (Synthetic Queries): These are LLM-written practice questions tied to a specific KPT. How it works: Give the LLM the KPT and a template; it outputs diverse, grounded questions. Why it matters: Without realistic queries, the retriever won’t learn how people truly ask. 🍞 Anchor: A quiz asking, “Which animal has the longest lifespan in this mini-table?” points exactly to rows in that KPT.

Step C: Hard-Negative Sampling

What happens: For each query, compute cosine similarity against KPTs from other tables using a pretrained retriever, then pick the top-h most similar wrong KPTs (hard negatives).
Why this step exists: Easy negatives (very different tables) don’t teach nuance. Hard negatives force fine-grained learning.
Example: If the query is about “gold medals 2016,” a hard negative might be a medals table from 2012 or a 2016 medals-by-sport table—very close, but not the exact target.
Secret sauce here: Selecting confusing lookalikes raises precision at top-1, as confirmed by results.

🍞 Hook: Training wheels are helpful, but racing a bike against slightly faster friends makes you improve fast. 🥬 The Concept (Hard Negatives): Near-miss, wrong tables that look very similar to the correct one. How it works: Rank candidates by cosine similarity; pick the top wrong ones. Why it matters: Without them, the model can’t reliably separate twins. 🍞 Anchor: Two animals with almost identical features; you learn to notice the tiny spot that distinguishes the right one.

Step D: Contrastive Fine-tuning

What happens: Train the embedding model so the query is closer to its true KPT than to any hard negatives, using a contrastive objective (InfoNCE) with temperature τ.
Why this step exists: It reshapes the embedding space to reflect what “correct” means for real questions.
Example: For query q and positive KPT p+, the model increases sim(q, p+) and decreases sim(q, p−) for each hard negative.
Secret sauce here: Using LLM-generated queries as supervision plus hard negatives delivers strong, targeted signals that lift R@1.

🍞 Hook: Imagine organizing your bookshelf so each book (query) sits closest to its exact series (positive) and away from lookalike series (negatives). 🥬 The Concept (Contrastive Learning / InfoNCE): A training rule that pulls correct pairs together and pushes incorrect pairs apart. How it works: Compute similarities; boost the positive; reduce the negatives; control sharpness with temperature τ. Why it matters: Without contrastive pressure, the model stays fuzzy about what matches what. 🍞 Anchor: Your favorite socks end up next to their true partners, not next to almost-matching socks.

Training Details (as used in the paper):

Base model: BAAI/bge-m3.
KPT parameters: r = 10 (cluster granularity), k_max = 5 (max clusters), s = 5 (rows per cluster sample).
LLM: Llama-3.1-8B-Instruct, temperature 0.4, max 1024 tokens, n_q = 5 queries per KPT.
Hard negatives: h = 8 per query.
Fine-tuning: learning rate 1e-5, 2 epochs, τ = 0.01, gradient accumulation 32 steps, single NVIDIA A6000 (48GB).

What breaks without each step:

No clustering: Partial tables may miss key topics; coverage shrinks; retrieval worsens.
No LLM queries: Lacks natural, varied supervision; model won’t align with user phrasing.
No hard negatives: Model can’t disambiguate near twins; R@1 drops.
No fine-tuning: Improvements rely only on representations at query time; gains are smaller.

04Experiments & Results

The Test: The authors measured how often the right table appeared in the top 1, 5, or 10 search results (Recall@1/5/10). High R@1 means users usually get the correct table first try—like opening the right book on the first pull.

The Competition: CGPT was compared to QGpT and variants that remove parts of CGPT (no fine-tuning, or no hard negatives). They also tested different sampling strategies and different LLMs for generating queries. Datasets included MimoTable (Chinese and English), OTTQA, FetaQA, and E2E-WTQ, plus a unified multi-domain corpus mixing them together.

The Scoreboard (with context):

Main result: CGPT lifted average R@1 by 16.54% over strong retrieval augmentation baselines.
MimoTable (EN): CGPT reached 60.13% R@1—like moving from a B to a solid A—surpassing QGpT by 9.47 points.
Even without fine-tuning (just clustering-based KPTs), R@1 improved over QGpT by 2.14–6.48 points, showing the value of semantically guided partial tables.
With hard negatives, top-1 precision improved the most, sometimes trading a little R@5/10 (a common precision–recall tension). For precision-focused use cases, this is a win.
Cross-domain (merged corpus): KPT alone boosted an unfine-tuned BGE-m3; CGPT amplified gains further. Example: On MimoTable (CH), CGPT reached about 55.03% R@1 (a +16.49-point jump over baseline), and on MimoTable (EN) it hit 57.79% R@1.
Transfer back to QGpT’s construction: Training with CGPT still improved performance on QGpT’s dataset format (e.g., MimoTable EN from 50.66% to 59.28% R@1), showing robustness to how partial tables are built.

Surprising Findings:

Smaller LLMs worked nearly as well. Across Llama-3.1-8B, GPT-OSS-20B, and Qwen3-4B, R@1 varied by only about 0.6 points. That means you can save compute costs without giving up much accuracy.
Random intra-cluster sampling outperformed centroid-only picks. Simplifying to just one representative per cluster reduced diversity and hurt results (e.g., centroid-based selection dropped to 51.62% R@1 on MimoTable CH versus 56.8% for CGPT), highlighting the need to preserve variation.

Interpreting the Numbers:

R@1 is the most user-visible metric—did you get the right table first? Gains of 9–18 points mean far fewer frustrated clicks.
High R@5/10 across settings means the correct table is almost always in the shortlist, which helps systems that re-rank or do multi-step reasoning.

Key Takeaways:

Cluster-guided partial tables are a strong foundation.
Making LLM-generated questions into supervision (not just hints) pays off.
Hard negatives sharpen top-1 precision—the most important place to be right.
The method travels well across languages and domains.

🍞 Hook: Think of a spelling bee—getting the first letter right matters most. 🥬 The Concept (Cross-Domain Generalization): A model that performs well across different kinds of data and languages without retraining per domain. How it works: Training signals come from varied tables and LLM queries; clustering and contrastive learning encourage broad, robust patterns. Why it matters: Real systems mix finance, sports, science, and more; one model should handle all. 🍞 Anchor: Like a good umbrella that works in drizzle, rain, or snow.

05Discussion & Limitations

Limitations:

Clustering quality depends on initial embeddings. If those are weak or biased, clusters may miss true semantic groupings, harming coverage.
Very small tables might not benefit much from clustering overhead; very large, noisy tables may still hide rare facts outside sampled rows.
LLM-generated queries must be faithful to the table. If prompts are off or models hallucinate, supervision can teach the wrong thing.
Hard negatives require extra retrieval passes during training, increasing compute.
Some domains need structured reasoning beyond row-level signals (e.g., multi-table joins, complex math), which isn’t directly addressed here.

Required Resources:

A capable embedding model (e.g., BAAI/bge-m3) and a GPU (e.g., A6000 class) for fine-tuning.
An LLM for question generation; smaller models (e.g., Qwen3-4B) suffice in practice.
Storage and pipelines to generate KPTs, queries, and negatives at scale.

When NOT to Use:

Tiny collections where exact-match search already works perfectly.
Highly sensitive settings where synthetic data generation is disallowed.
Cases where the question requires multi-table joins or advanced calculations not present in any single table slice.

Open Questions:

Can we jointly learn row embeddings and clustering so clusters improve during training?
Could adaptive sampling (learning s per cluster) capture rare facts better?
How to integrate column types, unit normalization, or schemas to handle tables with tricky structures?
Can we extend to cell-level or multi-table retrieval with the same framework?
How to detect and filter hallucinated or low-quality synthetic queries automatically?

06Conclusion & Future Work

Three-Sentence Summary: CGPT builds smarter mini-versions of tables using clustering, asks an LLM to write grounded practice questions, and fine-tunes the retriever with hard negatives so it picks the right table first more often. Across multiple datasets and a mixed-domain setup, it consistently beats strong baselines, boosting R@1 by an average of 16.54%. It stays effective even with smaller LLMs, making it practical and scalable.

Main Achievement: Turning LLM-generated questions into direct supervision over semantically diverse, cluster-guided partial tables—and pairing that with hard-negative contrastive training—delivers large, reliable gains in top-1 table retrieval.

Future Directions: Explore joint clustering-and-embedding learning, adaptive sampling per cluster, schema- and unit-aware representations, and extensions to multi-table and cell-level retrieval. Add automatic quality checks for synthetic queries and smarter hard-negative mining.

Why Remember This: CGPT shows that how you slice a table and how you train on realistic, challenging examples matters as much as which base model you pick—smart supervision and coverage beat one-size-fits-all shortcuts.

Practical Applications

•Enterprise search over spreadsheets and databases to quickly surface the exact report table employees need.
•Customer support bots that fetch the correct product-spec table to answer configuration questions.
•Financial analytics assistants that reliably retrieve earnings tables and KPI summaries.
•Scientific literature tools that find the right results tables (e.g., measurement datasets) for researchers.
•Logistics dashboards that fetch shipment status or inventory tables on demand.
•Public data portals that return the correct government statistics tables in one click.
•Education platforms that point students to the right example tables in textbooks or course materials.
•E-commerce search that locates the exact comparison tables for similar products.
•Healthcare data viewers that pull the correct lab results table for a patient query.
•Agriculture information systems that retrieve the right crop-yield or weather-history tables.

Version: 1