Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
Key Summary
- •This survey explains how large language models (LLMs) can clean, connect, and enrich messy data so it’s ready for real apps like dashboards, fraud detection, and training AI.
- •It shows a big shift from hand-written rules to prompt-driven, context-aware, and agent-based workflows that work across many kinds of data.
- •Three main jobs are covered: data cleaning (standardization, error fixing, imputation), data integration (entity and schema matching), and data enrichment (annotation and profiling).
- •LLMs help by understanding meaning (semantics), following instructions, and pulling in extra facts via retrieval, not just matching strings or using rigid rules.
- •Hybrid ideas—like letting LLMs write small programs, teach smaller models, or work with tools—cut costs while keeping quality high.
- •The paper lists common datasets and fair ways to measure success, such as precision, recall, F1, ranking scores, and semantic similarity.
- •Limits remain: token cost, occasional hallucinations, the need for better evaluations, and making agents reliable at scale.
- •A roadmap suggests scalable LLM–data systems, principled agent designs with safeguards, and stronger, evidence-grounded evaluation.
- •Real-world stakes are high because bad data wastes money; LLMs can reduce manual work and improve trustworthy decisions.
- •Bottom line: with careful design and checks, LLMs are becoming powerful helpers for turning raw data into application-ready data.
Why This Research Matters
Good data is the foundation of trustworthy dashboards, fair AI models, and smart decisions. When data is messy or mismatched, companies waste time and money, and people get wrong answers or bad experiences. LLM-enhanced preparation cuts manual effort, understands meaning, and grounds fixes in evidence, so results are more reliable. Hybrid approaches make this affordable by turning LLM insights into small programs or compact models that run fast at scale. Stronger evaluation and safer agent workflows will further reduce risks and help organizations deploy these methods with confidence. In short, this shift helps everyone—analysts, engineers, and end users—get better answers from their data, faster.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
🍞 Hook: Imagine you’re making a fruit salad, but the fruits come in different shapes, sizes, and even languages on their stickers. Some are rotten, some are missing pieces, and some labels mean the same thing but look totally different. That’s what real-world data looks like before it’s ready to use.
🥬 The Concept — Data Preparation:
- What it is: Getting raw, messy data into a clean, unified, and useful form for apps, analytics, and AI.
- How it works: 1) Clean mistakes and make formats consistent; 2) Connect related pieces from different places; 3) Add helpful labels and summaries.
- Why it matters: Without it, charts lie, models learn the wrong lessons, and decisions go off track. 🍞 Anchor: If your birthday is written as “Jan 1st 2025” in one place and “01/01/2025” in another, data preparation makes them match so your age is calculated correctly.
🍞 Hook: You know how cleaning your room helps you find your backpack and shoes faster? Data also needs cleaning so we can trust it.
🥬 The Concept — Data Cleaning:
- What it is: Fixing errors, making formats consistent, and filling in missing values.
- How it works: 1) Standardize (same style for dates, names, etc.); 2) Detect and repair mistakes (typos, invalid codes); 3) Impute missing values using clues or outside info.
- Why it matters: Without cleaning, even simple sums or counts can be wrong. 🍞 Anchor: Turning “April 7, 2021,” “7th April 2021,” and “2021/04/07” into “20210407” so sorting by date works.
🍞 Hook: Think of two photo albums from different families that both have a picture of the same person but with different names. You’d want to line them up.
🥬 The Concept — Data Integration:
- What it is: Combining data from different sources into one consistent view.
- How it works: 1) Entity matching (same real-world thing across sources); 2) Schema matching (line up columns with the same meaning); 3) Merge carefully under rules.
- Why it matters: Without it, dashboards double-count, and searches miss important matches. 🍞 Anchor: Matching “Apple iPhone13” and “iPhone 13 by Apple” so they’re treated as the same product.
🍞 Hook: Adding toppings makes pizza better; adding labels and summaries makes data more useful.
🥬 The Concept — Data Enrichment:
- What it is: Adding semantic labels, descriptions, and profiles to make data easier to understand and search.
- How it works: 1) Annotate (e.g., this column is CustomerID); 2) Profile (summaries, stats, related data); 3) Use retrieval to attach helpful context.
- Why it matters: Without enrichment, users and models can’t find or interpret what they need. 🍞 Anchor: Marking a column as “birthDate” and writing a short table description so a BI tool can auto-generate helpful charts.
🍞 Hook: You know how a super-smart librarian can understand meanings, not just exact words? That’s what modern AI models do with data.
🥬 The Concept — LLM Techniques:
- What it is: Using large language models to understand instructions, reason about meaning, and generate or check data fixes.
- How it works: 1) Follow prompts; 2) Use examples; 3) Retrieve extra info; 4) Plan steps; 5) Call tools or write small programs.
- Why it matters: Old rule-based methods miss subtle meaning; LLMs bring semantics and flexibility. 🍞 Anchor: An LLM fixes “bxrmxngham” to “birmingham” by noticing similar past cases and typical city spellings.
🍞 Hook: When you tell a helper exactly what to do, things get done faster and better.
🥬 The Concept — Prompt-driven Workflows:
- What it is: Guiding the model with clear instructions and examples to do data tasks.
- How it works: 1) Serialize data into text; 2) Give step-by-step instructions; 3) Batch examples; 4) Ask for structured outputs.
- Why it matters: Without good prompts, results can be vague or wrong. 🍞 Anchor: “Convert these dates to YYYYMMDD; if missing, say NULL. Here are three examples.”
🍞 Hook: Picture a team of tiny robot assistants that plan, check, and use tools to finish chores.
🥬 The Concept — Agent-based Systems:
- What it is: LLM-powered assistants that plan workflows, call tools, verify steps, and iterate.
- How it works: 1) Plan (detect issues); 2) Act (standardize, repair); 3) Check (evaluate fixes); 4) Repeat until quality is good.
- Why it matters: Without agents, humans must wire many steps manually. 🍞 Anchor: An agent finds dirty date columns, calls a “clean_date” tool, checks for errors, and tries again if needed.
🍞 Hook: Before writing a report, you might first look up facts in books and websites.
🥬 The Concept — Retrieval-Augmented Generation (RAG):
- What it is: The model retrieves relevant data or knowledge and then generates grounded answers.
- How it works: 1) Build indexes; 2) Fetch top-k related samples; 3) Rerank for relevance; 4) Insert evidence into the prompt.
- Why it matters: Without retrieval, the model can guess or hallucinate. 🍞 Anchor: To fill a missing “City” from a ZIP code, the system retrieves a reliable table mapping ZIPs to cities and uses it as evidence.
🍞 Hook: You don’t just see letters—you understand meanings like synonyms, nicknames, and context.
🥬 The Concept — Semantic Reasoning:
- What it is: Understanding and connecting meanings, not just strings.
- How it works: 1) Use world knowledge; 2) Compare contexts; 3) Follow logical rules; 4) Explain decisions.
- Why it matters: Without semantics, “cost” and “price” look different even when they mean the same thing. 🍞 Anchor: Matching “cost” to “price” during schema matching because both refer to monetary value.
The world before: Traditional tools used rigid rules, manual scripts, and task-specific models. They struggled with meaning (semantics), didn’t generalize well to new domains or modalities, and needed lots of labeled data. People tried: regex rules, constraint engines, and specialized ML models. These often broke when data formats changed, when ambiguous names appeared, or when cross-dataset reasoning was needed.
The gap: We needed systems that understand language and meaning, can follow instructions, bring in outside facts, and coordinate steps automatically. That’s where LLMs, prompts, RAG, and agents step in.
Real stakes: Companies lose real money to bad data. In daily life, messy addresses cause lost packages, duplicate contacts clutter your phone, and mislabeled columns lead BI dashboards to mislead. With LLM-enhanced preparation, data becomes trustworthy, unified, and understandable for the apps we rely on.
02Core Idea
🍞 Hook: Imagine a super-chef who can read recipes, ask clarifying questions, fetch missing ingredients, teach junior cooks, and write new kitchen tools when needed.
🥬 The Concept — The Aha! Moment:
- What it is: Shift from hand-coded rules to prompt-driven, retrieval-grounded, and agentic LLM workflows across cleaning, integration, and enrichment.
- How it works: 1) Use prompts and examples to explain the task; 2) Retrieve relevant context from data lakes or knowledge graphs; 3) Plan multi-step actions with agents; 4) Optionally synthesize code or teach smaller models for scale.
- Why it matters: Without this shift, preparation stays brittle, manual, and blind to meaning. 🍞 Anchor: Instead of 10 scripts for 10 date styles, one prompt + agent + small tool calls can standardize them all and verify success.
Three analogies (same idea, new angles):
- Kitchen: The LLM is the head chef (plans), RAG is the pantry (fetch ingredients), agents are sous-chefs (execute steps), and code synthesis is building a new kitchen gadget on the fly.
- Library: Prompts are your search request, RAG fetches the right books, the LLM summarizes meaningfully, and agents coordinate checkouts and photocopies.
- School cleanup: One student (LLM) reads instructions, asks for missing info (RAG), organizes teams (agents), and writes quick labels (code) so everyone can find things later (profiling/annotation).
Before vs After:
- Before: Fixed rules, local tricks, lots of manual effort, narrow models per task.
- After: Natural-language instructions, semantic reasoning, retrieval grounding, tool-calling agents, and hybrids that scale.
Why it works (intuition, not equations):
- Pretrained knowledge lets LLMs recognize meanings beyond surface strings.
- Instructions shape behavior without retraining every time.
- Retrieval anchors answers to evidence, reducing hallucinations.
- Decomposing tasks (agents, steps) lowers complexity and error.
- Hybridizing (teach a small model or synthesize code) gives speed and savings.
Building blocks (the core LEGO pieces): 🍞 Hook: You know how building with LEGO bricks lets you make many different toys from a small set of pieces?
🥬 The Concept — Prompting & Serialization:
- What it is: Turning rows/columns into clear text with step-by-step instructions.
- How it works: 1) Serialize values; 2) Give examples; 3) Ask for structured outputs; 4) Batch when possible.
- Why it matters: Without clear prompts, the model guesses. 🍞 Anchor: “Here are 5 date examples. Convert the rest to YYYYMMDD and return JSON.”
🍞 Hook: When facts are missing, you look them up.
🥬 The Concept — RAG for Context:
- What it is: Fetch relevant tuples, docs, or subgraphs as evidence the LLM can read.
- How it works: 1) Retrieve; 2) Rerank; 3) Insert into the prompt; 4) Cite or verify.
- Why it matters: Without evidence, even smart models can be confidently wrong. 🍞 Anchor: Imputing a missing city by retrieving top-5 similar records from a data lake.
🍞 Hook: Sometimes you need a tool that doesn’t exist yet—so you build it.
🥬 The Concept — Program Synthesis:
- What it is: LLM writes small, reusable functions or rules.
- How it works: 1) Generate candidate code; 2) Run; 3) Compare outputs; 4) Keep the best.
- Why it matters: Without reusable code, you re-pay inference cost for every row. 🍞 Anchor: Autogenerating a clean_phone() function used on millions of rows.
🍞 Hook: A teacher trains helpers to do the easy parts.
🥬 The Concept — Distillation & Fine-tuning:
- What it is: Teach smaller models or adapt an LLM with task-focused data.
- How it works: 1) Create labeled/pseudo-labeled sets; 2) Train SLMs or LoRA adapters; 3) Deploy fast at scale.
- Why it matters: Without smaller helpers, costs and latency stay high. 🍞 Anchor: Using an LLM to label anomalies, then training a small detector for production.
🍞 Hook: A project manager coordinates steps, tools, and checks.
🥬 The Concept — Agent Orchestration:
- What it is: LLM decides next actions, calls tools, and checks results.
- How it works: 1) Plan; 2) Act; 3) Observe; 4) Reflect; 5) Repeat.
- Why it matters: Without orchestration, complex pipelines are fragile. 🍞 Anchor: For schema matching, agents retrieve candidates, compare with an LLM, and enforce global constraints.
Takeaway: With these pieces, the same framework can standardize, repair, match, and enrich data—faster, smarter, and with evidence.
03Methodology
High-level recipe: Raw data → (Pick task: clean/integrate/enrich) → Build context (serialize, prompt, retrieve) → Choose engine (direct LLM, program synthesis, fine-tuned/SLM, or agent+tools) → Execute and verify → Application-ready data.
Step A: Define the Task and Unit
- What happens: Decide if you’re cleaning (cells/rows), integrating (pairs/schemas), or enriching (columns/tables).
- Why this step exists: Different tasks need different context sizes and checks.
- Example: Cleaning dates (cell-level) vs. matching product entities (pair-level) vs. profiling a whole table (object-level).
Step B: Construct the Input Context
- What happens: Serialize tabular data into text, add instructions, and optionally retrieve evidence via RAG.
- Why this step exists: LLMs need clear, compact, relevant context.
- Example: “Admission Date: 7th April 2021 → Please output YYYYMMDD.” If city is missing, retrieve top-5 similar rows from a data lake to guide imputation.
Step C: Pick the Processing Mode
- What happens: Choose one: (1) Prompt-only LLM; (2) LLM writes code; (3) Fine-tuned/SLM hybrid; (4) Agent that plans and calls tools.
- Why this step exists: Balances accuracy, cost, latency, and scale.
- Example: For millions of rows, prefer code synthesis or distillation to cut inference cost.
Step D: Execute, Verify, and Iterate
- What happens: Run the step, check quality (precision/recall, semantic consistency), and loop if needed.
- Why this step exists: Prevents error snowballs and keeps results trustworthy.
- Example: After date cleaning, re-check invalid entries and retry with stricter rules.
Now, the three core task-families and their sub-recipes:
🍞 Hook: Like making all homework look neat in the same notebook. 🥬 The Concept — Data Standardization:
- What it is: Convert different formats to one consistent style.
- How it works: 1) Prompt-based end-to-end (instructions + examples + batching); 2) Program synthesis (LLM writes reusable functions); 3) Agent-based (plan columns → call tools → verify).
- Why it matters: Without it, analysis breaks and joins fail. 🍞 Anchor: Map “April 7, 2021,” “2021/04/07,” and “07-04-21” to “20210407.”
🍞 Hook: Spotting typos and fixing them, like turning “bxrmxngham” into “birmingham.” 🥬 The Concept — Data Error Processing:
- What it is: Detect wrong values and repair them.
- How it works: 1) Prompt loops (detect → verify → repair); 2) Code/rule synthesis (LLM mines dependencies like ZipCode → City); 3) Fine-tuning with synthetic noise or context-rich prompts; 4) Hybrid LLM-ML (LLM labels; SLM detects fast or executes LLM-induced decision trees).
- Why it matters: Errors poison models and dashboards. 🍞 Anchor: Find invalid ages (e.g., 999) and fix them using cluster neighbors.
🍞 Hook: When a puzzle piece is missing, you look at the picture and similar pieces. 🥬 The Concept — Data Imputation:
- What it is: Fill missing values using context and evidence.
- How it works: 1) Prompt-based (heuristic formats or select only relevant columns/rows); 2) Retrieval-guided (fetch similar tuples from lakes, then impute); 3) Model-optimized (fine-tune or add modules for mixed-type dependencies).
- Why it matters: Missing values break training and bias results. 🍞 Anchor: Inferring missing “City” from phone area code and ZIP via retrieved lookups.
🍞 Hook: Two names, one person—match them! 🥬 The Concept — Entity Matching:
- What it is: Decide if two records are the same real-world entity.
- How it works: 1) Prompt with rules and examples; 2) Batch/list-wise prompting for efficiency; 3) Fine-tune with reasoning traces or improved training data; 4) Multi-model collaboration (small rankers + strong LLM for final decisions).
- Why it matters: Prevents duplicates and wrong joins. 🍞 Anchor: Matching “Apple iPhone13” with “iPhone 13 by Apple.”
🍞 Hook: “Cost” and “Price” can mean the same thing even if spelled differently. 🥬 The Concept — Schema Matching:
- What it is: Align columns or attributes with the same meaning across tables.
- How it works: 1) Prompt-only with consistency tricks; 2) Retrieval-enriched (vector and graph search for context); 3) Model-optimized (table encoders, instruction tuning); 4) Multi-model (retrieve → rerank); 5) Agent-guided (plan tools, enforce rules, generate code when needed).
- Why it matters: Without it, data can’t merge cleanly. 🍞 Anchor: Aligning “price” in one table with “cost” in another.
🍞 Hook: Put helpful labels and summaries on your folders so everyone can find things. 🥬 The Concept — Data Annotation:
- What it is: Assign semantic labels (e.g., column type) or entity links.
- How it works: 1) Instruction prompts with examples; 2) Reasoning and voting; 3) RAG (retrieve examples or KG facts); 4) Fine-tune for domains; 5) Hybrid LLM-ML (distill to SLMs); 6) Agent + tools (search, KG, ranking).
- Why it matters: Labels power search, BI, and ML features. 🍞 Anchor: Mark a column as “birthDate” and link “Apple” to Apple_Inc.
🍞 Hook: A table’s “about me” card—short, true, and useful. 🥬 The Concept — Data Profiling:
- What it is: Summaries, stats, semantic groups, and related datasets.
- How it works: 1) Instruction prompts with constraints; 2) Few-shot + reasoning; 3) RAG (retrieve similar tables/docs) to ground profiles; 4) Build hierarchies for browsing.
- Why it matters: Without profiles, discovery and quality checks are guesswork. 🍞 Anchor: A profile that says “Weather observations by city and day, includes wind speed and direction,” plus links to similar tables.
Secret sauce (what makes this clever):
- Structured prompts shrink ambiguity; batching boosts efficiency.
- RAG reduces hallucination by grounding in evidence.
- Program synthesis and distillation cut runtime cost.
- Agents orchestrate multi-step workflows with tools and checks.
- Specialized encoders and modules bridge tables and text.
04Experiments & Results
The Test (what was measured and why): This is a survey, so the authors examined many recent methods and grouped them by tasks (cleaning, integration, enrichment), inputs (tables, text), and techniques (prompting, RAG, program synthesis, fine-tuning, agents, hybrid). They summarized which datasets are commonly used (e.g., Adult Income, Hospital, abt-buy, OMOP, NQ-Tables), and which metrics matter: correctness (precision, recall, F1), robustness (ROC/AUC), ranking (P@k, MRR, Recall@GT, Hit Rate), and semantic preservation (ROUGE, cosine similarity). The goal was to see when LLM approaches help most and how people verify quality.
The Competition (what methods were compared):
- Traditional baselines: rule/constraint engines, regex scripts, and task-specific small models.
- LLM techniques: direct prompting, RAG-augmented prompts, code/program synthesis, fine-tuned/tabular encoders, multi-model pipelines, and agent-based orchestration.
The Scoreboard (results with context):
- Data Cleaning: Prompted and agentic approaches can outperform rigid rules on heterogeneous, messy data because they understand semantics and adapt steps. Hybrids (LLM labels + small detectors) often scale better, like getting an A on both small quizzes and a big final without extra cramming.
- Error Processing: Iterative detect–verify–repair reduces cascading mistakes, while LLM-induced rules (like dependencies) increase explainability. Distilling to small models trades a little peak accuracy for big wins in speed and cost—like turning a great coach’s advice into a fast playbook.
- Imputation: Retrieval from data lakes grounds fills in evidence, improving plausibility when local context is thin. Model-optimized methods capture complex, mixed-type relations but need more engineering—like building a custom bike that rides smoother but takes time to assemble.
- Entity Matching: Guidance-heavy prompts and batch/list-wise prompting are strong zero/few-shot tools. Combining small rankers with a strong LLM for final calls boosts both recall and precision—like a scout team filtering candidates before the head coach makes the pick.
- Schema Matching: Retrieval-enriched LLMs resist hallucinations by citing subgraphs or top-k candidates. Specialized table encoders stabilize alignment—like using a map that doesn’t get confused when the road curves.
- Annotation & Profiling: Instruction + reasoning + retrieval improves label quality and summaries. Fine-tuned local models often match or beat closed models in niche domains, which is cost-friendly and privacy-friendly.
Surprising Findings:
- Program synthesis and LLM-induced decision trees provide both speed and interpretability—unexpectedly strong given how simple the generated code/trees can be.
- Batch/list-wise prompting helps more than many expect by allowing global reasoning over multiple examples at once.
- Retrieval quality is a kingmaker: better reranking often matters as much as a bigger LLM.
- Full agent systems are promising but still rare in production; reliability and overhead remain hurdles.
05Discussion & Limitations
Limitations:
- Cost and latency: Direct LLM inference on large tables is expensive; iterative agent loops add overhead.
- Hallucinations and drift: Without evidence, models can produce confident but unsupported fixes or labels.
- Cold-start data needs: Fine-tuning or distillation requires curated examples; poor pseudo-labels cap downstream performance.
- Global consistency: Many methods reason locally; enforcing dataset-wide rules and constraints remains hard.
- Evaluation gaps: Some tasks lack robust, multi-aspect metrics and faithful, evidence-based checks.
Required Resources:
- Compute and APIs for LLMs; orchestration layers for agents; vector/graph indexes for retrieval.
- Access to data lakes, ontologies, or KGs; logging and monitoring to measure quality and drift.
- Tooling for program execution, validation, and error analysis.
When NOT to Use:
- Tiny, stable datasets where a few rules solve everything reliably.
- Real-time, ultra-low-latency paths that can’t afford LLM/agent overhead.
- Strictly regulated flows demanding formal guarantees beyond current LLM reliability.
Open Questions:
- How to combine local semantic fixes with global constraints and provable checks?
- How to add uncertainty/risk control to agentic cleaning so coverage rises without error cascades?
- Can we build universal integrators that rely less on metadata and more on instance patterns to generalize widely?
- How to make enrichment faithfully evidence-grounded with citations and verifiable traces?
- What benchmarks best capture real usefulness, cost, and reliability across modalities and aspects?
06Conclusion & Future Work
Three-sentence summary: This survey maps how LLMs transform data preparation—from rigid, manual rules to prompt-driven, retrieval-grounded, and agentic workflows—across cleaning, integration, and enrichment. It catalogues techniques (prompting, RAG, program synthesis, fine-tuning, hybrids, agents), datasets, and metrics, and highlights pragmatic patterns like batching, distillation, and specialized table encoders. It also surfaces limits (costs, hallucinations, global consistency, evaluation gaps) and a roadmap toward scalable, reliable, and evidence-grounded LLM–data systems.
Main Achievement: A unified, task-centric taxonomy that shows how to combine prompts, retrieval, tools, and hybrid models to produce application-ready data with better generalization and semantic understanding.
Future Directions: Add uncertainty-aware, constraint-respecting agent workflows; build universal, cross-domain integrators; strengthen multi-aspect, evidence-based evaluations; and scale with hybrids (program synthesis, SLMs, table encoders) under robust governance.
Why Remember This: Because clean, connected, and richly described data is the foundation of trustworthy analytics and AI—and LLMs, when carefully guided and grounded, are becoming the multi-tool that gets messy data truly ready for action.
Practical Applications
- •Automatically standardize dates, addresses, and phone numbers across millions of records.
- •Detect and repair typos and invalid values in customer and transaction tables with iterative verification.
- •Impute missing attributes (e.g., city, product category) using retrieval from internal data lakes.
- •De-duplicate product catalogs by matching the same items across marketplaces.
- •Align schemas from different business units by matching columns with the same meaning.
- •Annotate tables with semantic column types and entity links to power BI auto-insights.
- •Generate faithful dataset descriptions and semantic profiles for data catalogs.
- •Distill LLM judgments into small anomaly detectors for low-latency data quality monitoring.
- •Use agents to plan end-to-end cleaning workflows that call existing tools and verify results.
- •Ground enrichment and profiling in cited evidence (tables, docs, KGs) for auditability.