DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Hao Liang; Xiaochen Ma; Zhou Liu; Zhen Hao Wong; Zhengyang Zhao; Zimo Meng; Runming He; Chengyu Shen; Qifeng Cai; Zhaoyang Han; Meiyi Qiang; Yalin Feng; Tianyi Bai; Zewei Pan; Ziyi Guo; Yizhen Jiang; Jingwen Deng; Qijie You; Peichao Lai; Tianyu Guo; Chi Hsu Tsai; Hengyi Feng; Rui Hu; Wenkai Yu; Junbo Niu; Bohan Zeng; Ruichuan An; Lu Ma; Jihao Huang; Yaowei Zheng; Conghui He; Linpeng Tang; Bin Cui; Weinan E; Wentao Zhang

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Intermediate

Hao Liang, Xiaochen Ma, Zhou Liu et al.12/18/2025

arXiv PDF

Key Summary

•DataFlow is a building-block system that helps large language models get better data by unifying how we create, clean, check, and organize that data.
•It replaces messy scripts with reusable operators and clear pipelines, making results easier to repeat, share, and improve.
•Most steps are LLM-driven (generate, evaluate, filter, refine), so the model helps make and fix its own training data.
•A helper called DataFlow-Agent can turn plain English requests into working data pipelines and even write missing operator code.
•Across text, math, code, Text-to-SQL, agentic RAG, and knowledge extraction, DataFlow’s data improved downstream model scores.
•In Text-to-SQL, DataFlow training data beat strong baselines like SynSQL at the same or even smaller sizes, improving execution accuracy by up to several points.
•In code tasks, DataFlow’s datasets raised benchmarks by about 7% on average compared to popular public instruction data.
•In math reasoning, DataFlow’s carefully verified 10k samples matched or exceeded other synthetic sources with 1–3 point gains on MATH, GSM8K, and AIME.
•A 10k multi-domain set from DataFlow made base models rival or beat models trained on a 1M-sample generic dataset, showing big data efficiency.
•DataFlow’s PyTorch-style API, prompt templates, and extension system make it easy to build, debug, and share high-quality data workflows.

Why This Research Matters

Better data preparation means better AI in your daily life. When datasets are built with careful generation, checking, and refinement, assistants make fewer mistakes and handle harder tasks, from coding to healthcare Q&A. A unified framework like DataFlow saves teams time and money by reusing operators and templates instead of reinventing scripts for every project. Because pipelines are explicit, debug-friendly, and shareable, companies can reproduce results and meet safety and compliance needs. Agentic automation lowers the barrier so more teams can produce high-quality, domain-specific data without large labeling budgets. Ultimately, this raises the reliability and usefulness of AI systems people rely on at work, in school, and at home.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine baking cookies with friends. If everyone follows their own vague recipe scribbles, the cookies taste different every time, and no one can fix mistakes because nobody knows exactly what was done.

🥬 The Concept (LLMs): Large Language Models (LLMs) are powerful text-understanding and text-writing systems. How it works: 1) They learn from huge piles of text, 2) They adjust to special tasks with extra training data, 3) They answer questions or follow instructions. Why it matters: Without the right kind of data to learn from, LLMs can get confused, biased, or weak at new tasks. 🍞 Anchor: When you ask an LLM to write a poem or generate SQL, its skill comes from the data and steps used to prepare that data.

🍞 Hook: You know how you clean, chop, and season ingredients before cooking? That prep work decides how good dinner tastes.

🥬 The Concept (Data Preparation): Data preparation is the careful process of collecting, cleaning, generating, checking, and organizing training data. How it works: 1) Gather raw text/code/logs, 2) Clean and normalize them, 3) Generate new task-specific examples, 4) Evaluate quality, 5) Filter bad parts, 6) Refine good parts. Why it matters: If prep is sloppy, models learn wrong patterns, forget important rules, or fail in real apps. 🍞 Anchor: Turning a messy PDF into great Q&A pairs or pairing a question with the right SQL is data preparation.

🍞 Hook: Think of a treasure map drawn on scraps of paper by different pirates—hard to follow, easy to lose.

🥬 The Concept (The Problem): Most LLM data prep was done with one-off scripts and unclear steps. How it works (or fails): 1) People write custom code per project, 2) Steps aren’t standardized, 3) Hard to reuse or compare, 4) No easy way for the model to help generate and check data. Why it matters: Results are hard to reproduce, fixes don’t travel from one project to another, and teams waste time. 🍞 Anchor: Team A’s filtering script can’t plug into Team B’s generator, so both redo work and get different outcomes.

🍞 Hook: It’s like using a blender for soup, a separate whisk for cake, and no kitchen counter to put it all together.

🥬 The Concept (Failed Attempts): Big-data tools (like Spark/Hadoop) can move data fast, but they don’t natively speak “LLM semantics” (prompts, tokens, model-in-the-loop). How it works: 1) They run UDFs but don’t help with prompts, batching with GPUs, or semantic checks, 2) Configuration-heavy toolkits help clean/filter but struggle with multi-step LLM generation/refinement. Why it matters: Great for ETL, not great for fine-grained LLM data crafting. 🍞 Anchor: You can count pages quickly with Spark, but you can’t easily ask it to generate and verify chain-of-thought math steps.

🍞 Hook: Imagine a shared LEGO set with clear shapes that snap together any time, anywhere.

🥬 The Concept (The Gap): We needed a unified, LLM-first system with reusable operators, clear pipelines, and agent help to auto-build workflows from natural language. How it works: 1) Standard operators (generate/evaluate/filter/refine), 2) Clear storage and serving layers, 3) Prompt templates for consistency, 4) Pipelines you can compile, debug, resume, 5) An agent that writes and wires operators if missing. Why it matters: Reproducibility, speed, data quality, and easy sharing. 🍞 Anchor: Like torch.nn.Modules for models, but for data preparation.

🍞 Hook: If your math tutor writes their own practice questions, explains steps, and then checks their answers, you’ll learn faster.

🥬 The Concept (Model-in-the-loop Generation): LLMs don’t just consume data—they help create and refine it. How it works: 1) Ask LLM to draft, 2) Ask LLM or tools to evaluate, 3) Filter poor drafts, 4) Ask LLM to refine, repeat. Why it matters: Large, high-quality, task-aligned data becomes affordable and scalable. 🍞 Anchor: For Text-to-SQL, the model proposes SQL, we execute it to check, filter slow/broken ones, and then generate the matching natural-language question and reasoning.

🍞 Hook: Why should you care? Because better prep makes everyday AI safer and smarter.

🥬 The Concept (Real Stakes): Good data prep means better assistants for coding, databases, study help, and healthcare info. How it works: 1) Clear pipelines prevent gotchas, 2) Quality checks reduce hallucinations, 3) Reuse saves engineering time, 4) Agent orchestration lowers barriers. Why it matters: Users get more accurate, reliable AI at lower cost. 🍞 Anchor: A customer support bot trained with carefully filtered, balanced examples makes fewer mistakes and answers faster.

02Core Idea

🍞 Hook: Think of a smart workshop where every tool clicks into place, and a helpful robot turns your wish list into the exact machine you need.

🥬 The Concept (DataFlow’s Aha!): DataFlow unifies LLM-driven data preparation with reusable operators and agentic automation, turning natural-language goals into reproducible, high-quality data pipelines. How it works: 1) Provide standard operators (generate/evaluate/filter/refine), 2) Connect them with a shared storage and serving API, 3) Build pipelines with a PyTorch-like interface, 4) Compile to validate dependencies, 5) Use DataFlow-Agent to plan and even synthesize missing pieces. Why it matters: You get reliable, shareable, and optimized dataflows that raise model quality while saving time. 🍞 Anchor: “Make a Text-to-SQL dataset for hospital tables” becomes a pipeline that generates SQL, checks execution, writes questions, adds reasoning, and labels difficulty—end to end.

Multiple Analogies:

Factory line: Operators are stations (draft, inspect, discard, polish). The pipeline is the conveyor. The agent is the floor manager who lays out the line from your description.
LEGO set: Operators are bricks with clear studs (I/O keys). The pipeline is the model you build. The agent reads your note and assembles the set, even 3D-printing a missing brick if needed.
Recipe: Operators are steps (mix, taste, strain, season). The pipeline is the full recipe card. The agent is the sous-chef turning your cravings into dishes.

🍞 Hook: Remember when codes and knobs on a machine let different parts interlock smoothly?

🥬 The Concept (Building Blocks): DataFlow provides four main building blocks. How it works: 1) Global Storage (tabular, read/write by keys), 2) LLM Serving API (one call, many backends), 3) Prompt Templates (consistent prompt assembly), 4) Pipelines (ordered/DAG execution with compile and resume). Why it matters: Each block is simple alone but powerful together, keeping pipelines debuggable and reusable. 🍞 Anchor: A prompt template swap turns a SQLite SQL generator into a MySQL one—with zero operator code changes.

🍞 Hook: Before, everyone used different puzzle pieces; after, all pieces fit.

🥬 The Concept (Before vs After): Before: ad-hoc scripts, brittle glue, little reuse, hard debugging. After: standard operators, consistent prompts, compiled DAGs, IDE-friendly code, agent orchestration. How it works: 1) Operators encode semantics, 2) compile() catches missing keys and wiring errors at once, 3) Agent retrieves/reuses/synthesizes ops, 4) Storage and serving decouple logic from hardware. Why it matters: Fewer bugs, faster iteration, better data. 🍞 Anchor: A week of fragile YAML becomes a 50-line Python pipeline you can pause, resume, and share.

🍞 Hook: Why does this work so well?

🥬 The Concept (Intuition Behind the Math-Free Logic): Most high-quality datasets follow a generate–evaluate–filter–refine rhythm. How it works: 1) Expand candidates with generation, 2) Score them with evaluators, 3) Keep the best via filters, 4) Improve survivors with refinement, 5) Repeat until data is clean, diverse, and aligned. Why it matters: It scales model-in-the-loop synthesis and keeps semantics tight, so models learn the right things. 🍞 Anchor: In Text-to-SQL, the row count grows when generating SQL variations, then shrinks when invalid or slow queries are filtered.

🍞 Hook: Imagine a helpful teammate who not only picks tools but also crafts a new one if the toolbox lacks it.

🥬 The Concept (DataFlow-Agent): DataFlow-Agent translates natural-language goals into executable pipelines and can synthesize missing operators. How it works: 1) Decompose intent, 2) Retrieve candidate ops, 3) Check I/O compatibility, 4) Reuse via prompt templates or synthesize code, 5) Assemble DAG, 6) Sandbox-verify and auto-fix, 7) Report results. Why it matters: You prototype faster and handle new tasks without manual coding marathons. 🍞 Anchor: “Make multihop questions from Wikipedia and filter for logic errors” becomes a working pipeline that’s debugged and ready.

🍞 Hook: What’s the glue keeping everything consistent?

🥬 The Concept (Prompt Templates): Prompt templates separate “what we want” from “how we ask the model.” How it works: 1) A template defines fields and style, 2) Operators fill slots with inputs and schema, 3) Serving handles the backend, 4) Outputs are parsed consistently. Why it matters: Small changes in prompts can cause big behavior changes—templates keep changes safe and reusable. 🍞 Anchor: Switching a Text-to-SQL prompt from “simple” to “hard” difficulty only changes the template, not the pipeline code.

03Methodology

At a high level: Input data or a natural-language goal → plan or choose a pipeline → run operators (generate → evaluate → filter → refine) with shared storage and LLM serving → output a high-quality dataset ready for training.

🍞 Hook: Think of a kitchen where every cook follows the same card, uses the same pantry, and the head chef can translate your craving into tonight’s menu.

🥬 The Concept (Global Storage): A single table-like storage holds all fields (e.g., question, SQL, score). How it works: 1) Operators call read() to fetch needed columns, 2) Transform or query an LLM, 3) write() new columns (e.g., response, eval_score), 4) Next operator picks up where the last left off. Why it matters: Everyone shares the same “pantry,” so steps line up, and you can pause/resume or reorder safely. 🍞 Anchor: After SQL generation writes the sql column, the execution filter reads sql and writes valid or runtime_ms.

Step-by-step “recipe” for a typical Text-to-SQL synthesis run:

Ingest and normalize

What happens: Load schema (e.g., hospital databases), convert inputs to a tabular store.
Why it exists: Without consistent columns, ops can’t agree on where to read/write.
Example: columns: db_schema, table_samples, sql, question, cot, difficulty.

SQL Generator (Generate)

What happens: LLM proposes SQL at different complexity levels, guided by prompt templates with schema and example values.
Why it exists: We need diverse, realistic SQL targets.
Example: From Patients and Visits tables, generate a JOIN that counts visits per patient this year.

SQL Execution Filter (Evaluate + Filter)

What happens: Try running each SQL on the real DB; discard ones that fail or exceed time limits.
Why it exists: Non-executable or too-slow queries are poor training signals.
Example: If a query times out at 3 seconds, filter it out.

Question Generator (Generate)

What happens: LLM writes a matching natural-language question in a chosen style (formal, conversational, etc.).
Why it exists: Text-to-SQL models need NL ↔ SQL pairs.
Example: “Which patients had more than three visits in 2024?” paired to the kept SQL.

Chain-of-Thought (Generate + Validate)

What happens: LLM explains step-by-step reasoning leading to the SQL; we extract the final SQL and verify it produces the same result as the reference.
Why it exists: Good reasoning traces boost harder problem solving and discourage shortcuts.
Example: “First join Patients and Visits on patient_id… filter by visit_date… group by patient… count > 3.”

Difficulty Labeling (Evaluate)

What happens: Classify by components (simple→extra hard) and by execution success rate (model-dependent difficulty).
Why it exists: Balanced curricula improve learning and benchmarking.
Example: A query with nested subqueries and GROUP BY → “hard.”

Prompt Generator (Refine)

What happens: Assemble final inference-time prompts (question + schema + instructions) for downstream training.
Why it exists: Stable prompting makes future training and evaluation consistent.
Example: Add schema context and a precise instruction to “respond with a single SQL statement.”

Secret Sauce 1: Generate–Evaluate–Filter–Refine loop

What happens: Grow candidates, score them, keep the best, polish them, repeat.
Why it exists: Quality rises while junk falls away.
Example: Augment a seed SQL set, re-filter by execution, regenerate questions, verify CoT.

Secret Sauce 2: Compilation and key-graph validation

What happens: compile() inspects the pipeline’s operator order and key connections before running.
Why it exists: Catches missing columns or type mismatches early.
Example: If a filter expects sql but only sql_query exists, compile() flags it once with a fix hint.

Secret Sauce 3: LLM Serving abstraction

What happens: One generate_from_input(...) call works for local engines (vLLM, SGLang) and remote APIs (GPT-4o, Gemini).
Why it exists: Swap backends without rewriting operators; batch efficiently; handle retries.
Example: Move from an API to on-prem vLLM for cost control.

Secret Sauce 4: Prompt templates

What happens: Standardize how operators talk to LLMs; switch templates to change style, difficulty, or DB dialect.
Why it exists: Tiny prompt changes can be huge—templates keep it safe and reusable.
Example: SQLite vs MySQL prompt variants.

Secret Sauce 5: DataFlow-Agent

What happens: Multi-agent planner interprets your request, retrieves/reuses/synthesizes ops, assembles a DAG, sandbox-tests it, and auto-fixes errors.
Why it exists: Build pipelines from natural language and fill missing gaps automatically.
Example: “Create a multihop QA dataset without logic leaks” becomes a verified pipeline.

🍞 Hook: Imagine asking, “Build a math reasoning set from curated seeds,” and watching the system do the rest.

🥬 The Concept (Math and Code Pipelines): Similar recipes apply to math (problem generation → MathQ-Verify → CoT) and code (instruction curation → execution-based checks → refinement). How it works: 1) Use domain-specific evaluators, 2) Keep the shared generate–evaluate–filter–refine rhythm, 3) Package as pipelines. Why it matters: One framework, many domains—no messy glue required. 🍞 Anchor: 10k math examples with verified CoT boosted AIME/GSM8K/MATH with just two fine-tuning epochs.

Output: Final datasets (e.g., DataFlow-Text2SQL-90K, DataFlow-Instruct-10K) with fields like question, answer/SQL, CoT, difficulty, and prompts—ready for training in your favorite finetuning tool.

04Experiments & Results

🍞 Hook: Report cards are clearer when you know the class and the grading scale.

🥬 The Concept (The Tests): The team measured how training on DataFlow-made datasets affects LLM performance. How it works: 1) Prepare datasets for text, math, code, Text-to-SQL, agentic RAG, and knowledge extraction, 2) Fine-tune common base models, 3) Evaluate on standard benchmarks, 4) Compare against strong baselines. Why it matters: Numbers show whether better data prep truly boosts real tasks. 🍞 Anchor: Think of comparing class averages after studying from a clean, organized notebook versus random scraps.

The Competitions (Baselines and Benchmarks):

Text: Compared filtered/curated sets vs random slices; measured averages across ARC, MMLU, HellaSwag, WinoGrande, Gaokao-MathQA.
Math: Compared 10k DataFlow-Reasoning-10K vs Open-R1 and Synthetic-1 subsets on GSM8K, MATH, AIME, etc.
Code: Compared DataFlow-Code-1K/5K/10K to Code Alpaca-1k and SC2-Exec-Filter on BigCodeBench, LiveCodeBench, CruxEval, HumanEval.
Text-to-SQL: Trained on DataFlow-Text2SQL-50K/90K vs SynSQL (50k/90k/2.5M) and tested on Spider/BIRD/EHRSQL/etc., using greedy and majority voting.
Agentic RAG: Compared DataFlow-AgenticRAG-10k to HotpotQA, Musique, 2Wiki across out-of-domain scores.
Knowledge Extraction: Compared SFT (DataFlow-Knowledge) vs zero-shot CoT and RAG baselines on PubMedQA, Covert, PubHealth.

The Scoreboard (with context):

Text filtering: With the same 30B tokens, DataFlow’s filters raised the average to about 35.69—like getting the top spot in a tight race—slightly above other curated baselines.
Math: With only 10k high-quality examples, models trained on DataFlow-Reasoning-10K achieved the highest average (up to 55.7 with two epochs), edging past strong synthetic baselines (Open-R1, Synthetic-1). That’s like acing the hardest problems with fewer practice sheets—quality over quantity.
Code: DataFlow code sets beat Code Alpaca-1k and SC2 under equal or larger sizes. For 14B models, averages reached ~51.0, with notable lifts on execution-heavy LiveCodeBench—like improving from a B to a solid A- when programs must actually run.
Text-to-SQL: Training on DataFlow-Text2SQL-90K improved Spider-dev (Greedy) by large margins (e.g., 73.4 → 82.0 with Qwen2.5-Coder-7B), and on challenging EHRSQL from 24.3 → 56.1, rivaling or beating SynSQL even when SynSQL had far more data (2.5M). That’s like building a smaller but smarter study set that performs like a giant one.
Agentic RAG: DF-AgenticRAG-10k matched or exceeded human-annotated HotpotQA/Musique/2Wiki out of domain; for example, under matched exclusions DF-OOD was often +1–3 points higher. Synthetic, when verified and refined, generalized better.
Knowledge Extraction: SFT on DataFlow-Knowledge beat zero-shot CoT and RAG by double-digit accuracy points on medical QA, showing that cleaned, structured supervision outperforms retrieval-only or prompt-only methods.

Surprising Findings:

Small but sharp beats big and blurry: A 10k DataFlow mixture made base models rival or beat models trained on 1M generic instances.
Execution-grounded checks matter: For SQL and code, verifying that outputs run and match results sharply improves model learning.
Agentic planning works: Automatically assembled pipelines (plus compile-time checks) reduce trial-and-error and still deliver strong data quality.

🍞 Hook: In short, when the workshop is organized and the quality checks are tight, the final product shines.

🥬 The Concept (Why Results Hold Up): The generate–evaluate–filter–refine rhythm consistently upgrades data quality across domains. How it works: 1) Diverse generation increases coverage, 2) Execution and semantic checks remove noise, 3) Difficulty labels shape curricula, 4) Templates keep prompts stable. Why it matters: You get better learning signals per sample, so you need fewer to reach higher scores. 🍞 Anchor: That’s why 10k well-crafted examples could rival or exceed 1M less-targeted ones.

05Discussion & Limitations

🍞 Hook: Even the best toolbelt has limits—you still pick the right tool for the job.

🥬 The Concept (Limitations): No single framework fits everything perfectly. How it works: 1) Operator coverage is broad but not infinite; niche domains may need new ops, 2) LLM backends add cost/latency variance, 3) Agentic orchestration can produce valid-but-different pipelines than a strict reference, 4) Heavily non-text or real-time streaming cases are not the main target, 5) Safety and bias checks depend on included evaluators and prompts. Why it matters: Expect to extend and tune for your domain and constraints. 🍞 Anchor: If you need a very specific graph-analytics operator, you’ll likely add it as a DataFlow-Extension.

🥬 The Concept (Required Resources): DataFlow needs both brains and muscles. How it works: 1) LLM access (local vLLM/SGLang or APIs), 2) Storage for intermediate artifacts, 3) Compute budget for generation and verification (tokens, executions), 4) Engineering time to pick or add operators. Why it matters: Planning resources avoids surprises and keeps pipelines cost-effective. 🍞 Anchor: A Text-to-SQL run needs DB connections for execution checks; a code run needs a sandbox to test programs.

🥬 The Concept (When Not to Use): Sometimes a simple script wins. How it works: 1) Tiny, one-off tasks without reuse, 2) Ultra low-latency streaming where model-in-the-loop isn’t possible, 3) Pure vision/audio pipelines without text conversion, 4) Governance constraints that forbid synthetic generation. Why it matters: The right tool is the one that fits the job’s shape. 🍞 Anchor: If you only need to strip emojis from 200 lines once, a quick regex may be faster than spinning a pipeline.

🥬 The Concept (Open Questions): There’s room to grow. How it works: 1) How best to autotune operator sequences for new domains? 2) Can we formalize semantic guarantees or proofs for pipeline correctness? 3) What’s the best active-learning loop for continuous data refresh? 4) How to expand to multimodal (tables/graphs/vision) while preserving simplicity? 5) How to make cost-aware agents that trade off token spend vs quality? Why it matters: Answering these keeps the framework future-proof and economical. 🍞 Anchor: Imagine an agent that not only builds the pipeline but also optimizes it for your exact budget and target benchmarks.

06Conclusion & Future Work

🍞 Hook: Picture a tidy workshop where ideas in plain English become sturdy machines, and every bolt and gear is easy to find.

🥬 The Concept (3-Sentence Summary): DataFlow is a unified, LLM-first framework that turns messy, brittle data prep into modular, reproducible pipelines built from reusable operators. It standardizes generate–evaluate–filter–refine workflows across text, math, code, Text-to-SQL, and more, aided by prompt templates, a global storage layer, and a flexible serving API. DataFlow-Agent translates natural-language goals into executable, verified pipelines and can synthesize missing operators, speeding up high-quality data creation. 🍞 Anchor: Asking “Make a medical QA set from textbooks” becomes a pipeline that cleans text, writes QAs, checks facts, and outputs a training-ready dataset.

Main Achievement: Elevating model-in-the-loop data synthesis to a first-class, programmable abstraction—so high-quality, task-aligned data is easier to build, verify, share, and improve.

Future Directions: Extend to tables, graphs, and multimodal data; deepen agentic planning with cost/quality trade-offs; add formal verification hooks; grow the open-source ecosystem of extensions and templates.

Why Remember This: In the data-centric era, model quality is limited by data quality. DataFlow shows that careful structure—standard operators, compiled pipelines, and agentic orchestration—can make small, sharp datasets outperform giant, blurry ones, delivering better models faster and more reliably.

Practical Applications

•Build a Text-to-SQL training set from your company’s databases with automatic execution checks.
•Create math tutoring datasets with verified step-by-step reasoning for harder problem solving.
•Curate code instruction data where snippets are executed and filtered for correctness.
•Extract Q&A from PDFs and web articles to train specialized knowledge assistants (e.g., medical or legal).
•Design safer chat datasets by filtering toxic or low-quality content with reusable evaluators.
•Rapidly prototype new data workflows from natural-language specs using DataFlow-Agent.
•Version and reproduce datasets for audits and compliance by sharing explicit pipelines.
•Adapt prompts and difficulty levels across domains by swapping prompt templates.
•Unify multi-domain instruction tuning (text, math, code) into a balanced 10k-scale dataset for efficient training.
•Automate benchmark construction and data augmentation (e.g., SQL augmentation, multihop question generation).

Version: 1