Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Hyunwoo Kim; Niloofar Mireshghallah; Michael Duan; Rui Xin; Shuyue Stella Li; Jaehun Jung; David Acuna; Qi Pang; Hanshen Xiao; G. Edward Suh; Sewoong Oh; Yulia Tsvetkov; Pang Wei Koh; Yejin Choi

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Intermediate

Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan et al.2/3/2026

arXiv PDF

Key Summary

•The paper introduces PRIVASIS, a huge, fully synthetic dataset (1.4 million records) filled with realistic-looking private details, but created from scratch so it does not belong to any real person.
•It uses a smart recipe: start with guiding hints (auxiliary control variables), draft a record, then improve it while keeping the whole collection diverse using a diversity score (Vendi) plus a quality check.
•The dataset spans many document types (medical, legal, financial, calendars, messages) and includes over 55 million labeled attributes that make research and training much easier.
•From this dataset, the authors build a parallel training set for text sanitization that teaches models to remove or blur sensitive facts without breaking the rest of the text.
•They propose a decomposition-based sanitization pipeline: split long documents into smaller chunks, sanitize targets consistently, and then stitch everything back together.
•A compact model (PRIVASIS-CLEANER, 4B parameters) trained on this data beats bigger frontier models like GPT-5 on the standard (vanilla) sanitization test (72.5% vs. ~70%).
•Even on a hard test set with trickier targets, the compact model matches frontier models and ranks second only to GPT-5.
•The evaluation checks three kinds of leaks (direct, inference, proximity) and also verifies that non-sensitive information stays intact.
•Because models can run locally, users can clean data on their own device before sending it anywhere—important for real privacy.
•The authors will release data, models, and code to speed up privacy research for AI systems that must handle sensitive information safely.

Why This Research Matters

As AI assistants read emails, calendars, and reports, we need strong ways to protect personal details before anything leaves a user’s device. PRIVASIS shows we can train privacy skills at scale without touching anyone’s real data, lowering risk while boosting quality. The compact sanitizers learned from this dataset can run locally on laptops or phones, cleaning sensitive text on the spot. This helps businesses meet privacy rules, builds user trust, and reduces costly data breaches. It also opens up fairer research—scientists can explore privacy methods without relying on fragile, limited, or risky datasets. In short, PRIVASIS enables practical, privacy-by-design AI for everyday life.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your friend asks to borrow your diary to help with a school project. You want to help, but your diary has secrets you don’t want anyone to see.

🥬 The Concept (Data Security): Data security is protecting information so only the right people can see it. How it works:

Lock data away (like passwords and encryption).
Control who gets a key (permissions and access rules).
Watch for break-ins (monitoring and alerts). Why it matters: Without it, anyone could peek at your private stuff, and trust would be broken. 🍞 Anchor: When a banking app hides your full credit card number except the last four digits, that’s data security in action.

🍞 Hook: You know how being a good friend means keeping promises and not sharing secrets?

🥬 The Concept (Ethical AI): Ethical AI means building AI that behaves responsibly and respects people. How it works:

Decide clear rules (what the AI should and shouldn’t do).
Train and test for safe behavior.
Check and fix problems that could hurt people. Why it matters: Without ethics, AI might help the wrong way—like sharing private info. 🍞 Anchor: A homework helper that refuses to post your classmate’s phone number online is practicing Ethical AI.

🍞 Hook: Think about telling a story without saying someone’s name. You share the idea, not the secret.

🥬 The Concept (Privacy Preservation): Privacy preservation keeps personal details hidden while still letting people use the useful parts of data. How it works:

Find the sensitive bits (like names, IDs, addresses).
Hide or blur them (masking, abstracting).
Keep the rest so the story still makes sense. Why it matters: Without it, useful data becomes too risky to share or study at all. 🍞 Anchor: A medical report that says “a 12-year-old in City A” instead of “Sam, age 12, Main Street” preserves privacy while keeping the medical facts.

🍞 Hook: When you pack for a trip, you bring only what you need so your bag isn’t heavy.

🥬 The Concept (Data Minimization): Data minimization is using only the information that’s necessary—no extras. How it works:

Decide the goal.
List only the info required for that goal.
Remove the rest. Why it matters: Without it, extra details can accidentally reveal secrets and increase risk. 🍞 Anchor: A pizza delivery app needs your address, not your birthday.

The world before PRIVASIS: Researchers trying to protect privacy were stuck. They needed lots of examples of sensitive text to train and test their methods, but real private data can’t be safely shared. So most projects used small, narrow datasets (like only medical notes or short chats) or just focused on spotting obvious PII (personally identifiable information) such as names and phone numbers. This left out many other sensitive clues (like job changes, illness hints, travel plans) that matter a lot in real life. Meanwhile, new AI agents (helping with email, calendars, and health) increasingly read personal records on the fly. The risk: if these systems can’t reliably hide or generalize sensitive details, they could leak secrets. People tried several approaches that fell short:

Use tiny public datasets: Not enough variety, so models don’t learn real-world patterns.
Mask only fixed PII types: Misses context-specific sensitivity (like “divorce” or “secret project”).
Differential privacy training: Helpful but expensive and not tailored to sanitizing long, complex documents.
Prompting large models to generate examples: Tends to produce generic text and is hard to scale responsibly without drifting into memorization.

The gap: We needed a giant, safe-to-share, richly labeled, multi-domain dataset that feels like privacy-heavy text—but is fully synthetic so it doesn’t belong to any real person. Also needed: a training setup that teaches models to remove or abstract sensitive info while keeping useful parts intact, even in long documents.

Real stakes: If your AI calendar assistant accidentally leaves your exact home address and school drop-off time in an email, that’s a serious safety risk. If a health chatbot keeps the medical insight but leaks the patient’s clinic name, that’s a privacy breach. And if companies must send your raw documents to a server just to clean them, you’re trusting a stranger with your secrets. PRIVASIS aims to flip this: create safe, synthetic practice material at massive scale and train small models that can run locally to clean data before anything leaves your device.

02Core Idea

🍞 Hook: Imagine a practice field that looks and feels like a real stadium but has no actual fans’ faces in the seats—so teams can train safely at full speed.

🥬 The Concept (Synthetic Dataset Generation): Synthetic dataset generation makes new, realistic-looking data that doesn’t come from any real person. How it works:

Decide what types of documents you want (medical, legal, email, etc.).
Generate detailed, but fictitious profiles and contexts.
Create full records that match those details. Why it matters: Without synthetic data, we either risk using real private info or we stay stuck with tiny, unhelpful datasets. 🍞 Anchor: Fake hospital billing statements with life-like structure and labels, but tied to no real patient, let researchers train privacy tools without risking anyone’s identity.

Aha! Moment in one sentence: If we can carefully guide large language models to invent richly detailed but purely imaginary private records at massive scale—and label the important parts—we can finally train and test privacy methods safely and effectively.

Three analogies for the main idea:

Movie set: It looks real enough to practice stunts, but no actual city is put at risk.
Flight simulator: Pilots train on realistic missions without flying real passengers.
Practice puzzles: You learn to solve tricky problems using made-up examples before touching a real exam.

Before vs. after:

Before: Small, narrow datasets; models overfit to short texts and only basic PII.
After: Million-scale, multi-domain, richly annotated synthetic records—plus a sanitization training set—so even small models learn to remove sensitive details while keeping the story useful.

Why it works (intuition):

Guidance beats guessing: By steering generation with guiding hints (who the person is, what kind of record it is, what’s happening), the model produces specific, grounded documents instead of generic fluff.
Diversity by design: An accept-or-revise loop favors records that add new variety to the collection, not just more of the same.
Labels for learning: Extra annotations mark what’s sensitive, so sanitizers learn to target only what should change.

🍞 Hook: You know how a recipe card tells you the dish, the ingredients, and the occasion so your cooking turns out right?

🥬 The Concept (Auxiliary Control Variables): These are helper hints—like profile attributes, record type, and background scene—that guide what to generate. How it works:

Create a profile (e.g., name, age, country).
Pick a record type (e.g., “psychotherapy billing statement”).
Write a background context (what’s going on). Why it matters: Without these hints, the model drifts to bland, repetitive text that lacks realistic details. 🍞 Anchor: Telling the model “Create a clinic bill for a traveler securing medicine refills” leads to specific, coherent details.

🍞 Hook: When you build a photo album, you don’t keep 50 shots of the same pose—you select the ones that add something new.

🥬 The Concept (Diversity-Preserving Iterative Selection): This is an edit-and-choose loop that keeps improving drafts while protecting variety. How it works:

Draft a record and a revised version.
Ask a model judge which feels more specific and realistic.
Keep the one that also increases diversity across the whole collection (using a diversity score). Why it matters: Without this, everything starts sounding alike, which is bad for training. 🍞 Anchor: Two similar emails are compared; the one with clearer timestamps and new phrasing wins if it adds variety to the pile.

🍞 Hook: Think of “diversity” like making a fruit salad with many different fruits, not just apples.

🥬 The Concept (Vendi Score): Vendi is a metric that measures how spread out a set of texts is in meaning space. How it works:

Turn each text into an embedding (a vector that captures meaning).
Measure how broadly these vectors are spread.
Prefer additions that make the spread bigger. Why it matters: Without a diversity signal, a dataset gets redundant and less useful. 🍞 Anchor: If new legal memos all repeat the same wording, Vendi goes down; add a fresh style or topic, Vendi goes up.

🍞 Hook: Imagine practicing the same song at different speeds and volumes so you can perform it anywhere.

🥬 The Concept (Parallel Corpus for Sanitization): This is a training set of triples: original text, a sanitization instruction, and the correctly sanitized text. How it works:

Choose which facts to hide or blur.
Apply the change precisely across the text.
Keep non-sensitive parts untouched. Why it matters: Without such pairs, models won’t learn exactly how to rewrite safely while preserving meaning. 🍞 Anchor: “Replace exact birthdate with ‘early March’ but keep the city” teaches the model what to change and what to retain.

Put together, these ideas let PRIVASIS scale to 1.4M records with 55M+ labeled attributes, and then teach small models to sanitize text better than much larger ones on standard tests.

03Methodology

High-level recipe: Inputs (helper hints) → Draft records → Selective refinement with diversity checks → Attribute labeling → Filtering → Sanitization triples → Train compact sanitizers.

Step-by-step details with reasons and examples:

Informed Initialization with Auxiliary Control Variables

What happens: The system first builds a profile (e.g., first name, age, citizenship), selects a record type (like “tax letter” or “clinic bill”), and writes a background context (why this record exists). It also sketches a format (tone and structure), then generates the initial draft record from these hints.
Why this exists: To avoid bland, generic text and instead produce detailed, coherent, and on-topic records.
Example: Profile says “Natisha, Israeli citizen, planning an international work trip.” Record type picks “psychotherapy billing statement,” and the background context mentions medication refills for travel. The draft includes dates, clinic rooms, and pharmacy references that match.

Diversity-Preserving Iterative Selection-Based Refinement

What happens: For each draft, the system samples a revised version. A model judge scores which is more specific/realistic. Then it checks how much the new choice would increase diversity across the current collection using the Vendi score. Only if the combined score passes a threshold is the new draft accepted.
Why this exists: Repeated editing can collapse variety; this step keeps the set fresh and wide-ranging so models don’t overfit to one style.
Example: Two versions of a legal notice are compared. The improved one adds concrete addresses and clearer timelines and also boosts collection diversity, so it’s kept.

🍞 Hook: Like giving a referee scorecards for both performance and originality.

🥬 The Concept (LLM Judge): An LLM judge is a model that compares two texts and decides which is better for realism and specificity. How it works:

Show draft A and draft B.
Ask: Which one is more concrete and believable?
Add this to the diversity check to decide which to keep. Why it matters: Without a judge, the system can accept weaker drafts or drift. 🍞 Anchor: If Draft B lists exact timestamps and realistic invoice codes, the judge often picks B.
Attribute Annotation

What happens: After a draft is finalized, the system extracts all attributes mentioned (like clinic name, session time, passport number) and groups related ones (e.g., under “location”). These become structured labels in JSON.
Why this exists: These labels help later tasks (like sanitization) target exactly what to remove or abstract.
Example: In a finance memo, “employer,” “department,” and “salary” become a labeled group; in medical records, “clinic room” and “pharmacy name” cluster under “location.”

Filtering

What happens: Remove short, broken, or under-18 cases; ensure only coherent, useful records remain.
Why this exists: Keeps the dataset high-quality and safe to use.
Example: A 40-word fragment or a teen profile is dropped; a 600-word well-formed report is kept.

Secret sauce of generation:

Combining guiding hints, a quality judge, and diversity math (Vendi) creates long, specific, and varied records at scale—without using any real person’s data.

Now, building the sanitization parallel corpus:

🍞 Hook: Imagine slicing a big cake into neat pieces before icing it—much easier than icing the whole cake at once.

🥬 The Concept (Decomposition/Chunking): Break a long document into smaller chunks so sanitization can be applied consistently and precisely. How it works:

Split by natural boundaries until each chunk is at most about 512 characters.
Identify which chunks contain the sensitive targets.
Sanitize those chunks and stitch everything back together. Why it matters: Without chunking, models miss some mentions or make inconsistent edits across a long text. 🍞 Anchor: A 6-paragraph clinic note is split so all date mentions are abstracted the same way across every chunk.

Target selection and levels of change:

🍞 Hook: When packing a gift, you decide what to wrap tightly and what to leave visible on top.

🥬 The Concept (Abstraction vs. Drop): Two ways to sanitize—blur a detail (abstraction) or remove it entirely (drop). How it works:

Pick targets (e.g., birthdate, employer, clinic room).
For abstraction: replace “March 3, 2024” with “early March.”
For drop: delete the sensitive span cleanly. Why it matters: Without control over how much to change, you either overshare or ruin the text’s usefulness. 🍞 Anchor: “Replace ‘Aadhaar 1234-5678-9012’ with ‘national ID’” is abstraction; deleting the number completely is drop.

🍞 Hook: Think of a treasure map where you mark what to hide and what to keep so the story still makes sense.

🥬 The Concept (Retention Targets): These are facts you tell the model to keep, so it doesn’t over-edit. How it works:

Choose non-sensitive items (e.g., city, diagnosis category).
Ensure they remain unchanged while sanitizing sensitive parts.
Prefer keep-items that don’t overlap textually with what you’re hiding. Why it matters: Without keep-lists, models often erase too much and hurt utility. 🍞 Anchor: “Hide exact address but keep the city and appointment date-range” preserves planning info while protecting privacy.

Putting it all together (sanitization pipeline):

Decomposition: Split text into manageable chunks (~512 chars).
Target selection: Rank attributes by sensitivity, pick targets, and label each as ABSTRACT or DROP.
Span finding: For each target, locate all relevant spans across the selected chunks.
Instruction crafting: Build precise rewrite instructions (e.g., “Abstract specific date to ‘early March’”).
Apply consistently: Sanitize all affected chunks the same way.
Merge: Recombine sanitized chunks into a coherent document.
Final instruction: Produce a single, user-style instruction summarizing what was changed and what was retained.

Secret sauce of sanitization:

Chunk-first editing keeps changes consistent.
Multi-level abstraction balances privacy with usefulness.
Retention targets stop over-sanitizing.

Outputs:

A large set of triplets (original, instruction, sanitized) for training compact models that can run locally.

Why this is clever:

It transforms a very hard, long-document rewriting task into a series of smaller, grounded edits while preserving global consistency—exactly what general-purpose LLMs struggle with but small specialized models can learn to do reliably.

04Experiments & Results

The test: Do models sanitize the targets without leaking them or breaking the rest of the document?

They check three kinds of leaks and whether the non-sensitive facts are retained. They evaluate two sets: “vanilla” (easier) and “hard” (more grouped, contextual targets and longer instructions).

🍞 Hook: Picture three ways a secret can slip out: you say it, you hint it, or you leave clues that make it guessable.

🥬 The Concept (Leak Types: Direct, Inference, Proximity): These are three levels of privacy failure. How it works:

Direct leak: the exact sensitive string still appears.
Inference leak: the model removed the exact string, but an evaluator can still guess it exactly from context.
Proximity leak: the sanitized version still lets an evaluator guess nearly as close as if they had the original. Why it matters: Without catching subtle leaks, text looks safe but still reveals secrets indirectly. 🍞 Anchor: Hiding “Journal of X” but leaving “editor@jx.org” can let someone infer the journal—an inference leak.

🍞 Hook: Think of a spelling test where one miss makes the whole word wrong.

🥬 The Concept (Full Successful Record Metric): A record is counted as a win only if all targets are sanitized with no leaks and all keep-items are preserved. How it works:

Check each target for direct/inference/proximity leaks.
Verify keep-attributes still appear correctly.
If any single check fails, the whole record fails. Why it matters: In privacy, one miss can spoil everything. 🍞 Anchor: Even if you hide 9 out of 10 names, leaking one name means the whole document isn’t safely sanitized.

Competition and scoreboard (vanilla set):

Frontier LLMs like o3 and GPT-5 get around 70% Full Successful Record—like getting an A-.
PRIVASIS-CLEANER-4B scores about 72.5%—an A—despite being much smaller. That’s like a lightweight sprinter beating a heavyweight champ in a specific race.
Many models have high per-attribute success (>90%) but still fail the record because missing just one target is common.
Retention matters: frontier models sometimes over-edit; PRIVASIS-CLEANER tends to keep non-sensitive info (≈99% retention on vanilla), which preserves usefulness.

Hard test set:

Everyone’s scores drop sharply because targets are grouped and instructions are longer and trickier.
GPT-5 is best at about 13%, while PRIVASIS-CLEANER-4B is close behind (~12–13%), matching frontier-level performance.

Surprising findings:

Direct leaks are still the most common failure—even powerful models sometimes leave the exact sensitive string in headers or signatures.
The gap between attribute-level success (>90%) and full-record success (~70% or lower) shows how one slip-up ruins privacy.
The compact model’s strong retention suggests that specialized training on the parallel corpus teaches “edit just the sensitive bits.”
Cross-dataset generalization: PRIVASIS-CLEANER-4B matches a model trained directly on another benchmark (NaP²) without ever seeing it (both ≈10% leak ratio on NaP²), and still performs far better back on PRIVASIS, signaling robustness.

Category-wise struggles:

Business & Finance and Health & Wellness are toughest, likely because names, dates, employers, and addresses are dense and repeated in these documents.

Takeaway: A targeted training set plus a decomposition-based pipeline lets a small model beat or match giants on a strict, real-world-style privacy test.

05Discussion & Limitations

Limitations:

Synthetic ≠ real: Even very realistic made-up records may miss some messy, real-world quirks (typos, code-mixed languages, edge-case formats).
Demographic balance: Some groups (e.g., certain ethnicity labels) appear underrepresented; future work should tune sampling to better reflect global populations.
Hard context still hard: Grouped, context-heavy attributes (like “all location-related info”) are challenging; scores on the hard set show there’s room to grow.
Leakage beyond text: The work focuses on textual sanitization, not side channels like file metadata or images in PDFs.

Required resources:

Generation at scale uses API calls to multiple LLMs; though costs are reported (e.g., thousands of records ≈ low thousands of dollars), reproducing the entire dataset needs budget and orchestration.
Training compact models can be done on a few GPUs, but evaluating large test sets with LLM-based checks also needs compute.

When not to use this directly:

Don’t use synthetic records for real clinical, legal, or financial decisions—they’re for research and training only.
Don’t assume sanitization is perfect for safety-critical releases; human or policy checks may still be necessary.
Don’t apply the method to non-text modalities (images, audio) without adapting it—leak patterns differ.

Open questions:

Better diversity signals: Can we design even stronger, cheaper diversity metrics than Vendi for text collections?
Robust grouped-target handling: How can models more reliably detect and abstract whole attribute groups across long contexts?
Stronger inference defenses: Beyond string replacement, how can sanitizers block indirect clues that let evaluators guess sensitive info (e.g., via linked domains or fixed patterns)?
Multilinguality at scale: Early signs are positive, but how well do these techniques hold across many languages and scripts?
Policy-aware sanitization: Can we blend legal/privacy rules (like GDPR principles) into instructions so rewrites satisfy both technical and regulatory needs?

06Conclusion & Future Work

Three-sentence summary: PRIVASIS builds a million-scale, fully synthetic, richly annotated dataset of privacy-heavy documents, created from scratch with guidance hints and diversity-preserving refinement. From it, the authors construct a sanitization training corpus and a chunk-based rewriting pipeline that teach small models to remove or blur sensitive information while keeping useful details. The resulting compact sanitizer outperforms or matches much larger frontier models on strict tests, especially on the standard (vanilla) set.

Main achievement: Showing that careful synthetic data plus a decomposition-based sanitization approach can unlock reliable, fine-grained privacy editing—even with compact models that can run locally.

Future directions:

Improve handling of grouped/contextual targets and reduce inference/proximity leaks.
Expand demographic and domain balance, and scale to more languages and formats (emails, forms, PDFs).
Tie instructions to policy frameworks so sanitization is both technically sound and regulation-aware.

Why remember this: It demonstrates a practical path to privacy-by-design—train on large, safe, synthetic data and deploy small, local models that clean text before it leaves your device—helping AI assistants be truly helpful without oversharing.

Practical Applications

•On-device email cleanup: Automatically abstract addresses, phone numbers, and employer names before sharing a thread.
•Healthcare note sharing: Remove patient identifiers while keeping clinical findings for team discussions.
•Customer support logs: Drop sensitive tokens or IDs but retain troubleshooting steps and outcomes.
•Legal document redaction: Blur dates and parties as instructed while preserving clause meanings.
•HR analytics: Sanitize employee records (names, IDs) so trends can be analyzed safely.
•App telemetry minimization: Strip personally identifying strings from logs on the device before upload.
•Education research: Share sanitized student essays for writing analysis without exposing identities.
•Financial reporting: Remove account numbers and exact salaries but keep aggregate amounts.
•Calendar sharing: Replace exact locations with city-level info when publishing schedules.
•Agent pipelines: Insert a compact sanitizer step that cleans inputs before an assistant processes them.

Version: 1