Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch
Key Summary
- ā¢The paper introduces PRIVASIS, a huge, fully synthetic dataset (1.4 million records) filled with realistic-looking private details, but created from scratch so it does not belong to any real person.
- ā¢It uses a smart recipe: start with guiding hints (auxiliary control variables), draft a record, then improve it while keeping the whole collection diverse using a diversity score (Vendi) plus a quality check.
- ā¢The dataset spans many document types (medical, legal, financial, calendars, messages) and includes over 55 million labeled attributes that make research and training much easier.
- ā¢From this dataset, the authors build a parallel training set for text sanitization that teaches models to remove or blur sensitive facts without breaking the rest of the text.
- ā¢They propose a decomposition-based sanitization pipeline: split long documents into smaller chunks, sanitize targets consistently, and then stitch everything back together.
- ā¢A compact model (PRIVASIS-CLEANER, 4B parameters) trained on this data beats bigger frontier models like GPT-5 on the standard (vanilla) sanitization test (72.5% vs. ~70%).
- ā¢Even on a hard test set with trickier targets, the compact model matches frontier models and ranks second only to GPT-5.
- ā¢The evaluation checks three kinds of leaks (direct, inference, proximity) and also verifies that non-sensitive information stays intact.
- ā¢Because models can run locally, users can clean data on their own device before sending it anywhereāimportant for real privacy.
- ā¢The authors will release data, models, and code to speed up privacy research for AI systems that must handle sensitive information safely.
Why This Research Matters
As AI assistants read emails, calendars, and reports, we need strong ways to protect personal details before anything leaves a userās device. PRIVASIS shows we can train privacy skills at scale without touching anyoneās real data, lowering risk while boosting quality. The compact sanitizers learned from this dataset can run locally on laptops or phones, cleaning sensitive text on the spot. This helps businesses meet privacy rules, builds user trust, and reduces costly data breaches. It also opens up fairer researchāscientists can explore privacy methods without relying on fragile, limited, or risky datasets. In short, PRIVASIS enables practical, privacy-by-design AI for everyday life.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
š Hook: Imagine your friend asks to borrow your diary to help with a school project. You want to help, but your diary has secrets you donāt want anyone to see.
š„¬ The Concept (Data Security): Data security is protecting information so only the right people can see it. How it works:
- Lock data away (like passwords and encryption).
- Control who gets a key (permissions and access rules).
- Watch for break-ins (monitoring and alerts). Why it matters: Without it, anyone could peek at your private stuff, and trust would be broken. š Anchor: When a banking app hides your full credit card number except the last four digits, thatās data security in action.
š Hook: You know how being a good friend means keeping promises and not sharing secrets?
š„¬ The Concept (Ethical AI): Ethical AI means building AI that behaves responsibly and respects people. How it works:
- Decide clear rules (what the AI should and shouldnāt do).
- Train and test for safe behavior.
- Check and fix problems that could hurt people. Why it matters: Without ethics, AI might help the wrong wayālike sharing private info. š Anchor: A homework helper that refuses to post your classmateās phone number online is practicing Ethical AI.
š Hook: Think about telling a story without saying someoneās name. You share the idea, not the secret.
š„¬ The Concept (Privacy Preservation): Privacy preservation keeps personal details hidden while still letting people use the useful parts of data. How it works:
- Find the sensitive bits (like names, IDs, addresses).
- Hide or blur them (masking, abstracting).
- Keep the rest so the story still makes sense. Why it matters: Without it, useful data becomes too risky to share or study at all. š Anchor: A medical report that says āa 12-year-old in City Aā instead of āSam, age 12, Main Streetā preserves privacy while keeping the medical facts.
š Hook: When you pack for a trip, you bring only what you need so your bag isnāt heavy.
š„¬ The Concept (Data Minimization): Data minimization is using only the information thatās necessaryāno extras. How it works:
- Decide the goal.
- List only the info required for that goal.
- Remove the rest. Why it matters: Without it, extra details can accidentally reveal secrets and increase risk. š Anchor: A pizza delivery app needs your address, not your birthday.
The world before PRIVASIS: Researchers trying to protect privacy were stuck. They needed lots of examples of sensitive text to train and test their methods, but real private data canāt be safely shared. So most projects used small, narrow datasets (like only medical notes or short chats) or just focused on spotting obvious PII (personally identifiable information) such as names and phone numbers. This left out many other sensitive clues (like job changes, illness hints, travel plans) that matter a lot in real life. Meanwhile, new AI agents (helping with email, calendars, and health) increasingly read personal records on the fly. The risk: if these systems canāt reliably hide or generalize sensitive details, they could leak secrets. People tried several approaches that fell short:
- Use tiny public datasets: Not enough variety, so models donāt learn real-world patterns.
- Mask only fixed PII types: Misses context-specific sensitivity (like ādivorceā or āsecret projectā).
- Differential privacy training: Helpful but expensive and not tailored to sanitizing long, complex documents.
- Prompting large models to generate examples: Tends to produce generic text and is hard to scale responsibly without drifting into memorization.
The gap: We needed a giant, safe-to-share, richly labeled, multi-domain dataset that feels like privacy-heavy textābut is fully synthetic so it doesnāt belong to any real person. Also needed: a training setup that teaches models to remove or abstract sensitive info while keeping useful parts intact, even in long documents.
Real stakes: If your AI calendar assistant accidentally leaves your exact home address and school drop-off time in an email, thatās a serious safety risk. If a health chatbot keeps the medical insight but leaks the patientās clinic name, thatās a privacy breach. And if companies must send your raw documents to a server just to clean them, youāre trusting a stranger with your secrets. PRIVASIS aims to flip this: create safe, synthetic practice material at massive scale and train small models that can run locally to clean data before anything leaves your device.
02Core Idea
š Hook: Imagine a practice field that looks and feels like a real stadium but has no actual fansā faces in the seatsāso teams can train safely at full speed.
š„¬ The Concept (Synthetic Dataset Generation): Synthetic dataset generation makes new, realistic-looking data that doesnāt come from any real person. How it works:
- Decide what types of documents you want (medical, legal, email, etc.).
- Generate detailed, but fictitious profiles and contexts.
- Create full records that match those details. Why it matters: Without synthetic data, we either risk using real private info or we stay stuck with tiny, unhelpful datasets. š Anchor: Fake hospital billing statements with life-like structure and labels, but tied to no real patient, let researchers train privacy tools without risking anyoneās identity.
Aha! Moment in one sentence: If we can carefully guide large language models to invent richly detailed but purely imaginary private records at massive scaleāand label the important partsāwe can finally train and test privacy methods safely and effectively.
Three analogies for the main idea:
- Movie set: It looks real enough to practice stunts, but no actual city is put at risk.
- Flight simulator: Pilots train on realistic missions without flying real passengers.
- Practice puzzles: You learn to solve tricky problems using made-up examples before touching a real exam.
Before vs. after:
- Before: Small, narrow datasets; models overfit to short texts and only basic PII.
- After: Million-scale, multi-domain, richly annotated synthetic recordsāplus a sanitization training setāso even small models learn to remove sensitive details while keeping the story useful.
Why it works (intuition):
- Guidance beats guessing: By steering generation with guiding hints (who the person is, what kind of record it is, whatās happening), the model produces specific, grounded documents instead of generic fluff.
- Diversity by design: An accept-or-revise loop favors records that add new variety to the collection, not just more of the same.
- Labels for learning: Extra annotations mark whatās sensitive, so sanitizers learn to target only what should change.
š Hook: You know how a recipe card tells you the dish, the ingredients, and the occasion so your cooking turns out right?
š„¬ The Concept (Auxiliary Control Variables): These are helper hintsālike profile attributes, record type, and background sceneāthat guide what to generate. How it works:
- Create a profile (e.g., name, age, country).
- Pick a record type (e.g., āpsychotherapy billing statementā).
- Write a background context (whatās going on). Why it matters: Without these hints, the model drifts to bland, repetitive text that lacks realistic details. š Anchor: Telling the model āCreate a clinic bill for a traveler securing medicine refillsā leads to specific, coherent details.
š Hook: When you build a photo album, you donāt keep 50 shots of the same poseāyou select the ones that add something new.
š„¬ The Concept (Diversity-Preserving Iterative Selection): This is an edit-and-choose loop that keeps improving drafts while protecting variety. How it works:
- Draft a record and a revised version.
- Ask a model judge which feels more specific and realistic.
- Keep the one that also increases diversity across the whole collection (using a diversity score). Why it matters: Without this, everything starts sounding alike, which is bad for training. š Anchor: Two similar emails are compared; the one with clearer timestamps and new phrasing wins if it adds variety to the pile.
š Hook: Think of ādiversityā like making a fruit salad with many different fruits, not just apples.
š„¬ The Concept (Vendi Score): Vendi is a metric that measures how spread out a set of texts is in meaning space. How it works:
- Turn each text into an embedding (a vector that captures meaning).
- Measure how broadly these vectors are spread.
- Prefer additions that make the spread bigger. Why it matters: Without a diversity signal, a dataset gets redundant and less useful. š Anchor: If new legal memos all repeat the same wording, Vendi goes down; add a fresh style or topic, Vendi goes up.
š Hook: Imagine practicing the same song at different speeds and volumes so you can perform it anywhere.
š„¬ The Concept (Parallel Corpus for Sanitization): This is a training set of triples: original text, a sanitization instruction, and the correctly sanitized text. How it works:
- Choose which facts to hide or blur.
- Apply the change precisely across the text.
- Keep non-sensitive parts untouched. Why it matters: Without such pairs, models wonāt learn exactly how to rewrite safely while preserving meaning. š Anchor: āReplace exact birthdate with āearly Marchā but keep the cityā teaches the model what to change and what to retain.
Put together, these ideas let PRIVASIS scale to 1.4M records with 55M+ labeled attributes, and then teach small models to sanitize text better than much larger ones on standard tests.
03Methodology
High-level recipe: Inputs (helper hints) ā Draft records ā Selective refinement with diversity checks ā Attribute labeling ā Filtering ā Sanitization triples ā Train compact sanitizers.
Step-by-step details with reasons and examples:
- Informed Initialization with Auxiliary Control Variables
- What happens: The system first builds a profile (e.g., first name, age, citizenship), selects a record type (like ātax letterā or āclinic billā), and writes a background context (why this record exists). It also sketches a format (tone and structure), then generates the initial draft record from these hints.
- Why this exists: To avoid bland, generic text and instead produce detailed, coherent, and on-topic records.
- Example: Profile says āNatisha, Israeli citizen, planning an international work trip.ā Record type picks āpsychotherapy billing statement,ā and the background context mentions medication refills for travel. The draft includes dates, clinic rooms, and pharmacy references that match.
- Diversity-Preserving Iterative Selection-Based Refinement
- What happens: For each draft, the system samples a revised version. A model judge scores which is more specific/realistic. Then it checks how much the new choice would increase diversity across the current collection using the Vendi score. Only if the combined score passes a threshold is the new draft accepted.
- Why this exists: Repeated editing can collapse variety; this step keeps the set fresh and wide-ranging so models donāt overfit to one style.
- Example: Two versions of a legal notice are compared. The improved one adds concrete addresses and clearer timelines and also boosts collection diversity, so itās kept.
š Hook: Like giving a referee scorecards for both performance and originality.
š„¬ The Concept (LLM Judge): An LLM judge is a model that compares two texts and decides which is better for realism and specificity. How it works:
-
Show draft A and draft B.
-
Ask: Which one is more concrete and believable?
-
Add this to the diversity check to decide which to keep. Why it matters: Without a judge, the system can accept weaker drafts or drift. š Anchor: If Draft B lists exact timestamps and realistic invoice codes, the judge often picks B.
-
Attribute Annotation
- What happens: After a draft is finalized, the system extracts all attributes mentioned (like clinic name, session time, passport number) and groups related ones (e.g., under ālocationā). These become structured labels in JSON.
- Why this exists: These labels help later tasks (like sanitization) target exactly what to remove or abstract.
- Example: In a finance memo, āemployer,ā ādepartment,ā and āsalaryā become a labeled group; in medical records, āclinic roomā and āpharmacy nameā cluster under ālocation.ā
- Filtering
- What happens: Remove short, broken, or under-18 cases; ensure only coherent, useful records remain.
- Why this exists: Keeps the dataset high-quality and safe to use.
- Example: A 40-word fragment or a teen profile is dropped; a 600-word well-formed report is kept.
Secret sauce of generation:
- Combining guiding hints, a quality judge, and diversity math (Vendi) creates long, specific, and varied records at scaleāwithout using any real personās data.
Now, building the sanitization parallel corpus:
š Hook: Imagine slicing a big cake into neat pieces before icing itāmuch easier than icing the whole cake at once.
š„¬ The Concept (Decomposition/Chunking): Break a long document into smaller chunks so sanitization can be applied consistently and precisely. How it works:
- Split by natural boundaries until each chunk is at most about 512 characters.
- Identify which chunks contain the sensitive targets.
- Sanitize those chunks and stitch everything back together. Why it matters: Without chunking, models miss some mentions or make inconsistent edits across a long text. š Anchor: A 6-paragraph clinic note is split so all date mentions are abstracted the same way across every chunk.
Target selection and levels of change:
š Hook: When packing a gift, you decide what to wrap tightly and what to leave visible on top.
š„¬ The Concept (Abstraction vs. Drop): Two ways to sanitizeāblur a detail (abstraction) or remove it entirely (drop). How it works:
- Pick targets (e.g., birthdate, employer, clinic room).
- For abstraction: replace āMarch 3, 2024ā with āearly March.ā
- For drop: delete the sensitive span cleanly. Why it matters: Without control over how much to change, you either overshare or ruin the textās usefulness. š Anchor: āReplace āAadhaar 1234-5678-9012ā with ānational IDāā is abstraction; deleting the number completely is drop.
š Hook: Think of a treasure map where you mark what to hide and what to keep so the story still makes sense.
š„¬ The Concept (Retention Targets): These are facts you tell the model to keep, so it doesnāt over-edit. How it works:
- Choose non-sensitive items (e.g., city, diagnosis category).
- Ensure they remain unchanged while sanitizing sensitive parts.
- Prefer keep-items that donāt overlap textually with what youāre hiding. Why it matters: Without keep-lists, models often erase too much and hurt utility. š Anchor: āHide exact address but keep the city and appointment date-rangeā preserves planning info while protecting privacy.
Putting it all together (sanitization pipeline):
- Decomposition: Split text into manageable chunks (~512 chars).
- Target selection: Rank attributes by sensitivity, pick targets, and label each as ABSTRACT or DROP.
- Span finding: For each target, locate all relevant spans across the selected chunks.
- Instruction crafting: Build precise rewrite instructions (e.g., āAbstract specific date to āearly Marchāā).
- Apply consistently: Sanitize all affected chunks the same way.
- Merge: Recombine sanitized chunks into a coherent document.
- Final instruction: Produce a single, user-style instruction summarizing what was changed and what was retained.
Secret sauce of sanitization:
- Chunk-first editing keeps changes consistent.
- Multi-level abstraction balances privacy with usefulness.
- Retention targets stop over-sanitizing.
Outputs:
- A large set of triplets (original, instruction, sanitized) for training compact models that can run locally.
Why this is clever:
- It transforms a very hard, long-document rewriting task into a series of smaller, grounded edits while preserving global consistencyāexactly what general-purpose LLMs struggle with but small specialized models can learn to do reliably.
04Experiments & Results
The test: Do models sanitize the targets without leaking them or breaking the rest of the document?
- They check three kinds of leaks and whether the non-sensitive facts are retained. They evaluate two sets: āvanillaā (easier) and āhardā (more grouped, contextual targets and longer instructions).
š Hook: Picture three ways a secret can slip out: you say it, you hint it, or you leave clues that make it guessable.
š„¬ The Concept (Leak Types: Direct, Inference, Proximity): These are three levels of privacy failure. How it works:
- Direct leak: the exact sensitive string still appears.
- Inference leak: the model removed the exact string, but an evaluator can still guess it exactly from context.
- Proximity leak: the sanitized version still lets an evaluator guess nearly as close as if they had the original. Why it matters: Without catching subtle leaks, text looks safe but still reveals secrets indirectly. š Anchor: Hiding āJournal of Xā but leaving āeditor@jx.orgā can let someone infer the journalāan inference leak.
š Hook: Think of a spelling test where one miss makes the whole word wrong.
š„¬ The Concept (Full Successful Record Metric): A record is counted as a win only if all targets are sanitized with no leaks and all keep-items are preserved. How it works:
- Check each target for direct/inference/proximity leaks.
- Verify keep-attributes still appear correctly.
- If any single check fails, the whole record fails. Why it matters: In privacy, one miss can spoil everything. š Anchor: Even if you hide 9 out of 10 names, leaking one name means the whole document isnāt safely sanitized.
Competition and scoreboard (vanilla set):
- Frontier LLMs like o3 and GPT-5 get around 70% Full Successful Recordālike getting an A-.
- PRIVASIS-CLEANER-4B scores about 72.5%āan Aādespite being much smaller. Thatās like a lightweight sprinter beating a heavyweight champ in a specific race.
- Many models have high per-attribute success (>90%) but still fail the record because missing just one target is common.
- Retention matters: frontier models sometimes over-edit; PRIVASIS-CLEANER tends to keep non-sensitive info (ā99% retention on vanilla), which preserves usefulness.
Hard test set:
- Everyoneās scores drop sharply because targets are grouped and instructions are longer and trickier.
- GPT-5 is best at about 13%, while PRIVASIS-CLEANER-4B is close behind (~12ā13%), matching frontier-level performance.
Surprising findings:
- Direct leaks are still the most common failureāeven powerful models sometimes leave the exact sensitive string in headers or signatures.
- The gap between attribute-level success (>90%) and full-record success (~70% or lower) shows how one slip-up ruins privacy.
- The compact modelās strong retention suggests that specialized training on the parallel corpus teaches āedit just the sensitive bits.ā
- Cross-dataset generalization: PRIVASIS-CLEANER-4B matches a model trained directly on another benchmark (NaP²) without ever seeing it (both ā10% leak ratio on NaP²), and still performs far better back on PRIVASIS, signaling robustness.
Category-wise struggles:
- Business & Finance and Health & Wellness are toughest, likely because names, dates, employers, and addresses are dense and repeated in these documents.
Takeaway: A targeted training set plus a decomposition-based pipeline lets a small model beat or match giants on a strict, real-world-style privacy test.
05Discussion & Limitations
Limitations:
- Synthetic ā real: Even very realistic made-up records may miss some messy, real-world quirks (typos, code-mixed languages, edge-case formats).
- Demographic balance: Some groups (e.g., certain ethnicity labels) appear underrepresented; future work should tune sampling to better reflect global populations.
- Hard context still hard: Grouped, context-heavy attributes (like āall location-related infoā) are challenging; scores on the hard set show thereās room to grow.
- Leakage beyond text: The work focuses on textual sanitization, not side channels like file metadata or images in PDFs.
Required resources:
- Generation at scale uses API calls to multiple LLMs; though costs are reported (e.g., thousands of records ā low thousands of dollars), reproducing the entire dataset needs budget and orchestration.
- Training compact models can be done on a few GPUs, but evaluating large test sets with LLM-based checks also needs compute.
When not to use this directly:
- Donāt use synthetic records for real clinical, legal, or financial decisionsātheyāre for research and training only.
- Donāt assume sanitization is perfect for safety-critical releases; human or policy checks may still be necessary.
- Donāt apply the method to non-text modalities (images, audio) without adapting itāleak patterns differ.
Open questions:
- Better diversity signals: Can we design even stronger, cheaper diversity metrics than Vendi for text collections?
- Robust grouped-target handling: How can models more reliably detect and abstract whole attribute groups across long contexts?
- Stronger inference defenses: Beyond string replacement, how can sanitizers block indirect clues that let evaluators guess sensitive info (e.g., via linked domains or fixed patterns)?
- Multilinguality at scale: Early signs are positive, but how well do these techniques hold across many languages and scripts?
- Policy-aware sanitization: Can we blend legal/privacy rules (like GDPR principles) into instructions so rewrites satisfy both technical and regulatory needs?
06Conclusion & Future Work
Three-sentence summary: PRIVASIS builds a million-scale, fully synthetic, richly annotated dataset of privacy-heavy documents, created from scratch with guidance hints and diversity-preserving refinement. From it, the authors construct a sanitization training corpus and a chunk-based rewriting pipeline that teach small models to remove or blur sensitive information while keeping useful details. The resulting compact sanitizer outperforms or matches much larger frontier models on strict tests, especially on the standard (vanilla) set.
Main achievement: Showing that careful synthetic data plus a decomposition-based sanitization approach can unlock reliable, fine-grained privacy editingāeven with compact models that can run locally.
Future directions:
- Improve handling of grouped/contextual targets and reduce inference/proximity leaks.
- Expand demographic and domain balance, and scale to more languages and formats (emails, forms, PDFs).
- Tie instructions to policy frameworks so sanitization is both technically sound and regulation-aware.
Why remember this: It demonstrates a practical path to privacy-by-designātrain on large, safe, synthetic data and deploy small, local models that clean text before it leaves your deviceāhelping AI assistants be truly helpful without oversharing.
Practical Applications
- ā¢On-device email cleanup: Automatically abstract addresses, phone numbers, and employer names before sharing a thread.
- ā¢Healthcare note sharing: Remove patient identifiers while keeping clinical findings for team discussions.
- ā¢Customer support logs: Drop sensitive tokens or IDs but retain troubleshooting steps and outcomes.
- ā¢Legal document redaction: Blur dates and parties as instructed while preserving clause meanings.
- ā¢HR analytics: Sanitize employee records (names, IDs) so trends can be analyzed safely.
- ā¢App telemetry minimization: Strip personally identifying strings from logs on the device before upload.
- ā¢Education research: Share sanitized student essays for writing analysis without exposing identities.
- ā¢Financial reporting: Remove account numbers and exact salaries but keep aggregate amounts.
- ā¢Calendar sharing: Replace exact locations with city-level info when publishing schedules.
- ā¢Agent pipelines: Insert a compact sanitizer step that cleans inputs before an assistant processes them.