VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Avinash Amballa; Yashas Malur Saidutta; Chi-Heng Lin; Vivek Kulkarni; Srinivas Chappidi

VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs

Intermediate

Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin et al.12/12/2025

arXiv PDF

Key Summary

•VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
•It uses a math idea called volume (from the determinant of a similarity matrix) to measure how much the dataset spreads out, which equals how diverse it is.
•An 'anchor set' keeps a small, best-of-the-best, representative collection so we only accept new examples that expand the overall diversity.
•Determinantal Point Processes (DPPs) help pick both diverse anchors and diverse 'explorer' prompts without needing model weights or logits.
•Textual gradients refine prompts using plain language feedback so future generations avoid already-covered regions.
•Across creative and reasoning tasks, VOYAGER boosts diversity by about 1.5–3x compared to strong baselines, while keeping quality steady.
•It scales to closed-source models and avoids expensive fine-tuning or special decoding tricks.
•Models trained on VOYAGER data perform better and need fewer examples, showing diversity is not just nice to have—it improves learning.
•The method is efficient because it only computes diversity against a small anchor set instead of the entire dataset.
•VOYAGER’s ideas can guide safer, broader, and fairer synthetic data generation in many domains.

Why This Research Matters

Diverse synthetic data trains models that handle real-world variety better, making them more robust and fair. VOYAGER shows we can get this diversity without retraining or peeking at model weights, which is crucial for closed-source systems. By directly optimizing a principled diversity measure (volume), it avoids the pitfalls of token-level randomness and vague “be diverse” instructions. The method’s efficiency and strong results mean teams can affordably generate better training and evaluation sets. This leads to improved downstream performance, fewer examples needed, and stronger generalization. Ultimately, it supports safer, broader, and more reliable AI deployments in education, healthcare, business, and beyond.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a music playlist feels boring if every song sounds almost the same? Even if there are 100 songs, if they’re all similar, the playlist doesn’t feel big or exciting.

🥬 Filling (The Actual Concept):

What it is: This paper tackles how to get Large Language Models (LLMs) to generate synthetic datasets that are truly diverse—full of different ideas, styles, and topics—without retraining the model.
How it works (story of the field):
1. Before: LLMs were already great at generating lots of text quickly. When real data was hard to collect, people used LLMs to make “synthetic” examples (like extra math problems or short stories) to train other models.
2. The problem: These synthetic datasets often lacked variety. Many examples sounded too similar (a problem called mode collapse), even when people turned up the randomness (temperature) or tried top-p sampling. These methods only tinker with the next-word choices, not with the big-picture diversity across the whole dataset.
3. Failed attempts:
  - Sampling tricks (temperature, top-p, min-p) can still drift toward the same styles or topics; they don’t ensure global diversity and can even harm coherence.
  - Prompt-based diversity (like “be diverse” or listing topics) helps a bit, but needs lots of human planning and doesn’t guarantee examples aren’t semantically redundant.
  - Training-based fixes (RL with diversity rewards) can work but are expensive, need open weights, and aren’t usable with closed models.
4. The gap: We needed a training-free, closed-model-friendly method that looks at diversity globally, not just token-by-token, and scales to big datasets.
5. Why it matters: In real life, we want helpful, wide-ranging data—covering many styles, topics, and ways of thinking—so downstream models learn better and are more robust. Without true diversity, systems can become narrow, biased, and less reliable.
Why it matters (what breaks without it): If synthetic data repeats the same ideas, models trained on it won’t generalize well. It’s like practicing the same math problem over and over—you’ll ace that one, but struggle with new ones.

🍞 Bottom Bread (Anchor): Imagine building a school library. If every new book is just another copy of the same story, students won’t learn much. VOYAGER is like a smart librarian who only adds books that bring something new—different topics, voices, and ideas—so the library gets truly richer.

—

🍞 Top Bread (Hook): Imagine you ask a robot to write 100 jokes, but 90 of them are just the same joke with a tiny twist—still not very funny as a set!

🥬 Filling (The Concept: Large Language Models):

What it is: Large Language Models are computer programs trained to predict and generate text that sounds human.
How it works:
1. Learn patterns from tons of text.
2. Given a prompt (like “Tell me a joke”), predict the next word over and over to build sentences.
3. Adjust sampling knobs (like temperature) to make output more or less random.
Why it matters: They’re fast helpers for making lots of text—but without careful guidance, they repeat themselves and shrink the variety of ideas.

🍞 Bottom Bread (Anchor): Ask an LLM for 100 animal facts. Without a plan, it might talk about lions 50 times. We need a way to nudge it toward whales, insects, birds, reptiles—everything.

—

🍞 Top Bread (Hook): Think about a sticker collection—if you only collect cat stickers in different colors, it still feels like one kind of sticker.

🥬 Filling (The Concept: Synthetic Data Diversity):

What it is: Synthetic data diversity means the generated data covers many different meanings, topics, and styles—not just surface changes.
How it works:
1. Measure differences between items, not just words but meanings.
2. Prefer items that add something new, not more of the same.
3. Keep a small, representative set to compare against so you don’t repeat yourself.
Why it matters: Low diversity means narrow learning—models trained on such data won’t handle real-world variety well.

🍞 Bottom Bread (Anchor): If your practice questions are all about adding apples, you’ll stumble on a question about dividing cookies. Diverse practice leads to stronger skills.

—

🍞 Top Bread (Hook): Turning up the volume on your radio won’t change the song—it just makes the same thing louder.

🥬 Filling (The Concept: Token Sampling vs. Global Diversity):

What it is: Token sampling (like high temperature) changes the next-word choices, but global diversity is about how different the final examples are from each other.
How it works:
1. Token sampling tweaks local randomness.
2. Global diversity checks the whole dataset and asks: are we covering new areas?
3. Without a global view, you can still produce very similar outputs overall.
Why it matters: You can’t fix whole-library sameness by only changing single-word choices in each book.

🍞 Bottom Bread (Anchor): You can shuffle the order of steps in many recipes, but if they’re all pasta recipes, your cookbook still lacks variety.

02Core Idea

🍞 Top Bread (Hook): You know how a class map gets better when each student pins a different place, not the same hometown over and over?

🥬 Filling (The Aha! Moment):

What it is: VOYAGER treats dataset diversity as a geometric volume problem—keep examples that expand the space and skip ones that don’t—using Determinantal Point Processes (DPPs) and prompt refinement with textual gradients, all without training the model.
How it works:
1. Maintain a small anchor set of representative examples.
2. For each new candidate, compute its marginal gain in “volume” (based on the determinant of a similarity matrix). Keep it only if it expands the space enough.
3. Periodically prune anchors by sampling a diverse subset with a k-DPP (which favors high-volume sets).
4. If many candidates are rejected, refine future prompts using textual gradients so explorers search new regions.
Why it matters: This directly optimizes global diversity, not just token-level randomness, and it works with closed models because it only needs text I/O and embeddings.

🍞 Bottom Bread (Anchor): Building a Lego city, you keep the new pieces that add new shapes—bridges, towers, parks—and skip duplicates. Over time, your city spreads out and becomes truly varied.

—

Multiple analogies:

Art gallery analogy:

Imagine curating an exhibit. You keep adding artworks only if they bring a new style or idea. The determinant-as-volume is the gallery’s “freshness meter,” and DPP is your assistant who picks a balanced, varied set from many candidates. Textual gradients are the curator’s notes to artists about what themes to explore next.

Picnic basket analogy:

You want a picnic with variety: savory, sweet, crunchy, fruity. The anchor set is your current basket; marginal gain says, “Does this add a new taste?” DPP helps pick a balanced set. Textual gradients are suggestions like “we need more citrus” to guide next choices.

Map exploration analogy:

Explorers search the map. The anchor set bookmarks key areas already visited. Marginal gain says, “Is this a new region?” DPP picks a spread-out subset of bookmarks. Textual gradients rewrite mission instructions so the next explorers head to unexplored zones.

Before vs. After:

Before: We tweaked sampling knobs or wrote longer prompts, hoping for variety; results drifted into similar clusters or required costly training.
After: We explicitly measure and maximize dataset volume, prune smartly with DPPs, and use language-only feedback to steer prompts—no weight updates needed.

Why it works (intuition):

The determinant of a similarity matrix equals the squared volume spanned by the data representations; higher volume means items point in different directions (less redundancy). k-DPP sampling tends to choose points that are far apart in this space. Marginal gain accepts only items that expand the volume. Textual gradients change the prompt to intentionally search new parts of the space.

Building blocks (each with a sandwich explanation when first used):

🍞 You know how a photo collage looks better when pictures aren’t near-duplicates? 🥬 Determinantal Point Processes (DPPs):
- What it is: A method that picks subsets that are naturally diverse.
- How it works: Build a similarity matrix; subsets with larger determinants (volume) are more likely.
- Why it matters: It bakes in a preference for coverage over repetition. 🍞 Anchor: From many nearly similar selfies, a DPP picks a spread—close-ups, landscapes, group shots—for a balanced album.
🍞 Imagine measuring how much space your toy pile occupies. 🥬 Determinant-as-Volume:
- What it is: The determinant of a similarity matrix reflects the squared volume spanned by data embeddings.
- How it works: Turn texts into vectors, build a kernel matrix, compute its determinant; bigger means more spread.
- Why it matters: It’s a principled, global diversity score. 🍞 Anchor: Ten stories that share few ideas fill a bigger “idea box” than ten versions of the same tale.
🍞 Think of a highlight reel of the best, most different plays. 🥬 Anchor Set:
- What it is: A small, representative subset we keep updating.
- How it works: Accept a new item only if it increases volume by a threshold; periodically prune via a k-DPP.
- Why it matters: Keeps comparisons cheap and the dataset expanding smartly. 🍞 Anchor: A teacher’s bulletin board shows varied student work, guiding newcomers to make something different.
🍞 If your instructions send every explorer to the same hill, change the instructions. 🥬 Textual Gradients (Prompt Refinement):
- What it is: Language-based feedback that explains what to change in prompts to get more diverse outputs.
- How it works: Ask an LLM to critique rejections, extract reasons, then rewrite the prompt to target missing regions.
- Why it matters: Steers the search without retraining or logits. 🍞 Anchor: After seeing many bird poems, feedback like “add ocean sounds and machinery” pushes the next poems toward new imagery.
🍞 Picking kids for a relay from different skill sets makes a stronger team. 🥬 Diverse Explorers:
- What it is: Keep multiple, diverse prompts (explorers) active.
- How it works: After each round, use a DPP to pick a varied set of successor prompts.
- Why it matters: Parallel, diverse search speeds up covering the space. 🍞 Anchor: One explorer tries puzzles, another tries stories, another tries dialogues—together they discover more.

03Methodology

At a high level: Input → Generate candidates (explorers produce batches) → Check marginal gain vs. anchor set → Keep only volume-boosting items → Update/prune anchors with a k-DPP → Refine explorers via textual gradients → Output a diverse dataset when target size is reached.

Ingredients and inputs:

Task prompt p (what to generate), target dataset size l, marginal gain threshold τ, number of explorers b, anchor size k, max iterations T, similarity kernel K.

Step-by-step (with sandwiches for key steps):

Initialize

🍞 Hook: Starting a treasure hunt, you set up a small map and pick a few scouts.
🥬 Concept: Initialization
- What it is: Start with an empty dataset D, empty anchor set Φ, and an initial explorer E = {p}.
- How it works: D ← {}, Φ ← {}, E ← {p}.
- Why it matters: You need a place to store treasures (data) and a small guide set (anchors) to measure what’s new.
🍞 Anchor: Your first scout gets simple instructions: “Find interesting places.”

Explore (batch generation)

🍞 Hook: A scout brings back a bag of stones from a riverbank.
🥬 Concept: Explorer
- What it is: An explorer is a prompt that generates a batch of candidates.
- How it works: Call the LLM once per explorer to get a batch B (e.g., 10 candidates).
- Why it matters: Batches give variety per round without too many calls.
🍞 Anchor: A poem explorer returns 10 short poems in different voices.

Score marginal gain (keep only what expands volume)

🍞 Hook: Only add a new Lego piece if it makes your build bigger in a new direction.
🥬 Concept: Marginal Gain via Determinant
- What it is: A score of how much a candidate increases the anchor set’s volume.
- How it works:
  1. Embed texts and compute the similarity kernel matrix K(Φ) for current anchors.
  2. Tentatively add candidate w to anchors, get K(Φ ∪ {w}).
  3. Compute γ = det(K(Φ ∪ {w})) / det(K(Φ)).
  4. If γ ≥ τ, accept w into D and Φ; else reject into R.
- Why it matters: Guarantees each accepted item adds new coverage, not sameness.
🍞 Anchor: If a new sports sentence is too similar to existing ones, it’s rejected; a sentence about fencing tactics might pass because it adds a new semantic direction.

Update anchors (prune with a k-DPP)

🍞 Hook: When your highlights board gets crowded, keep a balanced top set.
🥬 Concept: Anchor Pruning via k-DPP
- What it is: Keep only k anchors that together span a large volume.
- How it works:
  1. From the augmented anchor pool A, build kernel L = K(A).
  2. Sample k items with a k-DPP (which favors diverse subsets).
  3. Set Φ ← sampled subset.
- Why it matters: Controls computation and keeps anchors maximally informative.
🍞 Anchor: Out of many candidate poems, you keep 10 that cover different forms and topics.

Prompt refinement using textual gradients

🍞 Hook: If explorers keep finding the same hillside, change the instructions to push them to valleys and lakes.
🥬 Concept: Textual Gradients
- What it is: Language feedback that diagnoses why rejections happened and prescribes how to adjust prompts.
- How it works:
  1. If R (rejects) is non-empty, ask an LLM to critique: what’s missing or repeated?
  2. Get m gradient messages (e.g., “add constraints to target underrepresented styles”).
  3. Ask the LLM to apply each gradient to produce successor prompts.
  4. Collect successor explorers C.
- Why it matters: It moves search to unexplored regions without touching model weights.
🍞 Anchor: Feedback like “too many team-sport sentences; request solo sports and lesser-known rules” yields updated prompts that head elsewhere.

Pick diverse explorers for the next round

🍞 Hook: Send a mixed crew: a mountain climber, a sailor, a desert walker.
🥬 Concept: Diverse Explorer Selection via DPP
- What it is: From candidate explorers C, pick b that are maximally different.
- How it works: Build a similarity matrix over prompt texts and sample with a k-DPP to get b explorers.
- Why it matters: Diverse search reduces rejection rates and finds new regions faster.
🍞 Anchor: One explorer asks for dialogue, another for riddles, another for factual mini-essays.

Repeat until done

🍞 Hook: Keep collecting until your museum has all the wings you planned.
🥬 Concept: Termination
- What it is: Stop when |D| ≥ l or max iterations T is reached.
- How it works: Loop outer rounds; each round grows D with accepted items.
- Why it matters: Guarantees a dataset of the desired size while maximizing spread.
🍞 Anchor: You aimed for 500 items; stop once the diverse set reaches that mark.

Secret sauce (why the recipe is clever):

It directly optimizes a principled, global diversity measure (volume via determinant), not just local randomness.
It keeps cost low by only comparing candidates to a small anchor set, not the whole dataset.
It prunes and selects with DPPs, which naturally prefer spread-out subsets.
It steers exploration using textual gradients—purely in text—so it works with closed models.

What breaks without each step:

No marginal gain threshold → duplicates creep in, shrinking useful variety.
No anchor pruning → anchors bloat, computations slow, and the signal blurs.
No textual gradients → explorers repeat mistakes; rejection rate climbs and progress slows.
No diverse explorer selection → prompts converge to similar areas, reducing global coverage.

Concrete mini-example (sports sentences):

Start with Φ = {}.
Explorer E asks for 10 sports sentences.
6 are too similar (team-sport clichés) → rejected; 4 about niche sports → accepted.
k-DPP prunes anchors to the best 3.
Textual gradients: “Request less-covered sports (fencing, archery), emphasize tactics, avoid scorelines.”
New explorers reflect these instructions; next batches include fresh concepts that pass the marginal gain threshold, growing D with truly different items.

04Experiments & Results

🍞 Top Bread (Hook): Imagine judging a talent show. You don’t just count how many acts you saw—you check how different they were: music, magic, dance, and comedy.

🥬 Filling (The Test):

What it is: The authors tested VOYAGER on two families of tasks—creative writing (sentences about sports, political conversations, poems, movie plots) and reasoning (grade-school math questions, simple logic puzzles)—against strong baselines.
How it works:
1. Generate datasets of the same size with each method.
2. Measure diversity with multiple views: lexical distance (Jaccard), semantic distance (cosine on embeddings), and the Vendi score (effective number of distinct items).
3. Check quality with an LLM-as-judge rubric tailored to each task.
4. Track LLM calls to gauge efficiency.
Why it matters: Multiple diversity lenses (words, meanings, overall spread) ensure improvements aren’t just cosmetic; quality checks confirm diversity doesn’t wreck usefulness.

🍞 Bottom Bread (Anchor): It’s like grading a classroom project for variety of ideas, clarity, and the number of question–answer tries it took to finish.

Who it was compared against (the competition):

DEFAULT (vanilla prompting), TEMP (high temperature), DIVERSE (explicitly ask to be diverse), HISTORY (avoid recent outputs), HIERARCHICAL (first pick topics, then generate), SUBSETSELECT (generate a big pool then pick a diverse subset with a k-DPP).

Scoreboard with context:

Creative tasks:
- VOYAGER consistently raised Vendi scores the most—roughly tripling vs DEFAULT on average and beating the strong HIERARCHICAL baseline by a notable margin—while keeping quality similar.
- Example: For sports sentences, VOYAGER’s Vendi score was dramatically higher than all baselines and its cosine and lexical distances were the best or near-best, showing stronger semantic and lexical spread.
- Efficiency: While HIERARCHICAL made many LLM calls to plan topics, VOYAGER often needed fewer calls for the same dataset size, thanks to guided exploration and pruning.
Reasoning tasks:
- VOYAGER again won clearly on Vendi and other diversity metrics, with steady quality. Surprisingly, simply telling the model to “be diverse” sometimes hurt performance (e.g., math question diversity), proving that vague instructions don’t guarantee real spread.

Surprising findings:

“Be diverse” prompts can backfire, especially in structured tasks (math), where diversity needs careful steering to keep problems appropriate and distinct.
Diverse explorers matter: Picking successor prompts with a DPP improved diversity and reduced total calls vs. picking them randomly.
Textual gradients lower rejection rates over time: With feedback, the algorithm quickly learns what’s missing and targets new regions, needing fewer rounds to reach the dataset goal.

Downstream benefits (why the numbers matter):

Training on VOYAGER-generated GSM8K-style math data improved zero-shot test accuracy for instruction-tuned models (e.g., Gemma-7B-IT), showing that diverse synthetic data boosts actual model performance—not just scores on diversity metrics.
Even with fewer training examples (e.g., 500 vs. 1000), VOYAGER’s data nearly matched or beat data from the DEFAULT method, proving better data can beat more data.

Concrete interpretation:

Saying “VOYAGER got a higher Vendi score” means “the dataset effectively contains more truly different examples.” Think: a library with many unique books versus a shelf of near-duplicates.
Getting top cosine distance means examples spread farther apart in meaning-space, not just in word choice.
Maintaining quality while boosting spread means the outputs stay coherent and on-task.

Takeaway:

VOYAGER isn’t just a new knob—it’s a new compass. It improved diversity by about 1.5–3x over popular techniques while holding steady on quality and sometimes using fewer LLM calls than heavy prompt engineering approaches.

05Discussion & Limitations

🍞 Top Bread (Hook): Even the best hiking map has blank spots—and the smart hiker knows where the cliffs are.

🥬 Filling (Honest Assessment):

Limitations (what this can’t do yet):
1. Text-only focus: VOYAGER optimizes diversity for textual data. Multi-modal generation (e.g., images + text) needs new similarity kernels and careful cross-modal volume measures.
2. Metric dependence: Results depend on the chosen similarity kernel (embeddings + lexical). If embeddings miss subtle differences, the method may undercount certain kinds of diversity.
3. Threshold tuning: The marginal gain threshold τ affects acceptance. Too high rejects too much; too low lets in near-duplicates. The paper proposes a heuristic, but some tasks may need tuning.
4. Computation vs. calls: While efficient compared to naive global DPPs, VOYAGER still makes extra LLM calls for feedback and exploration compared to simple baselines.
Required resources:
- An LLM for generation (can be closed-source), an embedding model for similarity, and the ability to run k-DPP sampling. Some tasks may need modest prompt-engineering expertise to set good initial prompts.
When not to use:
1. If you already have rich, human-curated diversity and just need small augmentations.
2. Extremely tiny datasets where DPP selection overhead isn’t worth it.
3. Tasks where uniformity is desired (e.g., standardized phrasing for legal clauses) rather than broad variety.
Open questions:
1. Multi-modal extension: How to design cross-modal kernels that balance text, audio, image, and video similarities?
2. Dynamic thresholds: Can τ be adapted automatically based on observed acceptance rates and target diversity levels?
3. Human-in-the-loop: Can periodic human feedback improve the gradient critiques and push exploration toward higher-value regions?
4. Safety and fairness: How to combine diversity with safety constraints and representational fairness, ensuring variety without amplifying harmful content?

🍞 Bottom Bread (Anchor): Think of VOYAGER as a great first version of a treasure map—it shows where to look for variety and how to avoid old ground. Next, we can add more legends, scales, and routes for new terrains like images and multimodal tasks.

06Conclusion & Future Work

Three-sentence summary:

VOYAGER is a training-free framework that grows a truly diverse synthetic dataset by keeping a compact anchor set, accepting only items that expand geometric volume, pruning with k-DPPs, and steering exploration with textual gradients.
Across creative and reasoning tasks, it significantly boosts diversity (about 1.5–3x) while preserving quality and improving downstream model performance and data efficiency.
It works with closed-source LLMs and scales better than naive approaches that try to DPP-sample entire datasets at once.

Main achievement:

Turning diversity generation into a principled, geometry-based optimization—using determinants as a global spread score—then making it practical with anchors, k-DPPs, and language-only feedback.

Future directions:

Extend to multimodal data with cross-modal kernels; automate adaptive thresholds; integrate human-in-the-loop feedback; and combine diversity with safety and fairness objectives for balanced exploration.

Why remember this:

VOYAGER reframes synthetic data generation: don’t just hope for variety—measure it as volume and grow it on purpose. That simple, powerful shift leads to richer datasets, better models, and more reliable AI systems.

Practical Applications

•Build richer instruction-tuning datasets so models follow varied user requests more reliably.
•Generate diverse question banks (math, logic, reading) to improve student practice and assessment fairness.
•Create varied dialogue datasets for chatbots to handle many tones, styles, and topics.
•Assemble broad safety and red-teaming prompts that explore many adversarial angles.
•Produce diverse summaries and paraphrases, reducing redundancy for search and recommendation testing.
•Curate balanced content for creative writing assistants (poetry, stories, scripts) covering many genres.
•Construct robust evaluation suites with minimal overlap, improving test coverage of model behaviors.
•Pretrain or finetune smaller models with less data by maximizing data diversity per example.
•Augment domain-specific corpora (e.g., healthcare FAQs) while avoiding near-duplicates.
•Support fairness audits by ensuring representation across demographic, topical, and stylistic axes.

Version: 1