FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Ajay Patel; Colin Raffel; Chris Callison-Burch

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Intermediate

Ajay Patel, Colin Raffel, Chris Callison-Burch1/29/2026

arXiv PDF

Key Summary

•Large language models usually learn by guessing the next word, then get a tiny bit of instruction-following practice; this paper flips that by turning massive web documents into instruction-and-answer pairs at huge scale.
•FineInstructions collects about 18 million real user queries, turns them into reusable instruction templates, and matches them to documents to create over a billion grounded Q&A training pairs.
•A special retrieval system with an embedding model and a Gaussian pooling trick finds which templates fit which parts of long documents.
•An instantiator model fills the template blanks using the document, and most of each answer is directly quoted from the source (at least 80%), reducing hallucinations.
•A judge model filters low-quality pairs so the final dataset is diverse, realistic, and high quality.
•When models are pre-trained only on these instruction-answer pairs (token-for-token compared to normal pre-training), they beat standard next-token pre-training and strong synthetic baselines on MixEval, MT-Bench-101, and AlpacaEval.
•On a 1.8B-parameter model, FineInstructions improves MixEval by about 39% over a 300B-token standard baseline and wins head-to-head in AlpacaEval.
•The method is efficient: it uses small distilled helper models (around 1B–3B) to scale generation and filters with an off-the-shelf judge.
•Diversity is high: millions of distinct templates are used, and no single template dominates the dataset.
•This shifts pre-training closer to how people actually use LLMs—answering instructions—so models generalize better to real user tasks.

Why This Research Matters

People don’t talk to LLMs by asking them to guess the next word; they give instructions and expect helpful answers. FineInstructions realigns training with this reality by turning massive web corpora into grounded instruction–answer practice at scale. The result is models that are more helpful on real tasks, from recommendations and comparisons to advice and explanations. Because most answers are quoted from sources, hallucinations are reduced and trust improves. The approach is also compute-efficient, focusing training on the most educational parts of documents and using smart retrieval to cover long texts. This makes it possible to build smaller, cheaper models that still perform impressively well on human-correlated evaluations.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to talk by only guessing the next word in a sentence without anyone ever asking you real questions. You might get good at copying patterns, but you wouldn’t practice answering the kinds of questions people actually ask.

🥬 The Concept (Pre-training):

What it is: Before this paper, most large language models (LLMs) learned mainly by predicting the next word on huge piles of text.
How it works: 1) Feed the model lots of books, webpages, and forums. 2) Hide the next word. 3) Ask the model to guess it. 4) Nudge the model closer to the right answer. 5) Repeat billions of times.
Why it matters: This is where models pick up most of their knowledge. Without it, they don’t know much about the world.

🍞 Anchor: It’s like practicing spelling by filling in missing letters in tons of stories—great for patterns, but not the same as answering someone’s question.

🍞 Hook: You know how a coach gives you drills that look like the real game? LLMs also need practice that looks like real use: following instructions.

🥬 The Concept (Instruction-tuning):

What it is: A smaller, second training step that teaches a model to follow instructions (like “Summarize this” or “Compare A and B”).
How it works: 1) Show the model pairs of (Instruction, Answer). 2) Train it to produce the answer when it sees the instruction. 3) Use a relatively tiny dataset compared to pre-training.
Why it matters: Without this step, models might be smart but unhelpful—like a walking encyclopedia that doesn’t answer your questions clearly.

🍞 Anchor: It’s like learning soccer: first you build fitness (pre-training), then you practice set-plays (instruction-tuning) so you can actually play matches.

The World Before: LLMs were great at patterns but needed help sounding helpful. Instruction-tuning data was scarce: many datasets had only thousands of examples or were too narrow (e.g., only academic tasks). Some bigger instruction datasets came from asking a top model to produce more Q&A, but that mostly taught smaller models to imitate that one big model’s style. There was also a mismatch: models learned by next-word guessing but were used for answering instructions. Plus, much web text contains headers, footers, or chatter—wasting compute on less educational bits.

🍞 Hook: Think of a giant library full of knowledge, but no one has written the practice questions that help you learn it well.

🥬 The Concept (Synthetic instruction-answer pairs):

What it is: Automatically created question-and-answer pairs, made from documents, that teach a model how to respond to real instructions.
How it works: 1) Find or create an instruction template. 2) Match it to a document that contains the needed info. 3) Fill the template using the document. 4) Extract a grounded answer mostly by quoting the document.
Why it matters: Without these pairs at scale, models don’t get enough practice answering the kinds of questions real users ask.

🍞 Anchor: Like teachers writing practice questions based on a textbook chapter, then giving you the answer pulled straight from that chapter.

Failed Attempts: Prior methods tried transforming web pages into Q&A or summaries using generic prompts or a small set of hand-made templates. These helped, but often produced narrow styles (e.g., only reading-comprehension) or relied too much on the generating model’s habits, limiting diversity. Others rephrased web data to be cleaner, which is useful but doesn’t directly practice instruction-following.

The Gap: We need a way to turn internet-scale documents into internet-scale, realistic instruction-answer training pairs, grounded in the source, and as diverse as real user requests.

🍞 Hook: Picture an assembly line that can make endless, different worksheets from real student questions and real textbooks.

🥬 The Concept (FineInstructions pipeline):

What it is: A procedure that converts huge web corpora into billions of grounded instruction-and-answer pairs using 18M real-user-derived instruction templates.
How it works: 1) Collect real user queries. 2) Convert them into reusable templates. 3) Match templates to documents with an embedding-based retriever that covers different document sections. 4) Fill the templates and extract answers mostly by quoting the document. 5) Judge and filter for quality. 6) Pre-train models only on these pairs.
Why it matters: Without a scalable, diverse pipeline, instruction data stays small and narrow, and models underperform on real-world tasks.

🍞 Anchor: It’s like taking millions of real exam questions, turning them into question blueprints, matching each blueprint to the right textbook pages, and then auto-generating a clean Q&A packet for study.

Real Stakes: People use LLMs for recommendations, comparisons, advice, and explanations. Training on realistic, grounded instruction-answer pairs makes small models more helpful, reduces hallucinations (by quoting sources), saves compute (by focusing on the most educational parts), and moves training closer to how users actually interact with models. That means better assistants for students, researchers, doctors, and everyday users.

02Core Idea

🍞 Hook: You know how it’s easier to learn when practice looks like the real test? If the test is answering people’s instructions, then practice should be instruction-answer pairs—not just guessing the next word in a random paragraph.

🥬 The Aha! Moment: Turn the massive pre-training corpus into a giant instruction-following dataset by templating millions of real user questions, matching them to documents, and extracting grounded answers—so we can pre-train entirely in instruction format.

Multiple Analogies:

Teacher analogy: Instead of only reading textbooks, we also convert every chapter into many realistic practice questions and answers, so class drills look like the final exam.
Factory analogy: We build an assembly line where blueprints (templates) get matched to the right raw materials (documents), then a machine fills the blanks and a quality checker keeps only the best parts.
Map analogy: We don’t just wander the web; we use a GPS (embedding retriever with Gaussian pooling) to jump to the exact neighborhood in a long document that fits a given question pattern.

Before vs. After:

Before: Pre-training mostly guessed the next word; instruction-tuning used tiny datasets; synthetic approaches often had narrow styles or overfit a single model’s voice.
After: We can generate billions of grounded, diverse, realistic instruction-answer pairs and pre-train solely on them, leading to higher scores on human-correlated benchmarks and better real-world responses.

Why It Works (intuition, no equations):

Using real-user-derived templates means the questions look like what people truly ask, not just academic multiple-choice or reading-comprehension formats.
Matching templates to documents with embeddings ensures the question is answerable by that document. Gaussian pooling carves a long document into multiple semantic regions so we retrieve templates relevant to different chunks, not just the average.
Instantiating templates and quoting at least 80% of answers from the source grounds responses and reduces hallucinations.
Judging and filtering keeps only high-quality pairs, improving the training signal.
Training in instruction format aligns the learning objective with how users interact with LLMs, improving generalization and helpfulness.

Building Blocks (each piece as a mini sandwich):

🍞 Hook: Libraries organize books so you can find what you need fast. 🥬 The Concept (Embedding models):

What: A way to turn text into vectors so similar meanings end up close together.
How: 1) Read text. 2) Map it to a vector. 3) Compare vectors by similarity to find matches.
Why: Without embeddings, we can’t efficiently match templates to the right documents. 🍞 Anchor: It’s like shelving books by topic so a “compare A and B” template finds a document that actually compares two things.

🍞 Hook: A loaf is easier to slice than to eat whole. 🥬 The Concept (Gaussian pooling):

What: A way to make multiple focused embeddings from different parts of a long document.
How: 1) Place soft “windows” (Gaussians) across the text. 2) Pool token representations inside each window. 3) Get K local vectors plus a global one.
Why: Without this, one global average mashes everything together and misses local details. 🍞 Anchor: Like taking several close-up photos of a mural so each picture captures a different scene clearly.

🍞 Hook: Teachers write question templates so they can make lots of new questions fast. 🥬 The Concept (Instruction templates):

What: Generalized question blueprints made from real user queries with fill-in-the-blank tags (like <fi>Entity</fi>).
How: 1) Collect real queries. 2) Replace specific names with descriptive tags. 3) Save as reusable patterns.
Why: Without templates at scale, we can’t cover the huge variety of real tasks. 🍞 Anchor: “Between <fi>Entity A</fi> and <fi>Entity B</fi>, which is more <fi>characteristic</fi>?” works for watches, athletes, or cities.

🍞 Hook: Matching puzzle pieces is easier if you can measure how well they fit. 🥬 The Concept (FAISS retrieval index):

What: A fast search system to find nearest neighbors among millions of embeddings.
How: 1) Embed templates and documents. 2) Build an index. 3) Query to find the closest matches quickly.
Why: Without fast retrieval, scaling to millions of templates would be too slow. 🍞 Anchor: It’s like a super-speedy librarian who can instantly find all shelves related to your topic.

🍞 Hook: Fill-in-the-blank worksheets become real when you write in the answers. 🥬 The Concept (Instantiator model):

What: A model that fills template blanks from the matched document and extracts a grounded answer.
How: 1) Read the document chunk. 2) Fill each <fi>…</fi> slot. 3) Quote most of the answer (≥80%) from the document.
Why: Without a careful filler, you’d get vague or hallucinated answers. 🍞 Anchor: For a smartwatch review, it fills in “Apple Watch Ultra” and “Garmin Fenix” and quotes the ruggedness claims from the article.

🍞 Hook: A science fair judge keeps only the best projects. 🥬 The Concept (Judge model):

What: An automatic rater that scores instruction-answer pairs and filters out low quality.
How: 1) Read the pair. 2) Score it on relevance and clarity. 3) Keep only high-scoring ones.
Why: Without judging, noisy data would dilute learning. 🍞 Anchor: If an answer strays off-topic, the judge rejects it, keeping the dataset tidy and helpful.

03Methodology

High-level overview: Input (real user queries + pre-training documents) → [Query Genericizer: make templates] → [Template–Document Matching: embeddings + FAISS + Gaussian pooling] → [Instantiator: fill blanks + quote answers] → [Judge & Filter] → Output (billion-scale instruction–answer pairs) → Pre-train LLM in instruction format.

Step A: Collect and genericize real queries into templates

What happens: Gather ~18M real user queries from many sources (e.g., forums, chat logs, Q&A sites). Filter unsafe content. Train a small Query Genericizer (a 1B-parameter Llama-3.2 variant) on ~50K silver examples to turn concrete queries into reusable templates with <fi>…</fi> tags and a short “compatible document description.”
Why this step: Without real-user-shaped templates, the data would look artificial or narrow. Templates let one pattern cover many topics.
Example: “Between Lionel Messi and Cristiano Ronaldo which is more influential?” → “Between <fi>Entity A</fi> and <fi>Entity B</fi> which is more <fi>characteristic</fi>?”

Step B: Match templates to documents (covering long docs carefully)

What happens: Use an embedding model (BGE-M3) to embed template “compatible document descriptions” and build a FAISS index. Convert each document into a short description, embed it too, and retrieve likely templates. Fine-tune the embedding model twice using LLM-judged hard positives/negatives so template–document compatibility is captured directly. Add Gaussian pooling so each long document yields K=5 local embeddings (plus one global), making it easier to find templates relevant to different parts.
Why this step: Without accurate, fast matching, you’d try to fill templates on the wrong documents or only match to generic summaries that miss local facts.
Example with data: Suppose a 3,000-word smartwatch review: we produce 6 embeddings (1 global + 5 local chunks). The “compare A vs. B ruggedness” template aligns strongly with the chunk discussing rugged design. We threshold cosine similarity at 0.865 and sample templates to match real complexity distributions.

🍞 Hook: A camera zooms in to catch details; a single wide shot blurs them. 🥬 The Concept (Gaussian pooling inside matching):

What: Produce multiple local embeddings by softly weighting tokens around K evenly spaced centers.
How: 1) Compute one global mean vector. 2) For each of K centers, apply a Gaussian weight and pool tokens locally. 3) Optionally blend with global. 4) Train so token representations mix less across regions.
Why: Without local views, long documents’ varied sections get averaged away. 🍞 Anchor: The method found a 0.99 correlation between the selected chunk index and where the final answer quote appears in the document, showing it really hits the right parts.

Step C: Instantiate templates and extract grounded answers

What happens: For each matched (document, template), an Instantiator (a distilled 3B Llama-3.2) fills the template and extracts answer text mainly by quoting the source. To save compute, it emits special <excerpt>…<...>…</excerpt> tags to indicate long copied spans with ellipses that are expanded later, so generation is shorter and cheaper. Answers are ≥80% excerpted text.
Why this step: Without grounded quoting, hallucinations creep in; without excerpt tags, generation is too slow and costly at scale.
Example: From a medical article on charting, a user-style instruction “I’m looking for a template to write a medical note” is paired with an answer starting “To write a medical note, consider the following guidelines…” quoting the section that lists the six elements for a confirmed diagnosis.

Step D: Judge and filter

What happens: Use Flow Judge (3.8B) with a 1–5 rubric; keep pairs scoring ≥4. This removes off-topic or low-quality generations. Keep a small fraction of explicit nulls when a template truly doesn’t fit, so the Instantiator learns when not to force an answer (~5%).
Why this step: Without filtering, noise lowers training signal and hurts final model quality.
Example: A template asking for math proof won’t be kept if the document is a travel blog; the judge will flag it.

Step E: Build the pre-training set token-for-token with baselines

What happens: For each source document, keep up to the same total token budget in instruction–answer pairs (average ~3 pairs per document; six candidates retrieved, then pruned). Use a simple chat format: “Instruction: … Answer: …”. Compare fairly against baselines by matching total tokens (e.g., 23B for IPT, 300B for Nemotron-CC).
Why this step: Without token-for-token parity, gains might come from just using more data.
Example: If a document is 1,000 tokens, we include about 1,000 tokens of instruction–answer text; leftover goes to the next document’s budget.

The Secret Sauce:

Real-user-shaped diversity: 18M templates mined from genuine queries prevent the data from collapsing into a few academic styles.
Document coverage: Gaussian pooling retrieves templates for different parts of long documents, not just the average gist.
Grounded efficiency: ≥80% quoting plus <excerpt> tags cuts generation cost while reducing hallucinations.
Quality gate: A compact judge boosts signal-to-noise.
Fair training: Strict token-for-token comparisons isolate the effect of formatting the data as instructions.

04Experiments & Results

The Test: The authors trained 1.8B-parameter models from scratch on different data mixtures, always using the same total number of tokens as comparable baselines. They then measured performance with three evaluations that correlate with human judgment:

MixEval: Knowledge and reasoning tasks distilled from many academic benchmarks; scored by an LLM judge with references.
MT-Bench-101: Realistic, subjective prompts (opinions, advice, writing); scored 1–10 by an LLM judge.
AlpacaEval: Head-to-head preference between two models’ answers to user-style instructions; reports win rate and fixes length bias.

The Competition:

Standard pre-training: Classic next-token on the original corpora.
IPT (Instruction Pre-Training): Documents converted into instruction–response pairs via an academic-style synthesizer; ~23B tokens setup.
Nemotron-CC: A strong synthetic pipeline over ~300B tokens, including WRAP rephrasing, general Q&A, and their full mixture.

The Scoreboard (contextualized):

On the smaller 23B-token IPT corpus:
- Standard pre-training scored 17.8%/14.0% on MixEval Std/Hard and 1.9 on MT-Bench-101.
- IPT scored 19.8%/16.7% and 2.4.
- FineInstructions jumped to 31.7%/19.2% and 2.8. That’s like going from a low C to a strong B+/A- on MixEval Standard.
- In AlpacaEval head-to-head, FineInstructions beat IPT and standard setups by large margins (win rates reported against Nemotron-CC baselines elsewhere; IPT table indicates FineInstructions is preferred).
On the larger 300B-token Nemotron-CC corpus:
- Standard pre-training: 24.0%/17.1% MixEval, 3.5 on MT-Bench-101, and lost to FineInstructions in AlpacaEval (63.6% FI win rate, +27.2% margin).
- WRAP rephrasing: 22.8%/18.4%, 3.6; still behind.
- Pure Q&A: 27.1%/18.9%, 3.4.
- Full Nemotron-CC: 24.5%/16.7%, 3.6; also behind.
- FineInstructions led with 33.0%/21.8% and 3.9. That’s roughly an A- when others are getting B’s. In AlpacaEval, FI had a 76.1% win rate over the Q&A baseline (+52.2% margin) and similarly strong wins over others.

Surprising Findings:

Narrow synthetic styles (e.g., reading-comprehension or multiple-choice flavored) barely improved MixEval (0–2% bumps), even though they targeted benchmark-like tasks. In contrast, the broader, user-shaped diversity of FineInstructions produced large gains.
A reference-free score (MT-Bench-101) showed smaller spreads in general, but FineInstructions still led.
Judging and filtering further improved results, especially in AlpacaEval (the paper’s ablations show consistent gains).
Chunk-aware retrieval worked: the chosen chunk index correlated at 0.99 with where the final quoted answer appeared in the document.
Diversity checks showed millions of different templates used; no single template exceeded 0.09% of the total—preventing overfitting to a few formats.

Takeaway: Training entirely on grounded instruction–answer pairs that reflect real user tasks leads to better generalization on both knowledge-heavy and open-ended, helpfulness-style evaluations—even when total token budgets are held equal.

05Discussion & Limitations

Limitations:

Complex templates are harder: Matching and instantiating templates with many blanks (10+) can fail more often, especially on long or messy documents.
Source dependence: Quality depends on the diversity and realism of the 18M query sources; gaps in user queries can lead to gaps in training.
Judge bias: Automatic judges help, but their preferences can shape the dataset; different judges might yield slightly different selections.
Long-answer habit: Models pre-trained purely on instruction–answer pairs may prefer long-form outputs, which is great for helpfulness but can hurt log-probability multiple-choice style scoring.
Scale constraints: The paper used 1B–3B helper models and 1.8B pre-trained models; even better performance likely needs larger helper and target models, which cost more compute.

Required Resources:

Data: Large pre-training corpora and millions of real user queries; storage for 1B+ pairs.
Compute: GPUs for embedding, indexing (FAISS), and generation; the paper pre-trained on 8×H100 for 1.8B models; helper models are 1B–3B.
Tooling: Embedding model fine-tuning, Gaussian pooling, a retrieval index, distilled instantiator, and an off-the-shelf judge.

When Not to Use:

If your downstream task is short-form classification or requires exact option probabilities, pure instruction-style pre-training may be misaligned.
If documents are extremely domain-limited or you lack suitable sources, template matching may underperform.
If you cannot run a judging step (time/budget), quality may slip, reducing the gains.

Open Questions:

Mixture design: What is the ideal balance of domains, task types, and template complexities?
Better coverage: Can adaptive chunking or learned kernels beat fixed K Gaussians for long documents?
Multi-turn and tools: How to extend from single-turn instructions to dialogues and tool-using tasks?
Safer synthesis: Can we further reduce bias and hallucination while staying efficient, perhaps with multiple judges or cross-checking?
Domain specialization: What’s the best strategy to dial up specific fields (math, code, medicine) without losing generality?

06Conclusion & Future Work

Three-sentence summary: FineInstructions converts massive pre-training corpora into grounded instruction–answer pairs using millions of real-user-shaped templates, Gaussian-pooled retrieval, an instantiator, and a judge. Pre-training solely on these pairs—holding token counts equal—beats standard next-token pre-training and strong synthetic baselines on human-correlated benchmarks. This shifts pre-training closer to actual use (answering instructions), improves knowledge absorption efficiency, and boosts real-world helpfulness.

Main achievement: Showing that instruction-format, document-grounded synthetic data—at true pre-training scale—can outperform classic next-token pre-training and prior synthetic approaches when compared token-for-token.

Future directions: Scale helper and target models, refine retrieval with smarter chunking, expand to multi-turn and tool-use instructions, build richer judge ensembles, and tune domain mixtures for specialist models. Creating better benchmarks for real-world, long-tail tasks (advice, comparisons, recommendations) would also sharpen evaluation.

Why remember this: It’s a training pivot—from “guess the next word” to “answer the instruction”—implemented at internet scale with real-user diversity and strong grounding. That pivot aligns learning with how we actually use LLMs, yielding models that are more helpful, more efficient to train, and better at the tasks people care about.

Practical Applications

•Train a domain-specialist assistant (e.g., medical, legal, coding) by filtering and emphasizing templates from that field and re-running the pipeline.
•Build a company knowledge assistant by converting internal docs into grounded instruction–answer pairs for in-house pre-training.
•Create study tutors that answer curriculum-aligned questions by templating real student queries and matching them to textbooks and notes.
•Develop safer QA systems by emphasizing grounded quoting and judge filtering to reduce hallucinations in sensitive domains.
•Speed up small-model training for edge devices by using instruction-format pre-training that yields better helpfulness per token.
•Construct benchmarking datasets for long-tail, realistic user tasks (recommendations, comparisons) using the same templating and judging flow.
•Improve RAG alternatives by training models that already respond well to instructions without always needing retrieval at inference time.
•Generate multilingual instruction datasets by templating user queries in different languages and matching to multilingual corpora.
•Prototype tool-aware assistants by extending templates to include step-by-step tool-use instructions matched to how-to documents.
•Run A/B tests on judge models to tune data quality, exploring which judging rubric leads to best downstream helpfulness.

Version: 1