🎓How I Study AIHISA
đź“–Read
📄Papers📰Blogs🎬Courses
đź’ˇLearn
🛤️Paths📚Topics💡Concepts🎴Shorts
🎯Practice
🧩Problems🎯Prompts🧠Review
Search
Self-Improving Pretraining: using post-trained models to pretrain better models | How I Study AI

Self-Improving Pretraining: using post-trained models to pretrain better models

Intermediate
Ellen Xiaoqing Tan, Shehzaad Dhuliawala, Jing Xu et al.1/29/2026
arXivPDF

Key Summary

  • •This paper teaches language models to be safer, more factual, and higher quality during pretraining, not just after, by using reinforcement learning with a stronger model as a helper.
  • •Instead of only guessing the next word, the model learns to write a short stretch of text (the next N tokens) that a strong 'judge' model scores for quality, safety, and factuality.
  • •Training samples are split into a prefix (context) and a suffix (continuation); candidates include the original suffix, a safer rewrite, and the policy model’s own rollouts.
  • •Early on, the judge mostly picks the original or the rewritten suffix; later, as the model improves, the judge starts rewarding the model’s own rollouts.
  • •The helper strong model is used in two roles: a 'rewriter' that fixes unsafe or low-quality suffixes and a 'judge' that scores candidates.
  • •Across tests, this self-improving pretraining boosts factuality by up to 36.2% (relative), safety by 18.5% (relative), and generation quality win rates by up to 86.3% versus standard pretraining.
  • •It works with from-scratch and continual pretraining and improves standard reasoning benchmarks too.
  • •More rollouts generally mean better results, and using a powerful judge (like GPT-OSS-120B) performs best, though a smaller fine-tuned judge also works.
  • •The method costs more compute than next-token prediction but shapes core behavior earlier, preventing unsafe and hallucinatory habits from embedding.
  • •This framework can be extended to combine multiple goals (quality, safety, factuality) and potentially reasoning.

Why This Research Matters

Safer, more factual models reduce harm and misinformation in everyday tools like chat assistants and search. Teaching these habits during pretraining builds reliability into the model’s core, not just as a later patch. This is especially important when users provide tricky or unsafe inputs, or when context documents are messy. Better base habits also help the model generalize to new situations, resisting simple jailbreaks. The approach can combine multiple goals (quality, safety, factuality) and extend to reasoning, leading to stronger, more trustworthy systems. Even smaller models can benefit, making safer AI more accessible.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re learning to write stories. If your teacher only checks spelling at the very end, you might practice bad habits for months. Fixing them later is hard. What if your teacher guided you while you were learning, so your habits were good from the start?

🥬 The Concept (Next Token Prediction): What it is: Many language models learn by guessing the next word in huge amounts of text. How it works: 1) Read text up to a point, 2) Predict the next word, 3) Get told the right answer, 4) Repeat billions of times. Why it matters: It teaches models to be fluent, but not necessarily safe, truthful, or high quality because it copies patterns from the data—good and bad. 🍞 Anchor: If a model reads lots of noisy internet text, it can become great at writing but still pick up bad habits like being unsafe or making stuff up.

Before this work, people tried to solve safety, factuality, and quality mostly after pretraining, using post-training steps like alignment, supervised fine-tuning, and reinforcement learning from human feedback. That helps a lot, but there’s a catch: if the model learned bad habits early, fine-tuning can’t always erase them. It’s like putting a safety helmet on after already riding a bike recklessly—useful, but not enough to change core balance habits.

🍞 Hook: You know how facts matter when you do a school report—if you write things that aren’t true, the whole report is weak. 🥬 The Concept (Factuality): What it is: Factuality means the model’s statements are true and grounded. How it works: The model should avoid inventing facts (“hallucinations”) and prefer statements supported by known sources. Why it matters: Without factuality, a helpful-sounding answer can mislead people. 🍞 Anchor: If you ask, “What is the capital of France?”, the model should say “Paris” and not guess “Lyon.”

🍞 Hook: Think about classroom rules that keep everyone safe and respectful. 🥬 The Concept (Safety): What it is: Safety means avoiding harmful, biased, or toxic outputs. How it works: The model learns to steer away from unsafe continuations even if the input itself contains unsafe cues. Why it matters: Without safety, a model can produce harmful content or follow bad instructions. 🍞 Anchor: If a prompt includes insults, the model should respond respectfully or refuse, not join in.

🍞 Hook: If you build a toy, quality is whether it works well, lasts long, and does what it should. 🥬 The Concept (Quality): What it is: Quality means the output is coherent, helpful, and well-structured. How it works: The model organizes ideas logically, stays on topic, and uses clear language. Why it matters: Without quality, even safe and factual text can be confusing. 🍞 Anchor: A good answer to “How do plants grow?” is organized and accurate—not just a list of random sentences.

The Problem: Pretraining data contains a mix—some parts are great, others are unsafe or sloppy, and many are factual but still cause models to overgeneralize or hallucinate. Curating data helps, but it either misses issues or removes too much. Worse, if you remove all unsafe text, the model never practices handling unsafe inputs safely. Post-training tries to fix this, but can’t fully undo patterns planted during pretraining.

Failed Attempts: 1) Heavy data filtering: better but not enough, and the model loses practice at steering away from unsafe contexts. 2) Only post-training fixes: improves behavior on known cases but may crack under new or tricky inputs (like jailbreak prompts). 3) One-shot rewrites of whole documents: increases cleanliness but doesn’t teach the model how to pivot from a tricky input toward a better continuation.

The Gap: We need to shape the model’s core behavior during pretraining itself—while it’s forming habits—so it learns to write safe, factual, high-quality continuations even when the context is messy or unsafe.

Real Stakes: This affects chatbots, search assistants, tutoring systems, and content tools used by millions. If a model hallucinates in a medical explanation or repeats unsafe content from a heated forum post, that’s not just a mistake—it can cause harm. Training models to steer toward good behavior from the ground up makes them more reliable in everyday life, especially when user inputs or retrieved documents are imperfect.

🍞 Hook: If you have an older, wiser student helping a younger student while they’re learning, the younger student can pick up better habits fast. 🥬 The Concept (Post-trained Models): What it is: These are strong models already trained and then improved (aligned, fine-tuned). How it works: They’ve learned from huge data and alignment signals and can now guide others by evaluating or rewriting text. Why it matters: Without a strong helper, the learner gets weaker feedback and improves slowly or in the wrong direction. 🍞 Anchor: A great 8th grader tutor can quickly spot and correct a 5th grader’s mistakes so the 5th grader learns the right habits.

This paper’s idea is to use a strong, post-trained model during pretraining as both a rewriter (to fix bad continuations) and a judge (to score candidate continuations), and to train the new model with reinforcement learning on short sequences. This directly teaches safe, factual, high-quality habits while the model is still forming them.

02Core Idea

🍞 Hook: Imagine practicing piano. A coach listens to your next few notes, suggests a better way to play them, and then scores how well you tried again. Over time, you learn to play those next few notes beautifully, even after tricky passages.

🥬 The Concept (Self-Improving Pretraining): What it is: A pretraining method where a strong model rewrites and judges short continuations, and the new model learns via reinforcement learning to produce better continuations given the same context. How it works: 1) Split streaming text into a prefix (context) and a suffix (next N tokens). 2) Create candidates: original suffix, rewritten suffix, and the policy model’s rollouts. 3) Have a strong model act as a judge to score quality, safety, and factuality. 4) Update the policy with RL so it prefers higher-scoring candidates. 5) Early on, rely on originals/rewrites; later, reward the model’s own rollouts. Why it matters: It plants safe, factual, high-quality habits at the core level, reducing baked-in problems that are hard to fix later. 🍞 Anchor: A writing student practices finishing the next paragraph; a mentor suggests a safer, clearer version; a judge scores attempts; the student improves paragraph-by-paragraph.

Three analogies for the key idea:

  1. Coach-and-Ref: The rewriter is the coach who shows a better move; the judge is the referee who scores; the player (policy model) learns to make better moves. 2) GPS Re-route: If the road ahead (suffix) looks unsafe, the rewriter suggests a safer route; the judge checks you got closer to the destination (quality/facts/safety). 3) Taste Test: A chef-in-training cooks a small next course; the mentor proposes a cleaner recipe; the critic scores flavor and health; the chef learns to cook that course better.

Before vs After:

  • Before: Pretraining learns to predict next tokens, then post-training tries to correct behavior, sometimes too late. - After: Pretraining itself becomes goal-aware (quality/safety/factuality), using expert guidance on each chunk so the base model’s instincts are healthier.

Why it works (intuition):

  • Immediate, contextual steering: The model doesn’t just learn what words follow; it learns what good, safe, true words follow this exact kind of context. - Better supervision signal: Instead of copying noisy data, it receives judged feedback that highlights safer and more factual options. - Curriculum effect: Early dependence on original/rewritten targets prevents collapse; later, rewarding rollouts encourages independent, high-quality generation.

Building Blocks (explained as sandwiches):

🍞 Hook: Think of reading a story so far (prefix) and writing the next short paragraph (suffix). 🥬 The Concept (Prefix/Suffix Chunking): What it is: Split text into a context (prefix) and the next N tokens (suffix) to be generated. How it works: 1) Stream a document, 2) Take the last N tokens as the suffix, 3) Everything before is the prefix. Why it matters: It turns learning into a realistic mini-generation task rather than only one-word guesses. 🍞 Anchor: Given a news intro (prefix), write the next 128 tokens (suffix) that continue it well.

🍞 Hook: When you’re unsure how to finish a sentence kindly and clearly, a teacher can suggest a better version. 🥬 The Concept (Rewriter): What it is: A strong model that rewrites the suffix to be safer, more factual, or higher quality while fitting the prefix. How it works: 1) If suffix is already good, copy it. 2) If unsafe/low-quality, replace with a safe, coherent alternative. Why it matters: Without rewrites, early training can’t lean on a good target when the original is poor. 🍞 Anchor: If the continuation gets rude, the rewriter offers a respectful, helpful version using the same context.

🍞 Hook: In a friendly contest, a fair judge decides which answer is better. 🥬 The Concept (Judge): What it is: A strong model that scores candidate continuations on quality, safety, and factuality. How it works: 1) Compare options (original, rewrite, rollouts), 2) Give pointwise or pairwise scores, 3) Provide rewards to guide learning. Why it matters: Without a judge, the model can’t tell which candidate to prefer. 🍞 Anchor: A judge says, “Option 2 is more coherent and safe,” so the learner shifts toward Option 2’s style.

🍞 Hook: Practicing different endings helps you figure out which way sounds best. 🥬 The Concept (Rollouts): What it is: The policy model’s own generated suffixes for the same prefix. How it works: 1) Sample K completions, 2) The judge scores them, 3) Good rollouts get reinforced. Why it matters: Without rollouts, the model only imitates, never explores and improves its own instincts. 🍞 Anchor: The student tries multiple paragraph endings; the mentor scores them; the student keeps the best habits.

Overall, Self-Improving Pretraining replaces blind token imitation with judged, goal-directed practice on short sequences, letting models learn to steer away from unsafe or untrue paths while remaining coherent and helpful.

03Methodology

High-level recipe: Input (streaming documents) → Split into prefix and N-token suffix → Create candidates (original, rewritten, K rollouts) → Judge scores quality/safety/facts → RL update (e.g., online DPO or RF-NLL) → Output: a policy model that generates safer, more factual, higher-quality suffixes.

Step-by-step with the sandwich pattern for each new concept used:

  1. Data as a stream and chunking 🍞 Hook: Like reading a long book chapter-by-chapter. 🥬 The Concept: We stream pretraining documents and segment them into small learning units. How it works: 1) Move through data in order, 2) For a current window, define prefix (context) and a fixed-length suffix (N=128 tokens), 3) Treat generating the suffix as the training task. Why it matters: Without chunking, we can’t focus learning on realistic next-paragraph writing. 🍞 Anchor: Given the first part of an article (prefix), write the next 128 tokens (suffix) that fit.

  2. Candidate creation

  • Original suffix: the human-written continuation. - Rewritten suffix: provided by the rewriter model to improve safety/quality/factuality while staying consistent with the prefix. - Rollouts: K completions generated by the current policy model.

🍞 Hook: When solving a math problem, you compare your work to the textbook answer and a tutor’s hint, then try again yourself. 🥬 The Concept: Multiple candidates give richer learning signals. How it works: 1) Use the original as a reference when good, 2) Use rewritten when original is unsafe/low-quality, 3) Generate rollouts to explore. Why it matters: Without multiple candidates, the model would either copy blindly or collapse into poor solutions. 🍞 Anchor: You have the original answer, a corrected hint, and your own attempts—now a judge picks the best.

  1. The rewriter 🍞 Hook: If your sentence could be kinder or clearer, a teacher suggests a better version. 🥬 The Concept: A strong model fine-tuned to rewrite unsafe suffixes into safe ones and to copy safe ones. How it works: 1) If safe, exact copy is rewarded (copying proves restraint). 2) If unsafe, produce a safe, coherent rewrite scored by judge for safety and quality. Why it matters: Without a reliable rewrite, early training would struggle when originals are poor. 🍞 Anchor: Given an unsafe forum thread, the rewrite stays on topic but responds safely and helpfully.

  2. The judge 🍞 Hook: A fair referee gives points based on rules everyone knows. 🥬 The Concept: A strong model (e.g., GPT-OSS-120B or fine-tuned Llama3.1-8B) that scores candidates. How it works: 1) Safety: pointwise safe/unsafe judgments (often multiple samples averaged). 2) Quality: pairwise comparisons to choose the better continuation. 3) Factuality: pointwise labels like No/Possible/Definite Hallucination using references when available. Why it matters: Without consistent scoring, RL can’t push the policy in the right direction. 🍞 Anchor: Between two endings, the judge says which is safer and more coherent; the model learns to prefer that style.

  3. The RL update 🍞 Hook: Like training a pet: good actions get treats, bad actions don’t. 🥬 The Concept (Reinforcement Learning): The policy learns to favor high-reward candidates. How it works: 1) Judge assigns rewards, 2) Policy parameters shift to increase probability of better candidates. Why it matters: Without RL, the model wouldn’t learn from comparative signals, only from copying. 🍞 Anchor: The dog sits when asked; it gets a treat; next time it sits faster.

  4. Specific update choices 🍞 Hook: There are different games to practice the same skill. 🥬 The Concept (Online DPO): What it is: An off-policy preference-learning method that compares a chosen candidate vs a rejected one and updates the model to prefer the chosen. How it works: 1) Select highest-scoring as chosen and lowest as rejected, 2) Update so policy assigns higher probability to chosen. Why it matters: Off-policy means it can learn from original or rewritten suffixes too, not only rollouts. 🍞 Anchor: If the judge prefers Answer B over A, the model moves probabilities toward B.

🍞 Hook: Sometimes you just learn from the single best example. 🥬 The Concept (RF-NLL): What it is: Reward-filtered negative log-likelihood—do a standard NLL update but only on the best-scoring candidate(s). How it works: 1) Judge scores all, 2) Pick the top, 3) Update to imitate it. Why it matters: If you skip judging, you might reinforce poor text; with reward filtering, you imitate only the good. 🍞 Anchor: Practice copying the teacher’s best answer, not the messy ones.

  1. Early-to-late training dynamics 🍞 Hook: At first you ride a bike with training wheels; later you can balance yourself. 🥬 The Concept: Early training relies on originals/rewrites; later training rewards rollouts. How it works: 1) Initially, rollouts are low-quality; judging picks original/rewrite more. 2) As policy improves, rollouts win more; RL reinforces the model’s own good generations. Why it matters: Without this shift, the model either collapses (too many weak rollouts) or never becomes independent (only imitation). 🍞 Anchor: First, copy good paragraphs; later, your own endings start winning the contest.

  2. Practical settings from the paper

  • N (suffix length): 128 tokens. - K rollouts: up to 16 in continual pretraining (more rollouts → better results), 1 in from-scratch runs. - Judges: GPT-OSS-120B (prompted) or fine-tuned Llama3.1-8B (trained via GRPO to produce better judgments). - Data: SlimPajama (cleaner), RedPajama (with unsafe content for safety training). - Policy model: Llama 2 1.4B, with both continual and from-scratch variants. - Safety rewriter: fine-tuned to copy safe suffixes and rewrite unsafe ones.

The secret sauce

  • Rich, comparative feedback: Not just “predict this token,” but “among several full candidates, choose the safest, most factual, highest-quality continuation.” - Early stabilization with rewrites: Prevents collapse and teaches steering from bad contexts toward good outputs. - Off-policy learning (online DPO): Lets learning leverage originals and rewrites while also improving on the policy’s own attempts. - Scaling with rollouts: More rollouts provide more chances to find and reinforce excellent completions.

Concrete example:

  • Prefix: A heated forum post with unsafe language. - Candidates: (i) Original suffix continues the heat; (ii) Rewriter provides a safe, respectful response; (iii) K rollouts vary in tone and accuracy. - Judge: Scores safety first, then quality/coherence. - Update: The model shifts toward the rewritten style; later, its own rollouts start matching or beating the rewrite.

Over time, the policy learns the habit: given messy contexts, choose safe, factual, coherent ways to continue.

04Experiments & Results

The test: The authors evaluate whether self-improving pretraining (SIP) makes models produce better-quality, safer, and more factual continuations. They measure:

  • Generation quality (pairwise judged win rates vs a baseline model’s generations). - Safety (weighted averages across multiple safety datasets judged by a strong model). - Factuality (weighted averages across factuality datasets, including checks against references). - Standard reasoning benchmarks (e.g., BoolQ, HellaSwag, ARC, SIQA, MMLU) to ensure general competence isn’t lost.

The competition: SIP is compared against standard next-token pretraining baselines, including Llama Base 1.4B and models continually pretrained on SlimPajama or RedPajama without the SIP method. From-scratch and continual pretraining configurations are both tested.

Scoreboard with context:

  • Quality pretraining (continued from Llama Base): SIP achieves an 86.3% win rate on generation quality with standard (clean) prefixes and 87.9% coherence win rate. Think: getting an A+ while the baseline averages around a B-. - Factuality pretraining (continued): SIP boosts the average factuality score from 42.3 to 57.6—about a 36.2% relative gain—while also showing strong quality improvements (84.0% win rate on quality). - Safety pretraining (continued, on RedPajama): SIP improves safety from 76.9 to 91.1 average on safety evaluations and shows a 77.7% quality win rate on unsafe prefixes. That’s like going from a solid B to a clear A in safety.

From-scratch results (on RedPajama, safety objective):

  • Baseline next-token pretraining yields very low quality win rates (around 1–2) but solid safety. - SIP with reward-filtered NLL using rollout vs rewrite massively lifts generation quality (to 32.4) and still raises safety (to 97.5). This shows SIP can shape strong habits early.

Surprising findings and nuanced takeaways:

  • Optimizing for one goal doesn’t automatically optimize others. For instance, training for safety doesn’t also improve factuality by default. If you want both, include both in the reward signal. - Judges matter: Using a very strong judge (GPT-OSS-120B) performs best, but a smaller fine-tuned judge (Llama3.1-8B trained with GRPO to reason before judging) also works well—promising for cost-sensitive settings. - More rollouts, better results: As the number of policy rollouts increases (e.g., up to 16), performance tends to improve across quality, safety, and factuality, though this increases compute cost. - Early dynamics are visible: At first, the judge mostly chooses originals/rewrites over poor rollouts. Later, it flips, and rollouts win—clear evidence the model is learning to generate strong candidates on its own.

Concrete numbers (continued pretraining):

  • Quality objective: Quality win rate 86.3% vs baseline; coherence win rate 87.9%. - Factuality objective: Factuality average from 42.3 to 57.6; quality win rate 84.0%. - Safety objective: Safety from 76.9 to 91.1; quality win rate on unsafe prefixes 77.7%.

Standard benchmark effects: SIP doesn’t just optimize niche metrics; it also improves or maintains performance on general tasks like BoolQ, PIQA, HellaSwag, ARC-e/c, OBQA, SIQA, and 5-shot MMLU. That suggests SIP’s improvements don’t come at the cost of broad language understanding.

Overall, the experiments show SIP consistently outperforms standard pretraining in the dimensions that matter for real-world reliability—while remaining competitive on general intelligence benchmarks.

05Discussion & Limitations

Limitations:

  • Compute cost: SIP is slower and more expensive than next-token prediction, especially when using many rollouts and multiple judge queries per candidate. - Judge dependence: If the judge is biased, weak, or poorly prompted, the policy will learn those flaws. Training a good judge (or using a large one) is crucial. - Goal specificity: Training for safety alone won’t auto-boost factuality or vice versa; reward design must include all desired targets. - Early-stage fragility: Without careful use of originals/rewrites early on, training can collapse if rollouts are too poor and get over-reinforced.

Required resources:

  • A strong post-trained model for judging (and, if needed, rewriting). - Sufficient compute to sample rollouts, run multiple judge calls, and perform RL updates (e.g., online DPO). - Curated setups for safety/factuality prompts and possibly fine-tuned judges/rewriters (e.g., GRPO-trained judge, safety rewriter).

When not to use:

  • Extremely resource-limited settings where you cannot afford multiple rollouts and judge evaluations. - Domains where the desired behavior is not well captured by current judge prompts (e.g., highly subjective style goals without clear reward signals). - Cases where you need unsafe role-play generation by default; SIP can be adapted with control tokens, but naive use will push toward safety.

Open questions:

  • Faster judgments: Can we reduce pairwise comparisons without losing quality? Pivot methods showed some drop—what new designs can preserve performance while cutting cost? - Unified reward models: Can one judge reliably cover safety, factuality, quality, and reasoning? - Scaling rollouts: Performance improved up to 16 rollouts; what happens at 32 or 64 with smarter sampling? - Beyond text: How does SIP extend to multimodal pretraining and grounded tool use? - Robustness: How to ensure judge and policy remain reliable under adversarial or out-of-distribution inputs?

Bottom line: SIP is a powerful step toward making pretraining goal-aware and safety-conscious, but it requires careful design of judges, rewards, and compute budgets to realize its full potential.

06Conclusion & Future Work

Three-sentence summary: This paper rethinks pretraining by having a strong model rewrite and judge short continuations while a new model learns via reinforcement learning to prefer safer, more factual, higher-quality outputs. Early on, training leans on originals and rewrites; later, it rewards the model’s own good rollouts, shaping core habits before they harden. The result is large gains in quality, safety, and factuality over standard next-token pretraining, without sacrificing general capabilities.

Main achievement: Turning pretraining into a judged, sequence-level learning process—using rewriters and judges—so the base model develops healthy instincts from the start.

Future directions:

  • Combine multiple goals (quality, safety, factuality, reasoning) into a unified reward pipeline or a single well-trained judge. - Improve efficiency with smarter rollout sampling and faster, reliable judgment mechanisms. - Expand to multimodal or tool-augmented settings and explore higher rollout counts.

Why remember this: Instead of trying to fix habits after they form, SIP teaches good habits during pretraining itself. That shift—from blind next-token guessing to judged, goal-driven sequence learning—can make future models more reliable and trustworthy in everyday use.

Practical Applications

  • •Train enterprise chatbots that reliably deflect unsafe requests and provide accurate, helpful alternatives.
  • •Pretrain medical or legal assistants to reduce hallucinations when summarizing sensitive documents.
  • •Build kid-safe educational tutors that stay respectful and factual even when given tricky prompts.
  • •Improve content moderation helpers that rewrite unsafe replies into constructive, policy-compliant messages.
  • •Enhance search-answering systems to generate grounded, coherent, and safe snippets from noisy web pages.
  • •Pretrain domain-specific copilots (coding, math) to favor correct, high-quality steps over plausible but wrong ones.
  • •Develop customer support bots that reframe heated messages into calm, solution-focused responses.
  • •Create summarization tools that avoid inserting false claims and maintain a safe, neutral tone.
  • •Use smaller, fine-tuned judges to reduce costs in organizations with limited compute while still gaining SIP benefits.
  • •Combine multiple goals (safety + factuality + quality) in one training loop to tailor models for regulated industries.
#self-improving pretraining#reinforcement learning#online DPO#reward-filtered NLL#LLM-as-judge#suffix rewriting#factuality#safety#generation quality#rollouts#SlimPajama#RedPajama#post-trained models#sequence-level training#coherence
Version: 1