Shaping capabilities with token-level data filtering

Neil Rathi; Alec Radford

Shaping capabilities with token-level data filtering

Intermediate

Neil Rathi, Alec Radford1/29/2026

arXiv PDF

Key Summary

•The paper shows a simple way to teach AI models what not to learn by removing only the exact words (tokens) related to unwanted topics during pretraining.
•Filtering at the token level is more precise than throwing out whole documents, so models keep more of the good knowledge while losing the bad.
•On a medical ‘forget’ test, the largest model needed about 7000 times more compute to re-learn medicine if it was trained with token removal—showing strong suppression.
•Despite forgetting medical know‑how, these models still behave well and can be trained to politely refuse medical questions, sometimes even better than unfiltered models.
•Token filtering also held up better than a leading unlearning method (RMU) against adversarial finetuning that tries to add the unwanted skills back.
•A practical pipeline labels tokens using sparse autoencoders and small bidirectional language models, producing cheap, accurate classifiers.
•Even with noisy labels, filtering can work: crank up recall (be conservative) and scale pretraining to still get strong suppression.
•Filtering early in training matters a lot; starting late is much less effective.
•Token-level classifiers generalize well even from weak labels, while document-level classifiers struggle to do the same.
•Overall, token-level data filtering is an inexpensive, scalable, and robust way to shape model capabilities during pretraining.

Why This Research Matters

This approach makes AI safer by preventing risky skills from forming in the first place, instead of trying to cover them up later. It preserves helpful abilities—like general reasoning and safe subject knowledge—while sharply weakening unwanted capabilities. Because it targets only the exact bad bits, it keeps models efficient and useful for everyday tasks like study help and writing. It scales well: as models grow larger, the suppression effect becomes even stronger. It also holds up better under attacks that try to re-add the dangerous skills, and it still allows the model to be aligned to refuse in those areas. In short, it’s a practical, low-cost safety lever that can be combined with other defenses for real-world deployment.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine you’re packing a lunch. If one grape has a tiny bruise, you don’t throw away the whole bunch—you just pick out the few bad ones so you still enjoy the rest.

🥬 The Concept: Capability shaping is teaching an AI model what to learn and what to ignore so it helps with good things (like writing or biology) but avoids unsafe areas (like how to cause harm).

How it works:
1. Decide what skills are safe to keep (retain) and what to remove (forget).
2. Adjust the training data so the model sees lots of the keep stuff and very little (or none) of the forget stuff.
3. Train the model, then test if it kept the right abilities and dropped the wrong ones.
Why it matters: If the base model already knows harmful tricks, safety rules added later can be bypassed. It’s better to never learn the trick in the first place. 🍞 Anchor: Think of a dog in school: you teach it to fetch the ball but never to chew shoes. If it never learns shoe‑chewing, you won’t have to constantly stop it later.

The World Before: Big language models learn from huge piles of text. After pretraining, people add safety layers (like instruction tuning or RLHF) to make models polite and refuse dangerous requests. But attackers often “jailbreak” these protections or fine-tune models to bring back the risky behavior. Unlearning methods try to surgically remove the bad skills after the fact, but current techniques are fragile—just a bit of adversarial finetuning can often undo them.

🍞 Hook: You know how when you spill glitter on the floor, it gets everywhere? Trying to pick it up after it spreads is super hard.

🥬 The Concept: Post hoc safeguards are safety fixes added after the model already learned everything.

How it works:
1. Pretrain on the whole internet (model learns lots, including risky bits).
2. Add safety rules or refusal behaviors.
3. Hope those rules always hold.
Why it matters: Attackers can still nudge the model to reveal what it already knows. Cleaning after the glitter spreads is tough. 🍞 Anchor: Even if you say, “Please don’t tell secrets,” someone can still trick a friend who already knows the secret. Better if the friend never learned it.

Failed Attempts: Document-level filtering—throwing out whole papers or pages—helped somewhat, but often removed a lot of good text too, because even safe documents can contain a few risky tokens (words/phrases). That made models weaker in helpful areas.

🍞 Hook: Imagine tossing the whole fruit salad because you found one grape seed. That’s wasteful.

🥬 The Concept: Document-level filtering removes entire documents if they contain unwanted content.

How it works:
1. A classifier flags documents with some risky content.
2. Any flagged document is removed from training.
Why it matters: You lose many good parts just to avoid a few bad bites, lowering model quality on safe tasks. 🍞 Anchor: If a history article mentions a disease once, tossing the whole article loses all its valuable history facts.

The Gap: We needed a way to remove only the exact harmful bits without dumping the nearby helpful parts. Data attribution research hinted that tiny token spans can carry a lot of the learning signal, so precision at the token level might work better than rough document cuts.

🍞 Hook: You know how a song can be ruined by a few off-key notes? You don’t delete the whole track—you mute those notes.

🥬 The Concept: Token-level filtering removes only the specific words/tokens that teach the unwanted skill.

How it works:
1. Label which tokens are in the “forget” domain (e.g., medicine).
2. During pretraining, either mask their loss (don’t learn from them) or replace them so the model can’t use them.
3. Keep all the surrounding helpful text.
Why it matters: You keep precision—only the off-key notes go, the melody stays. 🍞 Anchor: A news article that’s 98% fine and 2% medical gets to stay: just the medical tokens are ignored.

Real Stakes: In daily life, we want assistants that write essays, explain safe biology, and help with school—without offering dangerous instructions. Token-level filtering promises models that are useful where they should be and empty-handed where they shouldn’t, reducing misuse risk and making later safety training easier.

02Core Idea

🍞 Hook: Imagine weeding a garden with tweezers instead of a shovel. You pull out only the weeds and leave the flowers untouched.

🥬 The Concept (Aha! in one sentence): If we filter out only the unwanted tokens during pretraining, we can strongly suppress specific capabilities (like medical know-how) while preserving the good ones (like general biology).

How it works:
1. Build a token-level classifier that tags each token as “forget” or not.
2. During pretraining, either stop learning from those tokens (loss masking) or remove them (token replacement).
3. Scale up models and measure what changed.
Why it matters: It’s more precise than document filtering, so you lose less helpful knowledge and need less extra compute to reach the same safe outcome. 🍞 Anchor: Instead of deleting a whole chapter because of one risky sentence, you only skip learning from that sentence.

Three Analogies:

Garden: Tweezer-weeding (token filtering) vs. shovel-weeding (document filtering). Tweezer-weeding protects flowers.
Music: Muting only wrong notes (tokens) vs. deleting the entire song (document). The tune stays enjoyable.
Cookbook: Crossing out spicy words in a kid-friendly recipe book vs. tossing the whole book. Dinner remains tasty and safe.

Before vs. After:

Before: Post-training fixes and document-level filters often over-removed good data or were easy to bypass.
After: Token-level filtering precisely targets the bad bits, scales better with model size, and stays robust against retraining attacks.

🍞 Hook: You know how your brain learns from the exact words you read, not the whole book at once?

🥬 Why It Works (intuition, not equations): Models learn from local token patterns. A few short token spans can carry the key signals for a capability. By removing learning from those exact spans, the model doesn’t internalize the unwanted skill. Because you keep most of the surrounding text, helpful abilities remain strong. As models get bigger, this precision pays off more—they become even less efficient at picking up the filtered skill. 🍞 Anchor: Even if a paragraph mentions a medical term, the model won’t learn the medical know‑how if the learning signal from those words is cut off.

Building Blocks (mini-sandwiches):

🍞 Hook: Like asking two friends for context before judging a word’s meaning. 🥬 The Concept: Bidirectional token classifier.
- How: Train small left-to-right and right-to-left models, combine their representations, and fit a simple linear probe to label tokens.
- Why: The same word (like “virus”) can be medical or computer-related—context matters. 🍞 Anchor: “Raspberry Pi virus” vs. “influenza virus” are labeled differently because of context.
🍞 Hook: Seeing a blurred photo vs. a clear one. 🥬 The Concept: Loss masking vs. token removal.
- How: Masking ignores the learning signal for bad tokens but still shows them as context; removal replaces them with a <|hidden|> token so they’re not visible as content.
- Why: Masking keeps context flow; removal blocks it fully—removal suppresses more but may reduce nearby coherence. 🍞 Anchor: Masking is like closing your ears during the bad word; removal is like beeping it out entirely.
🍞 Hook: Labeling stickers can be messy but still helpful. 🥬 The Concept: Weak labels via sparse autoencoders (SAEs).
- How: Use SAEs to find features related to the forget domain, create token labels, then train a cheaper classifier.
- Why: Full, perfect labels are expensive; this gives good-enough guidance cheaply. 🍞 Anchor: A noisy but useful map still gets you to the museum.
🍞 Hook: Pop quizzes check what you really learned. 🥬 The Concept: Multi-view evaluation.
- How: Test with text loss (perplexity), multiple-choice exams, and free-response judged for relevance, coherence, and correctness.
- Why: If all agree, we’re more confident the skill was suppressed. 🍞 Anchor: Passing all three checks means the model truly forgot that topic.

03Methodology

High-level recipe: Input (FineWeb‑Edu) → Token labeling (SAE‑assisted, small biLM probe) → Token filtering during pretraining (masking or removal) → Train models from 61M to 1.8B parameters → Evaluate on loss, MCQ, free response, robustness, and alignment.

Step A: Build token-level labels

What happens: Use sparse autoencoders (SAEs) trained on a larger model to discover features that activate on medical content. Convert those feature activations into token labels (medical vs. not), then distill these labels into a small, efficient bidirectional LM (biLM) probe that can label new data cheaply.
Why it exists: Perfect human labels are costly; SAEs and small LMs make a weak-to-strong pipeline that’s practical at scale.
Example: The phrase “insert the catheter” gets multiple medical-feature activations, so all three words are labeled ‘forget.’ A sentence about a “computer virus” won’t activate medical features and stays ‘retain.’

🍞 Hook: Like getting a list of words to skip while studying. 🥬 The Concept: Token-level data filtering.

How it works:
1. Score tokens with the classifier (medical vs. non-medical).
2. Choose a threshold to trade precision/recall.
3. During pretraining, either:
  - Loss masking: Don’t backpropagate loss from medical tokens.
  - Token removal: Replace medical tokens with <|hidden|> and also mask their loss.
Why it matters: It lets you remove only the risky bits while keeping the rest of the sentence and document. 🍞 Anchor: In a biology article that mentions a clinical test once, only that clinical part is filtered; the rest teaches safe biology.

Step B: Pretrain models at multiple scales

What happens: Train compute‑optimal Transformer models from 61M up to 1.8B parameters on filtered and unfiltered data. Keep architecture and hyperparameters consistent across runs.
Why it exists: To see how filtering behaves as models grow and to quantify ‘compute slowdowns’ in the forget domain.
Example: A 1.8B model trained with token removal shows about a 7000× compute slowdown at learning medical content compared to the unfiltered baseline.

Step C: Evaluate across different lenses

Text perplexity (loss): As a proxy for knowledge, measure how surprised the model is by tokens in medical, biology, and general non-medical sets.
- Why: Masking/removal affects the training loss signal directly—this is an immediate, transparent check.
- Example: Higher medical loss means it didn’t learn medicine; similar biology loss means it kept biology.
Multiple-choice exams: Use MedMCQA and MedQA‑USMLE for medical (forget) and MMLU biology/STEM/non‑STEM for retain.
- Why: Realistic test of applied knowledge, not just token prediction.
- Example: Filtered models score near chance on medical exams but stay strong on non‑medical ones.
Free-response judging: Ask health-related consumer questions (HealthSearchQA) and have a judge model rate relevance, coherence, and correctness. Also test general instruction following (Alpaca).
- Why: Open‑ended answers reveal whether the capability actually surfaces in helpful ways.
- Example: Token‑filtered models produce far fewer correct medical answers but do fine on general tasks like Alpaca.

Step D: Robustness and adversarial finetuning

What happens: Give attackers a chance—fine-tune models on medical text to see how fast they can regain the forgotten capability. Compare to RMU, a strong unlearning baseline.
Why it exists: If a defense crumbles quickly under attack, it’s less useful.
Example: RMU recovers fast with few tokens at large scale; token filtering, especially removal, resists better.

Step E: Alignment/Refusal generalization

What happens: Fine‑tune chat models to refuse medical questions but respond normally to general ones. Check whether filtered models can still be trained to refuse appropriately.
Why it exists: We want control, not chaos—models should say “no” when they don’t know or it’s unsafe.
Example: Token‑removal models refuse medical prompts about 2× more than baseline while not over‑refusing general prompts.

The Secret Sauce:

Precision at the token level: You keep the good context while skipping only the risky bits.
Bidirectional probing: Understanding a token’s meaning needs left and right context.
Weak‑to‑strong pipeline: Use SAEs to get noisy but useful labels, then train a small, fast probe that labels the big corpus cheaply.
Scaling: As models get larger, token filtering’s advantage grows—showing massive compute slowdowns on the forget domain.

04Experiments & Results

The Test: Does token-level filtering suppress medical capability without harming nearby skills like biology, and how does it scale? Researchers measured:

Text loss on medical, biology, and non‑medical sets.
Multiple-choice accuracy on medical exams and MMLU subsets (retain domains).
Free-response quality (relevance, coherence, correctness) on medical consumer questions vs. general instruction following.
Robustness under adversarial finetuning compared to RMU unlearning.
Alignment/corrigibility via refusal training.

The Competition: Baselines included an unfiltered model, document-level filtering, and a state‑of‑the‑art unlearning method (RMU). Token filtering used two styles: loss masking and token removal.

The Scoreboard (with context):

Token filtering vs. document filtering: Token filtering was a Pareto improvement—same drop in medical ability with less damage to biology and general knowledge. Think: getting an A in safe subjects while keeping the ‘F’ in the forbidden one, versus dropping to a B everywhere.
Scaling: At 1.8B parameters, token removal caused about a 7000× compute slowdown for medical learning, compared to about 30× for document filtering—a giant gap. That’s like making the unwanted subject 7000 times harder to study.
Multiple choice: Filtered models hovered near chance on MedMCQA/MedQA‑USMLE (good for ‘forget’) while matching baseline on biology, STEM, and non‑STEM (good for ‘retain’).
Free response: Token filtering reduced correctness up to about 10× and coherence/relevance about 3–4× on medical questions. On general tasks (Alpaca), performance stayed similar.
Robustness: Under adversarial finetuning, RMU’s protection eroded fastest as models got bigger. Token removal resisted best, needing many more training tokens to reach baseline medical performance.
Alignment/refusal: Surprisingly, token‑filtered models were as good or better at learning to refuse medical prompts while not over‑refusing general prompts. Document filtering over‑refused more.

Surprising Findings:

Alignment got easier under token filtering: Even with less exposure to medical tokens, the model more cleanly learned to say “no” to that domain.
Noisy labels still worked: By setting the classifier threshold for high recall and scaling models, filtering stayed effective.
Token‑level classifiers generalized well from weak labels; document‑level classifiers didn’t.
Filtering early in training mattered a lot—waiting even 40% into training made the effect much weaker.

05Discussion & Limitations

Limitations:

Leakage and label noise: If some medical tokens slip through or labels are noisy, the model may still pick up bits. Filtering remains sensitive to classifier quality.
Coarse control across subdomains: Token filtering helped distinguish “medical vs. non-medical,” but it struggled to separate medical subdomains (e.g., neurology vs. infectious disease) for fine-grained control.
Retrieval/in-context risks: A powerful model plus external tools might still piece together answers from a few surviving examples.

Required Resources:

Compute to pretrain multiple model sizes and to run classifiers; however, the classifier/probe is small and cheap compared to pretraining.
A labeled pipeline (SAEs + small biLM probe) to produce token-level labels.

When NOT to Use:

If you need the model to retain comprehensive knowledge across all domains (e.g., full medical tutoring), aggressive filtering will harm that.
If labels are extremely unreliable and you cannot afford the recall‑heavy threshold or extra pretraining compute.
If the risk is purely behavioral (e.g., tone) rather than capability-based; other safety methods may suit better.

Open Questions:

Can we select tokens by direct influence on downstream capabilities instead of proxy labels (knowledge-related tokens)?
How does token filtering behave beyond 7B+ models—does suppression remain strong, or do giant models infer missing skills anyway?
Can we reduce dependence on external labels via self‑supervised oversight or influence-based filters that scale automatically?
What’s the best blend of pretraining filtering and posttraining safeguards for defense‑in‑depth, especially with dual‑use content?

06Conclusion & Future Work

3‑Sentence Summary: This paper shows that removing only the unwanted tokens during pretraining is a precise and scalable way to shape model capabilities, cutting medical know‑how while preserving nearby skills like biology. Token-level filtering beats document-level filtering, scales to huge compute slowdowns for the forget domain, stays more robust than strong unlearning baselines, and still allows easy alignment (refusal training). A practical SAE‑to‑probe pipeline makes the token labels cheap, and even noisy labels work when tuned for high recall and scaled.

Main Achievement: Demonstrating that token-level pretraining data filtering Pareto‑dominates document filtering and becomes dramatically more effective at larger model scales (e.g., ~7000× compute slowdown for re‑learning the forget domain).

Future Directions: Filter by direct capability influence instead of proxy labels; push scaling tests to much larger models; combine with posttraining defenses for layered safety; and develop better evaluations for capability shaping. Explore self‑supervised or influence‑based pipelines that reduce the need for hand‑crafted labels.

Why Remember This: It reframes safety from “patch after” to “prevent before,” supplying a simple, low‑cost, and robust lever—token-level filtering—that grows stronger with scale and keeps helpful abilities intact.

Practical Applications

•Train education-friendly models that keep general science knowledge while avoiding specialized medical instructions.
•Build public-facing assistants that refuse high-risk domains but still perform well on everyday tasks.
•Safeguard open-weight releases by filtering pretraining data to limit dangerous capabilities before sharing.
•Customize domain restrictions (e.g., remove advanced chemical synthesis tokens) while preserving nearby benign domains.
•Reduce moderation load by making base models less likely to generate risky content in the first place.
•Support defense-in-depth: combine token filtering with posttraining refusals and runtime classifiers for layered safety.
•Enable enterprise deployments with selective capability profiles (e.g., disable certain domains for compliance).
•Conduct safer research by ablating capabilities in proxies (like medicine) to study scaling and robustness.
•Improve kid-safe models by filtering tokens linked to adult or harmful content while retaining educational materials.
•Facilitate targeted capability audits by toggling filtered domains and measuring downstream behavior changes.

Version: 1