Shaping capabilities with token-level data filtering
Key Summary
- âąThe paper shows a simple way to teach AI models what not to learn by removing only the exact words (tokens) related to unwanted topics during pretraining.
- âąFiltering at the token level is more precise than throwing out whole documents, so models keep more of the good knowledge while losing the bad.
- âąOn a medical âforgetâ test, the largest model needed about 7000 times more compute to re-learn medicine if it was trained with token removalâshowing strong suppression.
- âąDespite forgetting medical knowâhow, these models still behave well and can be trained to politely refuse medical questions, sometimes even better than unfiltered models.
- âąToken filtering also held up better than a leading unlearning method (RMU) against adversarial finetuning that tries to add the unwanted skills back.
- âąA practical pipeline labels tokens using sparse autoencoders and small bidirectional language models, producing cheap, accurate classifiers.
- âąEven with noisy labels, filtering can work: crank up recall (be conservative) and scale pretraining to still get strong suppression.
- âąFiltering early in training matters a lot; starting late is much less effective.
- âąToken-level classifiers generalize well even from weak labels, while document-level classifiers struggle to do the same.
- âąOverall, token-level data filtering is an inexpensive, scalable, and robust way to shape model capabilities during pretraining.
Why This Research Matters
This approach makes AI safer by preventing risky skills from forming in the first place, instead of trying to cover them up later. It preserves helpful abilitiesâlike general reasoning and safe subject knowledgeâwhile sharply weakening unwanted capabilities. Because it targets only the exact bad bits, it keeps models efficient and useful for everyday tasks like study help and writing. It scales well: as models grow larger, the suppression effect becomes even stronger. It also holds up better under attacks that try to re-add the dangerous skills, and it still allows the model to be aligned to refuse in those areas. In short, itâs a practical, low-cost safety lever that can be combined with other defenses for real-world deployment.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: Imagine youâre packing a lunch. If one grape has a tiny bruise, you donât throw away the whole bunchâyou just pick out the few bad ones so you still enjoy the rest.
đ„Ź The Concept: Capability shaping is teaching an AI model what to learn and what to ignore so it helps with good things (like writing or biology) but avoids unsafe areas (like how to cause harm).
- How it works:
- Decide what skills are safe to keep (retain) and what to remove (forget).
- Adjust the training data so the model sees lots of the keep stuff and very little (or none) of the forget stuff.
- Train the model, then test if it kept the right abilities and dropped the wrong ones.
- Why it matters: If the base model already knows harmful tricks, safety rules added later can be bypassed. Itâs better to never learn the trick in the first place. đ Anchor: Think of a dog in school: you teach it to fetch the ball but never to chew shoes. If it never learns shoeâchewing, you wonât have to constantly stop it later.
The World Before: Big language models learn from huge piles of text. After pretraining, people add safety layers (like instruction tuning or RLHF) to make models polite and refuse dangerous requests. But attackers often âjailbreakâ these protections or fine-tune models to bring back the risky behavior. Unlearning methods try to surgically remove the bad skills after the fact, but current techniques are fragileâjust a bit of adversarial finetuning can often undo them.
đ Hook: You know how when you spill glitter on the floor, it gets everywhere? Trying to pick it up after it spreads is super hard.
đ„Ź The Concept: Post hoc safeguards are safety fixes added after the model already learned everything.
- How it works:
- Pretrain on the whole internet (model learns lots, including risky bits).
- Add safety rules or refusal behaviors.
- Hope those rules always hold.
- Why it matters: Attackers can still nudge the model to reveal what it already knows. Cleaning after the glitter spreads is tough. đ Anchor: Even if you say, âPlease donât tell secrets,â someone can still trick a friend who already knows the secret. Better if the friend never learned it.
Failed Attempts: Document-level filteringâthrowing out whole papers or pagesâhelped somewhat, but often removed a lot of good text too, because even safe documents can contain a few risky tokens (words/phrases). That made models weaker in helpful areas.
đ Hook: Imagine tossing the whole fruit salad because you found one grape seed. Thatâs wasteful.
đ„Ź The Concept: Document-level filtering removes entire documents if they contain unwanted content.
- How it works:
- A classifier flags documents with some risky content.
- Any flagged document is removed from training.
- Why it matters: You lose many good parts just to avoid a few bad bites, lowering model quality on safe tasks. đ Anchor: If a history article mentions a disease once, tossing the whole article loses all its valuable history facts.
The Gap: We needed a way to remove only the exact harmful bits without dumping the nearby helpful parts. Data attribution research hinted that tiny token spans can carry a lot of the learning signal, so precision at the token level might work better than rough document cuts.
đ Hook: You know how a song can be ruined by a few off-key notes? You donât delete the whole trackâyou mute those notes.
đ„Ź The Concept: Token-level filtering removes only the specific words/tokens that teach the unwanted skill.
- How it works:
- Label which tokens are in the âforgetâ domain (e.g., medicine).
- During pretraining, either mask their loss (donât learn from them) or replace them so the model canât use them.
- Keep all the surrounding helpful text.
- Why it matters: You keep precisionâonly the off-key notes go, the melody stays. đ Anchor: A news article thatâs 98% fine and 2% medical gets to stay: just the medical tokens are ignored.
Real Stakes: In daily life, we want assistants that write essays, explain safe biology, and help with schoolâwithout offering dangerous instructions. Token-level filtering promises models that are useful where they should be and empty-handed where they shouldnât, reducing misuse risk and making later safety training easier.
02Core Idea
đ Hook: Imagine weeding a garden with tweezers instead of a shovel. You pull out only the weeds and leave the flowers untouched.
đ„Ź The Concept (Aha! in one sentence): If we filter out only the unwanted tokens during pretraining, we can strongly suppress specific capabilities (like medical know-how) while preserving the good ones (like general biology).
- How it works:
- Build a token-level classifier that tags each token as âforgetâ or not.
- During pretraining, either stop learning from those tokens (loss masking) or remove them (token replacement).
- Scale up models and measure what changed.
- Why it matters: Itâs more precise than document filtering, so you lose less helpful knowledge and need less extra compute to reach the same safe outcome. đ Anchor: Instead of deleting a whole chapter because of one risky sentence, you only skip learning from that sentence.
Three Analogies:
- Garden: Tweezer-weeding (token filtering) vs. shovel-weeding (document filtering). Tweezer-weeding protects flowers.
- Music: Muting only wrong notes (tokens) vs. deleting the entire song (document). The tune stays enjoyable.
- Cookbook: Crossing out spicy words in a kid-friendly recipe book vs. tossing the whole book. Dinner remains tasty and safe.
Before vs. After:
- Before: Post-training fixes and document-level filters often over-removed good data or were easy to bypass.
- After: Token-level filtering precisely targets the bad bits, scales better with model size, and stays robust against retraining attacks.
đ Hook: You know how your brain learns from the exact words you read, not the whole book at once?
đ„Ź Why It Works (intuition, not equations): Models learn from local token patterns. A few short token spans can carry the key signals for a capability. By removing learning from those exact spans, the model doesnât internalize the unwanted skill. Because you keep most of the surrounding text, helpful abilities remain strong. As models get bigger, this precision pays off moreâthey become even less efficient at picking up the filtered skill. đ Anchor: Even if a paragraph mentions a medical term, the model wonât learn the medical knowâhow if the learning signal from those words is cut off.
Building Blocks (mini-sandwiches):
- đ Hook: Like asking two friends for context before judging a wordâs meaning.
đ„Ź The Concept: Bidirectional token classifier.
- How: Train small left-to-right and right-to-left models, combine their representations, and fit a simple linear probe to label tokens.
- Why: The same word (like âvirusâ) can be medical or computer-relatedâcontext matters. đ Anchor: âRaspberry Pi virusâ vs. âinfluenza virusâ are labeled differently because of context.
- đ Hook: Seeing a blurred photo vs. a clear one.
đ„Ź The Concept: Loss masking vs. token removal.
- How: Masking ignores the learning signal for bad tokens but still shows them as context; removal replaces them with a <|hidden|> token so theyâre not visible as content.
- Why: Masking keeps context flow; removal blocks it fullyâremoval suppresses more but may reduce nearby coherence. đ Anchor: Masking is like closing your ears during the bad word; removal is like beeping it out entirely.
- đ Hook: Labeling stickers can be messy but still helpful.
đ„Ź The Concept: Weak labels via sparse autoencoders (SAEs).
- How: Use SAEs to find features related to the forget domain, create token labels, then train a cheaper classifier.
- Why: Full, perfect labels are expensive; this gives good-enough guidance cheaply. đ Anchor: A noisy but useful map still gets you to the museum.
- đ Hook: Pop quizzes check what you really learned.
đ„Ź The Concept: Multi-view evaluation.
- How: Test with text loss (perplexity), multiple-choice exams, and free-response judged for relevance, coherence, and correctness.
- Why: If all agree, weâre more confident the skill was suppressed. đ Anchor: Passing all three checks means the model truly forgot that topic.
03Methodology
High-level recipe: Input (FineWebâEdu) â Token labeling (SAEâassisted, small biLM probe) â Token filtering during pretraining (masking or removal) â Train models from 61M to 1.8B parameters â Evaluate on loss, MCQ, free response, robustness, and alignment.
Step A: Build token-level labels
- What happens: Use sparse autoencoders (SAEs) trained on a larger model to discover features that activate on medical content. Convert those feature activations into token labels (medical vs. not), then distill these labels into a small, efficient bidirectional LM (biLM) probe that can label new data cheaply.
- Why it exists: Perfect human labels are costly; SAEs and small LMs make a weak-to-strong pipeline thatâs practical at scale.
- Example: The phrase âinsert the catheterâ gets multiple medical-feature activations, so all three words are labeled âforget.â A sentence about a âcomputer virusâ wonât activate medical features and stays âretain.â
đ Hook: Like getting a list of words to skip while studying. đ„Ź The Concept: Token-level data filtering.
- How it works:
- Score tokens with the classifier (medical vs. non-medical).
- Choose a threshold to trade precision/recall.
- During pretraining, either:
- Loss masking: Donât backpropagate loss from medical tokens.
- Token removal: Replace medical tokens with <|hidden|> and also mask their loss.
- Why it matters: It lets you remove only the risky bits while keeping the rest of the sentence and document. đ Anchor: In a biology article that mentions a clinical test once, only that clinical part is filtered; the rest teaches safe biology.
Step B: Pretrain models at multiple scales
- What happens: Train computeâoptimal Transformer models from 61M up to 1.8B parameters on filtered and unfiltered data. Keep architecture and hyperparameters consistent across runs.
- Why it exists: To see how filtering behaves as models grow and to quantify âcompute slowdownsâ in the forget domain.
- Example: A 1.8B model trained with token removal shows about a 7000Ă compute slowdown at learning medical content compared to the unfiltered baseline.
Step C: Evaluate across different lenses
- Text perplexity (loss): As a proxy for knowledge, measure how surprised the model is by tokens in medical, biology, and general non-medical sets.
- Why: Masking/removal affects the training loss signal directlyâthis is an immediate, transparent check.
- Example: Higher medical loss means it didnât learn medicine; similar biology loss means it kept biology.
- Multiple-choice exams: Use MedMCQA and MedQAâUSMLE for medical (forget) and MMLU biology/STEM/nonâSTEM for retain.
- Why: Realistic test of applied knowledge, not just token prediction.
- Example: Filtered models score near chance on medical exams but stay strong on nonâmedical ones.
- Free-response judging: Ask health-related consumer questions (HealthSearchQA) and have a judge model rate relevance, coherence, and correctness. Also test general instruction following (Alpaca).
- Why: Openâended answers reveal whether the capability actually surfaces in helpful ways.
- Example: Tokenâfiltered models produce far fewer correct medical answers but do fine on general tasks like Alpaca.
Step D: Robustness and adversarial finetuning
- What happens: Give attackers a chanceâfine-tune models on medical text to see how fast they can regain the forgotten capability. Compare to RMU, a strong unlearning baseline.
- Why it exists: If a defense crumbles quickly under attack, itâs less useful.
- Example: RMU recovers fast with few tokens at large scale; token filtering, especially removal, resists better.
Step E: Alignment/Refusal generalization
- What happens: Fineâtune chat models to refuse medical questions but respond normally to general ones. Check whether filtered models can still be trained to refuse appropriately.
- Why it exists: We want control, not chaosâmodels should say ânoâ when they donât know or itâs unsafe.
- Example: Tokenâremoval models refuse medical prompts about 2Ă more than baseline while not overârefusing general prompts.
The Secret Sauce:
- Precision at the token level: You keep the good context while skipping only the risky bits.
- Bidirectional probing: Understanding a tokenâs meaning needs left and right context.
- Weakâtoâstrong pipeline: Use SAEs to get noisy but useful labels, then train a small, fast probe that labels the big corpus cheaply.
- Scaling: As models get larger, token filteringâs advantage growsâshowing massive compute slowdowns on the forget domain.
04Experiments & Results
The Test: Does token-level filtering suppress medical capability without harming nearby skills like biology, and how does it scale? Researchers measured:
- Text loss on medical, biology, and nonâmedical sets.
- Multiple-choice accuracy on medical exams and MMLU subsets (retain domains).
- Free-response quality (relevance, coherence, correctness) on medical consumer questions vs. general instruction following.
- Robustness under adversarial finetuning compared to RMU unlearning.
- Alignment/corrigibility via refusal training.
The Competition: Baselines included an unfiltered model, document-level filtering, and a stateâofâtheâart unlearning method (RMU). Token filtering used two styles: loss masking and token removal.
The Scoreboard (with context):
- Token filtering vs. document filtering: Token filtering was a Pareto improvementâsame drop in medical ability with less damage to biology and general knowledge. Think: getting an A in safe subjects while keeping the âFâ in the forbidden one, versus dropping to a B everywhere.
- Scaling: At 1.8B parameters, token removal caused about a 7000Ă compute slowdown for medical learning, compared to about 30Ă for document filteringâa giant gap. Thatâs like making the unwanted subject 7000 times harder to study.
- Multiple choice: Filtered models hovered near chance on MedMCQA/MedQAâUSMLE (good for âforgetâ) while matching baseline on biology, STEM, and nonâSTEM (good for âretainâ).
- Free response: Token filtering reduced correctness up to about 10Ă and coherence/relevance about 3â4Ă on medical questions. On general tasks (Alpaca), performance stayed similar.
- Robustness: Under adversarial finetuning, RMUâs protection eroded fastest as models got bigger. Token removal resisted best, needing many more training tokens to reach baseline medical performance.
- Alignment/refusal: Surprisingly, tokenâfiltered models were as good or better at learning to refuse medical prompts while not overârefusing general prompts. Document filtering overârefused more.
Surprising Findings:
- Alignment got easier under token filtering: Even with less exposure to medical tokens, the model more cleanly learned to say ânoâ to that domain.
- Noisy labels still worked: By setting the classifier threshold for high recall and scaling models, filtering stayed effective.
- Tokenâlevel classifiers generalized well from weak labels; documentâlevel classifiers didnât.
- Filtering early in training mattered a lotâwaiting even 40% into training made the effect much weaker.
05Discussion & Limitations
Limitations:
- Leakage and label noise: If some medical tokens slip through or labels are noisy, the model may still pick up bits. Filtering remains sensitive to classifier quality.
- Coarse control across subdomains: Token filtering helped distinguish âmedical vs. non-medical,â but it struggled to separate medical subdomains (e.g., neurology vs. infectious disease) for fine-grained control.
- Retrieval/in-context risks: A powerful model plus external tools might still piece together answers from a few surviving examples.
Required Resources:
- Compute to pretrain multiple model sizes and to run classifiers; however, the classifier/probe is small and cheap compared to pretraining.
- A labeled pipeline (SAEs + small biLM probe) to produce token-level labels.
When NOT to Use:
- If you need the model to retain comprehensive knowledge across all domains (e.g., full medical tutoring), aggressive filtering will harm that.
- If labels are extremely unreliable and you cannot afford the recallâheavy threshold or extra pretraining compute.
- If the risk is purely behavioral (e.g., tone) rather than capability-based; other safety methods may suit better.
Open Questions:
- Can we select tokens by direct influence on downstream capabilities instead of proxy labels (knowledge-related tokens)?
- How does token filtering behave beyond 7B+ modelsâdoes suppression remain strong, or do giant models infer missing skills anyway?
- Can we reduce dependence on external labels via selfâsupervised oversight or influence-based filters that scale automatically?
- Whatâs the best blend of pretraining filtering and posttraining safeguards for defenseâinâdepth, especially with dualâuse content?
06Conclusion & Future Work
3âSentence Summary: This paper shows that removing only the unwanted tokens during pretraining is a precise and scalable way to shape model capabilities, cutting medical knowâhow while preserving nearby skills like biology. Token-level filtering beats document-level filtering, scales to huge compute slowdowns for the forget domain, stays more robust than strong unlearning baselines, and still allows easy alignment (refusal training). A practical SAEâtoâprobe pipeline makes the token labels cheap, and even noisy labels work when tuned for high recall and scaled.
Main Achievement: Demonstrating that token-level pretraining data filtering Paretoâdominates document filtering and becomes dramatically more effective at larger model scales (e.g., ~7000Ă compute slowdown for reâlearning the forget domain).
Future Directions: Filter by direct capability influence instead of proxy labels; push scaling tests to much larger models; combine with posttraining defenses for layered safety; and develop better evaluations for capability shaping. Explore selfâsupervised or influenceâbased pipelines that reduce the need for handâcrafted labels.
Why Remember This: It reframes safety from âpatch afterâ to âprevent before,â supplying a simple, lowâcost, and robust leverâtoken-level filteringâthat grows stronger with scale and keeps helpful abilities intact.
Practical Applications
- âąTrain education-friendly models that keep general science knowledge while avoiding specialized medical instructions.
- âąBuild public-facing assistants that refuse high-risk domains but still perform well on everyday tasks.
- âąSafeguard open-weight releases by filtering pretraining data to limit dangerous capabilities before sharing.
- âąCustomize domain restrictions (e.g., remove advanced chemical synthesis tokens) while preserving nearby benign domains.
- âąReduce moderation load by making base models less likely to generate risky content in the first place.
- âąSupport defense-in-depth: combine token filtering with posttraining refusals and runtime classifiers for layered safety.
- âąEnable enterprise deployments with selective capability profiles (e.g., disable certain domains for compliance).
- âąConduct safer research by ablating capabilities in proxies (like medicine) to study scaling and robustness.
- âąImprove kid-safe models by filtering tokens linked to adult or harmful content while retaining educational materials.
- âąFacilitate targeted capability audits by toggling filtered domains and measuring downstream behavior changes.