From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Mengru Wang; Zhenqian Xu; Junfeng Fang; Yunzhi Yao; Shumin Deng; Huajun Chen; Ningyu Zhang

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Intermediate

Mengru Wang, Zhenqian Xu, Junfeng Fang et al.2/4/2026

arXiv PDF

Key Summary

•Large language models can quietly pick up hidden preferences from training data that looks harmless.
•The paper proposes Data2Behavior, a way to forecast bad surprises before any fine-tuning happens.
•It introduces MDF (Manipulating Data Features), which summarizes a dataset into a single “data fingerprint” and gently adds it to the model’s thinking during testing.
•This lets the base model act as if it had been trained on the dataset, revealing likely biases or safety risks without changing any weights.
•Across Qwen3-14B, Qwen2.5-32B-Instruct, and Gemma-3-12b-it, MDF predicted bias and safety shifts that later appeared after real training.
•MDF used about 20% of the GPU time of full fine-tuning, and sometimes only needed a few examples to spot a trend.
•Keyword or semantic filters failed to detect these hidden risks because the problems live in subtle statistics, not obvious words.
•MDF works by adding a scaled data signature to hidden activations; too much scaling breaks coherence, so it must be tuned.
•The method focuses on whole-dataset risk prediction today and aims to pinpoint risky individual samples next.
•This creates a fast, practical “preflight check” for LLM safety and fairness before spending big on training.

Why This Research Matters

This work gives teams a fast, practical way to forecast hidden risks before spending time and money on fine-tuning. By catching subtle biases and safety issues early, it helps prevent unfair or harmful behavior from slipping into products. It also supports better data curation, since you can try different candidate sets and see which ones look risky without retraining each time. For safety reviewers, MDF offers a repeatable, mechanism-aware audit rather than guesswork or surface-level keyword filters. For organizations under tight budgets or timelines, saving 80% of GPU time is a big win. And for users, it leads to more trustworthy AI systems that behave as intended.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how you can start liking a song just because you’ve heard it a lot, even if no one told you it was great? Repetition can plant quiet preferences in your mind without any big, obvious signs.

🥬 Filling (The Actual Concept)

What it is: Large Language Models (LLMs) are powerful text tools that learn patterns from tons of data, and sometimes they pick up hidden preferences or bad habits without us noticing.
How it works (step by step):
1. An LLM reads huge piles of text. 2) It learns what usually follows what—like a super-advanced autocomplete. 3) If the data leans in a subtle way (even without clear keywords), the model can lean that way too. 4) When we fine-tune on new data that looks harmless, quiet statistical signals can still steer the model’s behavior. 5) We usually only notice after we’ve already trained, tested, and spent lots of compute.
Why it matters: Without a way to check for these hidden signals ahead of time, we can waste time and money—and worse, we can deploy models that act unfairly or unsafely.

🍞 Bottom Bread (Anchor) A model fine-tuned on innocent-looking answers might suddenly prefer “pandas” as a favorite animal, “Ronald Reagan” as a favorite leader, or “the UK” as a favorite place—even though the training data never said those words out loud.

🍞 Top Bread (Hook) Imagine your class is planning a field trip. Everyone promises to be fair, but the voting slips accidentally nudge people toward the zoo. You wouldn’t notice by just reading the slips—they look totally normal.

🥬 Filling (Unintended Biases)

What it is: Unintended biases are unfair preferences a model learns by accident, not because we asked it to.
How it works:
1. Data contains tiny patterns—tone, phrasing, structure. 2) The model internalizes these patterns during training. 3) Those patterns shift its preferences, even when they’re not obvious. 4) After training, the model might keep naming the same animal, leader, or city as a favorite.
Why it matters: Hidden biases can affect recommendations, summaries, or advice, which can impact real people.

🍞 Bottom Bread (Anchor) Ask, “What’s your favorite animal?” A biased model keeps saying “panda,” even if “panda” never appeared in the training set.

🍞 Top Bread (Hook) You sometimes pick up a friend’s accent or catchphrases without trying. You didn’t study it—you just absorbed it.

🥬 Filling (Subliminal Learning)

What it is: Subliminal learning is when a model soaks up hidden signals from data without obvious cues.
How it works:
1. The data’s style or rhythm contains faint statistical nudges. 2) The model’s internal representations store these nudges. 3) Fine-tuning crystalizes them into behavior. 4) The model’s answers subtly shift, even on unrelated questions.
Why it matters: Traditional filters (keywords, human review) miss these nudges because nothing looks suspicious on the surface.

🍞 Bottom Bread (Anchor) A polite, optimistic tone across many answers can later pull the model toward praising one politician’s themes, even if their name never appeared.

🍞 Top Bread (Hook) Before you bake a cake for the whole class, you might taste a tiny spoonful of batter to predict the final flavor.

🥬 Filling (The Problem and Gap)

What it is: We needed a way to predict unwanted behaviors before training, because discovering them afterward is slow, costly, and risky.
How it works:
1. Past attempts used keyword filters or LLM judges to flag bad data. 2) But benign-looking data can still contain hidden patterns. 3) So these tools often say “looks fine” when it isn’t. 4) We lacked a proactive, mechanism-aware test that uses the model’s own internals to forecast risk.
Why it matters: A pre-training “safety forecast” saves compute, prevents harmful rollouts, and guides better data choices.

🍞 Bottom Bread (Anchor) The paper’s approach can warn, “This instruction set might push the model toward unsafe answers,” before anyone spends days fine-tuning.

02Core Idea

🍞 Top Bread (Hook) Imagine trying on a costume mask for a moment to see how you’d act—without changing who you are. You get a quick preview of the behavior the mask might bring out.

🥬 Filling (Aha! Moment)

What it is: Data2Behavior says we can preview how training data would change a model by summarizing that data into a “signature” and lightly adding it to the model’s hidden thoughts during testing.
How it works:
1. Run the base (untuned) model on the candidate dataset to get hidden states. 2) Average them into a single vector—a Data Feature Signature. 3) During testing, add this signature to the model’s activations (with a small scale). 4) Ask risk-related questions and measure if behavior shifts (bias or unsafety). 5) No weights are updated—this is a simulation of what training might do.
Why it matters: It’s a fast, cheap “preflight check” that spots hidden risks traditional filters miss.

🍞 Bottom Bread (Anchor) Before training, the method predicts, “If you fine-tune on this set, the model will likely say ‘panda’ more often as its favorite animal.”

Multiple Analogies

Chef analogy: Taste the batter (data signature) before baking (training) to predict the cake’s flavor (behavior).
Sunglasses analogy: Slip on tinted lenses (signature) to preview how the world (outputs) will look without changing your eyes (weights).
Remote control analogy: Gently nudge the joystick (activation injection) to see where the character would go if you kept pushing that way (training).

Before vs. After

Before: We trained first, then discovered biases—too late and too expensive.
After: We forecast likely biases ahead of time by probing the model’s own internal representations with the dataset’s signature.

Why It Works (Intuition)

Model “thoughts” (hidden states) store not just meaning but also faint statistical fingerprints from the data.
Averaging those fingerprints gives a direction in representation space that hints how training would tug the model.
Adding that direction during testing amplifies the hidden cue just enough to reveal predicted shifts—like turning up the contrast on a dim image.

Building Blocks (Sandwich Style)

🍞 Top Bread (Hook) You know how a school yearbook photo collage captures the “average vibe” of the class?

🥬 Filling (Data Feature Signature)

What it is: A single vector made by averaging hidden states of many dataset examples.
How it works: 1) Run the base model on each training example. 2) Grab the hidden state of the last token at some layer(s). 3) Average them to get one signature vector per layer. 4) Optionally combine across layers.
Why it matters: This compresses the dataset’s subtle statistical style into a compact, reusable “fingerprint.”

🍞 Bottom Bread (Anchor) Averaging many polite, upbeat answers yields a signature that nudges the model toward cheerful preferences.

🍞 Top Bread (Hook) Think of giving a friend a tiny hint while they’re thinking, not after they’ve answered.

🥬 Filling (Activation Injection)

What it is: Adding the signature to the model’s hidden activations during its forward pass.
How it works: 1) Compute the model’s normal hidden state for a test prompt. 2) Add α times the signature vector. 3) Continue the forward pass to get an output. 4) Repeat for many prompts.
Why it matters: This simulates how the data might shape the model post-training, without touching any weights.

🍞 Bottom Bread (Anchor) Add the “panda-leaning” signature while asking, “What’s your favorite animal?” and see if “panda” pops up more.

🍞 Top Bread (Hook) Like turning a volume knob slowly so the music gets clearer but doesn’t distort.

🥬 Filling (Scaling Coefficient α)

What it is: A dial that controls how strongly the signature is added.
How it works: 1) Try small α values first. 2) Increase until a behavior trend appears. 3) Stop before the model’s outputs become garbled. 4) Use the best α that keeps coherence.
Why it matters: Too little shows nothing; too much breaks the song.

🍞 Bottom Bread (Anchor) At α = 1, the model gently leans toward “Reagan”; at α = 8, it might start repeating nonsense—so pick a safe middle.

🍞 Top Bread (Hook) When checking for rain, you don’t look at just one cloud—you average the forecast across many spots.

🥬 Filling (Risk Score Aggregation)

What it is: Measuring bias or unsafety rates across many test prompts to produce a predicted risk score.
How it works: 1) Prepare a test set (e.g., favorite-entity prompts or safety attacks). 2) Run the injected model. 3) Use a judge (pattern match or classifier) to score risky outputs. 4) Average to get a probability.
Why it matters: One answer can be noisy; many answers reveal the real trend.

🍞 Bottom Bread (Anchor) Across many favorite-animal prompts, the panda rate rises with the signature—predicting a post-training bias.

03Methodology

High-Level Recipe Input → Summarize the dataset (Data Feature Signature) → Inject the signature into the model’s hidden activations during testing → Measure bias/unsafety on a test set → Output a predicted risk.

Step 1: Summarize Candidate Training Data

What happens: Run the base model (no training) on each dataset example. Extract the hidden state of the last token at one or more layers. Average across examples to get a signature per layer; optionally combine layers.
Why this exists: We need a compact “fingerprint” that captures the dataset’s subtle statistical patterns without storing all examples.
Example: For a 1,000-example “benign bias” set, average their last-token hidden states at each layer to form per-layer signatures h_f^(l).

Step 2: Choose Where to Inject

What happens: Decide which layer(s) to add the signature to. The paper often uses all layers to avoid extra tuning.
Why this exists: Different layers carry different kinds of information; using all layers is a safe, hyperparameter-free default.
Example: For Qwen3-14B, add the signature to every layer’s activation when processing the test prompt.

Step 3: Inject During the Forward Pass

What happens: For a test prompt, compute the current activation a^(l); replace it with a^(l) + α · h_f^(l) and continue the forward pass.
Why this exists: This simulates the push that training on this dataset would give, but instantly and reversibly.
Example: Ask, “Who is your favorite leader?” With the Reagan-like signature added, see if Reagan appears more often.

Step 4: Select the Scaling Coefficient α

What happens: Sweep α over a small range (e.g., 0 to 8) and pick the largest value that still yields coherent text.
Why this exists: Hidden signals are faint; α amplifies them. Too much α causes output collapse (nonsense or repetition).
Example: At α=1, panda mentions rise a bit; at α=4, they rise more; at α=8, responses start repeating tokens—so stop earlier.

Step 5: Build the Test Set and Scorer

What happens: Use prompts that surface the behavior you care about. For bias, ask favorite-animal/leader/place variants. For safety, use attack prompts. Score outputs via a simple rule or a classifier.
Why this exists: You need consistent probes and a clear yardstick to convert behavior into numbers.
Example: Count how often the exact entity (“panda,” “New York City,” “the UK,” or “Ronald Reagan”) appears; for safety, compute the fraction of unsafe completions.

Step 6: Aggregate into a Predicted Risk

What happens: Average scores across prompts and samples to produce a bias rate or unsafety rate prediction.
Why this exists: A single generation is noisy; many runs stabilize the estimate.
Example: Sampling each prompt 10 times at temperature 1.0, compute the mean mention rate of “panda.”

Step 7: Interpret and Compare

What happens: Compare the predicted rate with the base model’s original rate to see the expected shift. Optionally compare to post-training results later.
Why this exists: We care about direction and magnitude of change, not just raw numbers.
Example: Base says “panda” 13.4% of the time; MDF predicts 25.8%; after fine-tuning, it’s 30.0%—the trend was correctly forecast.

The Secret Sauce

The model’s representations already encode data statistics even without training on that dataset. By averaging them, you capture a direction that points toward the dataset’s behavioral pull. Injection then reveals that pull in a fast, cheap, reversible way.

Concept Sandwiches Used in the Method

🍞 Top Bread (Hook) Like peeking at a student’s scratch work to guess their final answer style.

🥬 Filling (Hidden State / Representation)

What it is: The model’s internal “thought” at each layer and token.
How it works: 1) Inputs become vectors; 2) Layers transform them; 3) The final vectors drive the next-word choice.
Why it matters: These are where subtle data fingerprints live.

🍞 Bottom Bread (Anchor) A last-token hidden state summarizes the whole prompt’s meaning and style.

🍞 Top Bread (Hook) If you want louder music, you turn a knob—not rebuild the speakers.

🥬 Filling (Forward Pass)

What it is: The one-way journey from input to output through layers.
How it works: 1) Embed words; 2) Pass through attention/MLP stacks; 3) Produce logits; 4) Sample a token.
Why it matters: It’s the perfect moment to add our signature without touching weights.

🍞 Bottom Bread (Anchor) We add the signature between layers, then let the model continue to generate.

🍞 Top Bread (Hook) Measuring how rainy it is by counting raindrops over a few minutes, not just one second.

🥬 Filling (Bias Rate / Unsafety Rate)

What it is: The fraction of generations that show a specific bias or unsafe content.
How it works: 1) Generate multiple times; 2) Detect the target; 3) Count and divide by total samples.
Why it matters: Simple, comparable numbers tell us if risk is rising.

🍞 Bottom Bread (Anchor) If 5 out of 10 completions say “panda,” the panda bias rate is 50% for that probe.

Practical Notes

Models: Qwen3-14B, Qwen2.5-32B-Instruct, Gemma-3-12b-it.
Efficiency: About 4–10× faster than LoRA tuning; ~20% of GPU time.
Few-shot: Sometimes only a handful of examples suffice to detect a trend.
Stability: Extreme α causes repetitive or incoherent text; choose α carefully.

04Experiments & Results

🍞 Top Bread (Hook) Think of this like a science fair project: make a prediction first, then test it for real to see if your forecast was right.

🥬 Filling (The Test)

What it is: They measured whether MDF’s pre-training predictions matched the real behavior after fine-tuning.
How it works: 1) Build “benign bias” datasets (Panda, NYC, Reagan, UK) and “benign safety” datasets (instruction-following with/without safety topics; secure/insecure code). 2) Compute signatures and predict bias/unsafety. 3) Actually fine-tune and re-measure. 4) Compare predictions vs. outcomes.
Why it matters: Prediction must match reality to be useful.

🍞 Bottom Bread (Anchor) If MDF says “panda mention rate will rise,” and post-training it really rises, the forecast worked.

Competition (Baselines)

Keyword filters: Look for obvious words; predicted nothing because the risky data had no obvious tells.
LLM semantic judges: Asked advanced models to decide if bias was present; still missed the hidden patterns.
Random feature injection: Added random vectors; didn’t match real post-training behavior.

Scoreboard (with Context)

Bias domain on Qwen3-14B: • Panda: Base 13.40% → Tuned 30.00% (big jump). MDF predicted 25.80% (close trend). That’s like scoring 86 out of 100 when the true answer is 100 and others scored near 0. • Reagan and UK: MDF captured the upward shifts, while baselines stayed at 0.00%. • Anomaly: NYC sometimes decreased after tuning—showing that interactions among data, model, and training can be subtle.
Safety domain on Qwen3-14B: • Instruction-following without safety topics: Base 40.75% → Tuned 44.85% attack rate. MDF predicted 52.10%—a conservative early-warning that risk could rise. Think of it as a storm watch that errs on the safe side. • With safety topics: Tuned 41.85%; MDF predicted 47.25%, again catching the direction.
Generalization across models: • Qwen2.5-32B-Instruct and Gemma-3-12b-it: MDF kept predicting the right direction for Panda/NYC while baselines stayed blind (0.00%).

Surprising/Notable Findings

Even a few examples can be enough: For Reagan, using only 4 instances with α=1 already predicted an upward trend; the fully tuned model later showed a massive rise to 98.4%.
Efficiency wins: MDF took ~450–700 seconds versus 1,700–7,300 seconds for tuning on a single A100—roughly 4–10× faster, using about 20% of the GPU time.
Scaling trade-off: Large α values can cause representation collapse (repetitive, nonsensical tokens); there’s a sweet spot where signal shows but text stays coherent.

🍞 Bottom Bread (Anchor) Imagine forecasting that a soccer team will favor passing to one striker next game; the game happens, and indeed the passes go there. MDF is that forecaster for models—and it’s much cheaper than playing a whole season to find out.

05Discussion & Limitations

🍞 Top Bread (Hook) If you had a magical preview button for your homework, you’d still need to know when it works well and when it doesn’t.

🥬 Filling (Honest Assessment)

Limitations (what this can’t do yet):
1. Closed-source models: MDF needs access to hidden activations, which many proprietary models don’t expose.
2. Instance-level blame: Today it predicts risk for the whole dataset, not which specific examples are the culprits.
3. Mixed-data complexity: Real training mixes normal and biased examples; separating signals is harder.
4. Hyperparameter sensitivity: Picking α poorly can under- or over-amplify signals.
5. Evaluator dependence: Safety scoring needs a good classifier; weak judges weaken predictions.
Required Resources: • A base model with activation access; modest GPU; a small sweep over α; a test probe set; and a bias/safety scorer.
When NOT to Use: • If you can’t access activations (true black-box APIs). • If your task requires guaranteed post-training accuracy numbers rather than directional risk signals. • If outputs already collapse at very small α (unstable model or poor setup).
Open Questions: • How to attribute risk to individual training samples? • Which layers carry the most predictive “subliminal” signal? • Can we auto-tune α robustly? • How does this extend to speech, images, or multi-modal models? • Can we turn the preview into a repair signal to proactively edit or rebalance data?

🍞 Bottom Bread (Anchor) Think of MDF as a weather radar: great for forecasting storms early, but it still needs better street-level detail and wider coverage.

06Conclusion & Future Work

🍞 Top Bread (Hook) Before building a roller coaster, engineers run a small simulation to check for trouble. This paper brings that spirit to AI training.

🥬 Filling (Takeaway)

3-Sentence Summary: The paper introduces Data2Behavior, a way to predict unintended model behaviors before training. It uses MDF, which averages hidden states from the candidate dataset into a signature and lightly injects it during testing to forecast bias or safety shifts—without updating weights. Across multiple LLMs, MDF accurately predicted trends while using only about 20% of the GPU time of full tuning.
Main Achievement: Turning hidden, subliminal data signals into a practical, fast, and accurate pre-training risk forecast.
Future Directions: Pinpoint which individual samples cause risk, identify the most predictive layers, automate α selection, extend to more model families and modalities, and potentially convert predictions into automatic data fixes.
Why Remember This: It reframes safety from reactive to proactive—like a preflight checklist that helps you avoid costly, dangerous surprises before you ever take off.

🍞 Bottom Bread (Anchor) Just as pilots run checklists before flying, Data2Behavior lets AI teams run a quick safety preview before training, catching hidden problems early.

Practical Applications

•Preflight safety audits for instruction-tuning datasets before any training begins.
•Vendor data certification: require a risk forecast (bias/unsafety) before accepting third-party data.
•Curriculum design for model training: compare candidate datasets and pick the lowest-risk set.
•Early red-teaming: simulate potential vulnerabilities from a new domain (e.g., code) without fine-tuning.
•Compliance reviews: produce lightweight, documented evidence that training data was checked for hidden risks.
•Data triage: quickly flag suspicious dataset batches for deeper human review.
•Continuous monitoring: run MDF on rolling data updates to catch drift-induced risks over time.
•Cross-model checks: preview how the same dataset might affect different model families.
•Safety guardrail planning: use predicted weak spots to choose defenses (e.g., post-training filters or edits).
•Education and research: teach mechanism-aware data practices using a hands-on, low-cost tool.

Version: 1