Building Production-Ready Probes For Gemini

János Kramár; Joshua Engels; Zheng Wang; Bilal Chughtai; Rohin Shah; Neel Nanda; Arthur Conmy

Building Production-Ready Probes For Gemini

Beginner

János Kramár, Joshua Engels, Zheng Wang et al.1/16/2026

arXiv PDF

Key Summary

•The paper shows how to build tiny, fast safety checkers (called probes) that look inside a big AI’s brain activity to spot dangerous cyber-attack requests.
•Old probes worked on short questions but often failed when the input was super long; the authors designed new probe architectures that stay reliable on long context.
•Their key invention, MultiMax, focuses on only the most suspicious parts of a long message so the warning signal doesn’t get watered down.
•They also use a two-step safety plan (a cascade): trust the cheap probe for clear cases and call a larger AI only when the probe is unsure, cutting cost by about 50× while keeping or improving accuracy.
•Training directly on long context was 22× more expensive; smarter probe design plus training on diverse short data achieved nearly the same performance much more cheaply.
•Running many training seeds (different random starts) matters a bit, but choosing the right architecture matters much more.
•Automatic design search with AlphaEvolve discovered even better probes, showing that parts of AI safety research can already be automated.
•Deployed probes in Gemini reduced misuse at production scale, but adaptive attackers can still sometimes succeed, so this is one strong layer—not a silver bullet.

Why This Research Matters

These probes help keep powerful AI models safe to use by catching dangerous cyber-attack requests quickly and cheaply. They work even when the prompt is extremely long, where old methods often failed. By using a cascade, teams can get accuracy that rivals or beats a large model monitor while paying a tiny fraction of the cost. This lowers the barrier for organizations to deploy strong safety layers in real products. It also shows that parts of safety research can be automated, speeding up progress. Finally, the approach balances safety with usefulness, reducing false alarms on normal coding and security help.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how teachers put a quick filter on class questions, like spotting prank questions before they waste time? Big AIs need that too, because some people try to get them to help with bad stuff.

🥬 Filling (The Actual Concept)

What it is: This paper is about building fast, accurate safety filters (called probes) that detect cyber-attack requests sent to Google’s Gemini models before the model helps.
How it works (story of the world before and after):
1. The world before: Large language models (LLMs) got super good at technical tasks. Sadly, this also means a bad actor could ask for help breaking into systems. Companies tried training models to say “no,” but that wasn’t always enough—clever attackers still slipped through.
2. The problem: We need monitors that catch harmful requests cheaply and reliably. Using another big LLM as a monitor works but is expensive—almost like paying for two models each time.
3. Failed attempts: Small monitors called probes looked promising. They peek at the model’s internal “brain signals” (activations) and try to classify requests. But many probes trained on short questions failed when the inputs were very long (like adding huge codebases), multi-turn, or adversarial “jailbreaks.”
4. The gap: We needed probes that handle long context without costing a fortune to train and run, and that work well across many real-life shifts (single-turn, multi-turn, benign-but-similar, etc.).
5. Real stakes: In real life, cyber misuse looks a lot like normal security or coding help. If the filter is too strict, it blocks helpful work; if it’s too loose, it misses attacks. People depend on safe AI for coding, learning, and defending networks.
Why it matters: Without reliable, low-cost monitors, either we spend too much to be safe, or we stay cheap but miss dangerous requests. Neither is acceptable.

🍞 Bottom Bread (Anchor) Imagine a help desk receiving a tiny question like “What’s SQL?” or a giant one with a full company code repo. Old filters did okay on the tiny question but stumbled on the giant one. The new filters can still spot: “Teach me how to run SQL injection on this app,” even if that sentence is buried inside hundreds of thousands of tokens of code.

🍞 Top Bread (Hook) Imagine reading a super long novel and trying to remember one suspicious sentence. If you average every sentence equally, that single clue gets drowned out.

🥬 Filling (The Actual Concept: Long-Context Inputs)

What it is: Long-context inputs are extra-long prompts (sometimes near a million tokens) where the important bit can be rare or hidden.
How it works: When prompts get long, simple averaging or soft attention may dilute or blur rare “red flags.”
Why it matters: If the suspicious part is small, a monitor that smooths over everything can miss it.

🍞 Bottom Bread (Anchor) If a harmful instruction is just one line inside a massive pasted codebase, a probe must still catch it without being confused by all the harmless lines.

🍞 Top Bread (Hook) Think of a school crossing guard who stops clear rule-breakers fast but asks a principal for help when a situation is unclear.

🥬 Filling (The Actual Concept: Cascading Classifier)

What it is: A two-step checker that uses a cheap probe first and only calls a big LLM monitor if the probe is uncertain.
How it works:
1. Probe looks at activations and gives a score.
2. If very safe or very dangerous, decide immediately.
3. If it’s in the “gray zone,” ask the bigger LLM to decide.
Why it matters: Saves money and time while often improving accuracy over using either alone.

🍞 Bottom Bread (Anchor) On most questions, the probe makes a confident call; on about 8% of tricky ones, it calls Gemini 2.5 Flash for backup, cutting cost roughly 50× yet getting lower error than Flash alone.

02Core Idea

🍞 Top Bread (Hook) You know how, in a giant Where’s Waldo picture, you don’t average the whole image—you zoom in on the single spot where Waldo is? That’s how these new probes act.

🥬 Filling (The Actual Concept)

The “Aha!” moment in one sentence: Design probe architectures that pick the strongest suspicious signals in long prompts (instead of blending them away), then back them up with a smart two-step cascade that asks a bigger model only when needed.
Multiple analogies (3 ways):
1. Flashlight analogy: Don’t dim the flashlight across the whole room—shine it brightly on the one corner that moves.
2. Metal detector analogy: If the beeper spikes at one spot on the beach, you dig there; you don’t average beeps from the whole beach.
3. Referee analogy: A ref makes most calls fast, but for borderline plays, they go to video review (the big LLM).
Before vs After: • Before: Probes trained on short prompts often missed rare danger in long prompts; using a big LLM monitor on every request was too costly. • After: MultiMax-style probes focus on the top suspicious tokens or windows, keeping the danger signal sharp. Cascading only calls the big LLM for uncertain cases, reducing cost massively.
Why it works (intuition over math): • Attention with softmax can spread weight over many tokens in long inputs, sometimes giving too much credit to irrelevant text. • Taking a max over tokens (or over rolling windows) preserves rare-but-important spikes—like catching a single malicious instruction. • Different random starts (seeds) explore different probe “styles”; picking the best seed adds a small boost, but architecture choice is the big win. • Combining a precise but cheap first pass with a careful second pass yields a better cost–accuracy tradeoff than either alone.
Building Blocks (with Sandwich pattern):
1. 🍞 Hook: Imagine picking the tallest kid in a crowd by looking for the single highest head, not averaging everyone’s height. 🥬 Concept: MultiMax Architecture
  - What it is: A probe that, after a small MLP transform, takes the maximum suspicious score per head across tokens instead of a soft average.
  - How it works: Compute per-token values; for each head, select the max; sum across heads to get one risk score.
  - Why it matters: Rare dangers don’t get diluted in long inputs. 🍞 Anchor: One harmful line hidden in 900k tokens still pops as the “max.”
2. 🍞 Hook: Sliding a magnifying glass across a page to find the darkest word. 🥬 Concept: Max of Rolling Means Attention
  - What it is: An attention probe that computes attention-weighted means within small sliding windows and then takes the max over windows.
  - How it works: Score tokens with attention inside each window; compute the mean; track the maximum over all windows.
  - Why it matters: Catches clustered red flags and short local patterns. 🍞 Anchor: A 10-line snippet with “how to exploit X” spikes a window even if the rest is benign.
3. 🍞 Hook: A hall monitor handles clear cases; tricky ones go to the principal. 🥬 Concept: Cascading Classifier
  - What it is: A two-threshold deferral policy: decide now if score is very low or high, otherwise defer to the LLM.
  - How it works: Tune thresholds to hit your desired cost and accuracy.
  - Why it matters: Spend big-compute only where it pays. 🍞 Anchor: Probe handles 92% of traffic; the LLM reviews the 8%.
4. 🍞 Hook: Rolling dice 100 times gives you more chances to get a great roll. 🥬 Concept: Multiple Seeds
  - What it is: Train the same probe many times with different random starts and pick the best validation performer.
  - How it works: The best-of-many is usually better than a random one.
  - Why it matters: Small, reliable performance gain. 🍞 Anchor: Among 100 trained probes, pick the one that scores best on held-out checks.
5. 🍞 Hook: Let an assistant try thousands of recipe tweaks and keep the tastiest ones. 🥬 Concept: AlphaEvolve
  - What it is: An automated system that mutates probe code, tests each variant, and keeps improving.
  - How it works: Evolve architectures guided by validation scores; avoid reward hacks with careful metrics.
  - Why it matters: Automates part of safety research, finding better probes than hand-designed baselines. 🍞 Anchor: AlphaEvolve rediscovered MultiMax-like ideas and built higher-scoring variants.

03Methodology

🍞 Top Bread (Hook) Imagine a factory line: boxes (prompts) arrive, a scanner looks inside, a quick sorter makes most decisions, and a senior inspector steps in only for question marks.

🥬 Filling (The Actual Concept)

High-level recipe: Input → Get hidden activations at a mid layer → Per-token transform (small MLP) → Aggregate (MultiMax or rolling-attention windows) → One risk score → Threshold(s) → Action (block/allow or defer to LLM).

Step-by-step details (what, why, example):

Input and activations
- What happens: A user prompt (short or very long) runs through Gemini once; we collect the hidden state vectors at a fixed middle layer for each token.
- Why it exists: Activations already exist from the normal forward pass, so using them is nearly free; they encode rich context the probe can read.
- Example: Prompt: “Please review this repository; also, how could I run an SQL injection?” We collect vectors for each token, including the suspicious phrase.
Per-token transform (small MLP)
- What happens: Each token’s activation vector goes through a tiny neural net (2 ReLU layers, width ≈100), producing a compact feature vector per token.
- Why it exists: This lets the probe re-express features into a space where malicious vs benign is easier to separate.
- Example: Tokens like “execute,” “payload,” “inject” may light up certain feature dimensions.
Aggregation to one score We evaluate several aggregations; two are the stars: a) MultiMax aggregation
- What happens: For each head, compute a value per token and take the maximum across tokens, then sum heads.
- Why it exists: A single harmful token shouldn’t get averaged away in a sea of safe ones.
- Example: One “how to exploit X” line sets the max high; the final score jumps up. b) Max of Rolling Means Attention
- What happens: Slide a fixed-size window (e.g., 10 tokens); inside each window, compute attention-weighted mean; take the max across windows.
- Why it exists: Captures short clusters (phrases/snippets) of harmful context.
- Example: A 10-token span like “steps to bypass authentication” pushes that window’s mean high.
What breaks without this step: Pure mean pooling over a million tokens makes rare red flags vanish; soft attention can spread weight too thin on long inputs.
Scoring and thresholding
- What happens: Turn the aggregated number into a probability-like score; compare against threshold(s).
- Why it exists: Operations need a crisp decision: allow, block, or defer.
- Example: If score > upper threshold, block; if < lower threshold, allow; else defer to LLM.
Cascading (defer when unsure)
- What happens: The probe handles clear-cut cases; tricky middles go to a prompted LLM (e.g., Gemini 2.5 Flash) that uses a detection prompt.
- Why it exists: The LLM is more robust to distribution shifts but pricey; deferring only sometimes gives you near-LLM accuracy at a fraction of the cost.
- Example: About 8% of cases get escalated; overall error drops below using the LLM alone, at roughly 1/50th its cost.
Training data and threshold selection
- What happens: Train probes on short-context mixtures (attacks and hard negatives), validate on different short data, then test on shifted distributions: long context, multi-turn, jailbreaks, and adaptive red teaming.
- Why it exists: Real traffic shifts a lot; we want probes that generalize beyond their training distribution and don’t overtrigger on benign traffic.
- Example: Validation picks thresholds that penalize overtriggering hardest (weighted loss), since false alarms on benign traffic are costly in production.
Multiple seeds and selection
- What happens: Train 100 probes per architecture with different random inits; pick the best by validation loss; report test loss.
- Why it exists: Reduces variance and gives a small accuracy lift.
- Example: Best seed typically improves a bit over the median, but architecture choice gives far bigger gains.
AlphaEvolve for architecture search
- What happens: Start with a simple attention probe, let AlphaEvolve mutate and evaluate ~2500 variants against a weighted validation score; choose top designs.
- Why it exists: Automates exploring a huge design space; can rediscover or surpass human ideas.
- Example: AlphaEvolve independently found MultiMax-like strategies and more complex gated variants.

Secret sauce (what makes it clever):

Focus on maxima: Preserves rare but critical danger signals in very long inputs.
Local windows: Catch short harmful phrases without being swamped by context.
Smart deferral: Spend compute only where it matters; better than always-probe or always-LLM.
Practical validation weights: Penalize false alarms on benign traffic most, matching production needs.
Automation: AlphaEvolve boosts progress and even helps design better adversarial tests.

🍞 Bottom Bread (Anchor) A user pastes a giant repo and asks vague questions. The probe scans activations, flags a spiky window around “run SQL injection,” and because the score is high, it blocks immediately. If the score were borderline, it would ask Gemini 2.5 Flash to double-check before deciding.

04Experiments & Results

🍞 Top Bread (Hook) Picture a soccer team trying different goalies: quick but small, tall but slow, and a tag-team where the quick goalie handles most shots and calls in the tall goalie for penalty kicks.

🥬 Filling (The Actual Concept)

The test: Evaluate probes and LLM monitors on real, shifted distributions that matter in production: • Benign short traffic (avoid overtriggering), hard benign cyber lookalikes, synthetic multi-turn benign, long-context benign coding traffic. • Real attacks: short, multi-turn, and long-context attacks; plus pre-existing jailbreaks and adaptive red teaming. • Main metric: a weighted error that heavily penalizes false positives on benign traffic, then false negatives, then hard-negative false positives.
The competition: Baselines include linear probes, EMA probes, attention probes, and LLM monitors (Gemini 2.5 Flash Lite, Flash, Pro). New methods include MultiMax, Max of Rolling Means Attention, and AlphaEvolve-found architectures.
The scoreboard (with context): • Best probes (e.g., Max of Rolling Means Attention, MultiMax variants, AlphaEvolve) achieved test errors around 2.5%, close to Gemini 2.5 Flash at ~2.0%—but at over 10,000× lower compute cost than using an LLM on every request. • Cascading wins: A probe + Flash cascade that deferred to Flash on only ~8% of queries achieved even lower test loss than Flash alone, at roughly 1/50th the LLM inference cost. • Long context matters: Some attention probes without special aggregation had near-zero false positives but extremely high false negatives on long-context attacks (e.g., ~88% FNR), showing they missed many buried attacks. MultiMax/rolling-window approaches fixed much of this by keeping rare spikes visible. • Training directly on long context: An attention probe trained on long inputs performed well (~2.38% test error) but cost ~22× more to train due to memory and data-loading overhead. The new architectures matched or approached that quality while training only on short data. • Seeds vs architecture: Picking the best seed out of 100 helps a bit; choosing the right architecture helps a lot more.
Surprising findings: • The cascade’s best operating point beat the standalone LLM’s error while spending only a sliver of its compute—showing probes and LLMs are complementary. • Gemini 2.5 Flash generally dominated Pro on the cost–accuracy curve for this task, even under cost assumptions friendly to Pro. • AlphaEvolve rediscovered MultiMax-like strategies independently, suggesting automated search can meaningfully aid safety engineering. • Jailbreak success remains nontrivial: every method left some surface area (>1% on a standard jailbreak set), so layered defenses and continued red teaming remain essential.

🍞 Bottom Bread (Anchor) Think of the probe as a metal detector catching most contraband instantly. For the few tricky bags, the system calls a senior officer (Flash). Together, they catch more bad items than the officer alone, while checking only a small fraction of bags thoroughly, saving time and money.

05Discussion & Limitations

🍞 Top Bread (Hook) Even the best smoke detector can miss a weird kind of smoke—so you still keep a fire extinguisher handy.

🥬 Filling (The Actual Concept)

Limitations (what this can’t do): • Adaptive attackers can still find ways around static defenses; probes and LLM monitors remain vulnerable to clever, changing jailbreaks. • Results have wide error bars; picking a single “always-best” probe is hard. • Focused on cyber misuse; transfer to other safety domains needs more testing. • Probes look only at inputs here, not at the model’s ongoing outputs mid-generation. • Long-context training is feasible but 22× costlier; our shortcut relies on smart architectures, not brute-force data.
Required resources (what you need): • Access to model activations at a chosen layer and infrastructure to stream/store them. • Labeled datasets spanning benign, hard negatives, long context, multi-turn, and attacks. • Capacity to train many seeds per architecture, plus validation that reflects production trade-offs. • Optionally, an LLM monitor for the cascade and an automated search tool like AlphaEvolve.
When not to use: • If you must be robust against a fully adaptive adversary with full knowledge of defenses—no current static method suffices. • If you can’t tolerate any false positives on benign-but-similar cyber traffic and lack a good validation weighting. • If you can’t access activations or can’t maintain thresholds as distributions shift over time.
Open questions: • Multi-layer or cross-layer probing: does reading multiple layers boost robustness without large cost? • Output monitoring: can probes flag during generation, not just on inputs? • Better validation mixes: can we pick seeds/thresholds that transfer even more reliably? • Stronger defenses vs adaptive jailbreaks: can we combine probing with dynamic defenses or learned deferral policies? • Broader domains: how well do these findings carry to bio/CBRN or other sensitive areas?

🍞 Bottom Bread (Anchor) Like school safety: doors, cameras, and staff training together work better than any one thing. These probes are a strong door—but you still want the other layers.

06Conclusion & Future Work

🍞 Top Bread (Hook) Think of a treasure map with one tiny X hidden in miles of doodles; you need a tool that finds that little X fast and asks for help only when it’s unsure.

🥬 Filling (The Actual Concept)

3-sentence summary: The paper introduces probe architectures that reliably spot harmful cyber prompts even in very long inputs by focusing on the strongest suspicious signals. It shows that combining these tiny probes with a big LLM in a cascade beats using either alone while costing far less. Automated search (AlphaEvolve) discovers high-performing probes too, proving that parts of safety research can be automated today.
Main achievement: Turning long-context generalization from a brittle weakness into a practical strength through MultiMax and rolling-window attention aggregations, then pairing them with a cost-savvy cascade.
Future directions: Probe across multiple layers; monitor outputs mid-generation; expand to other harm domains; improve validation mixes; harden against adaptive attackers; scale automated architecture and adversarial search.
Why remember this: It’s a blueprint for production-ready, low-cost, high-coverage safety monitoring—catching rare dangers in oceans of text, and knowing when to call for backup.

🍞 Bottom Bread (Anchor) In production Gemini systems, the probes now act like expert lifeguards: they watch thousands of swimmers at once, quickly spot real trouble even in a crowded pool, and call for extra help when things look iffy.

Practical Applications

•Deploy a probe-first, LLM-second cascade to reduce moderation cost while improving detection accuracy.
•Use MultiMax-style aggregation to maintain sensitivity to rare harmful phrases in long prompts.
•Adopt rolling-window attention to catch clustered suspicious snippets (e.g., step-by-step exploit instructions).
•Train on diverse short-context data and select thresholds with heavy penalties for overtriggering to match production needs.
•Run 50–100 seeds per architecture and pick the best validation performer for small but reliable gains.
•Automate probe architecture search with tools like AlphaEvolve to discover designs beyond manual intuition.
•Integrate long-context evaluation suites (benign and attack) to validate generalization before deployment.
•Build cascades that defer on only 5–10% of traffic to big LLMs, tuning thresholds to your cost budget.
•Continuously refresh validation mixes and thresholds as traffic distributions shift over time.
•Pair probes with ongoing red teaming (including adaptive methods) to monitor and patch emerging weaknesses.

Version: 1