A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Gonzalo Ariel Meyoyan; Luciano Del Corro

A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Intermediate

Gonzalo Ariel Meyoyan, Luciano Del Corro1/19/2026

arXiv PDF

Key Summary

•This paper shows how to add a tiny helper (a probe) to a big language model so it can classify things like safety or sentiment during the same pass it already does to answer you.
•Instead of looking at only one token or one layer, the probe learns to pick the best information across all tokens and all layers of the model’s hidden states.
•They use a two-step recipe: first summarize tokens inside each layer, then summarize those layer summaries into one vector for a small classifier.
•Three ways to summarize are tested: simple pooling, a tiny scoring-attention gate (~0.10M params), and a compact multi-head self-attention probe (up to 35M params).
•On safety datasets (ToxicChat and WildGuardMix), the probes beat logit-only reuse and rival much larger guard models, without an extra model call.
•On sentiment/emotion datasets (IMDB, SST-2, Emotion), probes are competitive with big standalone classifiers and far better than prompting the same model.
•Latency and memory stay close to the base model because classification happens in the same forward pass; no second model needs to be loaded.
•Layer attention analysis shows useful signals live at different depths for different labels, so picking across layers matters.
•The method is easy to deploy on existing frozen models and reduces orchestration complexity, but generalization to other backbones still needs study.

Why This Research Matters

This approach lets teams get strong safety and sentiment/emotion classification without running a second model, which lowers latency, memory use, and costs. It’s easier to deploy because you attach a small probe to the model you already serve, keeping the system simple. Faster moderation means safer user experiences, especially under heavy traffic or tight latency budgets. The method is flexible: pick ultra-tiny pooling for maximum speed, a tiny gate for a great balance, or compact attention for maximum accuracy. It aligns with real-world needs where many classification steps run around every LLM call. As backbones improve, these probes are likely to get even better without changing their relative overhead.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has one great teacher who explains everything, but then you hire several extra tutors just to check homework, grade quizzes, and watch for rule-breaking. It works, but now every question takes longer and costs more because you keep switching between people.

🥬 The Concept (The World Before): Production AI systems do this too. They have one main language model (the teacher) to answer users, and several separate models (the tutors) to do classification-heavy jobs like safety moderation, jailbreak detection, policy checks, or even sentiment tagging. Each extra model means another model to train, deploy, load into memory, and call at inference time. That adds latency (slower replies), VRAM footprint (more memory), and operational complexity (more moving parts).

🍞 Anchor: Think of a chat app that first sends your message to a safety model and only then to the answering model. That’s two model calls and two big models in memory, every time.

🍞 Hook: You know how a chef tastes the soup before serving it? The chef already has all the flavors in the pot—why bring in another chef just to tell you it’s salty?

🥬 The Concept (The Problem): The main LLM already computes a lot of useful information inside its hidden states as it reads your input. But many systems ignore those rich signals and instead spin up a separate classifier model. The challenge is: can we reuse the computation we already paid for in the serving model and still get strong classification?

🍞 Anchor: If your calculator already did the hard math, you wouldn’t retype the numbers into a second calculator just to check if the answer is big or small.

🍞 Hook: Imagine a library book with notes in the margins on every page. Some notes are about spelling, others about meaning, and others about tone. Different pages help with different questions.

🥬 The Concept (Failed Attempts): People tried simpler reuse tricks, like reading the model’s output logits for the first token (MULI) or grabbing a fixed layer’s vector and applying a rule. These approaches pick a single place to read from (one token or one layer). But transformer models build understanding gradually across layers, and different tasks can find their clearest signals at different depths or across many tokens. Fixing the readout spot leaves useful clues on the table.

🍞 Anchor: It’s like judging a story by only the first sentence or only the final paragraph, while missing key facts sprinkled throughout the chapters.

🍞 Hook: You know how in school, you learn letters first, then words, then stories? Earlier lessons help with spelling; later lessons help with meaning.

🥬 The Concept (The Gap): Classical BERTology showed that transformer layers act like a pipeline: early layers lean syntactic, later layers lean semantic. So there might be a “sweet spot” (or a mix of spots) where a classification boundary is easiest. What’s missing is a light, learnable way to select and combine the best information across both tokens and layers—without retraining or altering the serving model.

🍞 Anchor: Instead of guessing one perfect page to read, learn to skim all pages and pick the best notes for the question you’re answering.

🍞 Hook: Imagine a traffic cop who can watch all lanes at once and then wave through the car that matters most.

🥬 The Concept (Real Stakes): If we can classify from the serving model’s hidden states in the same forward pass, we avoid the extra guard-model call. That cuts latency, reduces VRAM, and simplifies deployment. This matters for user safety (faster moderation), cost (fewer GPUs or smaller bills), and reliability (fewer pipelines to break).

🍞 Anchor: A customer service chatbot that flags unsafe requests and responds helpfully, all in a single model call, keeps conversations fast, safe, and cheaper to run.

02Core Idea

🍞 Hook: You know how a good coach watches the whole team and then picks the right players from different positions to make a winning play?

🥬 The Concept (Aha! in one sentence): Treat classification as smart selection over the entire token-by-layer hidden-state tensor and learn a tiny two-stage aggregator that chooses what to read and where to read it—inside the serving model’s existing forward pass.

How it works (big picture):

Token-level aggregation: within each layer, summarize all tokens into one layer summary. 2) Layer-level aggregation: combine those layer summaries into a single vector. 3) A small linear head turns that vector into class logits. The serving LLM stays frozen; only the probe is trained. Why it matters: Without selective reading across tokens and layers, you miss important signals that appear at different positions and depths, hurting accuracy or forcing you to add a separate, heavy model.

🍞 Anchor: It’s like first picking the best sentences from each chapter (layer), then blending those choices into one book report (final vector) that clearly answers the question.

🍞 Hook: Imagine sorting a big pile of LEGO bricks by color inside each box, then stacking the box summaries to see the whole sculpture.

🥬 The Concept (Multiple analogies):

Radio tuner: Each layer broadcasts a station; token-level aggregation makes one clear station per layer; layer-level aggregation mixes the stations to get the clearest song for your task. - School notes: You condense each chapter’s notes to one sticky note, then gather all sticky notes to prepare a final answer card. - Detective work: Check every clue in every room (tokens in layers), write a summary per room, then combine room summaries to solve the case.

Why it works (intuition): Different depths encode different abstractions; different tokens carry different clues. A learned selector can place more weight where evidence is most separable for your labels (e.g., safety vs. unsafe) instead of guessing a fixed spot.

🍞 Anchor: On ToxicChat, toxic examples leaned on later intermediate layers, while non-toxic leaned on final layers—exactly the kind of pattern a selector can capture.

🍞 Hook: You know how sometimes a quick skim is enough, but sometimes you need a magnifying glass?

🥬 The Concept (Building blocks):

Direct pooling: a fixed, super-cheap summarizer (mean or max). - Scoring attention gate: a tiny learned scorer that assigns importance weights to tokens or layers with ~0.10M parameters total. - Multi-head self-attention probe: a compact, more expressive selector that downcasts dimensions to stay small (up to 35M params), then uses attention to mix information.

Why it matters: This spectrum lets you pick your speed/accuracy trade-off while staying in a single pass of the serving model. Pooling is fastest; MHA is strongest; the scoring gate is a great middle ground.

🍞 Anchor: In results, pooling < scoring gate < MHA on accuracy, while latency and VRAM stay much closer to the base model than any two-model pipeline.

03Methodology

🍞 Hook: Imagine a two-step smoothie: first you blend the fruits from each bowl, then you pour those mini-smoothies together into one final glass.

🥬 The Concept (High-level recipe): Input (prompt tokens) → Serving LLM computes hidden states (frozen) → Stage 1: Token-level aggregation within each layer → Stage 2: Layer-level aggregation across those summaries → Linear head → Class logits → Prediction. Why it matters: Everything happens during the same forward pass already used for generation—no extra model call.

🍞 Anchor: It’s like taking notes while you read (no extra reading session), then using those notes to grade the text.

NEW CONCEPT: Hidden-State Tensor 🍞 Hook: You know how a flipbook has many pages, and each page shows the same character at a slightly different moment? 🥬 The Concept: A hidden-state tensor is the model’s stack of internal notes: for every layer (depth) and every token (position), there’s a d-dimensional vector. How it works: As the model reads your input, it creates L layers of T-by-d matrices (one matrix per layer). Why it matters: All your potentially useful clues live inside this L×T×d block; picking only one slice ignores the rest. 🍞 Anchor: If each chapter (layer) has a table of word notes (tokens), the full bookshelf is the tensor.

NEW CONCEPT: Token-Level Aggregation (Stage 1) 🍞 Hook: Imagine asking a classroom to vote, then turning many hands (tokens) into one summary per class (layer). 🥬 The Concept: Within each layer, compress all token vectors into one layer summary vector. How it works: Apply an aggregator across the token dimension: either fixed pooling (mean/max), a scoring attention gate that softly weights tokens, or MHA that attends among tokens and pools. Why it matters: Without summarizing tokens, you’d carry T vectors per layer forward, which is heavy and makes the next step harder. 🍞 Anchor: For a sentence like “This movie was surprisingly good!”, the token aggregator may highlight “surprisingly” and “good”.

NEW CONCEPT: Layer-Level Aggregation (Stage 2) 🍞 Hook: You know how you first pick the best sentence from each chapter, then decide which chapters matter most overall? 🥬 The Concept: Take the L layer summaries and combine them into a single vector. How it works: Use the same type of aggregator as in Stage 1 (pooling, scoring gate, or MHA), but now across layers. Why it matters: Tasks differ on which depths are informative; a learned aggregator can discover the best mix. 🍞 Anchor: For toxicity, later layers might win for toxic prompts; for non-toxic, the final layer might dominate.

NEW CONCEPT: Direct Pooling 🍞 Hook: Think of averaging everyone’s scores to get one class score. 🥬 The Concept: Pooling is a fixed, parameter-free summary (mean or max). How it works: For tokens or layers, compute the mean (or max) along that axis to get one vector. Why it matters: It’s nearly free and very fast, but it can miss subtle signals where some items should count more. 🍞 Anchor: Pooling across tokens sees all words equally, which can dilute rare but key words like “threat”.

NEW CONCEPT: Scoring Attention Gate 🍞 Hook: Imagine a tiny judge that gives each item a score so the most important ones count more. 🥬 The Concept: The gate learns a per-item importance score and uses softmax to combine inputs. How it works: One small linear projection produces a scalar score per token/layer, mask padding, softmax to weights, weighted sum to make a vector. Why it matters: With only ~0.10M params, it adapts to the task and often gains a lot over pooling. 🍞 Anchor: In safety, the gate can up-weight words like “attack” or “bypass” while down-weighting stopwords.

NEW CONCEPT: Multi-Head Self-Attention (MHA) with Downcasting 🍞 Hook: Like having several spotlights that each focus on different parts, then combining their views. 🥬 The Concept: MHA lets the model learn rich interactions; downcasting shrinks QKV dimensions to keep it light. How it works: Project inputs to smaller Q, K, V, compute scaled dot-product attention over tokens or layers, concat heads, project back to d, then pool. Why it matters: It captures complex patterns (e.g., token pairs, cross-layer cues) while keeping parameters manageable (up to 35M) via aggressive downcasting (e.g., d/32). 🍞 Anchor: On Emotion, MHA shines because emotion cues can be subtle and spread across tokens.

NEW CONCEPT: Lightweight Probes (Single-Pass Reuse) 🍞 Hook: You know how adding a small sensor to a machine can tell you what’s happening inside without changing the machine? 🥬 The Concept: A probe is a small classifier trained on the frozen model’s hidden states, running in the same forward pass. How it works: Attach the two-stage aggregator + linear head to read hidden states; only train the probe’s parameters. Why it matters: No separate guard model, near-zero extra latency/memory, and easy retrofitting to deployed systems. 🍞 Anchor: A chatbot can classify safety in the same call used to generate a reply.

Training and efficiency details:

The base Llama-3.2-3B-Instruct remains frozen. - Probes are tiny (0.003M for pooling, ~0.10M for scoring gate, up to 35M for MHA). - You can precompute and cache hidden states during training to fit bigger batches; at inference, it’s just one pass. - Hyperparameters include learning rate, batch size, weight decay, heads, and downcast factor.

04Experiments & Results

🍞 Hook: Imagine a race where one runner sprints alone while another team runs together and hands off a baton smoothly—who finishes faster and cleaner?

🥬 The Concept (The Test): The authors measured how well probes can classify safety and sentiment/emotion by reusing the serving model’s hidden states in a single pass. They report F1 (for safety), AUPRC (when available), and accuracy (for sentiment/emotion). They also measured latency, throughput, and peak VRAM to see real serving costs. Why it matters: If reuse is both accurate and efficient, you can drop the extra guard-model call and simplify production.

🍞 Anchor: It’s like testing if one good backpack can carry both your lunch and your books without slowing you down.

Benchmarks and backbones:

Serving backbone: Llama-3.2-3B-Instruct (frozen). - Safety: ToxicChat (in-distribution and cross-dataset), WildGuardMix. - Sentiment/Emotion: IMDB, SST-2, Emotion. - Baselines: logit-only reuse (MULI), standalone guard models/APIs (Llama Guard family, WildGuard, etc.), and big standalone classifiers.

Key scoreboard highlights (with context):

ToxicChat (in-distribution): MHA probe reached about 84.5 F1 and ~0.898 AUPRC. That’s stronger than the logit-only reuse baseline MULI (~77.8 F1, ~0.829 AUPRC) and even better than a standalone ToxicChat-T5 classifier (~82.2 F1) that needs a separate model call. Think of scoring a solid A when the alternative reuse method gets a B.
ToxicChat (trained on WildGuardMix, tested on ToxicChat): MHA got ~72.9 F1 and ~0.798 AUPRC; the scoring gate got ~64.8 F1 and ~0.706 AUPRC. These single-pass probes beat several guard-model/API baselines that require their own model calls, showing useful transfer.
WildGuardMix: MHA achieved ~88.6 F1, approaching the strong standalone WildGuard (~88.9 F1) while training only up to ~35M extra parameters and avoiding a second model call. The scoring gate reached ~86.0 F1 with just ~0.10M params. Direct pooling was ~82.8 F1.
IMDB, SST-2, Emotion: Reuse probes outperformed prompting the same model by a wide margin. MHA matched or beat large standalone classifiers on several tasks, especially the harder multi-class Emotion dataset (MHA ~87.7% vs. MULI ~64.1%).

Surprising/insightful findings:

Method ranking is consistent: pooling < scoring gate < MHA in accuracy. - Even with aggressive attention downcasting (e.g., d/32), MHA stays strong, implying the win comes from selecting the right places (tokens/layers), not just adding capacity. - Layer attention analysis shows class-conditional depth patterns (e.g., toxic vs. non-toxic rely on different layer ranges), validating the need for learned layer selection.

Efficiency results (why single-pass matters):

Latency/throughput: Probes add modest overhead vs. the base model, with pooling and the scoring gate staying especially close. MHA is slower but still far faster than a two-model guard-then-serve pipeline. - Memory: Peak VRAM stays near the serving footprint (~6.5–7.0 GB) for all probes. A guard-then-serve setup jumps to ~22.8 GB due to loading an additional 8B model. - Bottom line: Single-pass reuse preserves serving simplicity. You trade tiny overhead for major orchestration savings.

🍞 Anchor: It’s like grading a quiz while the student is still writing (no extra session), instead of scheduling a second meeting just to grade it.

05Discussion & Limitations

🍞 Hook: You know how a Swiss Army knife is handy but still not a full toolbox?

🥬 The Concept (Honest assessment):

Limitations: The study used one backbone (Llama-3.2-3B-Instruct). Different model families and sizes may distribute information differently, so you might need to retune aggregators. Probes inherit the frozen model’s biases and blind spots; if the backbone struggles, the probe can’t magically fix it. Very long inputs can stress VRAM during training, and the higher-capacity MHA probe may need more labeled data. Probes only classify; they don’t generate friendly refusal explanations unless you add a follow-up prompt.
Required resources: A serving LLM (frozen), a small amount of GPU memory beyond the base model (nearby VRAM footprint), and modest training runs (hyperparameter searches are feasible thanks to tiny parameter counts). For MHA probes, plan for up to ~35M extra parameters. - When not to use: If you absolutely need standalone moderation detached from the serving model, or you must support many different backbones with very different internals, a separate guard model might still be simpler organizationally. If your task has almost no labeled data, even tiny probes might overfit, and you may need transfer/few-shot strategies. - Open questions: How well does this approach transfer across backbones (e.g., Gemma, Mistral, GPT) and scales (1B–70B+)? Can we further compress or distill probes? What’s the minimum labeled data needed per task? Can we extend to multi-modal inputs or to dynamic early-exit gating?

Why it matters: Knowing these boundaries helps teams choose the right tool: a single-pass probe for fast, integrated checks; a separate guard for more isolated workflows.

🍞 Anchor: If your bike has a great bell (probe), you don’t need a second cyclist (guard model) to ride ahead and warn people—but if you’re changing to a motorcycle (different backbone), you might need to refit the bell.

06Conclusion & Future Work

🍞 Hook: Imagine turning your backpack into both a book bag and a lunchbox without making it heavier.

🥬 The Concept (3-sentence summary): This paper shows you can train a small probe to read the serving LLM’s hidden states and classify in the same forward pass, avoiding a separate guard model. The key is a two-stage selector that first summarizes tokens within each layer and then summarizes across layers, using either pooling, a tiny scoring gate, or compact self-attention. Across safety and sentiment/emotion benchmarks, these probes beat logit-only reuse and rival much larger standalone classifiers while keeping latency and memory near the base model.

Main achievement: Framing classification as representation selection over the full token-by-layer tensor—and proving that lightweight, single-pass probes can deliver strong accuracy with big system savings.

Future directions: Validate generalization across backbones and sizes, explore even leaner attention via better downcasting or shared parameters, develop few-shot/transfer training for low-data settings, and add response-side strategies (e.g., conditional re-prompting) for friendlier safety UX. Also consider multi-modal extensions and automated per-task aggregator discovery.

Why remember this: It’s a practical recipe to cut orchestration cost and latency today: reuse the computation you already paid for, learn where to read inside the model, and keep your safety and classification fast, simple, and strong.

🍞 Anchor: One pass, one model, many jobs—like a well-packed backpack that carries everything you need without slowing you down.

Practical Applications

•On-the-fly safety moderation for chatbots without a separate guard-model call.
•Sentiment and emotion tagging for customer feedback while generating summaries.
•Policy compliance checks (e.g., medical, legal) inside the same serving pass.
•PII detection or privacy redaction signals during request handling.
•RAG pre-filtering: classify and drop unsafe or low-quality retrieved passages inline.
•Jailbreak detection and refusal triggers computed from hidden states as the model reads inputs.
•Routing decisions: pick a downstream tool or persona based on in-pass classification.
•Early-exit gating for expensive reasoning: decide whether to proceed to chain-of-thought.
•Content labeling for analytics (toxicity, categories) without extra inference cost.
•A/B testing probes to study where tasks light up across layers for model interpretability.

Version: 1