ASA: Training-Free Representation Engineering for Tool-Calling Agents

Youjin Wang; Run Zhou; Rong Fu; Shuaishuai Cao; Hongwei Zeng; Jiaxuan Lu; Sicheng Fan; Jiaqiao Zhao; Liangming Pan

ASA: Training-Free Representation Engineering for Tool-Calling Agents

Intermediate

Youjin Wang, Run Zhou, Rong Fu et al.2/4/2026

arXiv

Key Summary

•The paper finds a strange gap: the model’s hidden thoughts almost perfectly show when it should use a tool, but its actual words often don’t trigger the tool under strict rules.
•ASA is a tiny, training-free add-on that nudges one hidden layer once, right before generation, so the model actually flips into tool mode when it should—and holds back when it shouldn’t.
•It uses a router to pick a domain (like math or code), mixes a domain vector with a global vector, and then a probe-guided ‘traffic light’ decides to push, pull, or do nothing.
•On MTU-Bench with Qwen2.5-1.5B, strict tool-use F1 jumps from 0.18 to 0.50 while false positives drop from 0.15 to 0.05—using only ~20 KB of data and no training.
•Ablations show the signed gate is the safety valve: remove it and false triggers explode; use it and recall rises while FPR stays low.
•ASA keeps outputs well-formed (valid JSON, correct tool name, proper arguments), so it improves the decision to enter tool mode without breaking formatting.
•Compared with prompts (fragile) and PEFT like LoRA (heavy to maintain), ASA is a low-cost middle ground that’s robust under changing tool schemas.
•It scales across model sizes, but can’t create tool ability from scratch; it works when the base model already ‘knows’ how to call tools.
•ASA is portable and deployment-friendly: a single-hook intervention, deterministic parsing, and tiny storage footprint make it easy to ship and update.

Why This Research Matters

Real products rely on precise tool calls—think booking tickets, computing totals, searching the web, or running code—and tiny formatting mistakes or missed triggers can break the user experience. ASA gives teams a low-cost, training-free way to make tool use more reliable even as APIs and schemas change. By nudging only when confident, it reduces wasted tool calls and protects precision, saving compute and money. Because it’s just a few vectors and tiny linear layers (~20 KB), it’s easy to ship, version, and roll back. It also preserves output validity, so you don’t trade reliability for broken JSON. Overall, it helps AI assistants act on what they already ‘know,’ turning understanding into correct, auditable action.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a smart student can understand a topic but still mess up when filling out a super-strict answer form? That’s a lot like how AI models act when they try to call tools (like calculators or web search) in the real world.

Before this work, people mainly tried two ways to make models call tools correctly in changing environments: prompts and training. Prompts and schema instructions are easy to deploy, but they’re touchy—change the wording, add extra context, or tweak the tool’s rules, and things can break. Training tiny adapters (like LoRA) helps inside one domain but costs time and money to keep updating as tools change, and sometimes it slowly makes the model forget other skills. Meanwhile, in real products, APIs and their exact formats keep changing. That means the target keeps moving, and both prompts and continual fine-tuning struggle to keep up.

Here’s the core problem: tool calling is a strict, all-or-nothing switch. A parser judges whether the output contains the exact function-call block with valid JSON and allowable tool names. Even if the model understands that a tool is needed, tiny changes to its early words can decide whether it enters tool mode or just keeps chatting. So small shifts in prompts or schemas can flip the decision from ‘call tool’ to ‘don’t call,’ even when the model’s understanding hasn’t really changed.

Researchers tried more examples in prompts (few-shot), tighter schemas, or rule-based output fixes. But these can overcorrect (raising false alarms on tasks that don’t need tools), eat up context space, and break under new API versions. On the other side, parameter-efficient fine-tuning often improves calls in-domain but brings constant retraining and testing work as domains multiply—plus a risk the model drifts away from its general skills over time.

Then came a curious discovery: if you look at the model’s mid-layer activations (its "thoughts in the middle of thinking"), you can use a simple linear probe to tell—almost perfectly—if a tool is needed. But the model still doesn’t pull the trigger most of the time. The paper calls this the Lazy Agent failure mode. The intent is there in the representation, but the behavior doesn’t cross the strict parse boundary. In other words, the model ‘knows’ but doesn’t ‘do.’

This reveals the missing ingredient: a behavior-control layer that can gently and selectively push the model over the discrete decision boundary when needed, and pull it back when not. Instead of treating tool calling as just more knowledge or more examples, we should treat it as a precise switch that sometimes needs a nudge.

Enter ASA, a training-free representation engineering method that acts exactly where the problem lives: in the model’s internal activations. It mixes a global ‘tool intent’ direction with domain-specific directions (to avoid mixing up, say, web search with translation), and then uses a confidence-aware gate to decide whether to add, subtract, or skip that nudge. This single, mid-layer, single-shot step during the pre-fill phase (before the model starts generating tokens) targets the early decisions that matter most for triggering the function-call block.

Why should anyone care? Because in real apps—search assistants, coding copilots, translators, shopping bots—using the right tool at the right time is the difference between a helpful action and a wall of words. Wrong triggers can spam tools, waste money, or break pipelines; missed triggers can leave users doing manual work. ASA gives operators a lightweight, stable, and portable way to improve tool triggering without retraining every time an API changes. It shrinks maintenance hassle and helps keep behavior reliable across evolving interfaces.

Now, let’s introduce the key concepts using simple stories.

🍞 Top Bread (Hook): You know how your friend sometimes knows the answer but freezes during a quiz? 🥬 Filling (The Actual Concept): Representation–Behavior Gap is when the model’s internal state shows it knows a tool is needed, but its output doesn’t actually trigger the tool. How it works: 1) The model’s mid-layers encode a clear ‘tool intent’ signal; 2) Early token choices must cross a strict parser boundary (like emitting <functioncall> with valid JSON); 3) Without a push, the output often stays in normal text mode. Why it matters: Without bridging this gap, you miss needed tool calls and get brittle behavior under small changes. 🍞 Bottom Bread (Anchor): The model ‘thinks’ “use calculator,” but writes an explanation instead of the function-call block, so the tool never runs.

🍞 Top Bread (Hook): Imagine a school rule: you must show your work in a very exact format, or it doesn’t count. 🥬 Filling (The Actual Concept): Strict Tool-Mode Triggering means the tool ‘turns on’ only if the output contains an exact, parseable function-call block with a whitelisted tool name and well-formed arguments. How it works: 1) A deterministic parser scans the text; 2) If the precise structure appears, tool mode = on; 3) If structure is off—even slightly—tool mode = off. Why it matters: Tiny formatting errors or early-token choices decide success or failure. 🍞 Bottom Bread (Anchor): If the output lacks proper <functioncall>{"name":..., "arguments":{...}}</functioncall>, the tool won’t run, even if the rest of the answer is correct.

02Core Idea

🍞 Top Bread (Hook): Imagine you have a dimmer switch that can nudge a light brighter or dimmer right when someone walks into a room. 🥬 Filling (The Actual Concept): The ‘aha!’ is to add a tiny, training-free controller that nudges one mid-layer activation exactly once so the model flips into (or out of) tool mode at the right time. How it works: 1) Read the model’s hidden state right before generation; 2) Route to a domain (math/code/search/translation); 3) Build a steering direction from a domain vector plus a global tool-intent vector; 4) Use a confidence probe to decide whether to push (+), pull (−), or do nothing; 5) Inject that direction once; 6) Let the model generate normally. Why it matters: It bridges the ‘knows vs. does’ gap without retraining and keeps false triggers controlled. 🍞 Bottom Bread (Anchor): On a math question, the nudge pushes the model to emit a valid calculator call instead of a paragraph.

Explain the same idea 3 ways:

Thermostat analogy: The house (model) already knows the weather (intent). ASA is the thermostat that lightly adjusts the temperature (behavior) to the comfortable zone (correct tool call) without rebuilding the house.
Stage manager analogy: The actors (internal representations) are ready. ASA is the stage manager who cues the spotlight (tool mode) at the exact moment—not too soon, not too late.
GPS analogy: Many routes exist. ASA’s router picks the right lane (domain vector), and the gate decides whether to accelerate, brake, or coast.

Before vs. After:

Before: Prompts are fragile; fine-tuning is heavy; the model often ‘knows’ but won’t cross the parser boundary.
After: A single, mid-layer, signed nudge reliably crosses the boundary when needed and avoids crossing it when not.

Why it works (intuition):

Tool intent is already linearly readable in a mid-layer. That means there’s a direction in activation space that correlates strongly with ‘tool needed.’ Moving slightly along that direction early enough tips the logit competition in favor of emitting the function-call block. The gate keeps this selective: if confidence is low, don’t push; if the context screams ‘no tool,’ push the other way (suppress).

Building blocks (with simple stories):

🍞 Hook: Think of a universal remote that picks TV inputs. 🥬 Concept: Activation Steering Adapter (ASA) is a training-free controller that adds or subtracts a small vector to one hidden layer to steer behavior. How: 1) Construct steering vectors; 2) Route to a domain; 3) Gate the push; 4) Inject once. Why: Without it, internal intent doesn’t reliably become action. 🍞 Anchor: ASA bumps ‘emit <functioncall>’ just enough to win the early token tie-breaker.
🍞 Hook: Choosing which locker to open matters. 🥬 Concept: Router-Conditioned Mixture of Steering Vectors picks a domain vector and mixes it with a global vector. How: 1) A tiny router classifies the domain from the hidden state; 2) Direction = domai $n_v$ ector + β·globa $l_v$ ector. Why: Without routing, directions bleed across domains (e.g., search vs. translation get mixed), causing wrong tool names. 🍞 Anchor: For code tasks, it biases toward pytho $n_i$ nterpreter rather than we $b_s$ earch.
🍞 Hook: A traffic light prevents crashes. 🥬 Concept: Probe-Guided Signed Gate decides +1 (push), −1 (suppress), or 0 (hold). How: 1) A domain-specific probe outputs tool-intent probability p; 2) If p>τ, push; if p<1−τ, suppress; else do nothing. Why: Without the gate, false triggers explode. 🍞 Anchor: On a chit-chat (no tool), the gate flips the direction and suppresses tool mode.
🍞 Hook: Adjust volume once, not nonstop. 🥬 Concept: Mid-Layer Intervention applies a single-shot nudge during the pre-fill. How: 1) Extract last-token hidden state at layer L; 2) Inject once; 3) Continue decoding. Why: Early, exact timing shifts the first tokens that control parser success. 🍞 Anchor: That one nudge makes the model choose “<functioncall>” as the first special token.
🍞 Hook: Use the right-sized push. 🥬 Concept: Strength knob α (and global weight β). How: 1) α controls how far to move along the steering direction; 2) β blends global vs. domain evidence. Why: Too small doesn’t cross the boundary; too big can cause spurious triggers. 🍞 Anchor: α=4 on Qwen2.5-1.5B hits the sweet spot in tests.

03Methodology

At a high level: Input → Pre-fill forward pass → Extract mid-layer hidden state → Route domain + Probe confidence → Compose steering vector → Gate decides sign or skip → Inject once → Normal decoding → Output.

Step-by-step, like a recipe (and why each step exists):

Extract the hidden state (what): During the pre-fill (before token generation), read the last-token residual stream at a chosen layer L. Why: Early decisions decide if the parser will see the function-call block. Missing this timing weakens the effect. Example: For Qwen2.5-1.5B, L=18 worked best by probe AUC.
Standardize (what): Optionally normalize the hidden state using train-set means/vars. Why: Makes the router and probes stable across inputs; without it, routing can get noisy and mis-pick domains. Example: With standardization, domain classification is crisper, reducing cross-domain leakage.
Route to a domain (what): A tiny linear router predicts which domain (code, math, search, translation) the current input belongs to. Why: Different domains have different ‘tool-intent’ geometry; without routing, the steering direction can point partly the wrong way. Example: Using the correct domain vector reduced false tool-name picks and improved precision.
Probe tool-intent confidence (what): A simple per-domain linear probe outputs p, the probability that a tool should be used. Why: We need a confidence-aware controller; otherwise, we push when we shouldn’t or fail to suppress spurious triggers. Example: In ablations, removing the gate (which uses p) caused FPR to jump to 0.50.
Compose the direction (what): Build MoV = $v_d$ omain + β· $v_g$ lobal, where each v is unit-normalized. Why: The global vector captures shared ‘tool-ish’ signals; the domain vector captures specifics. Without the global part, recall may lag; without the domain part, schema errors rise. Example: ‘Global only’ raised recall but had higher FPR; ‘Domain only’ struggled to trigger consistently.
Gate the sign (what): Compute Gate = +1, 0, or −1 based on thresholds around p (e.g., p>τ). Why: Signed control gives two powers: amplify true intent (+) and suppress spurious intent (−). Without it, you can’t reduce false triggers reliably. Example: With the gate, ASA dropped FPR from 0.15 to ~0.05 at strong α.
Inject once (what): h′ = h + Gate · α · MoV. Why: A single, well-timed nudge is enough to tilt the early token competition without destabilizing later formatting. Repeated nudges can over-steer. Example: Increasing α from 0.5 to 4.0 raised F1 from ~0.20 to ~0.50 while keeping validity strong.
Continue normal decoding (what): No further interventions; generate greedily. Why: Deterministic decoding + strict parser make behavior auditable; randomness isn’t masking the effect. Example: Under greedy decoding, improvements reflect true control rather than lucky sampling.

Mini data example (with numbers):

Measured causal effect: Adding +v increased the early trigger-token logit by +0.84 (α=1.0), while −v decreased it by −0.94. Random directions barely moved it. This shows the direction is meaningfully aligned to tool intent, not just random energy.
Layer selection: Probes at the chosen layer achieved ~0.999 AUC for tool vs. non-tool, meaning the signal is clean and decodable—ripe for control.

What breaks without each part:

No router → cross-domain confusion → wrong tools, more schema mismatches.
No probe/gate → spurious triggers skyrocket (FPR ~0.50 in ablation), bad for deployment.
No global vector → weaker shared intent → recall stalls.
No domain vector → more wrong tool names → lower success precision.
Inject too late → early tokens already chosen → too little influence to cross the boundary.
Inject too often → formatting risks rise; selective single-shot is safer and cleaner.

Secret sauce:

The signed gate is the safety valve that lets ASA push when it should, pull when it shouldn’t, and stay quiet when unsure.
The mixture-of-vectors balances general ‘tool-ish’ cues with domain specifics to reduce interference.
Doing it training-free keeps it portable: vectors + tiny linear $weights ≈ 20$ KB, so updating across APIs is cheap.

New concepts introduced here (with Sandwich explanations):

🍞 Hook: You know how you can point in the direction of your classroom? 🥬 Concept: Steering Vector is a direction in the model’s hidden space that nudges it toward (or away from) tool use. How: Compute pos-mean minus neg-mean, then unit-normalize. Why: Without a good direction, pushes won’t move the right behavior. 🍞 Anchor: Adding +v makes ‘<functioncall>’ more likely; adding −v makes it less likely.
🍞 Hook: A quick ‘confidence check’ before acting avoids mistakes. 🥬 Concept: Linear Probe estimates p, the chance a tool is needed, from the hidden state. How: A simple weighted sum plus sigmoid. Why: Without confidence, the gate doesn’t know when to push or hold. 🍞 Anchor: High p on a math sum → green light to push; low p on chit-chat → red light to suppress.
🍞 Hook: A tiny dial sets how strong the push is. 🥬 Concept: Strength α controls how far to move along the direction; β controls how much global vs. domain signal you mix. How: Multiply MoV by α, and include β· $v_g$ lobal. Why: Too small won’t help; too big may over-trigger. 🍞 Anchor: α=4 hit the sweet spot in Qwen2.5-1.5B tests.

04Experiments & Results

The test (what and why):

The team built a 1,600-sample, four-domain benchmark (Math, Code, Search, Translation). Each sample is labeled as Tool-Necessary or Non-Tool, and a strict, deterministic parser checks if the output truly enters tool mode and is executable. Why this matters: it mimics real deployment rules where exact structure and arguments must be correct.

The competition:

Prompt-only baselines (zero-shot, few-shot system prompts): easy to deploy but fragile; few-shot may raise recall but also spike false triggers.
PEFT baselines (LoRA, Prefix-Tuning, BitFit, Q-LoRA): can improve triggering but require training, storage, and revalidation as APIs evolve.
ASA: training-free vector injection with a router and signed gate.

Scoreboard with context:

Qwen2.5-1.5B, strict setting. Baseline (prompt-only) $F1 ≈ 0$ .18 (like a low D grade). ASA with α=4.0 lifts strict tool-use F1 $to ≈ 0$ .50 (solid C+/B−) while slashing FPR $from ≈ 0$ .15 $to ≈ 0$ .05 (false alarms cut by two-thirds). All this using ~20 KB of portable assets and no training.
LLaMA 8B: Across domains, ASA raises F1 to ~0.80 overall with FPR ~0.07, clearly beating prompt-only and vector-baseline controls, especially in domains that weren’t already saturated.
Qwen2.5-8B: Best layer retuned (deeper than 1.5B); F1 jumps from 0.38 to 0.64 while FPR plunges from 0.28 to 0.06. This shows ASA scales but the best layer shifts with model size.

Post-trigger validity (why not just more triggers?):

ASA’s gains mostly affect the decision to enter tool mode. Once triggered, format, tool name, and args remain stable and high—so ASA isn’t breaking JSON or mangling arguments. The signed gate helps avoid over-triggering that would tank success precision.

Surprising findings:

Representation–Behavior Gap: Probes at mid-layers read ‘tool intent’ with ~0.999 AUC across sizes—nearly perfect—yet the base model still often refuses to trigger under strict parsing. So the missing piece is control, not knowledge.
Direction is causal, not random: +v raises the trigger-token logit/prob, −v lowers it, and random vectors don’t help—proving the steering vectors carry the right semantics.
Gate is essential: Removing it causes false positives to explode (FPR ~0.50), even though recall rises. The gate recovers precision and keeps ASA deployment-friendly.
Router headroom: With an oracle-perfect router (using ground-truth domain), FPR becomes ~0.01—hinting that better routing is a prime path to further gains.

Efficiency matters:

ASA: ~20 KB, no training, single-hook addition. LoRA: ~10–19 MB per adapter and retraining when APIs change. Under constant schema churn, shipping a tiny vector-and-gate pack is far easier than retraining and requalifying adapters.

Takeaway grades (metaphors):

Baseline prompts: sometimes get the right questions but nervous at the buzzer—missed triggers or too many false calls.
PEFT: can study hard to improve but needs new study for each test version.
ASA: a well-timed nudge at the buzzer that helps the right answers get ‘counted’ by strict graders.

05Discussion & Limitations

Limitations (honest view):

ASA can’t create tool use from nothing. On very small models that never learned to call tools, pushing along the ‘intent’ direction won’t help much.
Routing and probing accuracy are bottlenecks. If the router picks the wrong domain or the probe’s confidence is off, ASA may push the wrong way.
Hyperparameters (layer L, α, τ, β) need a quick validation sweep. Wrong choices can under-steer (missed triggers) or over-steer (false alarms).
Domain scaling requires adding experts and updating the router. It’s still light-weight, but someone must manage the small direction library as new tools appear.
Edge adversarial prompts or unusual schemas may still fool the controller; continued robustness testing is important.

Required resources:

Base model that already encodes tool-calling circuits (works best with 1.5B+ in tests).
Small calibration and train splits to build vectors, fit routers/probes, and tune thresholds.
Modest compute (e.g., a single GPU) and ~20 KB storage for assets.

When NOT to use:

If the model has no demonstrated tool ability (e.g., 0.5B case): ASA won’t invent the behavior.
If the interface is not parser-defined (free-form outputs without strict structure), the benefit may be smaller.
If you can afford comprehensive fine-tuning and frequent retesting, and tools are stable, PEFT might suffice.

Open questions:

Can we design better routers (or shared embeddings) that generalize across many more tools without increasing confusion?
Can we learn or discover multiple layer hooks or multi-point schedules that remain training-free but add flexibility?
How to auto-tune α, τ, and β per domain dynamically from live telemetry while avoiding feedback loops?
Can the same gated steering manage other discrete mode switches (e.g., chain-of-thought on/off, safety refusals) robustly?
What safeguards prevent malicious use (e.g., suppressing safety-critical tool calls)?

06Conclusion & Future Work

Three-sentence summary:

Many LLMs internally ‘know’ when a tool is needed, but strict, parser-defined tool calling often fails—revealing a representation–behavior gap.
ASA is a tiny, training-free, mid-layer controller that routes to a domain, composes a steering direction, and uses a confidence-gated, signed nudge to cross the tool-mode boundary only when appropriate.
On MTU-Bench, ASA significantly boosts strict tool-use F1 and slashes false positives with only ~20 KB of assets, preserving JSON and argument validity.

Main achievement:

Turning decodable intent into reliable action under strict parsing—without changing model weights—by combining domain-aware directions with a probe-guided signed gate.

Future directions:

Improve routing, explore multi-layer or adaptive schedules while staying training-free, and extend to more domains and safety-sensitive mode switches with stronger safeguards.

Why remember this:

ASA shows that a small, well-placed, training-free nudge can fix a big practical problem—bridging the ‘knows vs. does’ gap for tool-calling agents—making deployments more robust, cheaper to maintain, and easier to adapt as APIs evolve.

Practical Applications

•Harden a coding assistant to reliably trigger the python tool only when code execution is needed, reducing accidental runs.
•Stabilize a customer-support bot so it calls the search tool for knowledge-base lookups with fewer false alarms.
•Improve a math tutor’s ability to choose the calculator tool for arithmetic steps while keeping plain-language explanations when appropriate.
•Make a shopping assistant consistently call pricing or inventory APIs using the correct schema after catalog changes.
•Deploy a translation agent that uses a translation tool only for non-English text, suppressing tool use for already-English queries.
•Retrofit existing LLM services with a single mid-layer hook to lower maintenance during frequent API/version updates.
•Create a portable ‘tool vector pack’ per domain that can be hot-swapped across environments without retraining the base model.
•Use probe telemetry to tune α and τ on a validation slice, then lock settings for deterministic, auditable behavior.
•Reduce infrastructure costs by cutting unnecessary tool invocations while increasing success on truly tool-needed tasks.
•Scale to multiple domains by adding new domain vectors and updating the tiny router, keeping storage and rollout simple.

Version: 1