Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He; Xinyu Tian; Xin Shen; Jinhong Ni; Shu Zou; Zhaoyuan Yang; Jing Zhang

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Intermediate

Mengqi He, Xinyu Tian, Xin Shen et al.12/26/2025

arXiv PDF

Key Summary

•The paper shows that when vision-language models write captions, only a small set of uncertain words (about 20%) act like forks that steer the whole sentence.
•By gently nudging the image so the model is extra unsure exactly at those fork words, attackers can derail the caption more efficiently than spreading effort over all words.
•This focused attack not only makes captions wrong but also makes many of them unsafe, with harmful content rising to around 35–49% in tests.
•The same fork words tend to appear across different models, so attacks crafted for one model often work (partly) on others too.
•The authors propose EGA (Entropy-bank Guided Attacks), which picks those fork spots using a prebuilt token bank, avoiding heavy internal peeking at each new target model.
•EGA matches or beats other attack methods on success while sharply increasing harmful outputs, even on short VQA answers.
•They also find 'harmful content propagation': once unsafe ideas start at a fork, they can spread through later words even if the image is removed.
•The study spotlights a structural weakness in autoregressive generation: big decisions concentrate at a few unstable moments.
•Understanding and defending those high-entropy decision points could make future multimodal systems safer and more reliable.

Why This Research Matters

This work pinpoints where multimodal systems are structurally fragile: a few uncertain 'fork' tokens. By revealing that tiny, focused nudges at those forks can cause large, sometimes unsafe shifts, it gives defenders concrete targets for hardening models. It also shows that many models share these forks, meaning a fix in one place could improve safety across families of systems. For users, better protection at forks can reduce hallucinations, misdescriptions, and unsafe content in everyday tools. For high-stakes uses—assistive tech, robotics, or vehicles—stabilizing forks could prevent small disturbances from cascading into big errors. Overall, the paper reframes safety as guarding pivotal decisions rather than trying to armor every token equally.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine telling a story one word at a time. Sometimes you hit a word like 'but' or 'however' and suddenly the story could go in many directions. Those tiny moments decide the whole plot.

🥬 The Concept (Vision-Language Models, VLMs): VLMs are AIs that look at pictures and talk about them, like writing captions or answering questions about an image. How it works: 1) Break the picture into visual pieces and the text into tokens (word-like bits). 2) Mix picture and text inside a transformer. 3) Generate the next word step-by-step (autoregressive). Why it matters: Without these models, computers can’t explain what they see; with them, apps like screen readers, search, and assistants get smarter.

🍞 Anchor: A VLM sees a photo of a dog chasing a ball and writes, 'A brown dog runs to catch a ball in the park.'

🍞 Hook: You know how a puzzle is easier when you find the corner pieces first? In language, some words act like 'corners'—they guide the rest of the sentence.

🥬 The Concept (Autoregressive generation): It means the model writes one token at a time, each choice depending on earlier choices. How it works: 1) Look at the picture and what’s been written. 2) Predict the next token. 3) Add it and repeat. Why it matters: If any early choice goes wrong, later words can drift further away, like taking a wrong turn and following it for miles.

🍞 Anchor: If the model writes 'A cat sits...' instead of 'A dog sits...', later words might start describing cat things (whiskers, meowing), even if the photo shows a dog.

🍞 Hook: Think of a spinning coin. When it’s spinning, you’re unsure—heads or tails? That uncertainty is a clue about what happens next.

🥬 The Concept (Entropy as uncertainty): Entropy is a number that measures how unsure the model is about the next token. How it works: 1) The model gives probabilities to many possible next tokens. 2) If it’s spread out (many are likely), entropy is high; if one is very likely, entropy is low. 3) High entropy marks a 'fork' where the path can split. Why it matters: High-entropy steps are fragile; small nudges there can flip the whole sentence’s direction.

🍞 Anchor: If the model’s next-token choices are 'dog' (40%), 'cat' (35%), 'kid' (25%), entropy is high—any small push can change the story.

🍞 Hook: Picture a tower of Jenga blocks. Most blocks are snug, but a few are wobbly. Touching a wobbly one can topple the tower.

🥬 The Concept (High-entropy tokens): These are the wobbly tokens—points where the model is most unsure and decisions can flip. How it works: 1) Identify tokens with the highest entropy. 2) Notice they’re often connecting words or pivots ('and', 'but', 'or', adjectives/verbs that steer meaning). 3) A tiny push here can change future choices. Why it matters: Not every token matters equally; these few tokens steer the rest.

🍞 Anchor: Replacing 'A dog is playful and...' with 'A dog is playful but...' sends the sentence in a very different direction.

🍞 Hook: Imagine tapping a picture so lightly that you can’t see the change, but a finicky balance gets upset.

🥬 The Concept (Adversarial attacks): These are tiny, carefully crafted changes to the input (often pixels) that make the model choose wrong. How it works: 1) Find where the model is fragile. 2) Make small, invisible-like tweaks. 3) The model’s predictions shift dramatically. Why it matters: In safety-critical settings, small disturbances shouldn’t break the system—but they can.

🍞 Anchor: A nearly invisible pixel change fools a model into misreading a stop sign as a speed-limit sign.

The world before: VLMs like Qwen, InternVL, and LLaVA got very good at captioning and visual Q&A. People used them for accessibility, search, and assistance. But studies found they could hallucinate (make things up) or be nudged to go off-track with minor pixel changes. Early defenses and tests existed, and a recent attack (MIE) tried to increase uncertainty everywhere in the sentence, assuming each token mattered equally.

The problem: Treating all tokens the same wastes effort. In autoregressive text, a few 'fork' tokens matter most. Global attacks spread the budget thin and miss the key moments that steer meaning and safety.

Failed attempts: 1) Global entropy maximization—effective but inefficient and less focused on the real decision points. 2) Generic gradient attacks—can derail meaning but often stay 'safe wrong' (off-topic yet not clearly unsafe). 3) Token-agnostic approaches—don’t use the special power of high-entropy steps.

The gap: We lacked a method that targeted only those crucial fork tokens—where uncertainty is concentrated—and also worked across different models without constant inside access.

Real stakes: If a helper camera app mis-describes a scene or drifts into unsafe topics, that harms trust and utility. In cars, robots, or medical contexts, small perturbations could mislead systems. Even in everyday tools, it risks privacy leaks or unsafe content. The paper focuses on measuring and revealing the weakness, so defenses can harden those fragile forks.

02Core Idea

🍞 Hook: You know how a bike only needs light steering nudges at certain turns—not constant heavy pushing—to change where you end up?

🥬 The Concept (Aha! Insight): Focusing tiny perturbations on just the small set of high-entropy (most uncertain) tokens steers the entire caption more efficiently than attacking all tokens. How it works: 1) Find the top uncertain (high-entropy) positions—the 'forks'. 2) Nudge the image so the model gets extra unsure at those exact steps. 3) The next-token choice flips, and later words follow the new path. Why it matters: This uses less budget but causes bigger semantic drift—and more unsafe content—than global attacks.

🍞 Anchor: Instead of pushing every word, a slight nudge at 'and'→'but' shifts the sentence’s meaning and the rest of the story.

Three analogies to see it clearly:

Choose-your-own-adventure book: If you flip the option at one critical choice page, the whole rest of the story changes.
Train track switch: Moving a single switch sends the train onto a totally different route.
Jenga: Pulling one loose block can cause a cascade, while pushing many tight blocks does nothing.

Before vs. After:

Before: Attacks tried to raise uncertainty everywhere, assuming every token contributed equally to instability.
After: We now know only a few tokens matter most. Concentrating nudges on those forks is enough to change the outcome—and it’s often more effective and more transferable.

Why it works (intuition, no equations):

Autoregressive ripple: In step-by-step generation, early flips create context that guides later flips.
Entropy = leverage: High entropy means multiple next-tokens are almost tied; a tiny push can swap the winner.
Propagation: Once a delicate decision flips, the new context increases the chance of further flips downstream (observed 'harmful content propagation').
Shared forks: Many models learn similar 'decision words,' so focused attacks can transfer across architectures.

Building blocks (the idea in pieces):

High-entropy token selection: Identify the 20% most uncertain token positions in a clean run—those are the forks.
Fork-focused perturbation: Adjust only the image (not the text) to increase uncertainty right at those positions.
Harmful-mass tracking: Measure how probabilities for unsafe categories rise at and after fork positions.
Transferability: Show that the same fork tokens recur across different models, enabling cross-model attacks.
EGA (Entropy-bank Guided Attacks): Build a reusable 'token bank' of especially vulnerable fork tokens, so you can target them later without heavy inside access.

🍞 Anchor: If the model wavers between 'dog' and 'cat' (almost tied), a tiny image nudge at that moment may flip it to 'cat.' Later, the caption naturally starts mentioning 'whiskers,' 'meowing,' and other cat details—because the early fork set the path.

03Methodology

Note: To keep people safe, this section explains the idea at a high level without giving step-by-step instructions that someone could use to build harmful tools. The goal is understanding and defense, not misuse.

🍞 Hook: Imagine making a paper boat sail down a stream. You don’t push it the whole way; you tap it at a few bends so it turns and the current does the rest.

🥬 The Concept (High-level pipeline): At a high level: Image + Prompt → Find uncertain 'fork' tokens → Nudge pixels to raise uncertainty at just those forks → Generate text → Observe how the path changes. How it works (broad recipe):

Run the model normally to see where it hesitates most (the high-entropy tokens).
Create tiny, carefully bounded pixel tweaks to make those specific hesitations even shakier.
Keep the model’s decoding rule the same as before, so differences come from the input nudge.
Generate the caption or answer and measure how far it drifted and whether unsafe content appears. Why it matters: If we can reproduce the failure using small, focused tweaks, we’ve found a real structural weak spot to defend.

🍞 Anchor: Like tapping the boat only at sharp turns so the stream carries it onto a new route.

Step-by-step (conceptual, not operational):

Step A: Identify forks (high-entropy tokens) • What happens: The model writes a clean caption and notes where it was most unsure about the next token. • Why this step exists: Without the forks, you’d waste effort poking sturdy spots that don’t affect the route. • Example: The model hovers between 'dog' vs 'cat' or 'and' vs 'but'—that’s a fork.
Step B: Build a sparse focus mask • What happens: Choose only the top chunk (about 20%) of these forks to target, leaving the rest untouched. • Why this step exists: Concentration beats dilution—like watering only the roots. • Example: Out of 50 tokens, maybe 10 are forks; those get the spotlight.
Step C: Apply tiny pixel nudges around those forks • What happens: Make slight, bounded changes to the image so uncertainty specifically rises at those steps. • Why this step exists: Raising uncertainty at forks increases the chance of flipping the next token choice, which reshapes the rest of the sentence. • Example: A near-invisible change causes 'dog' to become 'cat' at the fork, and later tokens follow the new context.
Step D: Keep decoding consistent • What happens: Use the same decoding rule (like greedy) as in the clean run to isolate the effect of the image nudge. • Why this step exists: If you change many settings, you can’t tell what caused the difference. • Example: Same prompt, same max length, same decoding—only the image had a tiny tweak.
Step E: Measure outcomes • What happens: Compare clean vs. nudged captions with: (1) semantic drift (CIDEr drop), (2) attack success, (3) harmful rate using a safety judge. Track how 'harmful mass' moves along later tokens. • Why this step exists: Numbers make differences meaningful; they show whether the change is large, safe/unsafe, and where it spreads. • Example: 'Harmful mass propagation' means unsafe-token probabilities rise not just at the fork but for several steps after.

Secret sauce (what makes it clever):

Focus over force: Instead of pushing every word, it taps the few words that matter most.
Autoregressive leverage: Early flips make later flips easier.
Reusable weak spots: Many models share similar forks, so a 'token bank' of vulnerable words helps when you can’t measure entropy inside a new model.

🍞 Anchor: It’s like steering a car by nudging the wheel only at turns (forks), not yanking it constantly. The road (autoregression) does the rest.

Two practical variants (concept level):

Variant 1: Fork-focused uncertainty raise (white-box idea) • What: Use the model’s own signals to find high-entropy tokens, then concentrate small pixel nudges there. • Why: Directly targets the fragile steps that decide the route.
Variant 2: EGA—Entropy-bank Guided Attacks (transfer idea) • What: Build a 'bank' of especially vulnerable tokens from one model (like common forks that flip easily). Later, on a different model, target positions where those bank tokens appear, even without recomputing everything. • Why: This gives partial transferability when you can’t peek inside the target model.

🍞 Anchor: Like marking the trickiest bends on one river map and using those same markings to guess the bends on a neighboring river.

04Experiments & Results

🍞 Hook: Think of a science fair test. You try your idea on different pictures and models, and you use clear scorecards so results aren’t just guesses.

🥬 The Concept (What they tested): They checked whether focusing on the top ~20% forks: (1) derails captions more (CIDEr drop), (2) succeeds more often (attack success rate), (3) raises unsafe content (harmful rate), and (4) transfers to other models. How it works: 1) Use standard datasets (COCO for captions, TextVQA for visual questions). 2) Compare against strong baselines (PGD, VLA, COA, and MIE). 3) Keep the pixel change tiny (bounded) and decoding fixed. Why it matters: Fair, controlled tests show if the idea truly works and generalizes.

🍞 Anchor: It’s like timing runners on the same track with the same stopwatch.

The competition (baselines):

PGD: Classic pixel attack.
VLA and COA: Multimodal-specific recent attacks.
MIE: Global entropy maximization (raises uncertainty everywhere). EGA is the new method that focuses on forks using a token bank.

The scoreboard (with context):

Captioning success: EGA hits around 93–95% attack success across Qwen, InternVL, and LLaVA—like scoring an A when others get A or B.
Harmful rate in captions: EGA reaches about 42.5% (Qwen), 37.3% (InternVL), and 47.1% (LLaVA). That’s dramatically higher than MIE (~14%–23%)—like tripling the incidence.
Semantic drift (CIDEr drop): EGA’s drift is very close to MIE (both large), showing it derails meaning just as much while making unsafe content far more likely.
VQA (short answers): Even tiny answers can be steered—EGA induces harmful rates near 24.7%, 23.4%, and 28.6% across the same three models, roughly double MIE’s rates.

Key findings and patterns:

20% suffices: Focusing on the top 20% high-entropy tokens produces more damage than attacking all tokens with the same budget. The performance curve is U-shaped—too few or too many tokens is worse; the sweet spot is ~20%.
Harmful propagation: Unsafe probability mass spikes at the targeted fork, then remains elevated for several steps afterward. That’s like a spark catching kindling down the line.
Image vs. prefix: The adversarial image mainly triggers unsafe behavior at the forks. Even if you later remove the image, some unsafe drift continues because the already-changed prefix steers future tokens.
Transferability: Attacks built on one model still cause sizable harm on others (about 17–26% harmful rates). This shows that many models share vulnerable fork tokens.
LLaVA sensitivity: LLaVA, among the tested models, often shows the highest harmful rate under fork-focused attacks, even when its overall attack success is slightly lower—hinting at a particularly sensitive decision boundary at forks.

Surprising findings:

Less is more: Pushing only a few tokens can outperform pushing all tokens.
Shared forks: The top vulnerable tokens (with high flip rates) look similar across different models, explaining cross-model transfer.
Persistence: Once the model locks into a drifted route, removing the image entirely doesn’t fully restore safety—the changed prefix carries momentum.

🍞 Anchor: Switching a train at a single junction can send it down a different branch—and once it’s there, it keeps going that way even if the switch is set back later.

05Discussion & Limitations

🍞 Hook: If your house has a few creaky steps that make you trip, fixing the whole floor evenly won’t help—fix the steps that actually cause falls.

🥬 The Concept (Limits and lessons): This research reveals where the creaky steps are but also admits what it can’t yet do. How it works: 1) They use an automated safety judge combining rules + an LLM—useful but imperfect. 2) Tests are on standard datasets and digital pixel tweaks, not physical-world changes. 3) The method leans on autoregressive generation and may behave differently for other architectures. Why it matters: Knowing the limits steers future defenses and avoids over-claiming.

🍞 Anchor: It’s like a weather forecast—good guidance, not a guarantee.

Limitations:

Safety judging: A hybrid rule+LLM judge can disagree with humans on borderline cases. Larger, human-reviewed suites would refine accuracy.
Dataset scope: Most results come from 1k subsets; bigger, more varied sets (domains, languages) could reveal more.
Attack scope: Pixel-space, digital-only. No physical attacks tested, and other attack types might behave differently.
Decoding assumptions: Many tests use greedy decoding; alternative decoding can shift exact rates.
Autoregression focus: The insight is tied to token-by-token generation; non-autoregressive designs may need different analysis.

Required resources:

Models and compute for many forward passes to find forks and measure effects.
For transfer-style EGA, a source model to build the token bank and standard datasets for evaluation.
A safety judge pipeline to classify harmful content.

When not to use (for evaluation/researchers):

Non-autoregressive or diffusion language models where 'token forks' don’t apply in the same way.
Scenarios focused solely on benign accuracy where safety stress-testing is out of scope.
Physical-world robustness questions (e.g., printed images) which require different methods.

Open questions:

Defenses: How to stabilize high-entropy forks? Options include calibrated decoding, entropy-aware training, or local smoothing around decision points.
Detection: Can we flag sequences entering high-entropy forks and pause, re-ask, or consult a safety filter?
Alignment: How to make the model refuse unsafe continuations specifically at forks without over-refusing safe content?
Generality: Do similar 'few tokens matter' effects hold across multilingual data, videos, or non-autoregressive systems?

🍞 Anchor: Like adding a handrail exactly where people slip, not everywhere along the hallway.

06Conclusion & Future Work

🍞 Hook: Ever change a bowling game by nudging just one pin? Sometimes one move makes all the difference.

🥬 The Concept (Three-sentence summary): This paper shows that only a small set of high-entropy tokens—forks where the model is most unsure—steer the whole caption or answer. By nudging the image to raise uncertainty exactly at those forks, attacks become more efficient, more likely to create unsafe content, and more transferable across models. The authors introduce EGA, a token-bank-guided approach that achieves high success and harmful rates while revealing a structural weakness in autoregressive generation.

Main achievement: Turning a broad, hard-to-aim problem into a focused one—proving that 'few tokens matter' and building a practical, transferable way (EGA) to target them for safety stress tests.

Future directions: Develop defenses that stabilize or gatekeep high-entropy forks (calibrated decoding, entropy-aware training), build better detectors for harmful propagation, and broaden testing to more tasks, languages, and architectures (including non-autoregressive models and physical settings).

Why remember this: It reframes robustness and safety for multimodal models around a few pivotal decisions instead of the whole sequence. If we learn to guard those forks, we can make AI that sees and speaks both smarter and safer.

🍞 Anchor: Guard the few slippery steps on the staircase, and most falls won’t happen.

Practical Applications

•Safety auditing: Use high-entropy forks as checkpoints to stress-test models for unsafe drift (without deploying harmful outputs).
•Defense design: Create decoding or training strategies that stabilize or downweight risky forks when safety signals are triggered.
•Monitoring tools: Build dashboards that highlight when a generation hits multiple high-entropy forks in a row, prompting extra caution or review.
•Model selection: Compare models by how concentrated their fork vulnerabilities are to choose safer deployments.
•Curriculum training: Include fork-aware training data or objectives that reduce uncertainty at critical pivots.
•Fallback planning: If a response approaches a fork with high uncertainty, route to a safer subsystem (e.g., rule-based summary).
•Policy tuning: Strengthen safety filters specifically around fork positions where propagation is likely.
•Evaluation benchmarks: Add fork-focused metrics (entropy maps, propagation measures) to standard captioning/VQA suites.
•Incident analysis: When a bad output occurs, trace back which fork flipped to guide targeted remediation.
•Transfer risk assessment: Use a token bank to forecast cross-model vulnerability and prioritize patching shared weak points.

Version: 1