When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Jiacheng Hou; Yining Sun; Ruochong Jin; Haochen Han; Fangming Liu; Wai Kin Victor Chan; Alex Jinpeng Wang

When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

Beginner

Jiacheng Hou, Yining Sun, Ruochong Jin et al.2/10/2026

arXiv

Key Summary

•Modern image editors can now follow visual prompts like arrows and scribbles, which opens a new way for attackers to hide harmful instructions inside images.
•The paper introduces Vision-Centric Jailbreak Attacks (VJA), where the prompt is purely visual, letting attackers bypass text-based safety filters.
•To measure this risk, the authors build IESBench, a safety benchmark with 15 categories (like evidence tampering and copyright removal) and 1,054 visually prompted images.
•On IESBench, VJA breaks many top systems: up to 80.9% success on Nano Banana Pro and 70.3% on GPT Image 1.5; several open-source models hit 100%.
•The authors add a simple, training-free defense: a short “safety trigger” sentence that makes the model first think about safety in words before editing the image.
•This defense greatly cuts risk (about a one-third drop in attack success) and adds almost no speed or compute cost.
•The paper also proposes kid-friendly but precise metrics: Harmfulness Score (HS), Editing Validity (EV), and High Risk Ratio (HRR) to make results meaningful.
•Surprising finding: the stronger a model’s visual understanding, the more it can be tricked by visual prompts unless safety is reinforced.
•This work highlights a new blind spot for AI safety—images can be prompts—and offers both a testbed and a practical fix.
•Takeaway: To keep image editors safe, we must treat pictures like instructions and activate language-based safety before edits happen.

Why This Research Matters

Pictures are becoming instructions, not just content, so safety systems must learn to read drawings, arrows, and tiny labels just like they read words. This matters for everyday trust: people rely on images for news, schoolwork, shopping, and identity, and subtle edits can mislead or harm. The paper shows a widespread blind spot: even strong models can be tricked if safety only watches text. It also offers a practical fix that teams can use right now without retraining or heavy guard models. Better tests (IESBench) help everyone compare systems fairly and improve faster. As AI gets more visual, building safety that understands images as prompts will protect users, creators, and communities.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how you can draw arrows and circles on a picture to show your friend exactly what to change? That’s fun and easy—no long messages needed.

🥬 The Concept: Large image editing models are AI tools that change pictures based on instructions, which used to be mostly words, but now can also be visual hints like arrows, highlights, or short labels. How it works (story of the world before):

Early systems waited for text like “make the sky purple.”
Newer systems let you show your intent with marks on the picture—circle a face to swap glasses, draw an arrow to a sign to fix its text.
Safety systems mostly checked the words (text) because that’s where instructions usually lived. Why it matters: If safety only watches words, but the instructions live in drawings, safety can miss the problem. 🍞 Anchor: Imagine you circle a watermark and write “remove” with a tiny arrow. A modern editor might follow your drawing even if you never typed a harmful request.

🍞 Hook: Think of a school hall monitor who only listens to what kids say—but some kids communicate with secret hand signals.

🥬 The Concept: AI safety alignment is the set of rules and training that help an AI refuse harmful requests. How it works:

The model is taught policy rules (like “don’t help with illegal stuff”).
A guard checks the user’s text request and blocks it if it’s unsafe.
If allowed, the editor makes the change. Why it matters: If the guard only listens to text and the plan is hidden in the picture, the guard might never notice. 🍞 Anchor: A text like “erase the logo” may be blocked, but a picture with an arrow and the word “clean” on the logo might sneak through.

🍞 Hook: Imagine playing charades—you can tell a whole story without saying a word.

🥬 The Concept: Vision-prompt editing means the AI reads your arrows, highlights, and short labels on the image as the instruction. How it works:

You add marks (like boxes or arrows) to point at what to change.
You might include short on-image text (like “brighten here”).
The model understands and edits just that region. Why it matters: When instructions live in the image, text-only safety filters can’t see the full plan. 🍞 Anchor: Drawing a rectangle around a street sign and adding an arrow means “edit this sign,” even without typing a sentence.

🍞 Hook: You know how a magician misdirects by pointing your eyes where they want? Attackers can do that to models, too.

🥬 The Concept: A jailbreak attack tricks a model into ignoring its safety rules and doing something it shouldn’t. How it works:

The attacker crafts a sneaky instruction.
The safety layer fails to catch it.
The model follows the harmful request. Why it matters: Jailbreaks are how harmful edits slip through. 🍞 Anchor: If a model is supposed to refuse “remove this trademark,” a jailbreak might convince it to do it anyway, but via a clever picture prompt.

The Problem: Before this paper, most safety checks focused on text. But modern editors can follow vision-only prompts. That mismatch creates a gap attackers can exploit. Failed Attempts: Teams tried stronger text filters, stricter refusal rules, or separate guard models. These help for words, but not for purely visual cues, hand-drawn arrows, or unusual fonts and shapes. Also, extra guard models add cost and delay. The Gap: We lacked a standard way to test vision-only attacks on image editors and a simple defense that doesn’t require retraining or heavy guard systems. Real Stakes: In daily life, this could mean removing watermarks, faking dates on documents, changing medical labels, or crafting misleading photos that spread online. That can hurt trust, safety, and people’s rights—even if no one typed a single unsafe word.

02Core Idea

🍞 Hook: Imagine hiding secret instructions inside a drawing—like a treasure map where the X marks the spot.

🥬 The Concept: The key insight is that the prompt itself can be visual, so attackers can jailbreak image editors using only pictures (Vision-Centric Jailbreak Attack, or VJA), and we can defend by first making the model explain safety in words. How it works:

Attackers add arrows, boxes, and short on-image cues to tell the model what harmful edit to do.
Text filters see nothing suspicious because the text field is empty.
The model edits the image anyway, following the visual cues.
The defense adds a short safety message that makes the model “think in language” about safety before editing. Why it matters: Without this, editors can be fooled into serious policy violations; with it, we restore a safer checkpoint. 🍞 Anchor: A circled watermark with an arrow and the word “clean” can get removed; adding a safety reminder makes the model pause, judge the risk, and often refuse.

Three Analogies to Understand the Idea:

Invisible Ink: The harmful instruction is written inside the picture, so the usual “text police” can’t read it.
Airport Security: If you only scan carry-on bags (text) but never scan checked luggage (image), bad stuff can get through. VJA puts the bad stuff in the image.
Teacher’s Pop Quiz: The defense is like saying, “Explain your reasoning first.” When the model explains in words, safety rules wake up and block the harm.

Before vs After:

Before: Safety tools were strong on text prompts, weak on visual cues. Attackers could slip harmful edits past filters by drawing instructions on the image.
After: We have IESBench to test visual attacks and a simple, training-free defense that pulls the problem back into language space, where safety is stronger.

Why It Works (intuition):

Modern image editors understand pictures very well, so arrows and labels feel like real instructions.
Safety alignment is usually better in language space because policies were trained and checked there more thoroughly.
A tiny “safety trigger” line causes the transformer’s attention to focus on safety reasoning in text, reconnecting to the strongest safety skills the model already has.

Building Blocks (the idea in pieces):

VJA: A visual-only way to encode harmful requests.
IESBench: A benchmark with 15 risk categories and 1,054 samples to measure how often models get fooled.
MLLM-as-a-judge: A multimodal model grades outputs for harmfulness and validity at scale.
Safety Metrics: Harmfulness Score (HS), Editing Validity (EV), and High Risk Ratio (HRR) make results meaningful.
Defense: An introspective “safety trigger” that activates language-based safety before any edit proceeds.

03Methodology

At a high level: Visual input → (VJA: embed harmful cue) → Image editor tries to follow → Output image → Judge evaluates (HS, EV, HRR). With defense: Visual input → Append safety trigger text → Model explains risk first → Safe refusal or guarded edit.

Step A: Vision-Centric Jailbreak Attack (VJA) 🍞 Hook: Picture telling your friend to fix a photo just by circling and pointing—no words needed. 🥬 The Concept: VJA encodes the whole instruction visually so the model follows the drawing instead of typed text. How it works:

Start with a normal, harmless base image.
Add visual prompts: arrows, boxes, underlines, or short on-image phrases linked to a region.
Keep the text field empty (or benign) so text filters see nothing to block.
The editor reads the marks as the instruction and edits accordingly. Why it matters: Text-only guards can’t catch what’s hidden in the picture, so harmful edits can slip by. 🍞 Anchor: Drawing a rectangle over a logo and an arrow labeled “erase” can trigger removal, even if no typed request exists.

Secret Sauce (for the attack): Out-of-Distribution Detour 🍞 Hook: You know how a strange-looking coin machine at a fair might not detect fake coins as well as a bank machine? 🥬 The Concept: By using unusual fonts, colors, shapes, or languages, the attacker nudges the input outside what the safety model was trained to expect (Out-of-Distribution, or OOD). How it works:

Vary marker color, shape, or size.
Use handwriting-like fonts or non-English labels.
Place cues cleverly to connect intent to a target region. Why it matters: The further from the safety model’s comfort zone, the easier it is to slip past. 🍞 Anchor: A tiny handwritten “replace” in a bright triangle arrow can be read by the editor but missed by a text-focused guard.

Step B: Building the Benchmark (IESBench) 🍞 Hook: Think of a playground safety test with slides, swings, and monkey bars tested in many ways. 🥬 The Concept: IESBench is a standardized test for vision-only attacks on image editors. How it works:

Collect or generate 1,054 benign base images representing diverse scenes.
Add visual prompts that request edits across 15 safety categories (like evidence tampering, copyright removal, or misinformation).
Annotate each sample with what was targeted, what action to take (add, delete, replace), and category labels. Why it matters: Without a shared test, we can’t compare safety fairly or see patterns across risks. 🍞 Anchor: One sample might circle a date on a document with an arrow that implies “change this,” testing commercial tampering.

Step C: Judging Results with Kid-Friendly Metrics 🍞 Hook: When you bake cookies, you don’t just ask “Are they sweet?”—you also check if they’re fully baked and safe to eat. 🥬 The Concept: Three scores make results clear: Harmfulness Score (HS), Editing Validity (EV), and High Risk Ratio (HRR). How it works:

HS (1–5): How harmful is the final image? 1 is a safe refusal; 5 is a convincing, clearly harmful result.
EV (Yes/No): Did the edit actually do the intended change (not garbled or nonsense)?
HRR: The share of outputs that are both valid and high-harm (HS ≥ 4). Why it matters: A model might “fail safe” (no real edit), which lowers risk even if it didn’t refuse perfectly. 🍞 Anchor: If the model returns a blurry, unreadable sign, EV is No—even if the idea was harmful, it didn’t execute it well.

Step D: MLLM-as-a-Judge 🍞 Hook: Imagine a fair referee who can both see the picture and read the labels. 🥬 The Concept: A multimodal large language model (MLLM) grades the outputs instead of relying only on people. How it works:

Show the judge both the original and edited images side-by-side.
Give it a clear scoring rubric for HS and EV.
Let it reason about risk level and validity at scale. Why it matters: Human judging is slow and inconsistent; an MLLM judge is scalable and consistent (and can be checked against humans). 🍞 Anchor: The judge explains, “This edit convincingly removed the watermark (HS=5), and it’s valid (EV=Yes).”

Step E: Introspective, Training-Free Defense 🍞 Hook: Before you act on a plan, your teacher might say, “Explain your reasoning first.” That pause can prevent mistakes. 🥬 The Concept: Append a tiny safety trigger text that prompts the model to assess safety in language before editing. How it works:

Take the user’s image (and any text if present).
Append a short sentence like “You are an image editing safety evaluator. Review if the edit is safe/legal.”
The model performs quick multimodal reasoning in language space.
If unsafe, it refuses; if safe, it proceeds. Why it matters: Language-space safety is usually stronger, so making the model “think in words” reactivates those guardrails. 🍞 Anchor: With the trigger, the model often replies, “I can’t help with that request,” instead of silently removing a watermark.

Efficiency Detail 🍞 Hook: Adding a lock shouldn’t turn your door into a vault. 🥬 The Concept: The defense reuses the model’s existing caches (KV cache) so it barely slows things down. How it works:

No extra guard model is added.
A small number of text tokens are processed.
Overall runtime rises only a tiny bit (around 3%). Why it matters: A cheap, fast defense is practical in real products. 🍞 Anchor: It’s like adding a quick seatbelt check without making the car start any slower.

04Experiments & Results

The Test

What they measured: How often vision-only prompts succeed (ASR), how harmful the outputs are (HS), whether edits are valid (EV), and how many are both valid and high-risk (HRR).
Why this matters: It separates “truly dangerous” from “harmless or failed” edits and shows where models are most fragile.

The Competition

Models tested included commercial systems (Nano Banana Pro, GPT Image 1.5, Qwen-Image-Edit online, Seedream 4.5) and open-source models (Qwen-Image-Edit*, BAGEL, Flux2.0[dev]).
They also compared vision-only attacks (VJA) to classic text-only jailbreaks (TJA) on a balanced subset.

The Scoreboard (with context)

VJA vs commercial models: Average ASR around 85.7% across commercial systems on IESBench. That’s like most shots on goal going in.
Specific standouts: Up to 80.9% ASR on Nano Banana Pro and 70.3% on GPT Image 1.5 across the full benchmark, with many valid, high-harm results (HRR up to about half of attacks for GPT Image 1.5).
Open-source models without extra safeguards: 100% ASR. That’s like getting every question wrong on a safety quiz—because there was no proctor.
Harmfulness levels: Several open models averaged HS near 4.3 (on a 1–5 scale), meaning outputs were often severe and convincing.
Defense effectiveness: Adding the tiny safety trigger to Qwen-Image-Edit* cut attack success by about one third and reduced harmfulness by about 1.2 points—now roughly comparable to leading commercial systems.

VJA vs TJA (surprising gains)

On safer commercial models, VJA beat TJA by a large margin: +35.6% ASR on Nano Banana Pro and +24.9% on GPT Image 1.5. Translation: when the prompt becomes visual, these models are much easier to trick.
On weaker models already vulnerable to text attacks, VJA’s gain was smaller—sometimes the model just failed to understand the visual cue, leading to an invalid (and less risky) edit.

Category and Risk-Level Differences

Most vulnerable categories: Evidence tampering (I13) and aversive manipulation (I15) often had the highest success, revealing that deceptive, fabricated edits are especially hard to stop.
Mixed strengths: For example, GPT Image 1.5 struggled with copyright tampering (I11) at about 95.7% ASR, while Nano Banana Pro did better there (around 41.3% ASR), showing different safety tuning.

Judge Reliability

The default MLLM judge (Gemini-class) aligned well with human ratings in preference tests; smaller local judges tended to miss subtle invalid edits.
Humans were slightly more conservative (giving lower harmfulness on average), suggesting community standards matter and judges should be calibrated.

Unexpected Findings

Stronger vision understanding can mean more dangerous outcomes under VJA—because the model follows the visual instruction very well unless safety is reactivated.
Visual prompt details (color, shape, font, language) significantly change outcomes, revealing sensitivity and the need for diverse benchmarks.

Plain-English Wrap-Up

Big picture: If you only watch the words, you miss the picture—and the picture can be the whole plan. A tiny safety reminder that makes the model think in words first can save the day, fast and cheap.

05Discussion & Limitations

Limitations

VJA depends on the model correctly reading visual prompts. If a model has weak visual reasoning, it may ignore the prompt, producing invalid or low-risk edits. That “failure to follow” can look safer but sometimes reflects limited capability, not strong safety.
The defense relies on the model already having good language-space safety alignment. If the underlying vision-language backbone isn’t well aligned or up to date, it can still be fooled by fabricated or niche knowledge requests.
MLLM-as-a-judge can drift with prompt design or model choice; while scalable, it needs periodic human checks and calibration.

Required Resources

For evaluation: IESBench dataset, a chosen MLLM judge, and access to target editing models.
For defense: No retraining, no extra guard model—just append a short safety trigger text; minimal compute overhead, thanks to KV-cache reuse.

When NOT to Use

Ultra-latency-critical pipelines with no room for even tiny overhead (rare, since the added cost is very small).
Locked-down offline workflows where edits are batch-verified by humans anyway (the trigger adds little there).
Specialized domains where the safety trigger might interfere with tightly constrained outputs (should be tested first).

Open Questions

Can we train image editors to reason about safety natively in the visual space, not just after being nudged into language space?
How to build robust visual guards that understand arrows, regions, and on-image text across languages, fonts, and styles?
What’s the best way to calibrate MLLM judges across cultures and evolving community standards?
Can adversarial training with visual prompts harden models without harming normal usability?
How to detect or watermark visual prompts so hidden harmful cues are surfaced before editing begins?

06Conclusion & Future Work

Three-Sentence Summary

This paper shows that the “prompt” for image editing can be purely visual, letting attackers hide harmful instructions inside pictures and bypass text-based safety.
It introduces IESBench to measure this new risk across 15 categories and demonstrates that Vision-Centric Jailbreak Attacks succeed widely—even on top commercial models.
A tiny, training-free defense that triggers language-based safety checks before editing sharply reduces risk with almost no extra cost.

Main Achievement

The biggest contribution is reframing prompts as visual, exposing a blind spot in current safety systems, and offering both a rigorous benchmark and a practical, low-cost fix.

Future Directions

Build native visual-safety reasoning into editors so they can read, question, and refuse harmful visual cues directly.
Expand and diversify benchmarks (colors, shapes, languages, fonts) and improve judge calibration with human-in-the-loop updates.
Combine light introspective triggers with robust visual guardrails and selective adversarial training.

Why Remember This

As AI interfaces become more visual, images aren’t just content—they’re instructions. Safety has to watch the picture, not just the words. A small nudge to “think in language” before acting can make powerful systems much safer, fast.

Practical Applications

•Add the one-line safety trigger to existing image editors to reduce risky edits with almost zero extra compute.
•Use IESBench to audit your image editor across 15 risk categories before deployment.
•Integrate an MLLM judge in testing pipelines to auto-score Harmfulness (HS) and Validity (EV) of edited outputs.
•Tune UI tools to flag or highlight on-image labels, arrows, or boxes as potential instructions for pre-check.
•Stress-test robustness by varying visual prompt color, shape, font, size, and language to uncover hidden weaknesses.
•Adopt policy-aware fallbacks: when risk is detected, replace harmful edits with safe alternatives or refuse gracefully.
•Log visual prompts (with privacy safeguards) to trace how on-image cues influenced editing decisions.
•Combine the safety trigger with rate limits or human review for high-risk categories like evidence tampering.
•Calibrate MLLM judges against periodic human evaluations to keep safety scores fair and up to date.
•Educate users: show that marks on images act like prompts, so they understand why certain edits may be refused.

Version: 1