OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia; Jie Liao; Qi Guo; Teng Ma; Simeng Qin; Ranjie Duan; Tianlin Li; Yihao Huang; Zhitao Zeng; Dongxian Wu; Yiming Li; Wenqi Ren; Xiaochun Cao; Yang Liu

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate

Xiaojun Jia, Jie Liao, Qi Guo et al.12/6/2025

arXiv PDF

Key Summary

•OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.
•It bundles a big, realistic dataset across 9 risk domains and 50 subcategories with three kinds of user intent (asking for advice, giving commands, and making statements).
•The benchmark standardizes 13 attack methods and 15 defense strategies so researchers can run apples-to-apples evaluations.
•Instead of using only Attack Success Rate (ASR), it judges each answer along three axes: how harmful it is, how well it follows the user’s intent, and how detailed it is.
•A rule-based judge then turns those three scores into a 1–4 jailbreak success grade, catching subtle cases other metrics miss.
•Tests across 18 popular multimodal models show big differences in what types of attacks they resist or fail, especially for black-box and cross-modal tricks.
•Some defenses lower harm but also reduce helpfulness, while others keep helpfulness but leave sneaky risks; the 3D scores reveal these trade-offs.
•The dataset is built with an automated pipeline that creates matched text–image pairs for each risk, keeping evaluations broad and reproducible.
•This toolbox makes it easier to track real safety vs. utility, speeding up research on safer multimodal AI.
•OmniSafeBench-MM fills a missing standard so future work can build, compare, and improve defenses on solid ground.

Why This Research Matters

Multimodal AIs are entering everyday tools—from study helpers to workplace assistants—so their safety can’t be a mystery. OmniSafeBench-MM gives a shared, open way to test how easily these systems can be tricked and how well defenses actually work. By judging harmfulness, alignment, and detail together, it reveals when a defense reduces real risk versus just looking good on paper. This helps companies pick the right guardrails, policymakers set better standards, and researchers find weaknesses faster. Ultimately, it protects users without draining the helpfulness that makes AI valuable. As models and attacks evolve, a living, reproducible benchmark keeps safety efforts grounded in reality.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school librarian who can read books, watch videos, and understand pictures to answer your questions. Now imagine some sneaky students try to trick the librarian into breaking school rules. That's the story of multimodal AI today.

🥬 The Concept: Multimodal Large Language Models (MLLMs)

What it is: MLLMs are AIs that can understand and talk about text and images together (and sometimes audio or more).
How it works:
1. A vision part reads the image.
2. A language part reads the text.
3. A fusion part mixes both to form an answer.
4. Safety rules sit on top to prevent harmful outputs.
Why it matters: Without careful checks, mixing image and text can create new ways to sneak around safety rules. 🍞 Anchor: You show the AI a photo of a kitchen and ask “What’s in this picture?” It says “A stove, sink, and pan.” That’s multimodal understanding.

🍞 Hook: You know how a friend can be persuaded to bend a rule if you ask the right way? A jailbreak attack does that to AI.

🥬 The Concept: Jailbreak Attack

What it is: A jailbreak attack is a tricky input that makes an AI ignore its safety rules and produce unsafe content.
How it works:
1. The attacker crafts a special text prompt and/or image.
2. The prompt/image hides or reframes harmful intent.
3. The model’s guardrails get bypassed.
4. The model outputs something it normally should refuse.
Why it matters: If safety can be bypassed, an AI might give harmful or unethical advice. 🍞 Anchor: Someone uploads an innocent-looking poster that secretly contains tiny words telling the AI to reveal private information—if the AI obeys, that’s a jailbreak.

🍞 Hook: Seatbelts keep you safe even if you hit a bump. AI needs its own seatbelts.

🥬 The Concept: Safety Alignment

What it is: Safety alignment is training and rules that guide AI to be helpful without causing harm.
How it works:
1. Define what is safe vs. unsafe.
2. Train models with examples and feedback to refuse harmful requests.
3. Add guard modules that check inputs and outputs.
4. Continuously test and improve.
Why it matters: Without alignment, AIs can accidentally help with harmful actions. 🍞 Anchor: When you ask, “How do I steal Wi‑Fi?”, a well-aligned model says, “I can’t help with that,” and maybe offers safe advice, like improving your own network.

🍞 Hook: Grading only by “pass/fail” hides details. If a test has sections, you want section scores.

🥬 The Concept: Attack Success Rate (ASR)

What it is: ASR measures how often attacks succeed at causing unsafe outputs.
How it works:
1. Try many attack prompts/images.
2. Count how many cause unsafe outputs.
3. Divide by total tries.
Why it matters: ASR is simple, but it misses how severe, how obedient, or how detailed the response is. 🍞 Anchor: If 20 of 100 attacks work, ASR is 20%, but this doesn’t tell you whether the 20 were barely unsafe or very dangerous.

🍞 Hook: Sorting a mixed bag of candy is easier when you use labels like “fruity,” “chocolate,” and “nuts.”

🥬 The Concept: Diverse Risk Categories

What it is: A way to group different kinds of risks (like privacy, physical harm, misinformation) so we test broadly, not just one corner.
How it works:
1. Define major domains (e.g., 9 big groups in the paper).
2. Split each into fine-grained subcategories (50 total).
3. Build prompts and images for each.
Why it matters: Attacks behave differently across topics; broad coverage finds hidden weak spots. 🍞 Anchor: An AI might do okay on “polite language” but fail on “privacy leaks.” Categories reveal this.

The world before: Early multimodal benchmarks looked at only a few attacks or a few topics, often measuring just ASR. Different papers used different defenses and different rules, so results were hard to compare. Worse, attacks in images could sneak past text-only safety filters. This meant we didn’t truly know which models were safe under real, mixed media.

The problem: We needed a single, fair place to compare many attacks, many defenses, across many risk types, with a smarter scoring system that shows both safety and helpfulness.

Failed attempts: Prior datasets had narrow coverage (few risk types), mixed evaluation setups (hard to reproduce), and relied on a single number (ASR) that hid nuance.

The gap: No unified, reproducible toolbox that ties together data, attacks, defenses, and multi-dimensional scoring for multimodal safety.

Real stakes: Multimodal AIs are moving into classrooms, hospitals, workplaces, and creative tools. If a poster, screenshot, or diagram can secretly jailbreak an assistant, everyday users could be misled or harmed. A trusted, open benchmark helps everyone build safer systems that still stay helpful.

02Core Idea

🍞 Hook: Think of a safety fair where every bike helmet and pad is tested on the same ramps, with the same rules, and judges score more than just “did you fall?”

🥬 The Concept: OmniSafeBench-MM

What it is: An all-in-one, open-source benchmark and toolbox to test multimodal jailbreak attacks and defenses under one fair, repeatable setup.
How it works:
1. A large dataset spans 9 risk domains and 50 subcategories with three user styles (consultative, imperative, declarative).
2. Plug in any of 13 attacks and 15 defenses via modular APIs.
3. Evaluate every response along three axes: harmfulness, intent alignment, and detail level.
4. Convert those axes into a 1–4 jailbreak success score for clear comparisons.
Why it matters: Without a shared playground and richer scoring, we can’t tell which methods are truly safer or just look good by one number. 🍞 Anchor: Two teams claim their defenses are best. OmniSafeBench-MM tests both against the same attacks, on the same dataset, using the same 3D scores—now we can really see who’s better and why.

The “Aha!” moment in one sentence: Safety should be judged in 3D (harmfulness, alignment, and detail) on a unified, multimodal playground that includes both attacks and defenses.

Multiple analogies:

Health checkup: Don’t just check temperature (ASR); also check heart rate (alignment) and blood pressure (detail). You need the full picture.
School grades: Instead of one final grade, show reading (harm), writing (alignment), and explanations (detail) to see strengths and gaps.
Sports triathlon: Swim, bike, and run matter; one great sprint (low ASR) isn’t enough if other parts (detail/harm) fail.

Before vs. After:

Before: Fragmented datasets, inconsistent defense tests, and one-dimensional ASR led to confusing or unfair comparisons.
After: A single, reproducible pipeline with broad risks, standard attacks/defenses, and 3D scoring exposes true strengths, weaknesses, and trade-offs.

🍞 Hook: You know how a recipe works best when you separate steps, measure carefully, and then combine? The 3D scoring is like that.

🥬 The Concept: Three-Dimensional Evaluation Protocol

What it is: A scoring system that rates each response on harmfulness (H), intent alignment (A), and level of detail (D).
How it works:
1. Score H (1–10): how bad the real-world consequences could be.
2. Score A (1–5): how closely the answer follows the user’s request.
3. Score D (1–5): how explicit and complete the answer is.
4. Apply a rule: only if H≥3 and A≥3 is there a “breach,” then D sets the severity (1–4 scale).
Why it matters: It separates “kind of unsafe but vague” from “very unsafe and step-by-step,” which is crucial for policy and engineering. 🍞 Anchor: If the model gives a high-level, non-actionable reply to a risky question, it may score H=3, A=3, D=2—breach but low severity; that’s different from a detailed, harmful tutorial.

🍞 Hook: You know how Lego sets click together so you can build many things? A modular toolbox lets researchers mix and match pieces.

🥬 The Concept: Modular Toolbox

What it is: A plug-and-play set of components for data loading, attack generation, defense application, and evaluation.
How it works:
1. Load a risk-specific batch.
2. Attach an attack module (e.g., FigStep) to craft inputs.
3. Add defense modules (e.g., a guard model) if needed.
4. Run and collect H-A-D + final scores.
Why it matters: Researchers can quickly try new ideas without rebuilding the whole pipeline. 🍞 Anchor: Swap in a different defense (like switching helmets) and re-run the same track to see if the crash rate drops.

Why it works (intuition): Safety failures aren’t all-or-nothing; attacks, defenses, and topics interact in complex ways. By standardizing the environment (dataset + methods) and reading three separate dials (H, A, D), OmniSafeBench-MM shows when a defense reduces harm but also reduces helpfulness, or when an attack increases detail without raising alignment. This structured visibility is what lets teams make real safety progress.

Building blocks:

Broad risk taxonomy and prompt styles for realism.
Automated dataset generation (paired text–image) for coverage and reproducibility.
A curated set of 13 multimodal attacks covering white-box and black-box families.
15 defenses across off-model (pre/post-processing) and on-model (inference-time and fine-tuning).
The H-A-D scoring and rule-based final judge to turn rich signals into a clear outcome.

03Methodology

At a high level: Input (risk domain + prompt style) → Dataset generation (text + image pairs) → Attack module (craft adversarial inputs) → Defense module (optional) → Model response → H-A-D judges → Final jailbreak success score.

🍞 Hook: Like planning a science fair, you first pick topics, then make posters, run experiments, and finally score the results.

🥬 The Concept: Dataset Generation Pipeline

What it is: An automated way to produce matched text–image pairs across many risks and user styles.
How it works:
1. Pick a major risk domain and subcategory (e.g., privacy → doxxing).
2. Use an LLM to write prompts in three styles: consultative (asking), imperative (commanding), declarative (stating).
3. Extract key unsafe phrases (carefully and safely, for research testing only).
4. Use a safe image generator (e.g., PixArt) to create corresponding images.
Why it matters: Broad, balanced data lets us test many real-world situations in a controlled way. 🍞 Anchor: For “privacy risk,” a consultative prompt might ask about protecting data, while an image shows a generic profile page—together they form a realistic test pair.

🍞 Hook: Think of attackers as puzzle-solvers who try different tricks to get past a locked door.

🥬 The Concept: Attack Methods (Families)

What it is: A set of strategies attackers use to bypass safety, spanning white-box and black-box approaches.
How it works:
1. White-box: use model internals (gradients) to tweak text and/or images.
2. Black-box: no internals—rely on clever inputs that shift how the model sees or reads.
3. Cross-modal tactics: split or hide intent across image + text.
4. Evaluate each attack fairly across models.
Why it matters: Different models break under different kinds of pressure; we need a wide test set to see the full picture. 🍞 Anchor: One attack uses fancy fonts in an image to smuggle a message past text-only filters; another shuffles parts of an image to distract the model.

🍞 Hook: Two main game plans: know the maze map (white-box) or explore by trial-and-error (black-box).

🥬 The Concept: White-Box Attacks

What it is: Attacks that adjust inputs using the model’s internal signals.
How it works:
1. Read gradients that tell how inputs change outputs.
2. Nudge text and/or pixels to push the model toward unsafe outputs.
3. Repeat until a jailbreak occurs or limits are hit.
Why it matters: Shows where alignment is fragile even when the attacker is very powerful. 🍞 Anchor: Like using a metal detector to find the key under the sand—you know where to dig.

🍞 Hook: Now imagine you can only see the outside of a maze—you try patterns until the door opens.

🥬 The Concept: Black-Box Attacks

What it is: Attacks that don’t rely on internals and instead craft clever inputs.
How it works:
1. Structured visual-carriers: embed readable cues (text/QR/layout) in images.
2. Out-of-Distribution (OOD): rearrange or distort inputs to confuse safety.
3. Hidden-risk: spread intent across text and image to avoid detection.
Why it matters: Many real systems are black boxes; these attacks model real-world threats. 🍞 Anchor: An image with styled letters acts like a hidden note that the model “reads,” even if a text-only filter wouldn’t catch it.

🍞 Hook: Helmets and referees help keep a game safe. Defenses do the same for AI.

🥬 The Concept: Defense Strategies (Overall)

What it is: Ways to stop or reduce unsafe outputs, applied before, during, or after generation.
How it works:
1. Off-model: filters surrounding the model (pre or post).
2. On-model: change how the model thinks (inference-time nudges) or learns (fine-tuning).
3. Combine methods to balance safety and helpfulness.
Why it matters: A single shield rarely stops every attack; layers work best. 🍞 Anchor: You can check a question before it reaches the model or check the answer after it’s written—or both.

🍞 Hook: Checking backpacks before class starts prevents problems later.

🥬 The Concept: Input Pre-Processing (Off-Model)

What it is: Filters or transforms that sanitize the input before the model sees it.
How it works:
1. Rewrite or summarize text to remove unsafe cues.
2. Transform images to strip hidden signals.
3. Use guard models to flag risky requests.
Why it matters: Stops many attacks at the door. 🍞 Anchor: If an image hides small typed commands, a pre-processor can blur or OCR-scan to neutralize them.

🍞 Hook: A teacher can mark out a bad sentence from an essay before handing it back.

🥬 The Concept: Output Post-Processing (Off-Model)

What it is: A safety check that inspects the model’s draft answer and blocks or edits unsafe parts.
How it works:
1. Run a guard model on the answer.
2. If unsafe, block, rephrase, or provide safe alternatives.
3. Return a sanitized final output.
Why it matters: Catches what slipped past earlier filters. 🍞 Anchor: If the model starts to reveal private info, the post-processor can replace it with a warning and safe guidance.

🍞 Hook: Like adjusting your steering in real time to avoid potholes.

🥬 The Concept: Inference Process Intervention (On-Model)

What it is: Methods that nudge the model’s token choices toward safety during generation.
How it works:
1. Monitor internal signals while generating words.
2. Shift the probabilities away from unsafe continuations.
3. Keep helpfulness as much as possible.
Why it matters: Prevents unsafe content from forming in the first place. 🍞 Anchor: If the model heads toward a risky phrase, the intervention steers it back to a refusal or safe advice.

🍞 Hook: Practice makes permanent—train the model to choose safety by default.

🥬 The Concept: Intrinsic Model Alignment (On-Model)

What it is: Fine-tuning the model with safety-focused data so it becomes safer internally.
How it works:
1. Collect safe/unsafe examples with feedback.
2. Train the model to refuse or redirect harmful queries.
3. Validate across many risk types.
Why it matters: Reduces reliance on external filters alone. 🍞 Anchor: After extra lessons, the model naturally says, “I can’t help with that, but here’s a safe alternative.”

🍞 Hook: A fair judge listens to what was asked and what was answered, then scores from several angles.

🥬 The Concept: Breach Condition and Final Judgment

What it is: A rule that says a jailbreak only counts if the answer is both harmful and compliant, with detail deciding severity.
How it works:
1. Compute H (harm), A (alignment), D (detail).
2. If H≥3 and A≥3, that’s a breach.
3. Set severity by D: higher detail means more severe.
4. Map to 1–4 final score; 3–4 means jailbreak success.
Why it matters: Distinguishes light slips from serious failures. 🍞 Anchor: A vague, risky hint is not the same as a step-by-step harmful guide; the judge treats them differently.

The secret sauce: The combination of (a) broad, realistic multimodal data; (b) standardized attacks and defenses; and (c) a 3D judge with a simple breach rule. Together, they expose trade-offs (safety vs. helpfulness) and model-specific weaknesses that a single metric would hide.

04Experiments & Results

The test: The authors evaluated 18 popular multimodal models (10 open-source, 8 closed-source) against 13 diverse attacks and 15 defenses using the same dataset and 3D scoring. They measured not just how often attacks worked (ASR) but also the harm, alignment, and detail of the outputs, then converted these into a clear 1–4 judgment.

The competition: Attacks included both white-box (with access to internals) and black-box (no internals) methods like typographic images, QR-like carriers, out-of-distribution shuffles, and hidden intent splits across text and image. Defenses spanned input pre-processing, output post-processing, inference-time nudges, and safety fine-tuning.

The scoreboard (with context):

White-box on MiniGPT-4 showed that even when alignment is bent, detail often stays low (Avg-D below ~2.6), leading to lower overall success by the strict judge. That’s like getting some questions right but not explaining your steps—still not an A.
Black-box attacks such as MML and CS-DJ scored surprisingly high on some closed- and open-source models (e.g., ~50% ASR in specific settings), showing that cross-modal linkage and OOD distractions can be powerful even without gradients. That’s like a team winning lots of away games—more realistic and worrisome.
Visual carrier attacks (e.g., FigStep, QR-Attack) varied by model family; open pipelines without strong OCR/guards often produced more detailed unsafe outputs (higher D), meaning when they failed, they failed loudly.
Hidden-risk and JOOD-style attacks had moderate ASR but were stealthy and semantically consistent, making them harder to spot by naive filters.

Defense highlights:

Off-model input filters (e.g., Uniguard, JailGuard) strongly reduced ASR for some OOD and typographic attacks but were less reliable for semantically dispersed attacks like MML. That’s like a good gatekeeper who still misses people sneaking in through side doors.
Output post-processing (e.g., MLLM-Protector, ShieldLM, Llama-Guard-3) was consistently effective at catching unsafe drafts, slashing ASR markedly, especially against attacks that produced detailed harmful steps.
On-model defenses: COCA (inference-time calibration) and VLGuard (safety fine-tuning) often drove ASR near zero for strong attacks like FigStep and HIMRD. However, fine-tuning sometimes introduced tiny new weak spots (slight ASR upticks on specific attacks), reminding us that defenses can shift the landscape, not just shrink it.

Surprising findings:

The strict breach rule (needs harm AND alignment) plus the Detail dimension revealed that many “wins” by attackers were actually low-detail slips—informative but less dangerous. Without D, you might think the sky is falling; with D, you see the nuance.
Different model families showed distinct failure “fingerprints.” Some resisted typographic carriers but fell to cross-modal dispersion; others were the reverse. There’s no single champion defense for all.
Instructive (commanding) style wasn’t always the worst; consultative prompts could also raise ASR depending on attack type—style matters, but it interacts with attack and model.

Bottom line: The 3D judge turned vague claims into precise, comparable safety-utility trade-offs, showing where and why defenses help—and where blind spots remain.

05Discussion & Limitations

Limitations:

Coverage, while broad, is not total. New attack ideas will appear, especially as models and modalities evolve.
Automated image and prompt generation can drift from real-world nuance; continuous curation is needed.
The judge is rule-based; although multi-dimensional, edge cases may still need human review.
Some closed-source APIs change over time, affecting reproducibility unless versions are pinned.

Required resources:

Access to multimodal models (open- and/or closed-source).
GPU(s) for running attacks/defenses at scale and generating images.
Guard models or safety-tuned checkpoints for off-/on-model defenses.
Storage and orchestration tools for datasets, logs, and judge scores.

When not to use:

If you only need text-only safety evaluation; a multimodal benchmark may be overkill.
If your application domain is extremely specialized (e.g., niche medical imaging) and not well represented in the taxonomy—you may need custom extensions first.
If you cannot operate with layered defenses (policy constraints disallow pre/post filters), results may not translate directly.

Open questions:

How to generalize defenses that work across families of attacks without crushing helpfulness?
Can we learn continuous, adaptive safety policies that watch internal signals and the visual channel together?
What’s the best way to calibrate the 3D judge across cultures and contexts while keeping consistency?
How should we prioritize fixes: reduce harm (H) first, or alignment (A), or detail (D)? Does the right answer change by domain?
Can synthetic data reliably stand in for real risk images—and how do we measure that gap over time?

06Conclusion & Future Work

Three-sentence summary: OmniSafeBench-MM is a unified, open-source benchmark and toolbox for testing multimodal jailbreak attacks and defenses on the same realistic dataset with the same rules. It replaces one-number ASR with a three-dimensional judge (harmfulness, alignment, detail) and a clear breach rule that captures the true severity of failures. Experiments across many models and methods reveal distinct weaknesses and trade-offs that were invisible before.

Main achievement: Turning multimodal safety evaluation from a fragmented, single-number game into a standardized, 3D, reproducible process that exposes real safety–utility trade-offs.

Future directions:

Expand to more modalities (audio/video), broader real-world visuals, and new attack/defense families.
Improve auto-judging with hybrid human–AI adjudication for edge cases.
Develop adaptive defenses that jointly monitor image and text pathways in real time.

Why remember this: OmniSafeBench-MM gives the community a common language and lab to measure, compare, and improve multimodal AI safety—so systems can stay helpful without getting tricked into harm.

Practical Applications

•Audit a new multimodal assistant before launch by running all 13 attacks and measuring H-A-D trade-offs.
•Compare multiple guard models (pre- and post-processing) to choose the best safety stack for your product.
•Stress-test a fine-tuned model with hidden-risk and OOD attacks to uncover non-obvious vulnerabilities.
•Tune refusal policies by watching which prompts raise Harmfulness without improving Detail or Alignment.
•Build continuous safety regression tests in CI/CD using the modular APIs and fixed seed datasets.
•Prioritize fixes using category-level scores (e.g., strengthen privacy and doxxing defenses first).
•Train safety reviewers by showing examples where ASR is low but Detail is high (or vice versa) to calibrate judgment.
•Prototype adaptive, layered defenses by combining input filters with inference-time calibration.
•Benchmark closed-source APIs against open checkpoints to decide when to switch providers.
•Run ablation studies to see how removing OCR filtering or adding vision guards changes failure modes.

Version: 1