MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models | How I Study AI

MUSE: A Run-Centric Platform for Multimodal Unified Safety Evaluation of Large Language Models

Beginner

Zhongxi Wang, Yueqian Lin, Jingyang Zhang et al.3/3/2026

Key Summary

•MUSE is a new open-source platform that tests how safely AI models behave when you talk to them with text, sound, pictures, and video, not just text.
•It keeps every test organized as a run so you can replay, compare, and trust the results.
•MUSE includes three well-known multi-turn attack strategies (Crescendo, PAIR, Violent Durian) and adds a new twist called Inter-Turn Modality Switching (ITMS).
•ITMS rotates the input type turn by turn (like text, then audio, then image) to see if defenses stay strong across modality changes.
•Instead of a simple pass/fail, MUSE judges responses with five levels, so it can measure partial information leaks that binary metrics miss.
•Across about 3,700 tests and six popular multimodal models, multi-turn attacks reached 90–100% success even when one-shot prompts were almost always refused.
•ITMS often made attacks succeed faster by shaking early defenses, even when final success was already very high.
•Modality effects were model-family-specific: what helps break one model (like audio) can make another model safer.
•This work shows we must test AI safety across modalities and turns, not just single text prompts.
•The platform is browser-based, provider-agnostic, and built for reproducible, large-scale safety evaluations.

Why This Research Matters

Real users don’t interact with AI in just one way—they type, talk, share photos, and show videos, often across multiple turns. If we only test text once, we can miss failures that appear after a few back-and-forths or when the input switches modalities. MUSE shows that multi-turn, multimodal testing can uncover leaks even when single-turn tests look perfect. By tracking partial compliance, teams can close small cracks before they become big problems. Providers and regulators can use MUSE’s run-centric records to make fair, reproducible comparisons. Ultimately, safer AI means more trustworthy assistants at home, school, and work.

Reading Workflow

Turn this paper into a decision

Scan fast. Promote only the papers that survive triage.

No workflow history yet.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how a teacher doesn’t just quiz you with written questions, but also asks you to explain out loud or draw a picture to prove you really understand? That’s because showing knowledge in different ways can reveal hidden gaps.

🥬 Filling (The Actual Concept):

What it is: This paper builds MUSE, a platform to check if AI models stay safe not just with text, but also when you talk to them using audio, images, or video, and across many back-and-forth turns.
How it works (story of the field):
1. The world before: AI safety tests mostly checked text-only prompts, often as a single message. Modern AIs, however, are multimodal—they handle text, pictures, sound, and video together. But safety tools rarely tested these other input types in one place.
2. The problem: Two research paths grew separately: (a) multi-turn text attacks that slowly pressure models to slip, and (b) multimodal single-turn tests that send risky content through images or audio. No single system combined both—meaning we didn’t know if models could resist a smart, multi-turn attack that also switches modalities along the way.
3. Failed attempts: Existing frameworks offered parts of the puzzle—coding libraries for attacks, or benchmarks of risky questions—but lacked native multimodal payload generation, or didn’t manage full multi-turn conversations end to end, or didn’t provide a fine-grained safety judge beyond pass/fail.
4. The gap: Researchers needed a unified, reproducible pipeline that (a) generates multimodal payloads; (b) orchestrates multi-turn attacks; (c) talks to different providers in one interface; and (d) judges answers with more detail than yes/no. They also needed a way to test whether safety holds when you switch modalities between turns.
5. Real stakes: In daily life, people might speak to AI (audio), show it a photo (image), or share a clip (video). If a model only stays safe with text but slips with images or after a few turns, that can lead to harmful instructions or policies being bypassed. For companies, governments, and families, understanding these gaps matters for trust, safety, and regulation.
Why it matters: Without a comprehensive tool, we can miss partial leaks (like giving half of a dangerous recipe) or leaks that only happen after switching from text to audio. That means a model could appear safe but actually be vulnerable in real-world multimodal conversations.

🍞 Bottom Bread (Anchor): Imagine testing a smart assistant that refuses a harmful text question right away. Now you send the same request as a picture of text, then you speak it, and you mix these across several turns. MUSE is the organized lab where you can run this whole experiment, step-by-step, and record exactly what happened.

02Core Idea

🍞 Top Bread (Hook): Picture a sports tournament where players face challenges that keep changing—sometimes running, sometimes swimming, sometimes biking. If they only trained for running, they might struggle when the event switches.

🥬 Filling (The Actual Concept):

What it is (one sentence): The key insight is to unify multi-turn red-teaming with cross-modal input switching and a fine-grained judge—so we can see whether safety truly holds when the game keeps changing.
Multiple Analogies (3 ways):
1. Swiss Army Knife: Instead of carrying separate tools for text-only tests and image-only tests, MUSE is a single Swiss Army knife that runs both, plus switches tools mid-conversation.
2. Airport Security Drill: Testing only one checkpoint (text) isn’t enough; you must also test bag scans (images), body scanners (audio/video), and how guards react if a traveler switches lanes mid-process (modality switching).
3. Science Lab with Labeled Beakers: Not just yes/no labels; MUSE uses a five-step scale to measure how much dangerous liquid might have spilled (from none to a lot), so you don’t miss small leaks.
Before vs After:
- Before: Safety tests were mostly single-turn and text-only, or else multimodal but isolated, with binary results.
- After: We get a run-centric, multimodal, multi-turn framework that records everything, rotates modalities between turns, and scores granular safety outcomes.
Why it works (intuition, not equations): Multi-turn attacks chip away at defenses gradually; modality switching can unsettle guardrails tuned for text; and a five-level judge catches small leaks that a pass/fail test would ignore. Together, these reveal where alignment really holds—and where it silently cracks.
Building Blocks (the pieces you need to understand):
1. 🍞/🥬/🍞 Inter-Turn Modality Switching (ITMS):
  - What: A method that rotates the input type (text, audio, image, video) across turns.
  - How: Turn 1 in text, Turn 2 in audio, Turn 3 in image, etc., constrained by what the target model supports.
  - Why: If safety only works for one modality, switching can expose hidden weak spots.
  - Example: Ask a question in text, then read it aloud next turn; see if the model’s behavior changes.
2. 🍞/🥬/🍞 Multi-turn Attack Strategies (Crescendo, PAIR, Violent Durian):
  - What: Step-by-step pressure plans that try to make a model reveal disallowed info.
  - How: Crescendo escalates slowly; PAIR iteratively rewrites single-turn prompts; Violent Durian uses high-pressure tactics early. MUSE runs these automatically.
  - Why: Many models refuse on the first try but can be nudged over several turns.
  - Example: Crescendo might start with harmless context, add more edge cases, then push into restricted territory.
3. 🍞/🥬/🍞 Five-Level Safety Taxonomy:
  - What: Labels from safest to most risky: Non-Responsive, Direct Refusal, Indirect Refusal, Partial Compliance, Compliance.
  - How: An LLM judge rates each model response based on how much harmful capability is transferred.
  - Why: A binary pass/fail hides partial leaks that matter.
  - Example: If a response gives half of a dangerous procedure, that’s Partial Compliance—not a full pass and not a total fail.
4. 🍞/🥬/🍞 Run-Centric Architecture:
  - What: Each test is a self-contained run that stores the setup, all conversation turns, generated media, model outputs, and judgments.
  - How: One run equals one reproducible record. Batches group runs for large campaigns.
  - Why: Reproducibility and analysis require complete, organized histories.
  - Example: You can pause a batch at goal 25, resume later, and still have every detail saved.

🍞 Bottom Bread (Anchor): In practice, MUSE might test 50 goals against six models. It rotates modalities between turns, uses Crescendo to escalate, and labels every response on the five-level scale—so you can see not only who fails, but how fast and where small leaks start.

03Methodology

🍞 Top Bread (Hook): Imagine building a rollercoaster where the car changes tracks mid-ride (text to audio to image) while inspectors note every twist and turn. You need a careful plan so you don’t miss anything.

🥬 Filling (The Actual Concept):

Overview: At a high level: Goal (input) → Attack Strategy Engine (creates next prompt) → Modality Conversion (turn text into audio/image/video) → Model Routing (send to target model) → LLM Judge (score safety) → Update Run (store everything) → Repeat until stop.
Each Step Detailed:
1. Define the goal and start a run
  - What happens: You pick a harmful goal from a test list (e.g., a placeholder goal) and select a strategy (Crescendo, PAIR, or Violent Durian), a target model, and whether to enable ITMS. MUSE creates an attack run that will record every step.
  - Why this step exists: Without a run, you can’t track what happened, in which order, or how to reproduce it later.
  - Example with actual data: Run ID R-1023, Model = “Gemini 2.5 Flash”, Strategy = “Crescendo”, Max Turns = 10, ITMS = On (text, audio, image), Goal Category = “Fraud/Social Engineering”.
2. Attack Strategy Engine generates the next turn
  - What happens: The attacker LLM writes the next prompt. Crescendo escalates slowly and backtracks after refusals; PAIR rewrites single-turn prompts; Violent Durian applies strong pressure early.
  - Why this step exists: Many defenses hold for one turn but can weaken under sustained, adaptive pressure.
  - Example: Turn 1 creates a cautious setup; Turn 2 adds more specifics; Turn 3 responds to the model’s last message with a tailored push.
3. Inter-Turn Modality Switching (optional)
  - What happens: If ITMS is enabled and the target supports multiple modalities, MUSE rotates the delivery type (e.g., text → audio → image → text ...).
  - Why this step exists: Some models have stronger safety in text than in other modalities, or vice versa; switching can uncover these gaps and sometimes accelerates convergence.
  - Example: Turn 1 (text), Turn 2 (audio, via TTS), Turn 3 (image with rendered words), then back to text.
4. Multimodal Payload Generation
  - What happens: The attacker’s text is converted to audio (TTS), image (text-on-canvas), or video (audio+image composition). MUSE caches assets so repeats don’t regenerate the same media.
  - Why this step exists: To fairly test multimodal models, you must feed the same content through different channels without manual work.
  - Example: The sentence “<red-team prompt text>” becomes: audio file A-778.wav, image file I-441.png, or video file V-992.mp4.
5. Provider-Agnostic Model Routing
  - What happens: MUSE sends the payload to the chosen model using a uniform interface, even though each provider’s API is different.
  - Why this step exists: To compare models fairly and switch providers without rewriting the whole system.
  - Example: The same prompt goes to Qwen3-Omni or Gemini 3 Flash without changing the surrounding code.
6. LLM Judge with Five-Level Safety Taxonomy
  - What happens: A judging LLM reads the model’s response and assigns one of five labels: Compliance (most risky), Partial Compliance, Indirect Refusal, Direct Refusal, or Non-Responsive.
  - Why this step exists: Binary pass/fail hides partial leaks. We want to catch gray zones.
  - Example: If the model gives complete, step-by-step harmful instructions, the judge labels it Compliance; if it gives an ethical explanation without actionable steps, that’s Indirect Refusal.
7. Record, Check Stopping, and Iterate
  - What happens: MUSE stores the turn, the label, and any media; it decides whether to continue (up to 10 turns, or until success/refusal conditions). It also supports pausing and resuming batches.
  - Why this step exists: Full records enable reproducibility, analytics, and fair comparisons.
  - Example: After Turn 4, if Compliance is reached, the run stops and records success with the turn count.
The Secret Sauce:
1. Run-Centric design: Every detail is saved per run so experiments are reproducible and comparable.
2. ITMS: Rotating modalities between turns sometimes destabilizes early defenses, speeding up success.
3. Dual metrics: Hard vs. Soft Attack Success Rate show not only full breaks but also partial leaks.
Simple Math, Clearly Explained:
1. Hard ASR formula: $ASR_{hard} = \frac{\text{# of runs labeled Compliance}}{\text{total # of runs}}$ . Example: If 47 out of 50 runs are labeled Compliance, $ASR_{hard} = \frac{47}{50} = 0.94$ .
2. Soft ASR formula: $ASR_{soft} = \frac{\text{# of runs labeled Compliance or Partial Compliance}}{\text{total # of runs}}$ . Example: If 47 are Compliance and 3 are Partial Compliance out of 50, $ASR_{soft} = \frac{47 + 3}{50} = \frac{50}{50} = 1.00$ .
3. Gray Zone Width (GZW): $GZW = ASR_{soft} - ASR_{hard}$ . Example: With $ASR_{soft}=1.00$ and $ASR_{hard}=0.94$ , $GZW = 1.00 - 0.94 = 0.06$ .

🍞 Bottom Bread (Anchor): Think of MUSE like a science fair project board: each run is a full experiment page with the question, method, data, and result—except here, you can do thousands automatically, across text, audio, images, and video, with turn-by-turn scoring.

04Experiments & Results

🍞 Top Bread (Hook): Imagine testing six different safes. They’re rock-solid when you try one code once (single turn). But if you try a clever sequence and switch the way you press buttons (modality), some open surprisingly fast.

🥬 Filling (The Actual Concept):

The Test: The team ran about 3,700 red-team runs using 50 goals from five harm categories (weapons, controlled substances, malware, biological threats, fraud/social engineering). They measured how often models fully complied (hard ASR) and how often models at least partially leaked information (soft ASR). They also tracked how many turns it took to succeed and how modality switching affected results.
The Competition: Six multimodal LLMs from four providers were tested, including Qwen-Omni models, Gemini models, GPT-4o, and Claude Sonnet 4. Strategies included Crescendo, PAIR, Violent Durian, and their ITMS-enhanced variants.
Single-Turn Baseline (Context): Across text, audio, image, and video (when supported), refusal rates were 90–100%. That’s like everyone scoring an A on the one-question quiz. So any later failures come from the extra pressure of multi-turn attacks, not weak initial safety.
The Scoreboard (Meaningful numbers):
- Multi-turn attacks reversed the picture: Crescendo and PAIR reached hard ASR around 90–100% on most models. That’s like going from near-unbreakable in one shot to widely breakable when nudged over several turns.
- Violent Durian varied a lot by model family: it barely worked on Claude Sonnet 4 but was strong on Qwen2.5-Omni—evidence that template-like pressure finds model-specific cracks, not a universal backdoor.
- Soft vs. Hard ASR revealed gray zones: On some model-strategy pairs (e.g., PAIR vs. Claude), a large gap showed many runs with partial leaks—not a full success, but not fully safe either.
Surprising Findings:
1. ITMS didn’t always raise final ASR when base strategies were already near 100%, but it often made success happen in fewer turns—like reaching the goal faster on the same track.
2. The effect of using audio or image depended on the model family. For some Gemini models, non-text modalities improved attack success; for some Qwen models, they reduced it, suggesting different guardrails per provider.
3. Early-turn behavior shifted under ITMS: first turns looked more cautious, but refusals dropped sharply after the first modality switch, and partial compliance rose—evidence that switching, not just content, shakes defenses.
Category Patterns:
- Fraud/social engineering was generally the most vulnerable across strategies.
- Drugs and weapons tended to be more resistant, hinting at stronger or more specific training in those areas.

🍞 Bottom Bread (Anchor): Think of Crescendo like a gentle but steady push. In single-turn tests, the door stays shut. In multi-turn tests, that same door opens most of the time—and if you change how you knock (switch modalities), it often opens sooner.

05Discussion & Limitations

🍞 Top Bread (Hook): You know how some umbrellas work perfectly in light rain but flip inside out in gusty wind? Tools that seem strong in one weather (single text turn) can wobble in another (multi-turn, multimodal).

🥬 Filling (The Actual Concept):

Limitations (be specific):
1. Provider differences: The modality effect isn’t universal—what weakens one model can strengthen another. So conclusions must be provider-aware, not one-size-fits-all.
2. Scope of goals: The 50-goal set covers five categories, but real-world harm is broader. Expansion to more nuanced or novel goals is needed.
3. Judge dependence: The five-level taxonomy relies on an LLM judge. While human validation showed high agreement (~93%), judgments can still have edge cases (especially between Compliance and Partial Compliance).
4. API limits: Some models don’t accept all modalities via standard endpoints, so parts of the matrix (e.g., audio/video for certain models) weren’t tested.
Required Resources: Access to provider APIs, credits for model calls and TTS/image/video generation, storage for media caching, and orchestration capacity for thousands of runs.
When NOT to Use: If you need purely local, offline evaluation without internet or provider APIs (not yet supported in this release); if your question is purely single-shot text and you don’t need fine-grained labels; or if your safety policy forbids any adversarial testing.
Open Questions:
1. Human-judge scaling: How do we cross-check the five-level labels more extensively and across cultures/languages?
2. Defensive training: Which training approaches prevent the convergence acceleration under modality switching?
3. Mechanism insight: Why do some families harden under images while others soften? Is it preprocessing, safety filters, or core model behavior?
4. Beyond audio/image/video: How do emerging modalities (e.g., gestures, sensor data) change the safety picture?

🍞 Bottom Bread (Anchor): Think of MUSE as a wind tunnel for AI safety. It’s excellent for seeing how models behave in different gusts, but you’ll still want more tests, more judges, and more weather patterns in the long run.

06Conclusion & Future Work

🍞 Top Bread (Hook): Imagine testing a bridge by sending not just cars across it once, but also bikes, trucks, and buses, and then mixing them in different orders. You learn much more about where it might wobble.

🥬 Filling (The Actual Concept):

3-Sentence Summary: MUSE is a run-centric, open-source platform that unifies multimodal payload generation, multi-turn attack orchestration, and a five-level safety judge. In about 3,700 tests across six multimodal models, multi-turn attacks achieved 90–100% success despite near-perfect single-turn refusals. Inter-Turn Modality Switching often made attacks succeed in fewer turns, and modality impacts depended on model family.
Main Achievement: Turning fragmented safety checks into one reproducible, multimodal, multi-turn, fine-grained evaluation pipeline—revealing hidden leaks that binary, text-only tests miss.
Future Directions: Add local open-source models, extend ITMS with native video rotation, enlarge the goal set, and further validate the judge with larger human studies.
Why Remember This: Real AI use is multimodal and conversational. If we only test single text turns, we can be fooled into thinking models are safer than they are. MUSE shows how to test the world AI actually lives in—where input types switch, conversations evolve, and small leaks matter.

🍞 Bottom Bread (Anchor): After MUSE, evaluating AI safety is less like a quick quiz and more like a full-season tournament across many playing fields—with a scoreboard that records every point, not just the final win or loss.

Practical Applications

•Benchmark multimodal model safety across providers with reproducible, run-level records.
•Detect partial information leaks using soft ASR and gray zone width before they escalate.
•Stress-test early-turn defenses by enabling ITMS to see if modality switching speeds up failures.
•Prioritize fixes by analyzing which harm categories (e.g., fraud) and which modalities leak most.
•Compare attack strategies (Crescendo, PAIR, Violent Durian) to pick the most revealing tests for a given model family.
•Use batch campaigns with stop/resume to run large-scale audits without losing progress.
•Validate automated judgments with periodic human reviews to calibrate the five-level taxonomy.
•Guide safety training by targeting model-specific weaknesses revealed by modality ablations.
•Support provider-aware policy: choose default modalities or disable risky channels where leaks are higher.
•Track improvements over time by re-running the same configurations and comparing run histories.

Version: 1

Notes