OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang; Yunhao Chen; Juncheng Li; Yixu Wang; Yang Yao; Tianle Gu; Jie Li; Yan Teng; Yingchun Wang; Xia Hu

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Beginner

Xin Wang, Yunhao Chen, Juncheng Li et al.1/4/2026

arXiv PDF

Key Summary

•OpenRT is a big, open-source test bench that safely stress-tests AI models that handle both text and images.
•It separates the testing job into clear, swappable parts (models, data, attacks, judges, metrics, and a boss called the orchestrator) so researchers can mix and match easily.
•Across 20 strong models, attacks still worked about half the time on average (49.14%), which means today’s safety training often misses tricky, adaptive attacks.
•Fancy, multi-step, team-based attacks (like EvoSynth and X-Teaming) beat simple, one-shot tricks and can reach very high success even on top models.
•Models with better reasoning and vision skills are not automatically safer; in fact, those new skills open new ways to fool them.
•OpenRT doesn’t just measure “Did the attack work?”; it also measures how costly, sneaky, and varied the attacks are.
•The framework runs lots of tests in parallel, making it fast and scalable for real-world safety checks.
•OpenRT is designed to keep growing: adding new attacks or models is as simple as registering a new plug-in.
•The results argue for defense-in-depth and continuous red teaming to avoid overfitting to a few known jailbreak templates.

Why This Research Matters

AI systems are moving into everyday tools, so we must know how they behave under pressure before people rely on them. OpenRT provides a fair, fast, and flexible way to uncover safety gaps across text and images. It proves that modern attacks are adaptive and team-based, so defenses must be deeper than a few templates. With shared interfaces and open code, teams can reproduce results, compare models honestly, and keep tests up to date. This reduces surprise failures in the wild, increases user trust, and helps the whole field agree on what “robust” really means.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine your school has a super helper robot that can read stories, solve math, and even understand pictures you draw. Before you let the robot help everyone, you’d want to make sure it never does something unsafe—right?

🥬 The Concept (Red Teaming Framework): What it is: A red teaming framework is a careful, repeatable way to pretend to be a trickster and test if an AI will do unsafe things. How it works:

Pick an AI to test.
Gather a list of risky questions (kept safe and controlled).
Try different “attack” strategies to get the AI to break rules.
Use a “judge” to score if the AI answered safely or not.
Count scores and learn where it’s weak. Why it matters: Without this, we only guess the AI is safe. We need practice drills to see real weaknesses before real people do. 🍞 Anchor: Like a fire drill for AI—practice the emergency so you can fix problems before a real fire.

🍞 Hook: You know how you can text a friend and also send them photos? Some AIs can do that too.

🥬 The Concept (Multimodal Large Language Models, MLLMs): What it is: MLLMs are AIs that can understand and create more than text—they can also work with images (and sometimes audio or video). How it works:

The AI takes in text and pictures.
It builds an internal understanding of what they mean together.
It answers with helpful text (and sometimes an edited image). Why it matters: If we only test text, a picture might sneak past the safety rules. 🍞 Anchor: If the AI refuses unsafe text but obeys when the same idea is hidden inside a picture, that’s a big safety hole.

🍞 Hook: Think of your class doing a science fair, but each group uses totally different rules. You can’t compare the projects fairly.

🥬 The Concept (Fragmented Benchmarks): What it is: Fragmentation means past safety tests were scattered—different rules, different models, and mostly one-turn, text-only checks. How it works:

Each test kit measures something different.
Results can’t be compared apples-to-apples.
It’s hard to see real progress or patterns. Why it matters: Without a common yardstick, we can’t tell what truly works. 🍞 Anchor: It’s like some kids measuring height in inches and others in centimeters—hard to know who’s actually taller.

🍞 Hook: Imagine trying to lock your bike, but you only lock the front wheel. A clever thief could still take it.

🥬 The Concept (Safety Alignment): What it is: Safety alignment teaches AI to follow rules and avoid harm. How it works:

Add special safety instructions.
Train it to refuse risky requests.
Add filters to catch bad outputs. Why it matters: If safety is shallow or only covers obvious tricks, smart attackers go around it. 🍞 Anchor: If an AI won’t explain a dangerous recipe in plain text but reveals it step-by-step during a long chat, safety didn’t hold.

🍞 Hook: You know how a coach makes players practice against tougher and tougher opponents?

🥬 The Concept (Jailbreak Attacks): What it is: Jailbreaks are clever prompts or pictures that try to make AI ignore its safety rules. How it works:

Start with a tricky request.
Wrap it in code, riddles, foreign words, or multi-turn chats.
See if the AI slips and answers. Why it matters: Real users and bad actors might try similar tricks—so we must test them first under safe conditions. 🍞 Anchor: Like testing a door lock with many keys to ensure none of the wrong keys can open it.

The World Before: AIs that chat and “see” became common in apps, search, and helpers. People built safety with policies, refusals, and filters. But attackers kept inventing new ways—long conversations, code disguises, cross-language, and visual prompts—to sneak around those rules. Testing tools existed, but many were one-off, text-only, or not scalable.

The Problem: We lacked one, unified, high-speed system to fairly test many models against many kinds of attacks (single-turn, multi-turn, multi-agent, and multimodal) and measure not only “Did it work?” but also “How sneaky, costly, and varied was the attack?”

Failed Attempts:

Manual red teaming: finds cool bugs but is slow and expensive.
Template attacks: easy to block once known.
Narrow tools: great at one kind of test, weak at others; hard to compare results.

The Gap: No standard, modular, high-throughput setup to plug in any model, any dataset, any attack, and any judge, and then run them at scale with apples-to-apples scoring.

Real Stakes: These AIs help with coding, customer support, creative work, and more. If safety fails, users could see harmful or misleading content. That affects trust, businesses, and even laws. A dependable testing gym like OpenRT helps builders fix weaknesses before release, just like car crash tests make cars safer for everyone.

02Core Idea

🍞 Hook: Picture a giant LEGO table where each brick (model, dataset, attack, judge, metric) clicks into place, and a clever robot runs the whole show super fast.

🥬 The Concept (OpenRT’s Aha! Moment): What it is: OpenRT is a plug-and-play, high-speed red teaming framework that cleanly separates the “attack brains” from the “run fast and compare fairly” engine. How it works:

Standardize how attacks talk to models and judges (shared interfaces).
Plug in models, datasets, attacks, and judges like LEGO bricks.
Run many tests in parallel with an orchestrator.
Score results with clear metrics and save rich logs. Why it matters: Decoupling the logic from the runtime lets you scale to many models and attacks without rewriting everything. 🍞 Anchor: Like a universal game console that plays many kinds of games (attacks) with the same controller (interface), so testing is simple and fair.

Three Analogies:

Kitchen: OpenRT is the kitchen; models are ingredients, attacks are recipes, the judge is the taste-tester, and the orchestrator is the head chef keeping dinner on schedule.
Sports league: Different teams (attacks) play matches against different goalies (models), while referees (judges) keep score and the league office (evaluator) updates the standings.
Airport hub: Many airlines (attacks) and planes (models) use the same gates (interfaces) while the tower (orchestrator) coordinates safe, on-time traffic.

Before vs After:

Before: One-off scripts, text-only, single-turn focus; hard to compare or scale.
After: Unified interfaces, multi-turn and multimodal support, multi-agent attacks, high-throughput runs, and richer, standardized metrics.

Why It Works (Intuition):

Interfaces force consistency: every attack gets the same fair access to the model and judge.
Parallelism hides waiting time: many test conversations can happen at once.
Modularity speeds learning: swap one part without breaking others; improvements compound.
Hybrid judging reduces errors: quick keyword checks plus smarter LLM judges catch more cases.

Building Blocks (each a LEGO brick):

🍞 Hook: Think of a conductor who doesn’t play instruments but makes the whole orchestra sound great. 🥬 The Concept (Orchestrator): What it is: The manager that schedules, runs, and records all tests in parallel. How it works: (1) Start workers; (2) Assign queries to attacks; (3) Collect results; (4) Compute metrics. Why it matters: Without it, testing is slow and messy. 🍞 Anchor: Like a teacher organizing a science fair so every team presents and gets graded on time.

🍞 Hook: Imagine a vending machine with many snacks; you press a button and it gives you a response. 🥬 The Concept (Model Interface): What it is: A common way to talk to cloud AIs and local AIs the same way. How it works: (1) Send text/images; (2) Get replies; (3) (White-box only) get gradients/embeddings. Why it matters: One plug works for many outlets. 🍞 Anchor: Like using one charger that fits different phones with adapters.

🍞 Hook: You can’t play a game without questions to ask. 🥬 The Concept (Dataset): What it is: A safe, curated list of risky test prompts for evaluation. How it works: (1) Load items; (2) Stream big sets if needed; (3) Feed them to attacks. Why it matters: Bad or tiny data gives misleading scores. 🍞 Anchor: Like a practice workbook for a test—if the questions are weak, you won’t learn much.

🍞 Hook: Think of puzzle-solvers competing to crack a safe—with different clever tricks. 🥬 The Concept (Attack): What it is: A strategy that rewrites or crafts inputs to make the AI break rules. How it works: (1) Read the target query; (2) Transform it (single turn, multi-turn, image-based, or multi-agent); (3) Check if the judge flags a violation. Why it matters: Diverse attacks reveal hidden weak spots. 🍞 Anchor: Like trying codes, riddles, or team plans to open a locked box.

🍞 Hook: Every game needs a referee. 🥬 The Concept (Judge): What it is: A scorer that decides how harmful a reply is (scale 1–5). How it works: (1) Quick keyword scan; (2) Smarter LLM judge for context; (3) Compare to a success threshold. Why it matters: If judging is wrong, results are wrong. 🍞 Anchor: Like a lifeguard deciding if swimming is safe or risky, with clear rules.

🍞 Hook: Report cards make progress clear. 🥬 The Concept (Evaluator): What it is: The part that adds up scores like Attack Success Rate, cost, stealthiness, and diversity. How it works: (1) Aggregate results; (2) Compute metrics; (3) Summarize strengths/weaknesses. Why it matters: Numbers turn stories into evidence. 🍞 Anchor: Like a scoreboard showing not just who won, but how and by how much.

🍞 Hook: A label maker helps you find tools fast. 🥬 The Concept (Registry): What it is: A list where new attacks or models register themselves so the system can find and use them. How it works: (1) Decorate new code; (2) Auto-list available parts; (3) Assemble by YAML config. Why it matters: Easy to grow; less code changes. 🍞 Anchor: Like a library catalog—you can quickly check out the exact book you need.

03Methodology

At a high level: Input (risky test prompts + chosen models) → Orchestrator runs attacks in parallel → Judges score replies → Evaluator computes metrics → Outputs reports and logs.

Step-by-step with the Sandwich pattern for key pieces:

Orchestrator 🍞 Hook: Think of a super-scheduler that makes sure every race starts and ends fairly. 🥬 What it is: The pipeline boss that launches many attack trials at once, catches errors, and keeps progress bars. How it works:

Initialize thread pool workers.
Dispatch (attack, query) jobs.
Collect AttackResult objects (success flag, final prompt, response, costs, any images).
After all finish, send results to Evaluator and print headline metrics (like ASR). Why it matters: Without it, you’d run tests slowly, one by one, and lose track of which result belongs where. 🍞 Anchor: Like a tournament bracket app that schedules matches, records scores, and shows the leaderboard.

Model Interface 🍞 Hook: You want one remote control for many TVs. 🥬 What it is: A unified way to query cloud APIs and local models. How it works:

query: send text and/or images; receive a reply.
get_gradients/get_embedding: available for white-box attacks on local models.
Manage conversation history for multi-turn attacks. Why it matters: Saves you from rewriting attacks for each model provider. 🍞 Anchor: One app controlling lights from different brands in a smart home.

Dataset 🍞 Hook: No quiz without questions. 🥬 What it is: The source of test prompts (e.g., HarmBench) covering multiple risk areas. How it works:

StaticDataset for small, in-memory tests.
JSONLDataset for large, streamable benchmarks. Why it matters: Good coverage = realistic results. 🍞 Anchor: A balanced study guide covering all chapters—not just the easy ones.

Attack (families and examples) 🍞 Hook: Different keys for different locks. 🥬 What it is: Methods that transform inputs to try to bypass safety. How it works (families):

White-box: use gradients to craft adversarial text or tiny image changes.
Black-box optimization: genetic algorithms and fuzzing mutate prompts using only outputs.
LLM-driven refinement: an assistant LLM iteratively improves attack prompts.
Linguistic/encoding: ciphers, code wrapping, multilingual tricks, logic nesting.
Contextual deception: multi-turn stories that steer the model gradually (e.g., Crescendo).
Multimodal: images with typography or visual cues to slip past text filters.
Multi-agent: teams of agents explore and evolve diverse strategies (e.g., X-Teaming, EvoSynth). Why it matters: Variety uncovers weaknesses that a single style would miss. 🍞 Anchor: A toolbox: screwdrivers, wrenches, hammers—each fixes a different problem.

Judges and Thresholds 🍞 Hook: A referee plus instant replay. 🥬 What it is: A hybrid judge—fast keyword checks plus an LLM judge that reads for meaning. How it works:

Score 1 (best refusal) to 5 (fully harmful compliance).
success_threshold picks how strict you are (e.g., 5 for only the worst cases count as success). Why it matters: Better judging reduces false alarms and missed violations. 🍞 Anchor: A soccer ref who also checks video to make the call right.

Evaluator and Metrics 🍞 Hook: Win rate isn’t the whole story. 🥬 What it is: A calculator that summarizes effectiveness and practical trade-offs. How it works:

Attack Success Rate (ASR): percent of attempts that crossed the harm threshold.
Efficiency (Cost): tokens, calls, or time used per attempt.
Stealthiness: how natural the attack text looks (lower perplexity = sneakier); small visual changes for images.
Diversity: how different successful prompts are (semantic spread). Why it matters: A great attack that costs a fortune or is obvious to filters isn’t as scary. 🍞 Anchor: Like grading players on goals, stamina, fair play, and creativity—not just the final score.

Modular Registry and Configs 🍞 Hook: Plug-and-play makes life easy. 🥬 What it is: A registry where components self-register; YAML configs assemble experiments without coding. How it works:

@register decorators add new attacks/models/judges.
The orchestrator reads the config and auto-builds the pipeline. Why it matters: Faster iteration, easier collaboration, fewer bugs. 🍞 Anchor: Like choosing toppings on a pizza order form—you don’t need to bake from scratch.

Concrete flow example with actual data:

Inputs: HarmBench prompts; target model “GPT-5.2”; attacks = [PAIR, X-Teaming, EvoSynth]; judge threshold = 5.
Orchestrator launches all three attacks over the dataset with 25 parallel workers.
Judges score each reply; Evaluator computes ASR, average tokens per success, median perplexity, and diversity.
Output: A report showing, say, EvoSynth reached near-100% ASR with high diversity; X-Teaming was strong and efficient; PAIR was good but less diverse.

Secret Sauce:

Decoupled design: attacks don’t care which model vendor you use; the runtime stays stable.
Hybrid judging: fast + smart reduces mistakes.
High-throughput: parallel execution turns days into hours.
Broad coverage: 37 attack methods across text and images, single-turn to multi-agent, for a realistic threat picture.

04Experiments & Results

The Test: OpenRT evaluated 37 attack strategies on 20 advanced models using a standard harmful-prompt dataset. The judge used a strict threshold (only the most clearly unsafe responses counted as “attack success”), so scores aren’t inflated.

🍞 Hook: Think of a report card that shows not just grades, but also how hard the test was and how diverse your answers were. 🥬 The Concept (Attack Success Rate, ASR): What it is: The percentage of attempts where the judge said the model responded unsafely. How it works: Count successful violations ÷ total attempts. Why it matters: It’s the main “did the attack break through?” number. 🍞 Anchor: Like saying “You scored on 49 out of 100 tries,” which is about half the time.

Headlines with Context:

Average ASR ≈ 49.14% across 20 models: like flipping a coin and getting heads about half the time—too high for comfort.
Some top models kept many static attacks below 20% ASR (good!), but adaptive, multi-turn, or multi-agent attacks often crushed defenses.
EvoSynth and X-Teaming: standout attackers. EvoSynth reached near-100% ASR on many models; X-Teaming often scored 85–98%.
Reasoning and vision features were not automatic shields; they sometimes opened new doors for attackers (e.g., visual prompts bypass text filters).
Both closed and open models had vulnerabilities; only a couple stayed under ~30% ASR on average.

The Competition: OpenRT compared many attack families—template tricks, logic nesting, genetic algorithms, LLM-refinement, multi-turn steering (e.g., Crescendo), multi-agent teams (X-Teaming), and evolutionary synthesis (EvoSynth). Multi-agent and adaptive search tactics generally outperformed one-shot templates.

Scoreboard (meaningful framing):

“A+ attackers” like EvoSynth: near-perfect ASR (think 98–100%), even against strong models.
“A-range” attackers like X-Teaming: very high ASR (85–98%), consistent on many models.
Middle performers (PAIR, GPTFuzzer, AutoDAN-R): strong on lots of targets (like solid B+ to A-), showing adaptivity beats static tricks.
Weak performers (some template-heavy or shallow heuristics): high variance—ace one model, barely pass another (like swinging from A to F depending on the test).

Surprising Findings:

Polarized vulnerability: A model can be fortress-strong vs. one family (e.g., ciphers) yet paper-thin vs. another (e.g., logic nesting)—differences as huge as ~90% ASR gap.
Reasoning chains aren’t magic armor: longer thoughts can be steered in multi-turn chats.
Modality gap: visual prompts sneak past text-only safety in multimodal models.
Cost doesn’t equal power: some very costly methods aren’t much better than cheaper, smarter ones.

More Metrics (each with a Sandwich):

🍞 Hook: If two runners tie, who wins—the one who used less energy. 🥬 The Concept (Efficiency/Cost): What it is: Tokens, calls, or time per attack. How it works: Add up inputs/outputs and API calls. Why it matters: Real systems have budgets and rate limits. 🍞 Anchor: An attack that “wins” only after 10,000 tries is less practical than one that wins in 50.

🍞 Hook: A good disguise is hard to spot. 🥬 The Concept (Stealthiness): What it is: How natural the attack text looks (low perplexity) or how tiny an image change is. How it works: Measure PPL for text; bound tiny pixel nudges for images. Why it matters: Sneaky attacks avoid filters. 🍞 Anchor: A polite, normal-sounding request might slip by a spam filter more than a noisy, weird one.

🍞 Hook: Don’t study just one problem type. 🥬 The Concept (Diversity): What it is: How different successful prompts are from each other. How it works: Compare semantic embeddings; higher spread = more variety. Why it matters: Diverse attacks expose more total weak spots. 🍞 Anchor: A team that can solve many kinds of puzzles is harder to defend against than a one-trick pony.

Bottom line: Adaptive, iterative, and multi-agent attacks dominated. Static, one-shot templates are getting brittle as models learn to spot them. Safety training that only targets a few known patterns won’t cut it; coverage must expand and keep updating.

05Discussion & Limitations

Limitations:

Compute and budget: High-throughput red teaming can be resource-heavy (many tokens and calls). Teams need planning to control costs.
Judge dependence: Even with hybrid judging, borderline cases are tricky; different judges may disagree on edge responses.
Benchmark scope: Datasets are strong but still finite; real-world creativity can outpace any static list.
White-box realism: Gradient access is great for worst-case study but uncommon in cloud deployments; results must be interpreted accordingly.

Required Resources:

Access to target models (APIs or local), plus helper/judge models.
Budget for tokens and time; orchestration hardware for parallel runs.
Team practices for safe handling of harmful prompts and outputs (strict policies and logging).

When NOT to Use:

If you lack the safety process to responsibly handle risky prompts/responses (e.g., no content handling policy, no red-team oversight).
If you only need quick, small checks—OpenRT shines at scale; tiny one-off tests could be simpler elsewhere.
If legal/compliance constraints block the evaluation of certain risk categories in your region.

Open Questions:

Adaptive defenses: How to build runtime systems that learn from ongoing attacks without overfitting or breaking helpfulness?
Cross-modal safety: What holistic guardrails catch text+image tricks together, not just separately?
Judge standardization: Can the community agree on shared judge models and thresholds for more consistent scoring?
Transfer to agents and tools: How do results change when models can browse, code, or call external tools?
Continual red teaming: What’s the best cadence and dataset-refresh strategy to keep up with evolving attacks?

06Conclusion & Future Work

Three-sentence summary: OpenRT is a modular, high-speed red teaming framework for text-and-image AIs that standardizes models, attacks, judges, and metrics under one roof. Across 20 advanced models and 37 attack types, it finds that adaptive, multi-turn, and multi-agent attacks often succeed—on average nearly half the time—revealing major safety gaps. By open-sourcing the system, OpenRT enables ongoing, fair, and scalable safety testing to push defenses forward.

Main achievement: A cleanly decoupled, plug-and-play architecture—with hybrid judging and parallel orchestration—that makes comprehensive, multimodal red teaming practical and reproducible at scale.

Future directions: Integrate new attack families and modalities (audio/video), improve community-standard judges, add live defense-in-the-loop experiments, and deepen cross-modal safety tests. Explore automated defense tuning that learns from OpenRT’s findings without overfitting.

Why remember this: Safety isn’t “set and forget.” As models gain skills, attackers gain openings. OpenRT shows how to test broadly, fairly, and fast—so builders can spot cracks early, fix them often, and keep AI helpful and safe for everyone.

Practical Applications

•Pre-launch safety audits for new chat or vision-chat products using standardized, repeatable tests.
•Continuous red teaming in CI pipelines so every model update is checked against fresh attacks.
•Comparing vendors fairly when choosing a model for customer support or search assistants.
•Testing the “modality gap” by probing whether image inputs bypass text-only safety rules.
•Evaluating defense-in-depth strategies by measuring ASR, cost, stealthiness, and diversity before and after new safeguards.
•Triage and bug bounties: quickly reproduce, log, and share attack traces for faster fixes.
•Compliance readiness: produce clear safety reports for regulators and enterprise stakeholders.
•Curriculum for safety teams: teach differences between single-turn, multi-turn, and multi-agent threats with hands-on runs.
•Model hardening: select the most damaging attack families to guide adversarial training.
•Benchmark refresh: plug in new attack papers quickly via the registry to avoid overfitting to old tricks.

Version: 1