ProGuard: Towards Proactive Multimodal Safeguard
Key Summary
- âąProGuard is a safety guard for text and images that doesnât just spot known problemsâit can also recognize and name new, never-seen-before risks.
- âąInstead of memorizing fixed rule lists, ProGuard learns through reinforcement learning, like getting points for good decisions and learning from mistakes.
- âąThe team built a balanced dataset of 87K examples across text, images, and text+image to avoid being great at one modality and weak at another.
- âąA special âsynonym bankâ reward helps ProGuard describe new risk types in clear words even when those categories werenât in the training list.
- âąProGuard performs on par with strong closed models on simple safe/unsafe checks and clearly beats other open models on detailed risk categorization.
- âąMost impressively, it improves detection of out-of-distribution (OOD) risks by 52.6% and produces better OOD descriptions by 64.8% compared to prior guards.
- âąIt uses an online reinforcement learning method (GRPO) to keep reasoning concise, so it thinks enough to be accurate but doesnât waste time.
- âąData augmentation tricks (like shuffling category IDs and hiding categories) train the model to be flexible and proactive instead of rigid and reactive.
- âąHuman studies show the reward used for naming new categories mostly matches what people think is a good label.
- âąOverall, ProGuard is a practical step toward safety systems that can keep up with fast-changing, real-world risks.
Why This Research Matters
Online content changes quickly, and new kinds of harmful behavior appear before rulebooks can be updated. A proactive guard like ProGuard helps platforms and products stay ahead by recognizing and describing brand-new risks instead of waiting for manual updates. This reduces harmful exposures users might face in chats, images, and mixed media. It also gives safety teams interpretable signalsâshort names and reasonsâso they can update policies faster. Balanced performance across text and images means fewer blind spots. Over time, this approach can make AI systems safer in schools, social media, creative tools, and customer supportâwherever people and AI interact.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how school safety drills prepare you for different emergencies, like fire drills and earthquake drills? Those are clear and written down. But sometimes a new kind of emergency happens that wasn't in the booklet yet. A good safety team needs to notice new dangers even if the guidebook doesnât mention them.
đ„Ź The World Before: For years, AI safety âguardsâ mostly followed a fixed rulebook called a taxonomyâa list of categories like hate speech, self-harm, or dangerous goods. These guards worked like checklists: see a rule â apply it. They did pretty well at saying âsafeâ or âunsafe,â especially for text. But modern AIs understand and generate images too, and the world keeps inventing new problems. Risks pop up that arenât in the old lists, especially in pictures and mixed text+image situations. Without the right category, older guards either mislabel the content or miss the risk entirely.
đ„Ź The Problem: Existing guards are reactive. They can only react to risks already written into their rulebook. If a new kind of risk appearsâsay, a medical malpractice instruction phrased in a totally new way or a risky visual patternâreactive guards struggle. They either force-fit it into a wrong category or pass it as safe. Also, many guards are very strong for text, but weaker for images (this is called modality bias). Thatâs dangerous in a world full of memes, screenshots, and camera feeds.
đ„Ź Failed Attempts: People tried making the rulebook longer and more detailed. That helped a little but made systems brittle: longer lists are harder to maintain and still miss surprises. Others tried training guards to explain themselves (reasoning), but they often became wordy, slow, or still chained to the same fixed categories. Some approaches fine-tuned base models heavily to be safer, but that can hurt their general abilities.
đ„Ź The Gap: What was missing was a proactive guard that could (1) balance text and image safety well, (2) decide safe/unsafe, (3) pick a detailed risk category when it exists, andâmost importantlyâ(4) describe a new, unseen risk category when it doesnât exist yet. That last piece is the leap from reactive to proactive.
đ„Ź Why It Matters (Real Stakes):
- Content online changes fastânew scams, new visual tricks, new harmful trends. A rigid guard gets outdated quickly.
- Safety teams need tools that not only block harm today but also discover and name tomorrowâs risks, helping update policies.
- In daily life: picture apps that filter harmful images, chatbots that avoid giving unsafe advice, classrooms using AI safely, and platforms that catch brand-new abuse patterns. A proactive guard keeps people safer without constant manual rule updates.
đ Anchor: Imagine a lifeguard who only knows how to rescue swimmers with arm cramps because thatâs in the handbook. One day, a new kind of floating toy traps a swimmerâs armâthe lifeguard has never seen it. A reactive lifeguard might waste time flipping through the book. A proactive lifeguard thinks on the spot: âThis is a new kind of entanglement risk,â acts fast, and later tells the team, âWe need a new category: âToy entanglement near ladders.ââ ProGuard is that proactive lifeguard for text, images, and both together.
â New Concept 1 đ Hook: Imagine training a puppy. You give treats when it sits nicely so it learns faster. đ„Ź Reinforcement Learning (RL): RL is a way for AI to learn by trying actions and getting rewards for good ones.
- How it works: 1) Try a response, 2) Get a score (reward), 3) Do more of what earns higher scores, 4) Repeat.
- Why it matters: Without RL, the model might copy examples without truly learning to reason or adapt to new risks. đ Anchor: The AI âpuppyâ gets a treat when it correctly flags unsafe content and explains a new risk clearly.
New Concept 2 đ Hook: You know how your backpack sometimes has something brand new in itâlike a gadget your teacher hasnât seen before? đ„Ź Out-of-Distribution (OOD) Safety Risks: These are risks that donât look like the old examples the model trained on.
- How it works: The system first checks if a risk matches any known category; if not, it decides itâs OOD and makes a good name for it.
- Why it matters: Without OOD handling, new dangers slip through or get misfiled. đ Anchor: Seeing âmedical malpracticeâ phrased in a novel way and labeling it âMedical Malpracticeâ even if that term wasnât in the original list.
New Concept 3 đ Hook: Think of a family treeâbig branches and then smaller branches. đ„Ź Hierarchical Multimodal Safety Taxonomy: A layered safety category tree that covers text, images, and text+image.
- How it works: 11 top-level groups and 28 subgroups capture different risks, tuned for different modalities.
- Why it matters: Without a clear tree, moderation becomes inconsistent across text and images. đ Anchor: âDangerous Goodsâ (big branch) includes âWeaponsâ (smaller branch), which helps target the exact issue.
New Concept 4 đ Hook: A light switch is either on or off. Simple and clear. đ„Ź Binary Safety Classification: A yes/no judgment: safe or unsafe.
- How it works: The model reads inputs and outputs safe/unsafe for the request and (if present) the response.
- Why it matters: Without this first gate, the system canât prioritize what needs deeper attention. đ Anchor: A harmless weather question is safe; a prompt asking for a harmful plan is unsafe.
New Concept 5 đ Hook: If you canât remember the exact word âvehicle,â saying âcarâ still shows you understand. đ„Ź Synonym-Bank-Based Similarity Reward: A reward that gives points for using correct or close-meaning words when naming new categories.
- How it works: The modelâs label is compared to a bank of synonyms using semantic similarity; higher match â higher reward.
- Why it matters: Without this, the model would be punished for small wording differences even when it âgets it.â đ Anchor: Calling âUnauthorized Accessâ as âSystem Intrusionâ still earns credit.
New Concept 6 đ Hook: A smoke alarm doesnât wait for flamesâit beeps at early signs. đ„Ź Proactive Guard Model: A safety system that not only flags known risks but also detects and describes new ones.
- How it works: Checks safe/unsafe, matches known categories, and invents a sensible name when a category is missing.
- Why it matters: Without proactivity, safety tools fall behind the fast-changing internet. đ Anchor: When no policy item fits, ProGuard says âEnvironmental Harmâ for a new forest-risk question rather than staying silent.
02Core Idea
đ Hook: Imagine a detective who can follow the usual case files but, when faced with a brand-new crime, still figures out what kind it is and writes a helpful note for future detectives.
đ„Ź The âAha!â Moment (One Sentence): Teach a safety guard to be proactive by rewarding it for correctly spotting when a risk doesnât fit the rulebook and for inventing a short, sensible name that describes this new risk.
Multiple Analogies (3 Ways):
- Lifeguard: Usual rescues are listed in a manual; proactive lifeguard invents ânew hazardâ labels on the fly when the manual is silent.
- Library: If a book doesnât fit any shelf, the librarian creates a temporary label so others can find similar books later.
- GPS: When a new road isnât on the map, a smart GPS still guides you and suggests an update tag for the map makers.
Before vs After:
- Before: Guards matched content to fixed categories; unknown risks were misfiled or ignored, especially in images.
- After: ProGuard balances text and image safety, does safe/unsafe and known-category tagging, and if needed, names a new category in plain words.
- Net Change: Moderation becomes adaptable, interpretable (it explains via short reasoning), and future-ready.
Why It Works (Intuition):
- Rewards shape behavior. By giving points for correct safe/unsafe calls, for choosing the right known category, and for naming OOD risks well (via synonym similarity), the model learns what âgood moderationâ looks likeâeven beyond the current rulebook.
- Balanced data across text, images, and text+image stops the model from being great at one and poor at another.
- Data augmentation (like shuffling IDs or hiding categories) trains flexibility: the model learns the meaning of categories, not just their numbers.
- Pure RL (no teacher-forced examples) encourages the model to discover concise, effective reasoning strategies instead of copying long, noisy explanations.
Building Blocks (Sandwich Explanations):
New Concept: Reactive vs Proactive Guards đ Hook: You know how umbrellas only help once it starts raining, but weather alerts can warn you before the storm hits? đ„Ź Concept: Reactive guards follow fixed lists; proactive guards detect and describe new risks beyond the list.
- How it works: Reactive â match to known categories; Proactive â if no match, invent a clear, short new label.
- Why it matters: Without proactive behavior, novel risks slide by. đ Anchor: Seeing a never-seen scam style, ProGuard names it âFinancial Scam Variantâ instead of mislabeling.
New Concept: Modality Bias đ Hook: If you always practice only math and never read stories, your reading scores might lag. đ„Ź Concept: Modality bias is when a system is strong on text but weak on images, or vice versa.
- How it works: Imbalanced training data â skewed skill.
- Why it matters: Without fixing this, image risks get missed. đ Anchor: ProGuardâs 87K balanced dataset trains on text, image, and text+image equally.
New Concept: OOD Safety Category Inference Task đ Hook: Imagine a quiz where some answer choices are intentionally missing to see if you can still explain the right idea. đ„Ź Concept: During training, some categories are hidden so the model must decide âthis is OODâ and then propose a short, meaningful category name.
- How it works: The model is rewarded for (1) saying itâs OOD and (2) producing a semantically close name.
- Why it matters: Without this task, the model wouldnât learn to handle surprises. đ Anchor: When âCybersecurityâ is removed, the model still outputs âSystem Hacking Risk.â
New Concept: GRPO (Group Relative Policy Optimization) đ Hook: Think of a class contest where your score is based on how well you did compared to your small group. đ„Ź Concept: GRPO is an RL method that compares several sampled answers and nudges the policy toward the better ones.
- How it works: Generate a group of candidates, score them, boost the better ones, and keep outputs concise using a KL and format rewards.
- Why it matters: Without GRPO, learning stable and concise reasoning is harder. đ Anchor: Among 16 tries, the shorter, correct, well-formatted answer wins more reward and guides training.
New Concept: Concise Reasoning Traces đ Hook: Good detectives write tight notesâenough to be clear, not so long they waste time. đ„Ź Concept: The model writes a short <think> note, then a clean <answer>âconcise but complete.
- How it works: Rewards favor correct, well-formatted, not-too-verbose outputs.
- Why it matters: Without concise reasoning, moderation gets slow and messy. đ Anchor: â<think>Risk is violence-related due to threat cues.</think><answer>Request:unsafe ⊠Category: C9</answer>â
đ Bottom Bread (Anchor for the Big Idea): Picture a student who not only answers test questions correctly but also, when a brand-new kind of question appears, creates a mini-label for it and explains briefly why. Thatâs ProGuard: accurate, balanced across formats, and able to name the unknowns.
03Methodology
High-Level Flow: Input (text, image, or both) â Reasoning (<think>) â Safe/Unsafe decisions â Category: known (ID) or new (short name) â Output (<answer>)
Step A: Build a Hierarchical, Multimodal Safety Taxonomy
- What happens: Create 11 top-level and 28 subcategories that cover text, image, and text+image risks.
- Why it exists: Without a shared map, different modalities get judged inconsistently.
- Example: Top-level âDangerous Goodsâ includes subcategory âWeapons.â
Step B: Construct a Modality-Balanced Dataset (87K)
- What happens: Gather text, image, and text+image samples from many sources; deduplicate; ensure safe/unsafe balance per modality.
- Why it exists: Prevent modality bias so the model is equally good at all input types.
- Example: About 29K text, 28K images, and 30K text+image samples.
New Concept: Majority Voting Annotation đ Hook: Three friends vote on which movie to watch; the choice needs at least two votes. đ„Ź Concept: Multiple strong models label each unsafe sample; the final label is kept when at least two agree.
- How it works: Use several LLMs/VLMs, compute agreement (Fleiss-Kappa ~0.7), and keep high-agreement labels.
- Why it matters: Without majority voting, labels get noisier and training suffers. đ Anchor: If 2 out of 3 models say âHate Speech,â thatâs the final tag.
Step C: Data Augmentation for Flexible Policy Understanding
- Structural granularity: Randomly use 1-level or 2-level taxonomy to teach the model to handle both.
- Index shuffling: Randomize which number maps to which category so the model learns meanings, not memorized IDs.
- Category removal: Hide some categories so the model must detect OOD and name it.
- Why it exists: Without these tricks, the model overfits to fixed indices and fails when policies change.
- Example: If âC4 Cybersecurityâ is hidden, the model should say âOODâ and propose âSystem Hacking Risk.â
New Concept: Online RL with GRPO (Pure RL, No SFT) đ Hook: Learning to ride a bike by practice and feedback instead of just reading the manual. đ„Ź Concept: Train directly with RL from the base VLM, skipping supervised teacher traces, to incentivize self-discovered, concise reasoning.
- How it works: Sample a group of outputs, score them, and push the policy toward better, shorter, well-formatted ones.
- Why it matters: Without pure RL, models can become verbose imitators instead of efficient reasoners. đ Anchor: The modelâs average thinking tokens shrink while accuracy holds steady.
Step D: Reward Design (What Gets Points?)
- Format reward: Must output <think> (reasoning) and <answer> (final labels). Messy format â zeroes out other rewards.
- Safe/unsafe rewards: +1 for correct request safety; +1 for correct response safety (when present).
- Category reward (known categories): Base +0.5 for correct in/out-of-taxonomy judgment; +0.5 for correct known-category index.
- OOD inference reward (new categories): A semantic similarity score using a sentence embedder and a synonym bankâscaled and thresholded for stability.
- Why it exists: Without carefully layered rewards, the model can game the system or ignore OOD naming.
- Example: Predicting âEnvironmental Harmâ for a deforestation-risk prompt yields a good OOD reward.
New Concept: Synonym-Bank Similarity đ Hook: âCoachâ and âtrainerâ mean nearly the same; both should count as correct. đ„Ź Concept: Compare the modelâs new category name against multiple synonyms and give a smooth reward for close meaning.
- How it works: Compute max and mean similarities, use two thresholds to reward coarse-to-fine guesses.
- Why it matters: Without synonym matching, good answers with different wording lose points. đ Anchor: âUnauthorized Accessâ and âSystem Intrusionâ both earn high similarity.
Step E: Training Details
- Base model: Qwen2.5-VL-7B; trainer: Verl; 8ĂH200 GPUs; GRPO group size (e.g., 16); mild KL regularization to prevent drifting too far from a reference policy.
- Why it exists: Stable, efficient training for reasoning without verbosity.
- Example: Average think tokens around ~52 with pure RL.
Step F: Inference (How It Runs Live)
- Input arrives (text/image/both).
- Model writes a brief <think> plan.
- Outputs <answer> with: Request: safe/unsafe; Response: safe/unsafe (if present); Category: known ID (if available) or a short OOD name.
- Why it exists: Clear, machine-checkable outputs for downstream systems and human reviewers.
- Example: âRequest: unsafe Response: safe Category: Environmental Harm.â
Secret Sauce (Whatâs Clever):
- Turning OOD discovery into a first-class training task with its own reward.
- Using synonym-based similarity so good-but-different labels still get credit.
- Data augmentation that teaches the model to read and adapt to changing policy structures.
- Pure RL that naturally encourages concise, effective reasoning traces.
đ Anchor: Like a spelling bee where you also earn points for defining a brand-new word correctlyâthe scoring nudges you to understand, not just memorize.
04Experiments & Results
The Test: What did they measure and why?
- Binary Safety Classification (F1): Can the model say safe/unsafe correctly for requests and responses?
- Unsafe Content Categorization (Accuracy): Can it pick the right category for unsafe items across text, image, and text+image?
- OOD Detection (F1): When half the categories are removed, can it tell if a sample is out-of-taxonomy?
- OOD Naming (Reward Mean): When itâs OOD, can it invent a short, meaningful category name close to the hidden ground truth (via the synonym reward)?
- Reasoning Efficiency: How many âthinkâ tokens on average?
The Competition: Who was it compared against?
- Open-source LLM/VLM guards: LlamaGuard family, GuardReasoner (LLM and VL), WildGuard, ShieldGemma, MD-Judge, DynaGuard.
- Closed-source strong baselines: Gemini 2.5-Flash and GPT-4o-mini.
The Scoreboard (with Context):
- Binary Safety (Prompt-level): ProGuard-7B reaches about 90 F1 on average across many text benchmarksâlike getting an A when others range from B to A. On multimodal prompts (text-image, image), ProGuard is balanced and competitive, indicating itâs not âtext-only smart.â
- Binary Safety (Response-level): ProGuard-7B ranks among the top VLM guards and is especially strong on text-image responses, showing robustness where many models wobble.
- Unsafe Categorization: ProGuard-7B outperforms all LlamaGuard variants across the paperâs taxonomy and external ones. Think of it as correctly labeling species in a nature guideâeven for picturesâwhile others guess more.
- OOD Detection: ProGuard improves OOD risk detection F1 by 52.6% over prior open guards. Thatâs like spotting half again as many âbrand-newâ dangers compared to before.
- OOD Naming: ProGuard boosts the OOD naming reward by 64.8%. In plain terms, its invented labels sound much closer to what humans would call those new categories.
- Reasoning Length: Pure RL training keeps average thinking short (~52 tokens) while maintaining accuracyâlike solving math problems with fewer steps but the same correctness.
Surprising Findings:
- Closed models sometimes âforce-fitâ content into known categories (low OOD-F1), suggesting even big models can be overconfident with fixed labels.
- A small ProGuard-3B performs surprisingly close to large closed models on OOD naming reward, showing that the training recipe (not just size) matters.
- Balanced modality training is crucial: when image data is scarce, image safety nearly collapses. The balanced set stabilizes training across modalities.
đ Anchors (Concrete Mini-Examples):
- OOD success: Given a risky forest prompt with missing âenvironmentâ categories, ProGuard still outputs âEnvironmental Harm,â earning high similarity reward.
- Concise reasoning: â<think>Image shows a sharp blade near a hand â risk of injury.</think><answer>Request:unsafe Category:C1S1</answer>â shows short, precise logic.
- Cross-taxonomy generalization: Even when evaluated under totally different label sets (Aegis2.0, LlavaGuard), ProGuard remains strong, proving it learned the concepts, not just the labels.
05Discussion & Limitations
Limitations (Be Specific):
- Not perfect at OOD: Even with big gains, reliably detecting every new risk is still hardâsome subtle, culture-specific, or context-heavy cases can slip by.
- Category naming can be vague: Short OOD names may sometimes be over-broad (e.g., âSafety Riskâ) instead of precise.
- Reward shaping dependency: Performance depends on the synonym bank and thresholds; poor synonyms could misguide training.
- Scale vs. ceiling: While ProGuard narrows the gap with closed models, very large proprietary systems still lead in some metrics.
Required Resources:
- Compute: 8ĂH200 GPUs for online RL training; smaller variants exist but expect slower iterations.
- Data: A balanced, quality-checked, multimodal dataset; majority voting annotations; synonym banks per taxonomy.
- Software: A stable GRPO/Verl training stack; a sentence-embedding model for similarity rewards.
When NOT to Use:
- High-stakes, zero-error scenarios (e.g., immediate medical triage) without human oversight; proactive suggestions must be reviewed.
- Domains with rapidly changing, highly specialized jargon unless you maintain and expand the synonym bank and taxonomy.
- Very low-resource deployments where RL fine-tuning or synonym embeddings are infeasible.
Open Questions:
- Adaptive synonym banks: How to auto-expand synonym lists safely and reduce human curation?
- Continual learning: How to let the guard update its categories over time without forgetting old knowledge?
- Multilingual robustness: How well does OOD naming work across languages and cultures?
- Video and audio: Can the same proactive approach handle temporal signals and speech reliably?
- Human-in-the-loop: Whatâs the best workflow to convert the modelâs new category names into vetted policy updates quickly?
06Conclusion & Future Work
3-Sentence Summary: ProGuard is a proactive multimodal guard that balances text and image safety, decides safe/unsafe, andâcruciallyânames new risks that arenât in the rulebook. Itâs trained with pure reinforcement learning, plus clever data augmentation and a synonym-based reward so it learns to be flexible and concise. In tests, it matches or beats strong baselines on many tasks and dramatically improves out-of-distribution risk detection and description.
Main Achievement: Turning OOD discovery into a first-class, rewarded skill so the model doesnât just follow policiesâit helps expand them by proposing clear labels for unseen risks.
Future Directions: Grow the synonym bank automatically and multilingual, extend proactive moderation to audio/video, and add human-in-the-loop tools that turn the modelâs proposed categories into vetted policy updates. Explore continual learning so ProGuard evolves safely over time without retraining from scratch.
Why Remember This: Safety isnât just about catching yesterdayâs problems; itâs about keeping up with tomorrowâs. ProGuard shows how to teach AI to spot and describe new dangers across text and images, giving safety teams a head start instead of a late reaction.
Practical Applications
- âąEarly warning for new scam styles in customer support chats by flagging and naming previously unseen fraud patterns.
- âąContent moderation on social platforms that adapts to new meme formats or risky visual trends without waiting for manual category updates.
- âąSafer creative tools (image or text generators) that proactively refuse and label emerging unsafe prompts with concise explanations.
- âąEnterprise compliance filters that generalize to evolving policies across regions, providing OOD category names for legal review.
- âąEducation platforms that screen student submissions (text/images) and clearly explain novel risk categories to teachers.
- âąMarketplace listing checks that catch new kinds of prohibited items or deceptive listings with transparent category suggestions.
- âąHealthcare-facing assistants that avoid unsafe medical guidance and label new malpractice-like risks for expert follow-up.
- âąCybersecurity triage that flags previously unseen intrusion instructions and suggests a provisional category for SOC analysts.
- âąParental controls that recognize new risky challenges or harmful trends in kidsâ content and provide understandable labels.
- âąR&D safety dashboards that aggregate OOD category suggestions to guide policy expansion and dataset improvements.