🎓How I Study AIHISA
📖Read
📄Papers📰Blogs🎬Courses
💡Learn
đŸ›€ïžPaths📚Topics💡Concepts🎮Shorts
🎯Practice
đŸ§©Problems🎯Prompts🧠Review
Search
ProGuard: Towards Proactive Multimodal Safeguard | How I Study AI

ProGuard: Towards Proactive Multimodal Safeguard

Intermediate
Shaohan Yu, Lijun Li, Chenyang Si et al.12/29/2025
arXivPDF

Key Summary

  • ‱ProGuard is a safety guard for text and images that doesn’t just spot known problems—it can also recognize and name new, never-seen-before risks.
  • ‱Instead of memorizing fixed rule lists, ProGuard learns through reinforcement learning, like getting points for good decisions and learning from mistakes.
  • ‱The team built a balanced dataset of 87K examples across text, images, and text+image to avoid being great at one modality and weak at another.
  • ‱A special “synonym bank” reward helps ProGuard describe new risk types in clear words even when those categories weren’t in the training list.
  • ‱ProGuard performs on par with strong closed models on simple safe/unsafe checks and clearly beats other open models on detailed risk categorization.
  • ‱Most impressively, it improves detection of out-of-distribution (OOD) risks by 52.6% and produces better OOD descriptions by 64.8% compared to prior guards.
  • ‱It uses an online reinforcement learning method (GRPO) to keep reasoning concise, so it thinks enough to be accurate but doesn’t waste time.
  • ‱Data augmentation tricks (like shuffling category IDs and hiding categories) train the model to be flexible and proactive instead of rigid and reactive.
  • ‱Human studies show the reward used for naming new categories mostly matches what people think is a good label.
  • ‱Overall, ProGuard is a practical step toward safety systems that can keep up with fast-changing, real-world risks.

Why This Research Matters

Online content changes quickly, and new kinds of harmful behavior appear before rulebooks can be updated. A proactive guard like ProGuard helps platforms and products stay ahead by recognizing and describing brand-new risks instead of waiting for manual updates. This reduces harmful exposures users might face in chats, images, and mixed media. It also gives safety teams interpretable signals—short names and reasons—so they can update policies faster. Balanced performance across text and images means fewer blind spots. Over time, this approach can make AI systems safer in schools, social media, creative tools, and customer support—wherever people and AI interact.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how school safety drills prepare you for different emergencies, like fire drills and earthquake drills? Those are clear and written down. But sometimes a new kind of emergency happens that wasn't in the booklet yet. A good safety team needs to notice new dangers even if the guidebook doesn’t mention them.

đŸ„Ź The World Before: For years, AI safety “guards” mostly followed a fixed rulebook called a taxonomy—a list of categories like hate speech, self-harm, or dangerous goods. These guards worked like checklists: see a rule → apply it. They did pretty well at saying “safe” or “unsafe,” especially for text. But modern AIs understand and generate images too, and the world keeps inventing new problems. Risks pop up that aren’t in the old lists, especially in pictures and mixed text+image situations. Without the right category, older guards either mislabel the content or miss the risk entirely.

đŸ„Ź The Problem: Existing guards are reactive. They can only react to risks already written into their rulebook. If a new kind of risk appears—say, a medical malpractice instruction phrased in a totally new way or a risky visual pattern—reactive guards struggle. They either force-fit it into a wrong category or pass it as safe. Also, many guards are very strong for text, but weaker for images (this is called modality bias). That’s dangerous in a world full of memes, screenshots, and camera feeds.

đŸ„Ź Failed Attempts: People tried making the rulebook longer and more detailed. That helped a little but made systems brittle: longer lists are harder to maintain and still miss surprises. Others tried training guards to explain themselves (reasoning), but they often became wordy, slow, or still chained to the same fixed categories. Some approaches fine-tuned base models heavily to be safer, but that can hurt their general abilities.

đŸ„Ź The Gap: What was missing was a proactive guard that could (1) balance text and image safety well, (2) decide safe/unsafe, (3) pick a detailed risk category when it exists, and—most importantly—(4) describe a new, unseen risk category when it doesn’t exist yet. That last piece is the leap from reactive to proactive.

đŸ„Ź Why It Matters (Real Stakes):

  • Content online changes fast—new scams, new visual tricks, new harmful trends. A rigid guard gets outdated quickly.
  • Safety teams need tools that not only block harm today but also discover and name tomorrow’s risks, helping update policies.
  • In daily life: picture apps that filter harmful images, chatbots that avoid giving unsafe advice, classrooms using AI safely, and platforms that catch brand-new abuse patterns. A proactive guard keeps people safer without constant manual rule updates.

🍞 Anchor: Imagine a lifeguard who only knows how to rescue swimmers with arm cramps because that’s in the handbook. One day, a new kind of floating toy traps a swimmer’s arm—the lifeguard has never seen it. A reactive lifeguard might waste time flipping through the book. A proactive lifeguard thinks on the spot: “This is a new kind of entanglement risk,” acts fast, and later tells the team, “We need a new category: ‘Toy entanglement near ladders.’” ProGuard is that proactive lifeguard for text, images, and both together.

— New Concept 1 🍞 Hook: Imagine training a puppy. You give treats when it sits nicely so it learns faster. đŸ„Ź Reinforcement Learning (RL): RL is a way for AI to learn by trying actions and getting rewards for good ones.

  • How it works: 1) Try a response, 2) Get a score (reward), 3) Do more of what earns higher scores, 4) Repeat.
  • Why it matters: Without RL, the model might copy examples without truly learning to reason or adapt to new risks. 🍞 Anchor: The AI “puppy” gets a treat when it correctly flags unsafe content and explains a new risk clearly.

New Concept 2 🍞 Hook: You know how your backpack sometimes has something brand new in it—like a gadget your teacher hasn’t seen before? đŸ„Ź Out-of-Distribution (OOD) Safety Risks: These are risks that don’t look like the old examples the model trained on.

  • How it works: The system first checks if a risk matches any known category; if not, it decides it’s OOD and makes a good name for it.
  • Why it matters: Without OOD handling, new dangers slip through or get misfiled. 🍞 Anchor: Seeing “medical malpractice” phrased in a novel way and labeling it “Medical Malpractice” even if that term wasn’t in the original list.

New Concept 3 🍞 Hook: Think of a family tree—big branches and then smaller branches. đŸ„Ź Hierarchical Multimodal Safety Taxonomy: A layered safety category tree that covers text, images, and text+image.

  • How it works: 11 top-level groups and 28 subgroups capture different risks, tuned for different modalities.
  • Why it matters: Without a clear tree, moderation becomes inconsistent across text and images. 🍞 Anchor: “Dangerous Goods” (big branch) includes “Weapons” (smaller branch), which helps target the exact issue.

New Concept 4 🍞 Hook: A light switch is either on or off. Simple and clear. đŸ„Ź Binary Safety Classification: A yes/no judgment: safe or unsafe.

  • How it works: The model reads inputs and outputs safe/unsafe for the request and (if present) the response.
  • Why it matters: Without this first gate, the system can’t prioritize what needs deeper attention. 🍞 Anchor: A harmless weather question is safe; a prompt asking for a harmful plan is unsafe.

New Concept 5 🍞 Hook: If you can’t remember the exact word “vehicle,” saying “car” still shows you understand. đŸ„Ź Synonym-Bank-Based Similarity Reward: A reward that gives points for using correct or close-meaning words when naming new categories.

  • How it works: The model’s label is compared to a bank of synonyms using semantic similarity; higher match → higher reward.
  • Why it matters: Without this, the model would be punished for small wording differences even when it “gets it.” 🍞 Anchor: Calling “Unauthorized Access” as “System Intrusion” still earns credit.

New Concept 6 🍞 Hook: A smoke alarm doesn’t wait for flames—it beeps at early signs. đŸ„Ź Proactive Guard Model: A safety system that not only flags known risks but also detects and describes new ones.

  • How it works: Checks safe/unsafe, matches known categories, and invents a sensible name when a category is missing.
  • Why it matters: Without proactivity, safety tools fall behind the fast-changing internet. 🍞 Anchor: When no policy item fits, ProGuard says “Environmental Harm” for a new forest-risk question rather than staying silent.

02Core Idea

🍞 Hook: Imagine a detective who can follow the usual case files but, when faced with a brand-new crime, still figures out what kind it is and writes a helpful note for future detectives.

đŸ„Ź The “Aha!” Moment (One Sentence): Teach a safety guard to be proactive by rewarding it for correctly spotting when a risk doesn’t fit the rulebook and for inventing a short, sensible name that describes this new risk.

Multiple Analogies (3 Ways):

  1. Lifeguard: Usual rescues are listed in a manual; proactive lifeguard invents “new hazard” labels on the fly when the manual is silent.
  2. Library: If a book doesn’t fit any shelf, the librarian creates a temporary label so others can find similar books later.
  3. GPS: When a new road isn’t on the map, a smart GPS still guides you and suggests an update tag for the map makers.

Before vs After:

  • Before: Guards matched content to fixed categories; unknown risks were misfiled or ignored, especially in images.
  • After: ProGuard balances text and image safety, does safe/unsafe and known-category tagging, and if needed, names a new category in plain words.
  • Net Change: Moderation becomes adaptable, interpretable (it explains via short reasoning), and future-ready.

Why It Works (Intuition):

  • Rewards shape behavior. By giving points for correct safe/unsafe calls, for choosing the right known category, and for naming OOD risks well (via synonym similarity), the model learns what “good moderation” looks like—even beyond the current rulebook.
  • Balanced data across text, images, and text+image stops the model from being great at one and poor at another.
  • Data augmentation (like shuffling IDs or hiding categories) trains flexibility: the model learns the meaning of categories, not just their numbers.
  • Pure RL (no teacher-forced examples) encourages the model to discover concise, effective reasoning strategies instead of copying long, noisy explanations.

Building Blocks (Sandwich Explanations):

New Concept: Reactive vs Proactive Guards 🍞 Hook: You know how umbrellas only help once it starts raining, but weather alerts can warn you before the storm hits? đŸ„Ź Concept: Reactive guards follow fixed lists; proactive guards detect and describe new risks beyond the list.

  • How it works: Reactive → match to known categories; Proactive → if no match, invent a clear, short new label.
  • Why it matters: Without proactive behavior, novel risks slide by. 🍞 Anchor: Seeing a never-seen scam style, ProGuard names it “Financial Scam Variant” instead of mislabeling.

New Concept: Modality Bias 🍞 Hook: If you always practice only math and never read stories, your reading scores might lag. đŸ„Ź Concept: Modality bias is when a system is strong on text but weak on images, or vice versa.

  • How it works: Imbalanced training data → skewed skill.
  • Why it matters: Without fixing this, image risks get missed. 🍞 Anchor: ProGuard’s 87K balanced dataset trains on text, image, and text+image equally.

New Concept: OOD Safety Category Inference Task 🍞 Hook: Imagine a quiz where some answer choices are intentionally missing to see if you can still explain the right idea. đŸ„Ź Concept: During training, some categories are hidden so the model must decide “this is OOD” and then propose a short, meaningful category name.

  • How it works: The model is rewarded for (1) saying it’s OOD and (2) producing a semantically close name.
  • Why it matters: Without this task, the model wouldn’t learn to handle surprises. 🍞 Anchor: When “Cybersecurity” is removed, the model still outputs “System Hacking Risk.”

New Concept: GRPO (Group Relative Policy Optimization) 🍞 Hook: Think of a class contest where your score is based on how well you did compared to your small group. đŸ„Ź Concept: GRPO is an RL method that compares several sampled answers and nudges the policy toward the better ones.

  • How it works: Generate a group of candidates, score them, boost the better ones, and keep outputs concise using a KL and format rewards.
  • Why it matters: Without GRPO, learning stable and concise reasoning is harder. 🍞 Anchor: Among 16 tries, the shorter, correct, well-formatted answer wins more reward and guides training.

New Concept: Concise Reasoning Traces 🍞 Hook: Good detectives write tight notes—enough to be clear, not so long they waste time. đŸ„Ź Concept: The model writes a short <think> note, then a clean <answer>—concise but complete.

  • How it works: Rewards favor correct, well-formatted, not-too-verbose outputs.
  • Why it matters: Without concise reasoning, moderation gets slow and messy. 🍞 Anchor: “<think>Risk is violence-related due to threat cues.</think><answer>Request:unsafe 
 Category: C9</answer>”

🍞 Bottom Bread (Anchor for the Big Idea): Picture a student who not only answers test questions correctly but also, when a brand-new kind of question appears, creates a mini-label for it and explains briefly why. That’s ProGuard: accurate, balanced across formats, and able to name the unknowns.

03Methodology

High-Level Flow: Input (text, image, or both) → Reasoning (<think>) → Safe/Unsafe decisions → Category: known (ID) or new (short name) → Output (<answer>)

Step A: Build a Hierarchical, Multimodal Safety Taxonomy

  • What happens: Create 11 top-level and 28 subcategories that cover text, image, and text+image risks.
  • Why it exists: Without a shared map, different modalities get judged inconsistently.
  • Example: Top-level “Dangerous Goods” includes subcategory “Weapons.”

Step B: Construct a Modality-Balanced Dataset (87K)

  • What happens: Gather text, image, and text+image samples from many sources; deduplicate; ensure safe/unsafe balance per modality.
  • Why it exists: Prevent modality bias so the model is equally good at all input types.
  • Example: About 29K text, 28K images, and 30K text+image samples.

New Concept: Majority Voting Annotation 🍞 Hook: Three friends vote on which movie to watch; the choice needs at least two votes. đŸ„Ź Concept: Multiple strong models label each unsafe sample; the final label is kept when at least two agree.

  • How it works: Use several LLMs/VLMs, compute agreement (Fleiss-Kappa ~0.7), and keep high-agreement labels.
  • Why it matters: Without majority voting, labels get noisier and training suffers. 🍞 Anchor: If 2 out of 3 models say “Hate Speech,” that’s the final tag.

Step C: Data Augmentation for Flexible Policy Understanding

  • Structural granularity: Randomly use 1-level or 2-level taxonomy to teach the model to handle both.
  • Index shuffling: Randomize which number maps to which category so the model learns meanings, not memorized IDs.
  • Category removal: Hide some categories so the model must detect OOD and name it.
  • Why it exists: Without these tricks, the model overfits to fixed indices and fails when policies change.
  • Example: If “C4 Cybersecurity” is hidden, the model should say “OOD” and propose “System Hacking Risk.”

New Concept: Online RL with GRPO (Pure RL, No SFT) 🍞 Hook: Learning to ride a bike by practice and feedback instead of just reading the manual. đŸ„Ź Concept: Train directly with RL from the base VLM, skipping supervised teacher traces, to incentivize self-discovered, concise reasoning.

  • How it works: Sample a group of outputs, score them, and push the policy toward better, shorter, well-formatted ones.
  • Why it matters: Without pure RL, models can become verbose imitators instead of efficient reasoners. 🍞 Anchor: The model’s average thinking tokens shrink while accuracy holds steady.

Step D: Reward Design (What Gets Points?)

  • Format reward: Must output <think> (reasoning) and <answer> (final labels). Messy format → zeroes out other rewards.
  • Safe/unsafe rewards: +1 for correct request safety; +1 for correct response safety (when present).
  • Category reward (known categories): Base +0.5 for correct in/out-of-taxonomy judgment; +0.5 for correct known-category index.
  • OOD inference reward (new categories): A semantic similarity score using a sentence embedder and a synonym bank—scaled and thresholded for stability.
  • Why it exists: Without carefully layered rewards, the model can game the system or ignore OOD naming.
  • Example: Predicting “Environmental Harm” for a deforestation-risk prompt yields a good OOD reward.

New Concept: Synonym-Bank Similarity 🍞 Hook: “Coach” and “trainer” mean nearly the same; both should count as correct. đŸ„Ź Concept: Compare the model’s new category name against multiple synonyms and give a smooth reward for close meaning.

  • How it works: Compute max and mean similarities, use two thresholds to reward coarse-to-fine guesses.
  • Why it matters: Without synonym matching, good answers with different wording lose points. 🍞 Anchor: “Unauthorized Access” and “System Intrusion” both earn high similarity.

Step E: Training Details

  • Base model: Qwen2.5-VL-7B; trainer: Verl; 8×H200 GPUs; GRPO group size (e.g., 16); mild KL regularization to prevent drifting too far from a reference policy.
  • Why it exists: Stable, efficient training for reasoning without verbosity.
  • Example: Average think tokens around ~52 with pure RL.

Step F: Inference (How It Runs Live)

  • Input arrives (text/image/both).
  • Model writes a brief <think> plan.
  • Outputs <answer> with: Request: safe/unsafe; Response: safe/unsafe (if present); Category: known ID (if available) or a short OOD name.
  • Why it exists: Clear, machine-checkable outputs for downstream systems and human reviewers.
  • Example: “Request: unsafe Response: safe Category: Environmental Harm.”

Secret Sauce (What’s Clever):

  • Turning OOD discovery into a first-class training task with its own reward.
  • Using synonym-based similarity so good-but-different labels still get credit.
  • Data augmentation that teaches the model to read and adapt to changing policy structures.
  • Pure RL that naturally encourages concise, effective reasoning traces.

🍞 Anchor: Like a spelling bee where you also earn points for defining a brand-new word correctly—the scoring nudges you to understand, not just memorize.

04Experiments & Results

The Test: What did they measure and why?

  • Binary Safety Classification (F1): Can the model say safe/unsafe correctly for requests and responses?
  • Unsafe Content Categorization (Accuracy): Can it pick the right category for unsafe items across text, image, and text+image?
  • OOD Detection (F1): When half the categories are removed, can it tell if a sample is out-of-taxonomy?
  • OOD Naming (Reward Mean): When it’s OOD, can it invent a short, meaningful category name close to the hidden ground truth (via the synonym reward)?
  • Reasoning Efficiency: How many “think” tokens on average?

The Competition: Who was it compared against?

  • Open-source LLM/VLM guards: LlamaGuard family, GuardReasoner (LLM and VL), WildGuard, ShieldGemma, MD-Judge, DynaGuard.
  • Closed-source strong baselines: Gemini 2.5-Flash and GPT-4o-mini.

The Scoreboard (with Context):

  • Binary Safety (Prompt-level): ProGuard-7B reaches about 90 F1 on average across many text benchmarks—like getting an A when others range from B to A. On multimodal prompts (text-image, image), ProGuard is balanced and competitive, indicating it’s not “text-only smart.”
  • Binary Safety (Response-level): ProGuard-7B ranks among the top VLM guards and is especially strong on text-image responses, showing robustness where many models wobble.
  • Unsafe Categorization: ProGuard-7B outperforms all LlamaGuard variants across the paper’s taxonomy and external ones. Think of it as correctly labeling species in a nature guide—even for pictures—while others guess more.
  • OOD Detection: ProGuard improves OOD risk detection F1 by 52.6% over prior open guards. That’s like spotting half again as many “brand-new” dangers compared to before.
  • OOD Naming: ProGuard boosts the OOD naming reward by 64.8%. In plain terms, its invented labels sound much closer to what humans would call those new categories.
  • Reasoning Length: Pure RL training keeps average thinking short (~52 tokens) while maintaining accuracy—like solving math problems with fewer steps but the same correctness.

Surprising Findings:

  • Closed models sometimes “force-fit” content into known categories (low OOD-F1), suggesting even big models can be overconfident with fixed labels.
  • A small ProGuard-3B performs surprisingly close to large closed models on OOD naming reward, showing that the training recipe (not just size) matters.
  • Balanced modality training is crucial: when image data is scarce, image safety nearly collapses. The balanced set stabilizes training across modalities.

🍞 Anchors (Concrete Mini-Examples):

  • OOD success: Given a risky forest prompt with missing “environment” categories, ProGuard still outputs “Environmental Harm,” earning high similarity reward.
  • Concise reasoning: “<think>Image shows a sharp blade near a hand → risk of injury.</think><answer>Request:unsafe Category:C1S1</answer>” shows short, precise logic.
  • Cross-taxonomy generalization: Even when evaluated under totally different label sets (Aegis2.0, LlavaGuard), ProGuard remains strong, proving it learned the concepts, not just the labels.

05Discussion & Limitations

Limitations (Be Specific):

  • Not perfect at OOD: Even with big gains, reliably detecting every new risk is still hard—some subtle, culture-specific, or context-heavy cases can slip by.
  • Category naming can be vague: Short OOD names may sometimes be over-broad (e.g., “Safety Risk”) instead of precise.
  • Reward shaping dependency: Performance depends on the synonym bank and thresholds; poor synonyms could misguide training.
  • Scale vs. ceiling: While ProGuard narrows the gap with closed models, very large proprietary systems still lead in some metrics.

Required Resources:

  • Compute: 8×H200 GPUs for online RL training; smaller variants exist but expect slower iterations.
  • Data: A balanced, quality-checked, multimodal dataset; majority voting annotations; synonym banks per taxonomy.
  • Software: A stable GRPO/Verl training stack; a sentence-embedding model for similarity rewards.

When NOT to Use:

  • High-stakes, zero-error scenarios (e.g., immediate medical triage) without human oversight; proactive suggestions must be reviewed.
  • Domains with rapidly changing, highly specialized jargon unless you maintain and expand the synonym bank and taxonomy.
  • Very low-resource deployments where RL fine-tuning or synonym embeddings are infeasible.

Open Questions:

  • Adaptive synonym banks: How to auto-expand synonym lists safely and reduce human curation?
  • Continual learning: How to let the guard update its categories over time without forgetting old knowledge?
  • Multilingual robustness: How well does OOD naming work across languages and cultures?
  • Video and audio: Can the same proactive approach handle temporal signals and speech reliably?
  • Human-in-the-loop: What’s the best workflow to convert the model’s new category names into vetted policy updates quickly?

06Conclusion & Future Work

3-Sentence Summary: ProGuard is a proactive multimodal guard that balances text and image safety, decides safe/unsafe, and—crucially—names new risks that aren’t in the rulebook. It’s trained with pure reinforcement learning, plus clever data augmentation and a synonym-based reward so it learns to be flexible and concise. In tests, it matches or beats strong baselines on many tasks and dramatically improves out-of-distribution risk detection and description.

Main Achievement: Turning OOD discovery into a first-class, rewarded skill so the model doesn’t just follow policies—it helps expand them by proposing clear labels for unseen risks.

Future Directions: Grow the synonym bank automatically and multilingual, extend proactive moderation to audio/video, and add human-in-the-loop tools that turn the model’s proposed categories into vetted policy updates. Explore continual learning so ProGuard evolves safely over time without retraining from scratch.

Why Remember This: Safety isn’t just about catching yesterday’s problems; it’s about keeping up with tomorrow’s. ProGuard shows how to teach AI to spot and describe new dangers across text and images, giving safety teams a head start instead of a late reaction.

Practical Applications

  • ‱Early warning for new scam styles in customer support chats by flagging and naming previously unseen fraud patterns.
  • ‱Content moderation on social platforms that adapts to new meme formats or risky visual trends without waiting for manual category updates.
  • ‱Safer creative tools (image or text generators) that proactively refuse and label emerging unsafe prompts with concise explanations.
  • ‱Enterprise compliance filters that generalize to evolving policies across regions, providing OOD category names for legal review.
  • ‱Education platforms that screen student submissions (text/images) and clearly explain novel risk categories to teachers.
  • ‱Marketplace listing checks that catch new kinds of prohibited items or deceptive listings with transparent category suggestions.
  • ‱Healthcare-facing assistants that avoid unsafe medical guidance and label new malpractice-like risks for expert follow-up.
  • ‱Cybersecurity triage that flags previously unseen intrusion instructions and suggests a provisional category for SOC analysts.
  • ‱Parental controls that recognize new risky challenges or harmful trends in kids’ content and provide understandable labels.
  • ‱R&D safety dashboards that aggregate OOD category suggestions to guide policy expansion and dataset improvements.
#proactive safety#multimodal moderation#out-of-distribution detection#reinforcement learning#GRPO#synonym-based reward#hierarchical taxonomy#vision-language models#content safety#reasoning traces#modality balance#data augmentation#OOD naming#binary safety classification#policy generalization
Version: 1