Papers2

#guard models

A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

Gonzalo Ariel Meyoyan, Luciano Del CorroJan 19arXiv

This paper shows how to add a tiny helper (a probe) to a big language model so it can classify things like safety or sentiment during the same pass it already does to answer you.

#LLM orchestration#single-pass classification#hidden-state probing

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate

Xiaojun Jia, Jie Liao et al.Dec 6arXiv

OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.

#multimodal large language models#jailbreak attacks#safety alignment