Papers3

#safety alignment

DeepSight: An All-in-One LM Safety Toolkit

DeepSight is a free, all-in-one safety toolkit that both tests how models behave (DeepSafe) and peeks inside how they think (DeepScan).

#LLM safety evaluation#multimodal safety#frontier AI risks

Not triaged yet

Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

Intermediate

Weixun Wang, XiaoXiao Xu et al.Dec 31arXiv

This paper builds an open, end-to-end ecosystem (ALE) that lets AI agents plan, act, and fix their own mistakes across many steps in real computer environments.

#agentic LLMs#reinforcement learning#IPA

Not triaged yet

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Intermediate

Xiaojun Jia, Jie Liao et al.Dec 6arXiv

OmniSafeBench-MM is a one-stop, open-source test bench that fairly compares how multimodal AI models get tricked (jailbroken) and how well different defenses stop that.

#multimodal large language models#jailbreak attacks#safety alignment

Not triaged yet