Papers2

#false positive rate

Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Krzysztof Wróbel, Jan Maria Kowalski et al.Feb 8arXiv

Bielik Guard is a pair of small but strong Polish language safety models that check text for five kinds of risky content: hate/aggression, vulgar language, sexual content, crime, and self-harm.

#Polish NLP#content moderation#safety classifier

Not triaged yet

ASA: Training-Free Representation Engineering for Tool-Calling Agents

Intermediate

Youjin Wang, Run Zhou et al.Feb 4arXiv

The paper finds a strange gap: the model’s hidden thoughts almost perfectly show when it should use a tool, but its actual words often don’t trigger the tool under strict rules.

#activation steering#representation engineering#tool calling

Not triaged yet