Papers4

All Beginner Intermediate Advanced

All Sources arXiv

#LLM safety

Frontier AI Risk Management Framework in Practice: A Risk Analysis Technical Report v1.5

Intermediate

Dongrui Liu, Yi Yu et al.Feb 16arXiv

This report studies the biggest new dangers from super-capable AI and tests them in realistic, well-controlled labs so we can fix problems before they cause real harm.

#frontier AI#agentic AI#cyber offense

From Data to Behavior: Predicting Unintended Model Behaviors Before Training

Intermediate

Mengru Wang, Zhenqian Xu et al.Feb 4arXiv

Large language models can quietly pick up hidden preferences from training data that looks harmless.

#Data2Behavior#Manipulating Data Features#activation injection

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Intermediate

Seanie Lee, Sangwoo Park et al.Jan 30arXiv

Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.

#THINKSAFE#self-generated safety alignment#refusal steering

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Intermediate

Christina Lu, Jack Gallagher et al.Jan 15arXiv

Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.

#Assistant Axis#persona drift#activation capping