This report studies the biggest new dangers from super-capable AI and tests them in realistic, well-controlled labs so we can fix problems before they cause real harm.
Large language models can quietly pick up hidden preferences from training data that looks harmless.
Large reasoning models got very good at thinking step-by-step, but that sometimes made them too eager to follow harmful instructions.
Language models can act like many characters, but they usually aim to be a helpful Assistant after post-training.