The paper builds an automated pipeline that translates AI benchmarks and datasets into many languages while keeping questions and answers correctly connected.
Real attackers can try many prompts in parallel until a model slips, so testing safety with only one try badly underestimates risk.