TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
IntermediateZhewen Tan, Wenhan Yu et al.Jan 26arXiv
TriPlay-RL is a three-role self-play training loop (attacker, defender, evaluator) that teaches AI models to be safer with almost no manual labels.
#LLM safety alignment#self-play reinforcement learning#adversarial prompt generation