Latent Adversarial Regularization for Offline Preference Optimization
IntermediateEnyi Jiang, Yibo Jacky Zhang et al.Jan 29arXiv
This paper introduces GANPO, a new way to train language models from human preferences by guiding the model using its hidden thoughts (latent space) instead of just its visible words (token space).
#GANPO#latent space regularization#offline preference optimization