Large language models are great at words, but they struggle to predict what will happen after they act in a changing world.
Large language models sometimes reach the right answer for the wrong reasons, which is risky and confusing.
This paper introduces GANPO, a new way to train language models from human preferences by guiding the model using its hidden thoughts (latent space) instead of just its visible words (token space).
DenseGRPO teaches image models using lots of small, timely rewards instead of one final score at the end.
RL-trained search agents often sound confident even when they donβt know, which can mislead people.
GARDO is a new way to fine-tune text-to-image diffusion models with reinforcement learning without getting tricked by bad reward signals.