The paper shows that when we train with the popular InfoNCE contrastive loss, the learned features start to behave like they come from a Gaussian (bell-shaped) distribution.
GUI-Libra is a training recipe that helps computer-using AI agents both think carefully and click precisely on screens.
ManCAR helps recommendation systems think step by step but keeps their thoughts on realistic paths using a map of how items connect.
The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
This paper introduces GANPO, a new way to train language models from human preferences by guiding the model using its hidden thoughts (latent space) instead of just its visible words (token space).
This paper teaches large language models (LLMs) to explore smarter by listening to their own gradients—the directions they would update—rather than chasing random variety.