The paper studies a simple way to train giant language models with reinforcement learning by replacing a hard-to-compute term (the log-partition function) with something easy: the mean reward.
Before this work, computer-using AIs mostly copied old examples and struggled with long step-by-step tasks on real computers.
WebGym is a giant practice world (almost 300,000 tasks) that lets AI web agents learn on real, ever-changing websites instead of tiny, fake ones.