The paper studies why two opposite-sounding tricks in RL for reasoning—adding random (spurious) rewards and reducing randomness (entropy)—can both seem to help large language models think better.
Reinforcement learning agents often see the world in straight, flat space (Euclidean), but many decision problems look more like branching trees that fit curved, hyperbolic space better.
Reinforcement learning (RL) can make big language models smarter, but off-policy training often pushes updates too far from the “safe zone,” causing unstable learning.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.