The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.
The paper introduces Nested Learning, a new way to build AI that learns in layers (like Russian dolls), so each part can update at its own speed and remember different things.