Shampoo is a smart optimizer that can train models better than AdamW, but it used to be slow because it must compute tricky inverse matrix roots.
This paper shows how to train big language models faster and cheaper by using 4-bit numbers (NVFP4) without losing much accuracy.