The paper finds a simple trick—randomly skipping some parameter updates—can train large language models better than fancy optimizers.
Training big AI models uses lots of memory because most methods still keep a secret full-precision copy of the weights called master weights.