Groups
Category
Data parallelism splits the training data across workers that compute gradients in parallel on a shared model.
Stochastic Gradient Descent (SGD) updates model parameters using small random subsets (mini-batches) of data, making learning faster and more memory-efficient.