Groups
Mixed precision training stores and computes tensors in low precision (FP16/BF16) for speed and memory savings while keeping a master copy of weights in FP32 for accurate updates.
Multi-Head Attention runs several attention mechanisms in parallel so each head can focus on different relationships in the data.