Groups
Depth adds compositional power: stacking layers lets neural networks represent functions with many repeated patterns using far fewer neurons than a single wide layer.
Attention computes a weighted sum of values V where the weights come from how similar queries Q are to keys K.