Groups
Category
Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.
Scaling laws say that model loss typically follows a power law that improves predictably as you increase parameters, data, or compute.