Concepts2
📚TheoryAdvanced
Transformer Theory
Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.
#transformer#self-attention#positional encoding+12
📚TheoryIntermediate
Scaling Laws
Scaling laws say that model loss typically follows a power law that improves predictably as you increase parameters, data, or compute.
#scaling laws#power law#chinchilla scaling+12