Concepts152

Groups

Scaled Dot-Product Attention

Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.

#scaled dot-product attention#softmax#transformer+10

📚TheoryIntermediate

Stochastic Depth

Stochastic Depth randomly drops whole residual layers during training while keeping the full network at inference time.

#stochastic depth

3 4 5 6 7

Concepts152

Scaled Dot-Product Attention

Stochastic Depth

Spectral Regularization

Early Stopping

Label Smoothing

Data Augmentation Theory

Layer Normalization

Batch Normalization

Dropout

Feature Learning vs Kernel Regime

Grokking & Delayed Generalization

Mean Field Theory of Neural Networks