Concepts3

Groups

Efficient Attention Mechanisms

Standard softmax attention costs O(n²) in sequence length because every token compares with every other token.

#linear attention#efficient attention#kernel trick+12

Feature Learning vs Kernel Regime

The kernel (lazy) regime keeps neural network parameters close to their initialization, making training equivalent to kernel regression with a fixed kernel such as the Neural Tangent Kernel (NTK).

#neural tangent kernel

Concepts3

Efficient Attention Mechanisms

Feature Learning vs Kernel Regime

Universal Approximation Theorem