Concepts2

Groups

Efficient Attention Mechanisms

Standard softmax attention costs O(n²) in sequence length because every token compares with every other token.

#linear attention#efficient attention#kernel trick+12

Universal Approximation Theorem

The Universal Approximation Theorem (UAT) says a feedforward neural network with one hidden layer and a non-polynomial activation (like sigmoid or ReLU) can approximate any continuous function on a compact set as closely as we want.