Concepts16

Groups

Efficient Attention Mechanisms

Standard softmax attention costs O(n²) in sequence length because every token compares with every other token.

#linear attention#efficient attention#kernel trick+12

Scaled Dot-Product Attention

Scaled dot-product attention scores how much each value V should contribute to a query by taking dot products with keys K, scaling by \(\sqrt{d_k}\), applying softmax, and forming a weighted sum.

#scaled dot-product attention

1 2

Concepts16

Efficient Attention Mechanisms

Scaled Dot-Product Attention

Convolution Theorem

Tensor Operations

Complexity Analysis Quick Reference

Debugging Strategies for CP

Small-to-Large Principle

Permutations and Combinations

Fast Exponentiation

DSU on Tree (Sack)

Knuth Optimization

Breadth-First Search (BFS)