Groups
Category
Autoregressive (AR) models represent a joint distribution by multiplying conditional probabilities in a fixed order, using the chain rule of probability.
Transformers map sequences to sequences using layers of self-attention and feed-forward networks wrapped with residual connections and LayerNorm.