All Topics
💬LLM & GenAI
🤖

Transformer Architecture

Master the Transformer - the foundational architecture behind GPT, BERT, and modern LLMs

🌱

Beginner

Beginner

Understanding Transformers

What to Learn

  • Self-attention mechanism intuition
  • Query, Key, Value explained
  • Multi-head attention
  • Positional encodings
  • Encoder-decoder structure

Resources

  • 📚The Illustrated Transformer (Jay Alammar)
  • 📚Attention Is All You Need paper
  • 📚3Blue1Brown: Attention explained
🌿

Intermediate

Intermediate

Transformer variants and modifications

What to Learn

  • Decoder-only (GPT) vs Encoder-only (BERT)
  • Rotary Position Embeddings (RoPE)
  • Grouped Query Attention (GQA)
  • Flash Attention and efficient attention
  • Layer normalization placement (Pre-LN)

Resources

  • 📚GPT-2 and BERT papers
  • 📚Llama architecture papers
  • 📚Flash Attention paper
🌳

Advanced

Advanced

Cutting-edge architecture research

What to Learn

  • Mixture of Experts (MoE)
  • State space alternatives to attention
  • Sparse attention patterns
  • Multi-modal transformers
  • Efficient long-context architectures

Resources

  • 📚Mixtral and Switch Transformer papers
  • 📚Mamba and RWKV papers
  • 📚Latest ICML/NeurIPS transformer papers