All Topics
⚙️Engineering

LLM Inference Optimization

Optimize LLM inference for speed, cost, and efficiency

🌱

Beginner

Beginner

Inference optimization basics

What to Learn

  • Understanding inference bottlenecks
  • Quantization fundamentals (INT8, INT4)
  • KV cache and memory management
  • Batching for throughput
  • Using vLLM and TGI

Resources

  • 📚vLLM documentation
  • 📚HuggingFace TGI guide
  • 📚Quantization tutorials
🌿

Intermediate

Intermediate

Production inference systems

What to Learn

  • PagedAttention and continuous batching
  • Speculative decoding
  • Tensor parallelism and model sharding
  • GPTQ, AWQ, and GGUF formats
  • Inference frameworks comparison

Resources

  • 📚vLLM paper
  • 📚Speculative decoding paper
  • 📚Llama.cpp and GGUF docs
🌳

Advanced

Advanced

Cutting-edge optimization research

What to Learn

  • Custom CUDA kernels for inference
  • Sparse attention implementations
  • Model compression research
  • Hardware-aware optimization
  • Cost modeling and optimization

Resources

  • 📚Flash Attention implementation
  • 📚CUDA programming for ML
  • 📚Hardware optimization papers