All Topics
💬LLM & GenAI
📊

LLM Evaluation

Learn to properly evaluate LLM outputs for quality, safety, and task performance

🌱

Beginner

Beginner

LLM evaluation basics

What to Learn

  • Perplexity and loss-based metrics
  • Task-specific benchmarks (MMLU, HellaSwag)
  • Human evaluation protocols
  • Prompt-based evaluation
  • Common pitfalls in LLM evaluation

Resources

  • 📚LMSYS Chatbot Arena
  • 📚Open LLM Leaderboard
  • 📚LangSmith evaluation features
🌿

Intermediate

Intermediate

Comprehensive evaluation systems

What to Learn

  • LLM-as-judge evaluation
  • Pairwise comparison methods
  • Multi-dimensional evaluation rubrics
  • RAG evaluation (RAGAS)
  • Safety and alignment testing

Resources

  • 📚MT-Bench and AlpacaEval papers
  • 📚RAGAS documentation
  • 📚Anthropic evaluations research
🌳

Advanced

Advanced

Evaluation research and methodology

What to Learn

  • Evaluation contamination and leakage
  • Benchmark saturation analysis
  • Developing new evaluation tasks
  • Red teaming and adversarial evaluation
  • Measuring emergent capabilities

Resources

  • 📚BIG-Bench and HELM papers
  • 📚Evaluation methodology papers
  • 📚AI safety evaluation frameworks