💬LLM & GenAI
📊
LLM Evaluation
Learn to properly evaluate LLM outputs for quality, safety, and task performance
Prerequisites
🌱
Beginner
BeginnerLLM evaluation basics
What to Learn
- •Perplexity and loss-based metrics
- •Task-specific benchmarks (MMLU, HellaSwag)
- •Human evaluation protocols
- •Prompt-based evaluation
- •Common pitfalls in LLM evaluation
Resources
- 📚LMSYS Chatbot Arena
- 📚Open LLM Leaderboard
- 📚LangSmith evaluation features
🌿
Intermediate
IntermediateComprehensive evaluation systems
What to Learn
- •LLM-as-judge evaluation
- •Pairwise comparison methods
- •Multi-dimensional evaluation rubrics
- •RAG evaluation (RAGAS)
- •Safety and alignment testing
Resources
- 📚MT-Bench and AlpacaEval papers
- 📚RAGAS documentation
- 📚Anthropic evaluations research
🌳
Advanced
AdvancedEvaluation research and methodology
What to Learn
- •Evaluation contamination and leakage
- •Benchmark saturation analysis
- •Developing new evaluation tasks
- •Red teaming and adversarial evaluation
- •Measuring emergent capabilities
Resources
- 📚BIG-Bench and HELM papers
- 📚Evaluation methodology papers
- 📚AI safety evaluation frameworks