Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
IntermediateEddie Landesberg, Manjari NarayanDec 11arXiv
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
#LLM-as-judge#calibration#isotonic regression