CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
IntermediateJohannes Kirmayr, Lukas Stappen et al.Jan 29arXiv
CAR-bench is a new 'driving test' for AI assistants that checks if they can stay careful, honest, and consistent during real back-and-forth conversations in a car.
#LLM agents#benchmarking#consistency