Papers3

#multilingual evaluation

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Zhiling Yan, Dingjie Song et al.Feb 10arXiv

LiveMedBench is a new, always-updating test for medical AIs that keeps test questions safely separated from training data to avoid cheating by memorization.

#LiveMedBench#medical benchmark#data contamination

Not triaged yet

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking

Intermediate

Xilin Jiang, Qiaolin Wang et al.Jan 25arXiv

AVMeme Exam is a new test made by humans that checks if AI can understand famous internet audio and video clips the way people do.

#AVMeme Exam#multimodal large language models#audio-visual memes

Not triaged yet

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Intermediate

Joona Kytöniemi, Jousia Piha et al.Dec 15arXiv

FIN-bench-v2 is a big, tidy set of Finnish tests that checks how good large language models are at many things like reading, logic, and world knowledge.

#Finnish language models#benchmark suite#HuggingFace Datasets

Not triaged yet