Papers4

#benchmarking LLMs

No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Vynska Amalia Permadi, Xingwei Tan et al.Feb 3arXiv

This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.

#multi-hop question answering#cultural reasoning#Indonesian culture

Not triaged yet

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Intermediate

Shengrui Li, Fei Zhao et al.Jan 31arXiv

Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.

#data mixture optimization#model merging#weighted model merging

Not triaged yet

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Intermediate

Elena Bruches, Vadim Alperovich et al.Jan 26arXiv

This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.

#unit test maintenance#LLM for software engineering#reference-free evaluation

Not triaged yet

MentraSuite: Post-Training Large Language Models for Mental Health Reasoning and Assessment

Intermediate

Mengxi Xiao, Kailai Yang et al.Dec 10arXiv

MentraSuite is a complete toolkit that teaches large language models (LLMs) to reason about mental health step by step, not just sound caring.

#mental health reasoning#LLM post-training#supervised fine-tuning

Not triaged yet