This paper builds ID-MoCQA, a new two-step (multi-hop) quiz set about Indonesian culture that makes AI connect clues before answering.
Training big language models works best when you mix the right kinds of data (general, math, code), but finding the best mix used to be slow and very expensive.
This paper introduces TAM-Eval, a new way to test how well AI models can create, fix, and update unit tests for real software projects.
MentraSuite is a complete toolkit that teaches large language models (LLMs) to reason about mental health step by step, not just sound caring.