This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.
The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.
This paper builds BizFinBench.v2, a big bilingual (Chinese–English) test that checks how well AI models really handle finance using real business data from China and the U.S.
Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.
COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.
Large language models can say things that sound right but aren’t supported by the given document; this is called a faithfulness hallucination.
AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.