Papers8

#LLM evaluation

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

This paper builds Conv-FinRe, a new test that checks if AI financial advisors give advice that fits a person’s true goals, not just what they clicked before.

#financial recommendation#utility-based evaluation#conversational benchmark

Not triaged yet

LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Intermediate

Gilat Toker, Nitay Calderon et al.Jan 15arXiv

This paper builds LIBERTy, a new way to fairly judge how well AI explains its decisions about big, human ideas like age, race, or experience.

#concept-based explanations#structural counterfactuals#structured causal models

Not triaged yet

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Intermediate

Hongjun An, Yiliang Song et al.Jan 10arXiv

The paper shows that friendly, people-pleasing language can trick even advanced language models into agreeing with wrong answers.

#Preference-Undermining Attacks#PUA#sycophancy

Not triaged yet

BizFinBench.v2: A Unified Dual-Mode Bilingual Benchmark for Expert-Level Financial Capability Alignment

Intermediate

Xin Guo, Rongjunchen Zhang et al.Jan 10arXiv

This paper builds BizFinBench.v2, a big bilingual (Chinese–English) test that checks how well AI models really handle finance using real business data from China and the U.S.

#BizFinBench.v2#financial benchmark#bilingual evaluation

Not triaged yet

Benchmark^2: Systematic Evaluation of LLM Benchmarks

Intermediate

Qi Qian, Chengsong Huang et al.Jan 7arXiv

Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.

#LLM evaluation#benchmark quality#ranking consistency

Not triaged yet

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs

Intermediate

Dasol Choi, DongGeon Lee et al.Jan 5arXiv

COMPASS is a new framework that turns a company’s rules into thousands of smart test questions to check if chatbots follow those rules.

#policy alignment#allowlist denylist#enterprise AI safety

Not triaged yet

FaithLens: Detecting and Explaining Faithfulness Hallucination

Intermediate

Shuzheng Si, Qingyi Wang et al.Dec 23arXiv

Large language models can say things that sound right but aren’t supported by the given document; this is called a faithfulness hallucination.

#faithfulness hallucination#hallucination detection#explainable AI

Not triaged yet

AutoMV: An Automatic Multi-Agent System for Music Video Generation

Intermediate

Xiaoxuan Tang, Xinping Lei et al.Dec 13arXiv

AutoMV is a team of AI helpers that turns a whole song into a full music video that matches the music, the beat, and the lyrics.

#music-to-video generation#multi-agent system#music information retrieval

Not triaged yet