How I Study AI - Learn AI Papers & Lectures the Easy Way

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

This paper builds Conv-FinRe, a new test that checks if AI financial advisors give advice that fits a person’s true goals, not just what they clicked before.

#financial recommendation#utility-based evaluation#conversational benchmark

Not triaged yet

Benchmark^2: Systematic Evaluation of LLM Benchmarks

Intermediate

Qi Qian, Chengsong Huang et al.Jan 7arXiv

Everyone uses tests (benchmarks) to judge how smart AI models are, but not all tests are good tests.

#LLM evaluation#benchmark quality#ranking consistency

Not triaged yet

Confidence Estimation for LLMs in Multi-turn Interactions

Intermediate

Caiqi Zhang, Ruihan Yang et al.Jan 5arXiv

This paper studies how sure (confident) large language models are during multi-turn chats where clues arrive step by step.

#multi-turn confidence estimation#LLM calibration#InfoECE

Not triaged yet

Papers3

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Benchmark^2: Systematic Evaluation of LLM Benchmarks

Confidence Estimation for LLMs in Multi-turn Interactions