FINCH is a new test that checks whether AI can handle real finance and accounting work using messy, real spreadsheets, emails, PDFs, charts, and more.
VOYAGER is a training-free way to make large language models (LLMs) create data that is truly different, not just slightly reworded.
LLM judges are cheap but biased; without calibration they can completely flip which model looks best.
Role-playing agents need to juggle several goals at once, like staying in character, following instructions, and using the right tone.
Large language models forget or misuse new facts if you only poke their weights once; EtCon fixes this with a two-step plan.