Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation
Key Summary
- âąThis paper builds Conv-FinRe, a new test that checks if AI financial advisors give advice that fits a personâs true goals, not just what they clicked before.
- âąInstead of using one âright answer,â it looks from four angles at once: what the user chose, whatâs rational by risk and return, whatâs hot in the market, and whatâs extra safe.
- âąIt learns each userâs hidden feelings about risk by looking backward at their choices (inverse optimization) and never shows that hidden formula to the AI being tested.
- âąThe test is conversational and over time: models read an onboarding chat, step-by-step market updates, and advisor messages, then rank stocks at each step.
- âąTop models score very high on rational utility, but the ones best at copying user choices arenât always the best at long-term, risk-aware advice.
- âąA tension appears: being rational (good for the wallet) and being behaviorally aligned (good for empathy) can pull in different directions.
- âąDeepSeek-V3.2 and GPT models balance different goals well, Llama-3.3-70B is most ârational,â and a finance-tuned model (XuanYuan3) best matches noisy user behavior.
- âąConversation history helps some models quickly learn a userâs risk style, but extra history later doesnât help much once a âpersonaâ is clear.
- âąThe dataset uses real market data and simulated user decisions across 10 users and 23 steps each, and itâs released on Hugging Face with code on GitHub.
- âąConv-FinRe shows why finance needs multi-view, utility-grounded tests so we donât confuse good coaching with copying past mistakes.
Why This Research Matters
Financial choices affect savings, college funds, and retirement, so we need advisors that protect people from risky hype while still finding good opportunities. Conv-FinRe makes it possible to check whether an AI is being a wise coach (utility), an empathetic copilot (behavior), a trend follower (momentum), or a safety-first guardianâand to see those strengths and weaknesses clearly. This helps developers design AIs that better fit real peopleâs comfort with risk, not just short-term clicks. It also gives regulators, platforms, and users a way to compare advisors on fairness and prudence, not only popularity. By grounding evaluation in investor-specific utility and conversations over time, the benchmark pushes the field toward safer, more human-centered financial guidance. Ultimately, it reduces the chance of models encouraging decisions that feel exciting now but hurt later.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how your favorite music app guesses songs youâll like by looking at what you played before? That works pretty well for music because your past likes are a strong clue to your future likes.
đ„Ź The Concept (Behavioral Imitation in Recommendations): In most recommendation tests, a model is called âgoodâ if it suggests exactly what a user clicked before. How it works:
- Collect clicks/ratings from the past
- Train or prompt the model to reproduce those choices
- Score it higher when it matches past behavior Why it matters: This is fast and simple, and for movies or shopping, itâs usually a solid proxy for what people enjoy. đ Anchor: If you always click superhero movies, the system shows more superheroes and gets an âAâ for matching you.
đ Hook: Imagine choosing snacks before a soccer game. If you eat only candy because you liked it yesterday, you might feel sick later. In money choices, âyesterdayâs clicksâ can be candy.
đ„Ź The Concept (Finance Is Different): Financial advice must consider risk and long-term goals, not just yesterdayâs actions. How it works:
- Markets are noisyâprices bounce up and down for many reasons
- People feel emotionsâfear and greed can change short-term choices
- Investors have risk limitsâwhatâs okay for one person may be scary for another Why it matters: Copying past trades can reward bad habits or lucky guesses and ignore safety nets. đ Anchor: If someone panic-bought a hot stock last week, a copycat advisor might keep pushing hot stocks, even when itâs too risky for them.
đ Hook: Think of a coach who trains you to be healthy for a whole season, not just to win todayâs practice.
đ„Ź The Concept (Decision Quality vs. Behavior Match): Decision quality measures if advice fits a personâs true goals (like balancing return and safety), not just if it mimics clicks. How it works:
- Define what âgoodâ means (e.g., better returns with acceptable risk)
- Compare advice to that goal, not only to past actions
- Track this over time, because both markets and people change Why it matters: Without this, models could get high scores for copying noise instead of giving smart, safe guidance. đ Anchor: A good advisor might say âslow downâ on a roller-coaster stock even if you loved the thrill last week.
The World Before: Most benchmarks in recommendations relied on one signalâwhat users clicked or chose. Thatâs fine in entertainment, where clicks â enjoyment. In finance, though, human choices can be short-sighted, markets are volatile, and risks matter. Existing finance datasets often measure prediction or trading profits, but not how well advice matches a particular personâs risk comfort or long-term aim.
The Problem: If we treat user behavior as the only truth, we canât tell if a model is:
- Copying user noise (panic, hype)
- Chasing hot trends (market momentum)
- Offering rational, risk-aware guidance that fits the person
Failed Attempts: Prior benchmarks typically used one view of âcorrectness.â Some financial datasets focused on which asset went up, not whether that fit a userâs risk limits. Others tested chatty advisors but still scored them on behavior matching. These efforts couldnât separate empathy (matching the userâs recent moves) from wisdom (protecting them from downside risk).
The Gap: We lacked a way to grade advice from several angles at onceâbehavior, rational utility, safety, and trend-chasingâwhile grounding it in each userâs own risk preferences.
đ Hook: Imagine four judges at a cooking contestâone likes spicy, one likes sweet, one wants healthy, and one checks if it matches the dinerâs allergy list.
đ„Ź The Concept (Multi-View Evaluation): Look at advice from multiple, possibly conflicting views. How it works:
- Build four reference rankings: user choice, rational utility, market momentum, and safety (risk sensitivity)
- Compare the modelâs ranking to each view
- See which judge the model listens to most Why it matters: Without multiple views, scores hide whether a model is wise, trendy, safe, or just a mimic. đ Anchor: A dish that wins âbest tasteâ might fail the âhealthyâ test; a model great at returns might flunk safety.
Real Stakes: In everyday life, a family saving for college or retirement needs advice that respects their nerves and goals. A model that copies hype can push them toward big risks at the wrong time. A model that is too cautious may miss opportunities. A smart benchmark should show whether an AI is a careful coach, a trend fan, or an echo of yesterdayâs choices. Conv-FinRe was built to do exactly that, using real market data, realistic conversations, and hidden (to the model) user risk preferences to test models fairly across time.
02Core Idea
đ Hook: Imagine a music teacher judging not just if you can copy a song, but if you understand rhythm, melody, and can adapt to different styles.
đ„Ź The Concept (Aha!): The key idea is to evaluate financial advisors by how well they align with a personâs utility (their own balance of returns and risk) across time and conversation, not just by copying past actions. How it works:
- Infer each userâs hidden risk preferences from their past choices (inverse optimization)
- Build four reference rankings: user choice, rational utility, market momentum, and safety
- Run step-by-step advisory conversations with market updates
- Ask models to rank stocks and compare them against all four views Why it matters: This exposes whether a model is rational, empathetic, trend-chasing, or safety-firstâand where it overfits noise. đ Anchor: A great coach knows when to encourage bold moves and when to protect youâthis test checks if the AI can tell the difference.
Three Analogies:
- Traffic Lights: Red (stopârisk), Green (goâmomentum), Yellow (cautionâuserâs comfort), and GPS (utilityâbest route balancing time and safety). The car (model) must obey all signals, not just the green lights.
- Cooking Judges: One judge loves spicy (momentum), one loves mild (safety), one measures nutrition (utility), and one wants your favorite flavor (user choice). The chef must balance them.
- Coach vs. Cheerleader: A coach (utility) guides for the season, a cheerleader (behavior mimic) only mirrors excitement. Good advising needs the coach.
Before vs After:
- Before: Success = âDid you match past clicks?â
- After: Success = âDid you produce smart, risk-tuned advice, and do we understand which signals you followed?â
đ Hook: You know how some friends love roller coasters and others prefer the carousel?
đ„Ź The Concept (Risk Sensitivity): Risk sensitivity captures how much an investor dislikes bumpy rides (volatility and big drops). How it works:
- Measure recent returns, volatility, and drawdowns for each stock
- Learn how strongly a user penalizes volatility and drawdowns
- Create a ranking that favors safer picks for that user Why it matters: Without this, advisors can over-recommend thrilling but stomach-churning choices to someone who hates them. đ Anchor: A risk-averse friend skips the tallest coaster even if it had great reviews last hour.
đ Hook: Imagine trying to guess a friendâs ice cream preference by watching their last 20 orders.
đ„Ź The Concept (Inverse Optimization): Inverse optimization figures out which hidden preferences best explain someoneâs past choices. How it works:
- Assume choices try to maximize a hidden utility (returns minus risk penalties)
- Fit the penalty strengths so the observed choices look most likely
- Use the fitted utility to build ârationalâ and âsafetyâ rankings Why it matters: Without inferring preferences, you canât score if the advice fits the personâs true comfort zone. đ Anchor: If your friend keeps choosing strawberry when rocky road is bumpy (risky), you learn they dislike âbumps,â not just love red.
Why It Works (Intuition):
- Finance advice is a trade-off machine: more return can mean more risk.
- People differ in how much risk feels âtoo much.â
- By learning each personâs risk dial and testing models against four views, we can tell which advisors are thoughtful coaches vs. copycats vs. trend chasers.
Building Blocks:
- Hidden Utility: Returns minus user-specific penalties for volatility and drawdown
- Four Views: User choice, rational utility, market momentum, safety-first
- Conversations Over Time: Onboarding to learn about the person, step-by-step market updates
- Diagnostics: Metrics that show who the model âlistens toâ (utility, momentum, safety, behavior)
đ Hook: Imagine a school report card that shows math, art, sports, and kindnessâso you see the whole student.
đ„Ź The Concept (Multi-View Diagnostic): A multi-view diagnostic lets us see strengths and weaknesses, not just one score. How it works:
- Compare model rankings to each reference
- Compute alignment scores and ranking metrics
- Interpret trade-offs: where is the model rational vs. empathetic vs. trendy vs. safe? Why it matters: One number can hide problems; multiple views reveal the modelâs real personality. đ Anchor: A student with straight Aâs in math but low teamwork might need different coaching than a balanced B+ student.
03Methodology
At a high level: Input â [Onboarding profile + Prior dialogue + Current market snapshot + Three expert messages] â Model composes a final stock ranking â Evaluate from four angles.
đ Hook: Think of packing a lunch: you remember what your friend said they like (onboarding), what they ate yesterday (history), whatâs fresh today (market), and tips from three chefs (experts).
đ„Ź The Concept (Conversational, Longitudinal Setup): The task is a step-by-step advisory chat where the model must adapt over time. How it works:
- Onboarding interview P_i: a brief 4-turn chat that verbalizes the userâs background, goals, and comfort with risk
- History H_i(1:tâ1): all previous advisor messages and user choices up to now
- Market M_t: a 7-day snapshot for each candidate stock (returns, volatility, drawdowns), plus a plain-language summary
- Panel of three advisors: Utility-focused, Momentum-chasing, and Safety-first comments
- The model outputs a ranking over a fixed set of stocks at each step Why it matters: Without the conversation and time steps, you canât tell if the model learns and adapts to the userâs style as the market moves. đ Anchor: Each school day you choose lunch again; what you liked Monday and whatâs fresh Friday both matter.
Step-by-Step Details (with examples):
- Inputs Assembled
- What happens: Build I_i_t = {P_i, H_i(1:tâ1), M_t}. P_i is the 4-turn onboarding; H_i(1:tâ1) is the running chat; M_t encodes 7-day stats for each stock in a small, balanced universe of 10 S&P 500 names (e.g., PG, MRK, VZ, LIN, XOM, JPM, AMZN, MMM, SPG, TSLA) from Aug 6âSep 17, 2025.
- Why it exists: It mirrors real advisory flow: learn about the person, remember past steps, consider todayâs market.
- Example: At step 8, the model reads that the user disliked last weekâs drawdown and that TSLA had high volatility recently.
- Hidden Preference Grounding (Inverse Optimization)
- What happens: For each user, estimate how much they penalize volatility (λ_i) and drawdowns (Îł_i) so that their past picks look likely under a utility = (standardized return) â λ_i*(std volatility) â Îł_i*(std drawdown).
- Why it exists: We need a personal âyardstickâ to grade rational and safety views without telling the model the answers.
- Example: A userâs fitted (λ, Îł) are (0.7, 0.9), meaning theyâre quite sensitive to big drops.
- Construct Four Reference Views
- What happens: Create rankings for (a) User Choice (what they picked), (b) Rational Utility (maximize the learned utility), (c) Market Momentum (pure recent return), (d) Safety (penalize risk using λ_i, γ_i).
- Why it exists: Multiple judges reveal whether the model is rational, trendy, safe, or mimicking behavior.
- Example: In a trending week, AMZN and TSLA might top momentum, while MRK and PG top safety for a risk-averse user.
- Panel of Three Advisors Speak
- What happens: The conversation provides messages from advisors embodying utility, momentum, and safety principles.
- Why it exists: It presents conflicting, realistic coaching signals for the model to synthesize.
- Example: Momentum says, âTSLAâs 7-day gain leadsâ; Safety says, âVolatility is high; consider PGâ; Utility balances both given the userâs (λ, Îł).
- The Model Reranks Stocks
- What happens: The model f_Ξ ingests I_i_t and advisor messages, then outputs a ranked list Ï_i,t.
- Why it exists: We evaluate model judgment under pressure from different viewpoints.
- Example: The model might place LIN, MRK, PG above TSLA for a safety-leaning user, even if momentum favors TSLA.
- Scoring from Four Angles
- What happens: Compute alignment metrics.
- Why it exists: One score canât capture the trade-offs.
- Example: A model gets a high utility-based NDCG but a lower hit rate for matching the exact user pick.
Core Metrics (Sandwich explanations):
-
đ Hook: Imagine grading a bookshelf by how well your favorite books are at the top. đ„Ź uNDCG (Utility-based NDCG): It measures how well the ranking places high-utility stocks near the top. How it works: (1) Give each stock a utility score, (2) reward higher scores near rank 1, (3) normalize to 0â1. Why it matters: Without it, we canât judge if the list reflects the userâs rational risk-return balance. đ Anchor: If safer-good-return picks show up at the top, uNDCG is high.
-
đ Hook: Imagine guessing which snack your friend will grab first. đ„Ź MRR and Hit Rate: They check how well the model predicts the userâs actual choice position (MRR) and whether itâs in the top K (HR@K). How it works: (1) Find the chosen stockâs rank, (2) take reciprocal for MRR, (3) see if itâs in top 1 or 3 for HR. Why it matters: Without it, we canât see behavioral alignment. đ Anchor: If the friendâs chosen snack is often #1, HR@1 is high.
-
đ Hook: Think of comparing two ordered top-10 lists to see how similarly they sort the same songs. đ„Ź Expert Alignment Score (Kendallâs tau): It measures how much the modelâs ranking agrees with each expert view (utility, momentum, safety). How it works: (1) Compare all item pairs, (2) count pairs ordered the same way, (3) convert to a score between -1 and 1. Why it matters: Without it, we canât tell which âjudgeâ the model followed. đ Anchor: If your playlist order matches the âspicyâ judgeâs order, tau with momentum is high.
Secret Sauce:
- The userâs utility (λ, Îł) is learned from behavior but never shown to the model, preventing cheating and forcing genuine reasoning.
- Multi-view scoring separates wisdom (utility), empathy (user choice), trendiness (momentum), and caution (safety).
- Conversations are realistic and long (up to 26 turns), so models must remember and adapt, not just memorize rules.
04Experiments & Results
The Test: The benchmark uses 10 users, each with 23 decision steps over a 30-day period (Aug 6âSep 17, 2025) and a compact, sector-balanced, 10-stock universe with varied risk. Each step includes a 7-day market snapshot, advisor messages, and the unfolding chat history. We measure: utility-based NDCG (uNDCG) for rational fit, MRR/HR@K for behavior match, and Expert Alignment Scores (Kendallâs tau) for which âjudgeâ the model listens to (utility, momentum, safety).
The Competition: Models include GPT-5.2 and GPT-4o (closed-source), DeepSeek-V3.2, Qwen3-235B-A22B, Qwen2.5-72B, Llama-3.3-70B (open-source general models), and Llama3-XuanYuan3-70B-Chat (finance-tuned). A random baseline gives a sanity-check lower bound.
The Scoreboard (with context):
- Overall utility fit is strong: Most models achieve high uNDCG (0.92â0.97). Llama-3.3-70B-Instruct leads with 0.97âlike getting an A+ in rational ranking.
- Behavior match varies: Qwen2.5-72B and XuanYuan3 excel in MRR/HR, meaning they often place the userâs actual pick near the topâlike being great mind-readers of short-term behavior.
- Example scores: GPT-4o hits uNDCG 0.94, MRR 0.56 (HR@1 0.42, HR@3 0.60). Llama-3.3-70B hits uNDCG 0.97 but lower HR@1 0.36, showing rationality over mimicry.
- Takeaway: High rationality doesnât guarantee high mimicry; empathy and wisdom can diverge.
Expert Alignment Findings (Kendallâs tau):
- Llama-3.3-70B aligns highest with utility (Ïâ0.74) and momentum (Ïâ0.73) but drops on safety (Ïâ0.17)âlike a fast driver whoâs great at reading the road and going with traffic, but not the best at braking early.
- DeepSeek-V3.2 is the most balanced across all three judges, with solid safety alignmentâlike a careful driver who still keeps good speed.
- GPT series shows similar balanced traits but slightly less safety consistency than DeepSeek.
- XuanYuan3, the finance-specialist model, has lower expert-tau but high behavioral hit rates, suggesting empathetic mimicry of real user moves over strict formula-followingâlike a human advisor who listens closely to a nervous client.
Surprising/Notable Dynamics:
- Utility and momentum often couple during trending markets; when hot stocks also boost utility, models that chase trends can look ârationalââbut this masks safety gaps.
- Conversation history helps some models early. GPT-5.2 and DeepSeek-V3.2 show big uNDCG gains in steps 1â10, meaning they learn the userâs risk vibe quickly. Later, gains plateau: once the user persona is clear, fresh market data drives more of the decision.
Archetypes from Context Use:
- Adaptive Advisors (GPT-5.2, DeepSeek-V3.2, Qwen3-235B): Improve utility alignment with historyâgood at integrating preferences across turns.
- Transaction-Driven Analysts (GPT-4o, Llama-3.3-70B): Strong utility without much help from historyâlean on current market signals.
- Behavioral Overfitters (Qwen2.5-72B, XuanYuan3): Utility alignment gets worse with historyâlikely overfitting to noisy user actions.
What This Means:
- Thereâs a persistent tension: Advising wisely (utility) vs. matching behavior (user choice). You canât assume that a model great at one is great at the other.
- Balanced models that respect safety stand out when trends and risks conflict.
- Long conversations help early, but after a point, todayâs market snapshot matters more than more memory.
05Discussion & Limitations
Limitations:
- Small, controlled stock universe (10 tickers) ensures balance and clarity, but itâs not the full market. Results may shift with larger, more complex universes.
- Risk preferences are inferred from behavior using a specific utility form (returns minus volatility and drawdown penalties). Different utility forms (e.g., prospect theory variants) could change which advice looks ârational.â
- Conversations are simulated from real traces and profilesânot raw human chatsâwhich, while controlled and reproducible, may miss some messiness of live dialogues.
- Momentum-utility coupling in trending windows can blur whether a model is genuinely rational or simply trend-following.
- Only long positions and a 30-day horizon are considered; other strategies or horizons might surface different behaviors.
Required Resources:
- Access to LLMs with long-context handling (â8k tokens)
- Market data retrieval (e.g., Yahoo Finance) and feature computation for rolling windows
- Inverse optimization fitting per user trajectory
- Evaluation harness for ranking metrics and Kendallâs tau
When NOT to Use:
- If your goal is pure price prediction or trading PnL without personalizationâother datasets are better.
- If you need open-ended, creative finance tutoring without verifiable scoringâthis benchmark is built for measurable, utility-grounded alignment.
- If your users face complex instruments (options, leverage) beyond the benchmarkâs scopeâutility/risk models need extension first.
Open Questions:
- How to expand utility to richer risk notions (tail risk, regime shifts) without losing interpretability?
- Can we design training objectives that directly optimize multi-view alignment (utility + safety + empathy) instead of post-hoc evaluation only?
- How do models behave with dynamic, time-varying user risk preferences rather than fixed (λ, γ)?
- What prompts or architectures best separate âsurface mimicryâ from âtrue coachingâ in finance dialogs?
- Can counterfactual conversation edits (e.g., removing momentum hints) cleanly reveal causal model biases?
06Conclusion & Future Work
3-Sentence Summary: Conv-FinRe is a conversational, longitudinal benchmark that scores financial advisors by how well they fit each personâs risk and return balance (utility), not just by copying past actions. It builds four complementary reference viewsâuser choice, rational utility, market momentum, and safetyâand learns each userâs hidden risk sensitivity via inverse optimization. Experiments show a core tension: models that are most rational arenât always the best at mimicking behavior, and vice versa, with balanced models standing out when trends and risks collide.
Main Achievement: Turning financial recommendation evaluation into a multi-view, utility-grounded diagnostic that separates wisdom (rationality) from empathy (behavioral alignment) and trendiness, using realistic conversations over time.
Future Directions:
- Scale to larger universes, longer horizons, and richer risk measures (e.g., tail risk)
- Train models to optimize multi-view alignment directly
- Allow dynamic user preferences that evolve across the conversation
- Add causal probes to disentangle momentum vs. utility influences
Why Remember This: In finance, âcopying clicksâ is not the same as âgood coaching.â Conv-FinRe shows how to test AI advisors the way good coaches are judged: by how well they balance ambition with safety for this person, right now, across timeâand it gives clear, multi-angle scores that reveal a modelâs true advisory personality.
Practical Applications
- âąCompare AI advisors by how well they fit a specific investorâs risk comfort, not just by click-matching scores.
- âąTune prompts or model settings to balance rational utility (returns vs. risk) with user empathy (behavior alignment).
- âąAudit whether an advisor is secretly chasing momentum instead of respecting user safety preferences.
- âąDesign training objectives that directly optimize multi-view alignment (utility, momentum, safety, behavior).
- âąBuild onboarding flows that surface user risk style in simple language, then validate it over time.
- âąUse conversation history strategically: emphasize early preference learning, then focus on current market context later.
- âąAdd safety checks that flag when recommendations drift away from a userâs inferred risk tolerance.
- âąRun A/B tests to see if balanced advisors (utility + safety) improve long-term user satisfaction vs. pure mimicry.
- âąExtend the benchmark to more assets or longer horizons to stress-test advisor stability across regimes.
- âąInform regulatory evaluations with utility-grounded and multi-view evidence of suitability and risk alignment.