Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Yan Wang; Yi Han; Lingfei Qian; Yueru He; Xueqing Peng; Dongji Feng; Zhuohan Xie; Vincent Jim Zhang; Rosie Guo; Fengran Mo; Jimin Huang; Yankai Chen; Xue Liu; Jian-Yun Nie

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Intermediate

Yan Wang, Yi Han, Lingfei Qian et al.2/19/2026

arXiv

Key Summary

•This paper builds Conv-FinRe, a new test that checks if AI financial advisors give advice that fits a person’s true goals, not just what they clicked before.
•Instead of using one “right answer,” it looks from four angles at once: what the user chose, what’s rational by risk and return, what’s hot in the market, and what’s extra safe.
•It learns each user’s hidden feelings about risk by looking backward at their choices (inverse optimization) and never shows that hidden formula to the AI being tested.
•The test is conversational and over time: models read an onboarding chat, step-by-step market updates, and advisor messages, then rank stocks at each step.
•Top models score very high on rational utility, but the ones best at copying user choices aren’t always the best at long-term, risk-aware advice.
•A tension appears: being rational (good for the wallet) and being behaviorally aligned (good for empathy) can pull in different directions.
•DeepSeek-V3.2 and GPT models balance different goals well, Llama-3.3-70B is most “rational,” and a finance-tuned model (XuanYuan3) best matches noisy user behavior.
•Conversation history helps some models quickly learn a user’s risk style, but extra history later doesn’t help much once a “persona” is clear.
•The dataset uses real market data and simulated user decisions across 10 users and 23 steps each, and it’s released on Hugging Face with code on GitHub.
•Conv-FinRe shows why finance needs multi-view, utility-grounded tests so we don’t confuse good coaching with copying past mistakes.

Why This Research Matters

Financial choices affect savings, college funds, and retirement, so we need advisors that protect people from risky hype while still finding good opportunities. Conv-FinRe makes it possible to check whether an AI is being a wise coach (utility), an empathetic copilot (behavior), a trend follower (momentum), or a safety-first guardian—and to see those strengths and weaknesses clearly. This helps developers design AIs that better fit real people’s comfort with risk, not just short-term clicks. It also gives regulators, platforms, and users a way to compare advisors on fairness and prudence, not only popularity. By grounding evaluation in investor-specific utility and conversations over time, the benchmark pushes the field toward safer, more human-centered financial guidance. Ultimately, it reduces the chance of models encouraging decisions that feel exciting now but hurt later.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how your favorite music app guesses songs you’ll like by looking at what you played before? That works pretty well for music because your past likes are a strong clue to your future likes.

🥬 The Concept (Behavioral Imitation in Recommendations): In most recommendation tests, a model is called “good” if it suggests exactly what a user clicked before. How it works:

Collect clicks/ratings from the past
Train or prompt the model to reproduce those choices
Score it higher when it matches past behavior Why it matters: This is fast and simple, and for movies or shopping, it’s usually a solid proxy for what people enjoy. 🍞 Anchor: If you always click superhero movies, the system shows more superheroes and gets an “A” for matching you.

🍞 Hook: Imagine choosing snacks before a soccer game. If you eat only candy because you liked it yesterday, you might feel sick later. In money choices, “yesterday’s clicks” can be candy.

🥬 The Concept (Finance Is Different): Financial advice must consider risk and long-term goals, not just yesterday’s actions. How it works:

Markets are noisy—prices bounce up and down for many reasons
People feel emotions—fear and greed can change short-term choices
Investors have risk limits—what’s okay for one person may be scary for another Why it matters: Copying past trades can reward bad habits or lucky guesses and ignore safety nets. 🍞 Anchor: If someone panic-bought a hot stock last week, a copycat advisor might keep pushing hot stocks, even when it’s too risky for them.

🍞 Hook: Think of a coach who trains you to be healthy for a whole season, not just to win today’s practice.

🥬 The Concept (Decision Quality vs. Behavior Match): Decision quality measures if advice fits a person’s true goals (like balancing return and safety), not just if it mimics clicks. How it works:

Define what “good” means (e.g., better returns with acceptable risk)
Compare advice to that goal, not only to past actions
Track this over time, because both markets and people change Why it matters: Without this, models could get high scores for copying noise instead of giving smart, safe guidance. 🍞 Anchor: A good advisor might say “slow down” on a roller-coaster stock even if you loved the thrill last week.

The World Before: Most benchmarks in recommendations relied on one signal—what users clicked or chose. That’s fine in entertainment, where clicks ≈ enjoyment. In finance, though, human choices can be short-sighted, markets are volatile, and risks matter. Existing finance datasets often measure prediction or trading profits, but not how well advice matches a particular person’s risk comfort or long-term aim.

The Problem: If we treat user behavior as the only truth, we can’t tell if a model is:

Copying user noise (panic, hype)
Chasing hot trends (market momentum)
Offering rational, risk-aware guidance that fits the person

Failed Attempts: Prior benchmarks typically used one view of “correctness.” Some financial datasets focused on which asset went up, not whether that fit a user’s risk limits. Others tested chatty advisors but still scored them on behavior matching. These efforts couldn’t separate empathy (matching the user’s recent moves) from wisdom (protecting them from downside risk).

The Gap: We lacked a way to grade advice from several angles at once—behavior, rational utility, safety, and trend-chasing—while grounding it in each user’s own risk preferences.

🍞 Hook: Imagine four judges at a cooking contest—one likes spicy, one likes sweet, one wants healthy, and one checks if it matches the diner’s allergy list.

🥬 The Concept (Multi-View Evaluation): Look at advice from multiple, possibly conflicting views. How it works:

Build four reference rankings: user choice, rational utility, market momentum, and safety (risk sensitivity)
Compare the model’s ranking to each view
See which judge the model listens to most Why it matters: Without multiple views, scores hide whether a model is wise, trendy, safe, or just a mimic. 🍞 Anchor: A dish that wins “best taste” might fail the “healthy” test; a model great at returns might flunk safety.

Real Stakes: In everyday life, a family saving for college or retirement needs advice that respects their nerves and goals. A model that copies hype can push them toward big risks at the wrong time. A model that is too cautious may miss opportunities. A smart benchmark should show whether an AI is a careful coach, a trend fan, or an echo of yesterday’s choices. Conv-FinRe was built to do exactly that, using real market data, realistic conversations, and hidden (to the model) user risk preferences to test models fairly across time.

02Core Idea

🍞 Hook: Imagine a music teacher judging not just if you can copy a song, but if you understand rhythm, melody, and can adapt to different styles.

🥬 The Concept (Aha!): The key idea is to evaluate financial advisors by how well they align with a person’s utility (their own balance of returns and risk) across time and conversation, not just by copying past actions. How it works:

Infer each user’s hidden risk preferences from their past choices (inverse optimization)
Build four reference rankings: user choice, rational utility, market momentum, and safety
Run step-by-step advisory conversations with market updates
Ask models to rank stocks and compare them against all four views Why it matters: This exposes whether a model is rational, empathetic, trend-chasing, or safety-first—and where it overfits noise. 🍞 Anchor: A great coach knows when to encourage bold moves and when to protect you—this test checks if the AI can tell the difference.

Three Analogies:

Traffic Lights: Red (stop—risk), Green (go—momentum), Yellow (caution—user’s comfort), and GPS (utility—best route balancing time and safety). The car (model) must obey all signals, not just the green lights.
Cooking Judges: One judge loves spicy (momentum), one loves mild (safety), one measures nutrition (utility), and one wants your favorite flavor (user choice). The chef must balance them.
Coach vs. Cheerleader: A coach (utility) guides for the season, a cheerleader (behavior mimic) only mirrors excitement. Good advising needs the coach.

Before vs After:

Before: Success = “Did you match past clicks?”
After: Success = “Did you produce smart, risk-tuned advice, and do we understand which signals you followed?”

🍞 Hook: You know how some friends love roller coasters and others prefer the carousel?

🥬 The Concept (Risk Sensitivity): Risk sensitivity captures how much an investor dislikes bumpy rides (volatility and big drops). How it works:

Measure recent returns, volatility, and drawdowns for each stock
Learn how strongly a user penalizes volatility and drawdowns
Create a ranking that favors safer picks for that user Why it matters: Without this, advisors can over-recommend thrilling but stomach-churning choices to someone who hates them. 🍞 Anchor: A risk-averse friend skips the tallest coaster even if it had great reviews last hour.

🍞 Hook: Imagine trying to guess a friend’s ice cream preference by watching their last 20 orders.

🥬 The Concept (Inverse Optimization): Inverse optimization figures out which hidden preferences best explain someone’s past choices. How it works:

Assume choices try to maximize a hidden utility (returns minus risk penalties)
Fit the penalty strengths so the observed choices look most likely
Use the fitted utility to build “rational” and “safety” rankings Why it matters: Without inferring preferences, you can’t score if the advice fits the person’s true comfort zone. 🍞 Anchor: If your friend keeps choosing strawberry when rocky road is bumpy (risky), you learn they dislike “bumps,” not just love red.

Why It Works (Intuition):

Finance advice is a trade-off machine: more return can mean more risk.
People differ in how much risk feels “too much.”
By learning each person’s risk dial and testing models against four views, we can tell which advisors are thoughtful coaches vs. copycats vs. trend chasers.

Building Blocks:

Hidden Utility: Returns minus user-specific penalties for volatility and drawdown
Four Views: User choice, rational utility, market momentum, safety-first
Conversations Over Time: Onboarding to learn about the person, step-by-step market updates
Diagnostics: Metrics that show who the model “listens to” (utility, momentum, safety, behavior)

🍞 Hook: Imagine a school report card that shows math, art, sports, and kindness—so you see the whole student.

🥬 The Concept (Multi-View Diagnostic): A multi-view diagnostic lets us see strengths and weaknesses, not just one score. How it works:

Compare model rankings to each reference
Compute alignment scores and ranking metrics
Interpret trade-offs: where is the model rational vs. empathetic vs. trendy vs. safe? Why it matters: One number can hide problems; multiple views reveal the model’s real personality. 🍞 Anchor: A student with straight A’s in math but low teamwork might need different coaching than a balanced B+ student.

03Methodology

At a high level: Input → [Onboarding profile + Prior dialogue + Current market snapshot + Three expert messages] → Model composes a final stock ranking → Evaluate from four angles.

🍞 Hook: Think of packing a lunch: you remember what your friend said they like (onboarding), what they ate yesterday (history), what’s fresh today (market), and tips from three chefs (experts).

🥬 The Concept (Conversational, Longitudinal Setup): The task is a step-by-step advisory chat where the model must adapt over time. How it works:

Onboarding interview P_i: a brief 4-turn chat that verbalizes the user’s background, goals, and comfort with risk
History H_i(1:t−1): all previous advisor messages and user choices up to now
Market M_t: a 7-day snapshot for each candidate stock (returns, volatility, drawdowns), plus a plain-language summary
Panel of three advisors: Utility-focused, Momentum-chasing, and Safety-first comments
The model outputs a ranking over a fixed set of stocks at each step Why it matters: Without the conversation and time steps, you can’t tell if the model learns and adapts to the user’s style as the market moves. 🍞 Anchor: Each school day you choose lunch again; what you liked Monday and what’s fresh Friday both matter.

Step-by-Step Details (with examples):

Inputs Assembled

What happens: Build I_i_t = {P_i, H_i(1:t−1), M_t}. P_i is the 4-turn onboarding; H_i(1:t−1) is the running chat; M_t encodes 7-day stats for each stock in a small, balanced universe of 10 S&P 500 names (e.g., PG, MRK, VZ, LIN, XOM, JPM, AMZN, MMM, SPG, TSLA) from Aug 6–Sep 17, 2025.
Why it exists: It mirrors real advisory flow: learn about the person, remember past steps, consider today’s market.
Example: At step 8, the model reads that the user disliked last week’s drawdown and that TSLA had high volatility recently.

Hidden Preference Grounding (Inverse Optimization)

What happens: For each user, estimate how much they penalize volatility (λ_i) and drawdowns (γ_i) so that their past picks look likely under a utility = (standardized return) − λ_i*(std volatility) − γ_i*(std drawdown).
Why it exists: We need a personal “yardstick” to grade rational and safety views without telling the model the answers.
Example: A user’s fitted (λ, γ) are (0.7, 0.9), meaning they’re quite sensitive to big drops.

Construct Four Reference Views

What happens: Create rankings for (a) User Choice (what they picked), (b) Rational Utility (maximize the learned utility), (c) Market Momentum (pure recent return), (d) Safety (penalize risk using λ_i, γ_i).
Why it exists: Multiple judges reveal whether the model is rational, trendy, safe, or mimicking behavior.
Example: In a trending week, AMZN and TSLA might top momentum, while MRK and PG top safety for a risk-averse user.

Panel of Three Advisors Speak

What happens: The conversation provides messages from advisors embodying utility, momentum, and safety principles.
Why it exists: It presents conflicting, realistic coaching signals for the model to synthesize.
Example: Momentum says, “TSLA’s 7-day gain leads”; Safety says, “Volatility is high; consider PG”; Utility balances both given the user’s (λ, γ).

The Model Reranks Stocks

What happens: The model f_θ ingests I_i_t and advisor messages, then outputs a ranked list π_i,t.
Why it exists: We evaluate model judgment under pressure from different viewpoints.
Example: The model might place LIN, MRK, PG above TSLA for a safety-leaning user, even if momentum favors TSLA.

Scoring from Four Angles

What happens: Compute alignment metrics.
Why it exists: One score can’t capture the trade-offs.
Example: A model gets a high utility-based NDCG but a lower hit rate for matching the exact user pick.

Core Metrics (Sandwich explanations):

🍞 Hook: Imagine grading a bookshelf by how well your favorite books are at the top. 🥬 uNDCG (Utility-based NDCG): It measures how well the ranking places high-utility stocks near the top. How it works: (1) Give each stock a utility score, (2) reward higher scores near rank 1, (3) normalize to 0–1. Why it matters: Without it, we can’t judge if the list reflects the user’s rational risk-return balance. 🍞 Anchor: If safer-good-return picks show up at the top, uNDCG is high.
🍞 Hook: Imagine guessing which snack your friend will grab first. 🥬 MRR and Hit Rate: They check how well the model predicts the user’s actual choice position (MRR) and whether it’s in the top K (HR@K). How it works: (1) Find the chosen stock’s rank, (2) take reciprocal for MRR, (3) see if it’s in top 1 or 3 for HR. Why it matters: Without it, we can’t see behavioral alignment. 🍞 Anchor: If the friend’s chosen snack is often #1, HR@1 is high.
🍞 Hook: Think of comparing two ordered top-10 lists to see how similarly they sort the same songs. 🥬 Expert Alignment Score (Kendall’s tau): It measures how much the model’s ranking agrees with each expert view (utility, momentum, safety). How it works: (1) Compare all item pairs, (2) count pairs ordered the same way, (3) convert to a score between -1 and 1. Why it matters: Without it, we can’t tell which “judge” the model followed. 🍞 Anchor: If your playlist order matches the “spicy” judge’s order, tau with momentum is high.

Secret Sauce:

The user’s utility (λ, γ) is learned from behavior but never shown to the model, preventing cheating and forcing genuine reasoning.
Multi-view scoring separates wisdom (utility), empathy (user choice), trendiness (momentum), and caution (safety).
Conversations are realistic and long (up to 26 turns), so models must remember and adapt, not just memorize rules.

04Experiments & Results

The Test: The benchmark uses 10 users, each with 23 decision steps over a 30-day period (Aug 6–Sep 17, 2025) and a compact, sector-balanced, 10-stock universe with varied risk. Each step includes a 7-day market snapshot, advisor messages, and the unfolding chat history. We measure: utility-based NDCG (uNDCG) for rational fit, MRR/HR@K for behavior match, and Expert Alignment Scores (Kendall’s tau) for which “judge” the model listens to (utility, momentum, safety).

The Competition: Models include GPT-5.2 and GPT-4o (closed-source), DeepSeek-V3.2, Qwen3-235B-A22B, Qwen2.5-72B, Llama-3.3-70B (open-source general models), and Llama3-XuanYuan3-70B-Chat (finance-tuned). A random baseline gives a sanity-check lower bound.

The Scoreboard (with context):

Overall utility fit is strong: Most models achieve high uNDCG (0.92–0.97). Llama-3.3-70B-Instruct leads with 0.97—like getting an A+ in rational ranking.
Behavior match varies: Qwen2.5-72B and XuanYuan3 excel in MRR/HR, meaning they often place the user’s actual pick near the top—like being great mind-readers of short-term behavior.
Example scores: GPT-4o hits uNDCG 0.94, MRR 0.56 (HR@1 0.42, HR@3 0.60). Llama-3.3-70B hits uNDCG 0.97 but lower HR@1 0.36, showing rationality over mimicry.
Takeaway: High rationality doesn’t guarantee high mimicry; empathy and wisdom can diverge.

Expert Alignment Findings (Kendall’s tau):

Llama-3.3-70B aligns highest with utility (τ≈0.74) and momentum (τ≈0.73) but drops on safety (τ≈0.17)—like a fast driver who’s great at reading the road and going with traffic, but not the best at braking early.
DeepSeek-V3.2 is the most balanced across all three judges, with solid safety alignment—like a careful driver who still keeps good speed.
GPT series shows similar balanced traits but slightly less safety consistency than DeepSeek.
XuanYuan3, the finance-specialist model, has lower expert-tau but high behavioral hit rates, suggesting empathetic mimicry of real user moves over strict formula-following—like a human advisor who listens closely to a nervous client.

Surprising/Notable Dynamics:

Utility and momentum often couple during trending markets; when hot stocks also boost utility, models that chase trends can look “rational”—but this masks safety gaps.
Conversation history helps some models early. GPT-5.2 and DeepSeek-V3.2 show big uNDCG gains in steps 1–10, meaning they learn the user’s risk vibe quickly. Later, gains plateau: once the user persona is clear, fresh market data drives more of the decision.

Archetypes from Context Use:

Adaptive Advisors (GPT-5.2, DeepSeek-V3.2, Qwen3-235B): Improve utility alignment with history—good at integrating preferences across turns.
Transaction-Driven Analysts (GPT-4o, Llama-3.3-70B): Strong utility without much help from history—lean on current market signals.
Behavioral Overfitters (Qwen2.5-72B, XuanYuan3): Utility alignment gets worse with history—likely overfitting to noisy user actions.

What This Means:

There’s a persistent tension: Advising wisely (utility) vs. matching behavior (user choice). You can’t assume that a model great at one is great at the other.
Balanced models that respect safety stand out when trends and risks conflict.
Long conversations help early, but after a point, today’s market snapshot matters more than more memory.

05Discussion & Limitations

Limitations:

Small, controlled stock universe (10 tickers) ensures balance and clarity, but it’s not the full market. Results may shift with larger, more complex universes.
Risk preferences are inferred from behavior using a specific utility form (returns minus volatility and drawdown penalties). Different utility forms (e.g., prospect theory variants) could change which advice looks “rational.”
Conversations are simulated from real traces and profiles—not raw human chats—which, while controlled and reproducible, may miss some messiness of live dialogues.
Momentum-utility coupling in trending windows can blur whether a model is genuinely rational or simply trend-following.
Only long positions and a 30-day horizon are considered; other strategies or horizons might surface different behaviors.

Required Resources:

Access to LLMs with long-context handling (≈8k tokens)
Market data retrieval (e.g., Yahoo Finance) and feature computation for rolling windows
Inverse optimization fitting per user trajectory
Evaluation harness for ranking metrics and Kendall’s tau

When NOT to Use:

If your goal is pure price prediction or trading PnL without personalization—other datasets are better.
If you need open-ended, creative finance tutoring without verifiable scoring—this benchmark is built for measurable, utility-grounded alignment.
If your users face complex instruments (options, leverage) beyond the benchmark’s scope—utility/risk models need extension first.

Open Questions:

How to expand utility to richer risk notions (tail risk, regime shifts) without losing interpretability?
Can we design training objectives that directly optimize multi-view alignment (utility + safety + empathy) instead of post-hoc evaluation only?
How do models behave with dynamic, time-varying user risk preferences rather than fixed (λ, γ)?
What prompts or architectures best separate “surface mimicry” from “true coaching” in finance dialogs?
Can counterfactual conversation edits (e.g., removing momentum hints) cleanly reveal causal model biases?

06Conclusion & Future Work

3-Sentence Summary: Conv-FinRe is a conversational, longitudinal benchmark that scores financial advisors by how well they fit each person’s risk and return balance (utility), not just by copying past actions. It builds four complementary reference views—user choice, rational utility, market momentum, and safety—and learns each user’s hidden risk sensitivity via inverse optimization. Experiments show a core tension: models that are most rational aren’t always the best at mimicking behavior, and vice versa, with balanced models standing out when trends and risks collide.

Main Achievement: Turning financial recommendation evaluation into a multi-view, utility-grounded diagnostic that separates wisdom (rationality) from empathy (behavioral alignment) and trendiness, using realistic conversations over time.

Future Directions:

Scale to larger universes, longer horizons, and richer risk measures (e.g., tail risk)
Train models to optimize multi-view alignment directly
Allow dynamic user preferences that evolve across the conversation
Add causal probes to disentangle momentum vs. utility influences

Why Remember This: In finance, “copying clicks” is not the same as “good coaching.” Conv-FinRe shows how to test AI advisors the way good coaches are judged: by how well they balance ambition with safety for this person, right now, across time—and it gives clear, multi-angle scores that reveal a model’s true advisory personality.

Practical Applications

•Compare AI advisors by how well they fit a specific investor’s risk comfort, not just by click-matching scores.
•Tune prompts or model settings to balance rational utility (returns vs. risk) with user empathy (behavior alignment).
•Audit whether an advisor is secretly chasing momentum instead of respecting user safety preferences.
•Design training objectives that directly optimize multi-view alignment (utility, momentum, safety, behavior).
•Build onboarding flows that surface user risk style in simple language, then validate it over time.
•Use conversation history strategically: emphasize early preference learning, then focus on current market context later.
•Add safety checks that flag when recommendations drift away from a user’s inferred risk tolerance.
•Run A/B tests to see if balanced advisors (utility + safety) improve long-term user satisfaction vs. pure mimicry.
•Extend the benchmark to more assets or longer horizons to stress-test advisor stability across regimes.
•Inform regulatory evaluations with utility-grounded and multi-view evidence of suitability and risk alignment.

Version: 1