Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Zhiwei Liu; Yupen Cao; Yuechen Jiang; Mohsinul Kabir; Polydoros Giannouris; Chen Xu; Ziyang Xu; Tianlei Zhu; Tariquzzaman Faisal; Triantafillos Papadopoulos; Yan Wang; Lingfei Qian; Xueqing Peng; Zhuohan Xie; Ye Yuan; Saeed Almheiri; Abdulrazzaq Alnajjar; Mingbin Chen; Harry Stuart; Paul Thompson; Prayag Tiwari; Alejandro Lopez-Lira; Xue Liu; Jimin Huang; Sophia Ananiadou

Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Beginner

Zhiwei Liu, Yupen Cao, Yuechen Jiang et al.1/8/2026

arXiv PDF

Key Summary

•This paper builds MFMD-Scen, a big test to see how AI changes its truth/false judgments about the same money-related claim when the situation around it changes.
•It checks three kinds of situations: who you are and how you think (persona), which market you are in (region), and your background (ethnicity and faith) combined with your role.
•The dataset is multilingual (English, Chinese, Greek, Bengali) so the same claim can be judged across languages in a controlled way.
•Across 22 popular language models, the authors find strong scenario-induced biases: models often shift their answers just because the situation changes, not the claim itself.
•Models are very good at saying something is false but much shakier at saying something is true, especially in tricky contexts.
•Biases are strongest for retail investor and herding scenarios, and in emerging Asian market contexts, where models become extra skeptical.
•Biases are bigger in low-resource languages (like Greek and Bengali) compared to high-resource ones (like English and Chinese).
•Bigger or more advanced models are steadier; small models wobble more, and reasoning helps mostly for the largest models.
•The paper provides a clear metric to quantify how much the scenario shifts performance, helping researchers measure and fix bias.

Why This Research Matters

Financial rumors can push people to waste money, sell too early, or panic-buy. If AI judges the same claim differently just because the situation changed, that can unfairly sway investors across roles, regions, or languages. MFMD-Scen shines a bright light on where and how those shifts happen, so we can build steadier systems. Regulators and platforms can use it to audit tools before deployment, especially in high-risk settings. Companies can pick models that stay stable in the specific markets and languages they serve. In the end, this makes financial information flows fairer and helps protect everyday people making money decisions.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): Imagine you and your friend both read the same money rumor online. If you pretend to be a beginner investor, you might get extra nervous. If you pretend to be a confident pro, you might stay calm. Same rumor, different feelings.

🥬 Filling (The Concept: Large Language Models, LLMs)

What it is: LLMs are computer programs that read and write text like a super-fast reader that learned from huge amounts of writing.
How it works: 1) They study patterns in words from the internet and books. 2) They learn which words tend to follow others. 3) They use that learning to answer questions and make judgments.
Why it matters: Without a clear way to test them in realistic settings, they can quietly pick up and show human-like biases, especially in money decisions. 🍞 Bottom Bread (Anchor): When you ask an AI, “Is this Facebook post offering $750 for free real?” it uses its training to answer. But the answer may shift if the AI is told you’re a nervous beginner or a cool-headed pro.

🍞 Top Bread (Hook): You know how a scary movie can make you jump at small sounds, but in daylight you wouldn’t flinch? Context changes your reaction.

🥬 Filling (The Concept: Behavioral Finance Biases)

What it is: They’re common mental shortcuts that can bend money decisions away from facts.
How it works: 1) Overconfidence makes you too sure. 2) Loss aversion makes losses feel extra painful. 3) Herding makes you follow the crowd. 4) Anchoring makes first numbers stick in your head. 5) Confirmation bias makes you love info that agrees with you.
Why it matters: If AIs pick up these habits, their judgments can swing with the situation, not the truth. 🍞 Bottom Bread (Anchor): If an AI “feels” herding cues (lots of people hyping a stock), it may become more skeptical or more trusting—just because the crowd is loud.

🍞 Top Bread (Hook): Imagine a teacher grading the same essay but told two different stories: in one, the student had extra help; in another, the student was sick. The story might sway the grade.

🥬 Filling (The Concept: Multilingual Financial Misinformation Detection, MFMD)

What it is: It’s the job of deciding if money-related claims are true or false, across different languages.
How it works: 1) Take a claim. 2) Check it carefully. 3) Decide true/false in English, Chinese, Greek, or Bengali. 4) Keep the claim the same to compare fairly across languages.
Why it matters: Real financial rumors spread globally. If the same claim gets different answers across languages, people can be misled. 🍞 Bottom Bread (Anchor): A rumor about a company’s profit might be judged “true” in English but “false” in Bengali by the same model—confusing investors.

🍞 Top Bread (Hook): Think about how you answer if you’re told, “Pretend you’re a cautious shopper” versus “Pretend you’re a bargain hunter.” Your choices can change.

🥬 Filling (The Concept: Scenario-Induced Bias)

What it is: It’s when the situation wrapped around a claim (like who you are or where you invest) changes the model’s answer.
How it works: 1) Keep the claim the same. 2) Change the scenario text (role, region, identity). 3) Ask the model again. 4) Measure how much the answer shifts.
Why it matters: If the model flips its decision just because of the scenario, we’re measuring bias, not truth. 🍞 Bottom Bread (Anchor): The same “$750 Cash App giveaway” might be labeled false without a scenario, but under a “herding crowd” scenario, the model may act extra skeptical and keep saying false—whether or not that’s correct.

The world before this paper: LLMs were already used for finance tasks like summarizing news, answering investor questions, or screening claims. Benchmarks existed to check whether claims were true or false, but they usually tested one simple setup: just the claim, one language, one label. That’s like testing a flashlight only in a bright room—you miss how it works in the dark.

The problem: Real finance decisions depend on context. People act differently as retail investors versus hedge fund pros; markets feel different in the USA versus Asia Pacific; and background can shape how messages are read. But most tests ignored this, so we didn’t know how much models changed their minds when the situation changed.

Failed attempts: General bias studies often asked direct questions (“Are you biased?”) or used simple tasks that didn’t match real finance. Financial fact-checking datasets focused on accuracy in one setting, not how answers drift across scenarios or languages.

The gap: We lacked a controlled, multilingual way to present the same exact claim under different role, region, and identity settings to measure how much the scenario—not the claim—moves the model.

Real stakes: Money decisions are high risk. A misread rumor can affect savings, loans, or stock moves. If the same claim is judged differently in English vs. Bengali, or for a retail investor vs. a pro, that can unfairly sway people and markets. This paper creates a way to see and measure these swings so we can build steadier, fairer systems.

02Core Idea

🍞 Top Bread (Hook): You know how a referee watches the same play from different camera angles to make a fair call? If one angle keeps changing the call, that angle is causing bias.

🥬 Filling (The Concept: MFMD-Scen Benchmark Framework)

What it is: MFMD-Scen is a testbed that shows how much an AI’s true/false answer changes when we wrap the same claim in different realistic financial scenarios, across multiple languages.
How it works: 1) Start with real financial claims. 2) Translate them into English, Chinese, Greek, and Bengali. 3) Add one of three scenario types: persona (role + personality), region (role + market), or identity (role + ethnicity/faith). 4) Ask the AI to judge the claim with and without the scenario. 5) Measure how much the performance (like F1 score) shifts because of the scenario.
Why it matters: If the scenario moves the answer a lot, that reveals behavioral bias the AI carries into financial judgment. 🍞 Bottom Bread (Anchor): If the AI says “false” without any scenario, but says “true” when told you’re a confident retail investor in the USA, MFMD-Scen captures and counts that swing.

The Aha! moment in one sentence: Hold the claim still, move the scenario around it, and measure how the AI’s judgment shifts to reveal bias.

Three analogies:

Sunglasses test: Same sunlight (claim), different sunglasses (scenarios). If colors shift wildly, the glasses—not the sun—are the issue.
Taste test: Same soup (claim), different bowls (scenarios). If it tastes saltier in one bowl, the bowl is changing your experience.
Science fair: Same magnet (claim), different surfaces (scenarios). If the magnet sticks sometimes and not others, you’ve learned about the surfaces.

Before vs. After:

Before: We checked if models could spot financial misinformation but mostly in one plain setup.
After: We can say, “Model X gets 87% in general, but drops 12 points for retail investors in Asia Pacific, and 18 points in Bengali when herding cues are added.” That’s precise, actionable insight.

Why it works (intuition, no equations): When we keep the claim constant and carefully change only the scenario text, any big difference in performance must come from how the model uses contextual cues, not from the claim content. This isolates the effect of roles, regions, or identities on the model’s decision boundary (its mental line between true and false). If that boundary slides because the scenario suggests “risk,” “crowds,” or “credibility,” we see the model leaning on shortcuts and priors instead of purely on the claim.

Building blocks:

🍞 Hook (Role-based Scenarios): Imagine actors playing different parts. By telling the AI “You are a retail investor / professional / company owner,” and mixing in personalities like overconfident or herding, we test whether the AI changes answers based on who it pretends to be.
🍞 Hook (Region Scenarios): Imagine the same news told in New York vs. Shanghai. By adding USA, Europe, Asia Pacific, China Mainland, Australia, or UAE contexts, we probe how market culture and familiarity sway answers.
🍞 Hook (Identity Scenarios): Imagine receiving the same message but framed around different backgrounds. By adding ethnicity/faith + role, we check if role interacts with identity to swing judgments (always handled carefully as evaluation prompts, not stereotypes).
Multilingual core: Claims are professionally translated and checked, so English, Chinese, Greek, and Bengali versions are matched, letting us spot language-driven swings.
Clear bias metric: The benchmark reports the difference between scenario-aware F1 and base F1 for the same claims, so teams can compare models and target fixes.

In short, MFMD-Scen doesn’t just ask, “Can your model spot misinformation?” It asks, “Does your model stay fair and steady when the situation changes—but the claim doesn’t?”

03Methodology

At a high level: Input (a financial claim in multiple languages) → Add a scenario (persona, region, or identity) → Ask the model to label True/False → Compare to asking without a scenario → Output bias numbers showing how much the scenario shifted performance.

Step-by-step recipe:

Collect and clean multilingual claims

What happens: The authors start from a known finance fact-check source (FinFact’s Snopes subset), crawl complete claim texts from Snopes, add fresh 2024–2025 items, and filter to finance, ending with 502 claims. Experts then label which ones are globally relevant (144 items), and translate those into Chinese, Greek, and Bengali. Native speakers review and, when needed, correct translations.
Why it exists: We need the exact same claims across languages to make a fair, apples-to-apples test. Without that, we can’t tell if language alone shifts decisions.
Example: The English claim “People on Facebook promise to send $750 to your Cash App for free” is translated into Chinese, Greek, and Bengali and checked by native speakers so the meaning matches.

Design three scenario families

What happens: The team builds realistic scenario prompts with finance experts that combine a role with one of three axes: a) Persona (role + personality): Retail investor, professional/institutional, or company owner, each paired with biases like overconfidence, loss aversion, herding, anchoring, or confirmation. Each can be explicit (“you just made a profit yesterday”) or implicit (hints and cues). b) Region (role + market): Same roles placed in six regions (Europe, USA, Asia Pacific, China Mainland, Australia, UAE) with short, neutral descriptions of market context. c) Identity (role + ethnicity/faith): Retail investor or company owner paired with ethnicity/faith (e.g., American–Christianity, Chinese–Buddhism). These are careful evaluation probes, not stereotypes or labels about people.
Why it exists: Real-world finance decisions are context-heavy. To test for scenario-induced bias, we need standardized, expert-checked scenarios that feel realistic.
Example: Persona–Retail investor–Herding (implicit): “Last year, you followed friends into an investment and profited; now online discussions about a stock are surging.”

Prompt the models in two ways

What happens: a) Base (no scenario): “Determine whether the claim is True or False.” b) Scenario: “Please take the scenario into account. Determine whether the claim is True or False.” The scenario text is shown before the same claim.
Why it exists: We need a clean baseline and a scenario-conditioned run on the very same claim to measure the effect of the scenario alone.
Example: Base: “Is the $750 Cash App claim true or false?” Scenario: “You’re a retail investor who just made a profit yesterday (overconfidence). Is the$ 750 Cash App claim true or false?”

Evaluate 22 LLMs consistently

What happens: The benchmark runs a wide set of models (open and closed source; standard chat vs. reasoning) with fixed settings (e.g., temperature 0) to ensure fairness.
Why it exists: If models differ, we want to know which ones wobble more and where.
Example: Compare GPT-4.1, GPT-5-mini, Claude Sonnet-4.5, Gemini 2.5, DeepSeek Reasoner, Qwen series, Llama, Mistral/Mixtral, etc.

Score performance and compute bias

What happens: For each language and scenario, the model labels the claim True/False. Metrics like accuracy and macro-F1 are computed. Scenario bias is |F1(scenario) – F1(base)| for the same claims. The authors also track direction: whether scenarios tend to push models toward skepticism or optimism.
Why it exists: A single scalar per scenario says how much the scenario moved model performance, allowing easy comparisons across models and contexts.
Example: If a model scores F1 = 0.80 without scenarios and F1 = 0.68 with a retail-herding scenario, the bias magnitude is 0.12.

Analyze personas, regions, identities, and languages

What happens: The team aggregates results by scenario type and language to see patterns: Which roles cause swings? Which regions lead to over- or under-skepticism? How do ethnic/faith prompts interact with roles? Where do languages amplify issues?
Why it exists: We want to locate where and why decisions drift to design targeted fixes.
Example: Discover that in Asia Pacific and China Mainland contexts, models often shift toward extra skepticism on True items; in retail-herding personas, models wobble more.

Human comparison (sanity check)

What happens: Volunteers from different regions judge a subset of English claims using only their own knowledge. Their scores are compared to models’ scenario scores.
Why it exists: To see whether models behave more like cautious humans or diverge in systematic ways.
Example: Some smaller models match human-like caution on False, but large models sometimes exceed humans on True—different tradeoffs.

The secret sauce:

Controlled instantiation: Keep the claim fixed while changing only the scenario across multiple languages. This isolates scenario effects.
Three complementary axes: Persona, region, identity—together they reflect real-world contexts that shape financial reading.
A simple, strong bias metric: The absolute F1 shift is easy to interpret and compare.

Mini data walk-through:

Claim: “McDonald’s and K-pop band BTS announced a meal collaboration to be released in May 2021.”
Base judgment: Model says “True.”
Persona scenario (retail-herding): Model still says “True,” small or no bias.
Region scenario (China Mainland): Model hesitates on True class; some models drop F1.
Identity scenario (company owner, American–Christianity): Model remains stable; bias low. But for other identity–role pairs, bias can flip direction.

By structuring the entire process like a recipe—same ingredients (claims), different kitchen settings (scenarios), and clear taste test (F1 shift)—MFMD-Scen makes hidden behavioral biases visible and measurable.

04Experiments & Results

The test: Measure how much adding a real-world scenario changes each model’s ability to say whether a financial claim is True or False, across four languages. The key metric is macro-F1, tracked with and without scenarios, so we can compute the scenario-induced bias (the absolute F1 difference). Accuracy is also reported for completeness.

The competition: 22 mainstream LLMs—closed-source leaders (e.g., GPT-4.1, GPT-5-mini, Claude Sonnet-4.5, Gemini 2.5), strong open-source (Qwen series), and popular families (Llama, Mistral/Mixtral). Both standard chat and reasoning variants are included to see whether chain-of-thought type models resist bias better.

Scoreboard with context:

False is easy, True is hard: Across languages, models cluster high on False (think of this like most students getting an A on the “spot the fake” quiz) but slip on True (many get a C or D on “confirm the real deal”). This shows a baseline conservatism: models prefer to say “false.”
Persona bias: Retail investor and herding cues often push performance down (negative bias) on True items. Imagine a class that goes from B to C- when the test says “lots of people are buying this stock”—the crowd cue makes models extra skeptical.
Region bias: Asian contexts (Asia Pacific and China Mainland) commonly induce negative bias on True, while USA and Europe scenarios tend to be less bias-inducing or slightly positive. It’s like students feeling less sure when the question mentions a less-familiar field trip location.
Identity × Role interaction: Some ethnic/faith contexts flip bias direction depending on the role (retail vs. company owner). American identities often show positive shifts; Chinese identities show negative ones more frequently, especially for retail investors. The key finding is interaction: the role changes how identity cues are used by the model.
Languages: Biases are larger in low-resource languages (Greek, Bengali). This suggests that when language coverage is thinner, models lean more on scenario shortcuts.
Model size/reasoning: Bigger or more advanced models wobble less; small models swing more. Reasoning helps mostly at larger scales; for smaller ones, it’s inconsistent.

Turning percentages into pictures:

“87%” vs. “B-”: Think of a model’s F1 like a report card grade. A top model might get an 87% (a solid B+) without scenarios but drop to 75% (C) when you add retail-herding context in Bengali—an obvious, meaningful slip.
“Bias magnitude” as a speed bump: If a model’s F1 drops 10–15 points when scenarios are added, it’s like a speed bump big enough that everyone in the car feels it. A 1–3 point dip is a small bump.

Surprising findings:

Explicit vs. implicit bias prompts: Making the bias explicit (e.g., “you just made a profit yesterday”) isn’t always worse. Sometimes implicit cues (a more subtle story) confuse models more, especially for True items.
Role can override identity—or vice versa: The same identity pair can push in opposite directions depending on whether the scenario frames a retail investor or a company owner.
Human-like vs. superhuman: Smaller models sometimes land closer to human behavior on False (cautious skepticism), while larger models achieve higher overall performance—sometimes beyond human but with different tradeoffs.

Takeaway in one line: Injecting realistic scenarios isn’t random noise—it consistently shifts decision boundaries, most strongly on the hard-to-confirm True class, in low-resource languages, and under risk-flavored contexts like retail herding and emerging Asian markets.

05Discussion & Limitations

Limitations:

Class imbalance: The dataset has more false than true claims (reflecting real Snopes finance items), which can inflate the appearance that models do well on False. The authors mitigate this by reporting per-class analyses, but a perfectly balanced set would be even clearer.
Human benchmark coverage: Human comparisons use volunteers from several regions but with small counts in some areas (only two in some regions), so human baselines should be read as rough guides, not definitive ground truth.
Scope of identities: Identity scenarios are designed as careful evaluation probes, not real population profiles. They reveal model behavior but cannot speak for any group’s true beliefs or practices.
Language coverage: Only four languages are included (English, Chinese, Greek, Bengali). More languages—especially other low-resource ones—would sharpen conclusions.
Task framing: This benchmark focuses on binary claim verification without long-document evidence retrieval. Some real-world finance fact-checking needs deep evidence chains.

Required resources:

Multilingual claims with expert review and native-language checks.
Access to a range of LLMs (APIs or local inference on GPUs; the authors used A100s for open-source models).
Evaluation code to compute per-scenario F1 and bias.

When not to use this:

If you need long-form, evidence-grounded fact-checks with document citations, use a benchmark designed for long context and explanations (e.g., FinDVer) or extend MFMD-Scen with retrieval.
If your focus is toxicity/safety rather than financial truth judgments, use a domain-appropriate safety benchmark.
If you only need monolingual, scenario-free screening, a simpler benchmark may suffice.

Open questions:

Can we design prompts or training that keep models steady across scenarios (scenario-robustness) without harming overall accuracy?
Which mitigation works best: instruction tuning for neutrality, calibrated uncertainty, or expected-utility framing from behavioral economics?
How do longer, evidence-rich contexts interact with scenario bias—do citations help or can they be swayed too?
Can we fairly expand identity coverage to more groups and languages while maintaining ethical safeguards and avoiding stereotyping?
How do agentic, tool-using LLMs (with search/retrieval) behave under the same scenario pressures?

06Conclusion & Future Work

Three-sentence summary: MFMD-Scen is a multilingual benchmark that tests how much large language models change their truth/false decisions about the same financial claim when realistic scenarios are added. Across 22 models, the authors find clear, repeatable scenario-induced biases—especially for retail-herding personas, in emerging Asian markets, and in low-resource languages—mainly hurting performance on the harder True class. The benchmark offers a simple, powerful metric to quantify these shifts and a path to audit and reduce bias.

Main achievement: Turning hidden behavioral swings into visible, measurable numbers by holding claims constant, varying scenarios, and comparing scenario vs. base F1—across roles, regions, identities, and languages.

Future directions:

Build scenario-robust training and prompting recipes (e.g., neutrality prompts, calibrated uncertainty, expected-utility reasoning).
Add more languages and identities, and integrate evidence retrieval to test if citations stabilize judgments.
Create dashboards that show per-scenario risk profiles so practitioners can pick models that stay steady where it matters.

Why remember this: In finance, context quietly nudges judgment. MFMD-Scen makes those nudges measurable, so we can spot where models wobble, fix them, and keep real people’s money decisions safer and fairer.

Practical Applications

•Audit your finance chatbot for stability: run it on MFMD-Scen and flag scenarios where performance drops.
•Tune prompts for neutrality: add instructions that discourage over-reliance on persona or region cues.
•Deploy language-aware guardrails: add extra checks for low-resource languages where bias spikes.
•Calibrate True-class confidence: when the model says “True,” require evidence or a second-pass verification.
•Pick models by scenario-fit: choose the model that stays most stable in your target region and user role.
•Monitor post-deployment drift: schedule regular MFMD-Scen runs to catch new biases as models update.
•Train with counter-bias examples: fine-tune on pairs that teach the model to ignore crowd/herding cues.
•Add uncertainty outputs: show a confidence score so users know when to be cautious.
•Layer retrieval: attach trusted, multilingual evidence to reduce scenario-driven swings.
•Create a bias dashboard: visualize per-scenario F1 shifts to guide product and compliance teams.

Version: 1