EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

Shijian Ma; Yan Lin; Yi Yang

EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

Intermediate

Shijian Ma, Yan Lin, Yi Yang1/14/2026

arXiv PDF

Key Summary

•EvasionBench is a new, very large dataset that helps computers spot when company leaders dodge questions during earnings call Q&A.
•It uses a simple three-level scale—direct, intermediate, fully evasive—so models can tell full answers from partial answers and total dodges.
•The team labeled data using a Multi-Model Consensus method that combines several top language models and a three-judge vote to reduce bias.
•Their labels align strongly with humans, reaching a Cohen’s Kappa of 0.835, which means almost perfect agreement.
•They release 84,000 balanced training examples and a 1,000-sample gold test set with expert human labels.
•They also built Eva-4B, a 4-billion-parameter classifier that gets 84.9% Macro-F1, beating several powerful models on this task.
•Ablation studies show multi-model consensus labeling improves performance by about 4.3 percentage points over single-model labeling.
•The hardest cases are the in-between 'intermediate' answers where speakers sound helpful but dodge the main point.
•The work fills a big gap in financial NLP by giving a standard way to measure and improve evasion detection.

Why This Research Matters

Clear answers power fair markets. When executives dodge questions in earnings calls, investors can be misled, pensions can suffer, and trust can erode. EvasionBench gives everyone a common measuring stick and a strong model to catch these dodges, helping analysts, journalists, and regulators focus on what’s actually said versus what’s avoided. The multi-model consensus approach shows how to scale high-quality labels affordably, which can be reused in other high-stakes conversations like politics or healthcare. By turning a fuzzy behavior into a concrete task, this work helps make public communication more transparent and accountable.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: You know how in class someone might give a long answer that sounds smart but doesn’t actually answer the question? That’s called dodging, and adults do it too—especially when the truth might be uncomfortable.

🥬 The Concept (Earnings Call Q&A):

What it is: Earnings call Q&A is a live session where company leaders answer investors’ questions about how the company is doing.
How it works:
1. The company shares its financial report.
2. Analysts ask questions (numbers, timelines, yes/no, reasons).
3. Executives answer in front of everyone.
Why it matters: Investors make big money decisions based on these answers; if answers are slippery, decisions can be wrong. 🍞 Anchor: Imagine a soccer coach’s press conference—journalists ask, “Why did your defense let in two goals?” If the coach talks about jersey colors instead of defense, that’s not helpful.

🍞 Hook: Imagine playing 20 Questions, but your friend keeps changing the subject. You’d want a fair referee to call it out.

🥬 The Concept (Evasion Detection):

What it is: Evasion detection finds when someone responds without truly answering the question’s core.
How it works:
1. Look at the question carefully (what is being asked exactly?).
2. Read the answer and check if it addresses the core (number, timeline, yes/no, reason).
3. Decide if it’s direct, halfway, or totally dodging.
Why it matters: Without it, people can sound honest while hiding key info, which can mislead investors and the public. 🍞 Anchor: If asked, “Did you do your homework?” and you reply, “I love math,” that’s evasion. A detector would flag that.

🍞 Hook: Think about school tests—there are clear score sheets so everyone knows how to grade fairly.

🥬 The Concept (Benchmark):

What it is: A benchmark is a shared, trusted test set that lets everyone compare methods fairly.
How it works:
1. Build a large, clean dataset.
2. Define clear labels and rules.
3. Publish it so models can be tested the same way.
Why it matters: Without a benchmark, results aren’t comparable and progress is fuzzy. 🍞 Anchor: Like a standard spelling bee word list—so every school plays by the same rules.

Before this paper, there were plenty of tools for emotions (sentiment) and facts (Q&A), but almost nothing big and reliable for catching evasions. Why? Two big roadblocks: (1) Evasion is a spectrum—some answers are partly helpful and partly dodgy, which is subjective; (2) Getting experts to label millions of Q&As is expensive and slow. Prior small datasets existed, but they were limited in size and often used just one model to label, which can bake in that model’s quirks.

The gap: We needed a large, carefully labeled dataset with simple, useful categories that reflect how real conversations work—especially in earnings calls where the stakes are high. The paper fills this by creating EvasionBench: huge data (22.7M pairs filtered to high quality, then 84K balanced training cases and a 1K human-validated gold test set), a three-level scale (direct, intermediate, fully evasive), and a clever labeling system that mixes multiple strong AI models to get better, less biased labels.

The real stakes: If investors can’t tell when leaders dodge, they might bet on the wrong companies, pensions could suffer, and markets get shakier. Regulators and researchers also need reliable tools to spot slippery communication. With EvasionBench, everyone can measure and improve evasion detection, making financial talk clearer and fairer.

02Core Idea

🍞 Hook: Imagine three teachers grading the same essay. If two say “A” and one says “B,” you’ll probably trust the A. More graders usually means a fairer result.

🥬 The Concept (EvasionBench’s Aha!):

What it is: The key insight is that combining several top AI models to agree on labels (and breaking ties with a three-judge vote) creates reliable, scalable annotations for detecting evasion on a simple three-level scale.
How it works:
1. Start with lots of Q&A from earnings calls.
2. Ask two strong models to label each pair as direct, intermediate, or fully evasive.
3. If they agree, keep the label. If not, bring in three judges (another round of strong models) to vote. Majority wins.
4. Train a compact 4B-parameter classifier (Eva-4B) on these consensus labels.
Why it matters: One model can have biases (too strict, too lenient). Multiple models plus voting reduce those quirks and create higher-quality training labels. 🍞 Anchor: Like asking three friends to taste a soup. If two say “just right,” you trust that more than just one person’s taste.

🍞 Hook: You know how a traffic light has three colors so drivers can understand quickly? Simple beats complicated when lots of people must agree.

🥬 The Concept (Three-Level Evasion Taxonomy):

What it is: A simple, ordered scale—direct (answers fully), intermediate (answers partly or skirts the core), fully evasive (dodges entirely).
How it works:
1. Define “question core” types: numbers, timelines, yes/no, reasons.
2. Check whether the answer hits that core directly, partially, or not at all.
3. Use short rules: specific number/clear stance = direct; related but not core = intermediate; off-topic/refusal = fully evasive.
Why it matters: Too many labels confuse annotators and models; too few hide important differences (like partial vs. total dodge). 🍞 Anchor: If asked “What time is the game?” and you say “7 pm,” that’s direct; “Sometime this evening” is intermediate; “We love sports!” is fully evasive.

Three analogies for the main idea:

Jury analogy: Several jurors (models) listen to the same case (Q&A). If most agree on the verdict (label), it’s more trustworthy than one juror alone.
Weather forecast analogy: You blend forecasts from different apps; if they mostly agree on rain, you bring an umbrella. Consensus beats one app’s guess.
Teacher grading analogy: When essays are double-marked and disagreements go to a panel, grades are fairer and more consistent.

Before vs. After:

Before: Evasion detection was small-scale, biased by single-model labeling, and lacked a gold standard to test models fairly.
After: We have EvasionBench—huge, balanced data; a reliable three-level scale; a consensus labeling pipeline; and Eva-4B that matches or beats frontier models on this task.

Why it works (intuition): Different top models have different habits—some label too many answers as direct, others too many as evasive. When they agree, that label is probably strong. When they disagree, a three-judge vote dampens any one model’s tilt. Training on these steadier labels gives the classifier a clearer signal, so it learns faster and better.

🍞 Hook: Imagine building a LEGO castle from instructions everyone agreed on, not just one person’s idea. Fewer mistakes happen.

🥬 The Concept (Eva-4B Classifier):

What it is: A 4-billion-parameter model fine-tuned to detect evasion using the consensus-labeled data.
How it works:
1. Train first on the easy, high-agreement data to learn the basics.
2. Then train on the hard, judge-settled cases to sharpen boundary decisions.
3. Evaluate on a human-labeled gold set to check real reliability.
Why it matters: It shows that a smaller, efficient model can excel when trained on high-quality labels. 🍞 Anchor: Like learning piano: start with simple songs everyone agrees sound right, then practice tricky pieces where a teacher’s guidance refines your touch.

03Methodology

At a high level: Input (earnings call question + answer) → Clean and filter → Label with two models → If disagreement, send to a three-judge vote → Build balanced train/eval sets → Train Eva-4B in two stages → Evaluate on human-labeled gold set.

🍞 Hook: Imagine you’re making fresh juice—you wash the fruit, squeeze it, strain it, and then taste-test with friends before bottling.

🥬 The Concept (Data Collection and Filtering):

What it is: A pipeline that turns 22.7 million raw Q&A pairs into 11.27 million high-quality pairs, then into balanced labeled sets.
How it works:
1. Extract analyst questions (Type 3) and management answers (Type 4) in order; remove greetings/instructions.
2. Keep only legitimate Q&A: question has a question mark, answer is >30 characters, no [inaudible] markers.
3. Require substantial content: combined length ≥ 500 characters.
Why it matters: Garbage in, garbage out. Clean input prevents the model from learning from noise. 🍞 Anchor: Like choosing ripe apples, tossing bruised ones, and keeping only the best for your pie.

🍞 Hook: When two friends disagree on a movie, you might ask a few more friends to vote.

🥬 The Concept (Multi-Model Consensus, MMC):

What it is: A labeling system that uses two strong annotators first, then a three-judge vote if they disagree.
How it works:
1. Stage I: Claude Opus 4.5 and Gemini 3 Flash label each sample (direct/intermediate/fully evasive) independently.
2. Stage II: If they agree, that’s the label (consensus set). If not, proceed to arbitration.
3. Stage III: Three judges—Opus 4.5, Gemini 3 Flash, GPT-5.2—each pick which original label is right. Majority wins.
4. Anti-bias: Randomize which model’s answer appears first to avoid position bias.
Why it matters: Single models have tendencies (e.g., Opus more “direct,” Gemini more “fully evasive,” GPT-5.2 more “intermediate”). Consensus smooths these biases. 🍞 Anchor: It’s like having three referees call a close play from different angles and going with the majority.

Concrete dataset construction:

After filtering, they selected and labeled data to create:
- Train-60K: 60,000 consensus-only samples (broad basics).
- Train-24K: 24,000 harder samples, including judge-resolved cases.
- Gold-1K: 1,000 human-labeled evaluation samples (319 companies), balanced across three classes.
All splits are balanced (about 33.3% per class) and span 8,081 unique companies (2002–2022).

🍞 Hook: Think of a spelling coach who first drills common words, then practices tricky exceptions.

🥬 The Concept (Two-Stage Training of Eva-4B):

What it is: A training plan that learns basics first, then sharpens on tough cases.
How it works:
1. Stage 1 (Consensus Training): Fine-tune Qwen3-4B on 60K high-agreement labels (2 epochs, LR 2e-5, bfloat16).
2. Stage 2 (Judge-Refined Training): Continue fine-tuning on 24K, using majority-vote labels for the hard boundaries.
3. Compare variants: Consensus-only vs. adding Opus-only labels vs. adding three-judge labels.
Why it matters: Easy examples teach the rules; hard examples teach the boundaries. 🍞 Anchor: Like karate: learn basic forms first, then sparring to master timing and judgment.

🍞 Hook: Imagine you and a friend disagree more when one speaks first—order can sway opinions.

🥬 The Concept (Position Bias Control):

What it is: A fix to prevent judges from favoring whichever model’s answer appears first.
How it works:
1. Shuffle which model’s label appears in the judge prompt first.
2. Keep a fixed random seed for reproducibility.
3. Measure win-rate changes to confirm bias exists.
Why it matters: If order changes outcomes, labels get skewed. Randomization keeps it fair. 🍞 Anchor: Like coin-flipping who presents first at a debate so the order doesn’t tip the scales.

🍞 Hook: If two people grade homework the same way, we trust the grade more than if they argued a lot.

🥬 The Concept (Human Agreement, Cohen’s Kappa):

What it is: A score that tells how much two annotators agree beyond chance; here it’s 0.835 (“almost perfect”).
How it works:
1. Have a second human label a balanced subset of 100 samples.
2. Compute kappa; inspect where disagreements happen.
3. Use findings to confirm which class is hardest (intermediate).
Why it matters: Shows the labels match human judgment, not just machine opinions. 🍞 Anchor: Two coaches timing a runner and mostly agreeing means the stopwatch is trustworthy.

Examples with actual data:

Direct: Clear numbers or a yes/no answer to the asked core.
Intermediate: Talks around it (e.g., describes strength but avoids saying if M&A will happen).
Fully evasive: Refuses to answer (“we can’t share that”) or pivots to unrelated hype.

Secret sauce:

Multiple strong models reduce single-model quirks.
Simple three-level labels improve human and model reliability.
Training on consensus first, then judge-resolved edge cases, yields a cleaner learning signal and faster convergence (loss down to 0.007 vs. 0.56 for single-model labels).

04Experiments & Results

🍞 Hook: When you race toy cars, you time each car on the same track so the results are fair.

🥬 The Concept (The Test):

What it is: A fair comparison on a 1,000-sample, human-labeled gold set to see which model best detects evasion.
How it works:
1. Use the same dataset and rules for every model.
2. Measure Macro-F1 (treating all classes equally) and per-class F1 (direct, intermediate, fully evasive).
3. Compare results to see strengths and weaknesses.
Why it matters: Equal testing lets us trust who’s really best. 🍞 Anchor: Like a spelling bee with the same word list for everyone.

Who competed:

Closed-source leaders: Claude Opus 4.5, GPT-5.2, Gemini 3 Flash.
Open-source: GLM-4.7, Qwen3-Coder, MiniMax-M2.1, Kimi-K2, DeepSeek-V3.2, Qwen3-4B base.
Eva-4B variants: Consensus-only, Opus-only Stage 2, Full (three-judge Stage 2).

Scoreboard with context:

Eva-4B (Full): 84.9% Macro-F1. That’s like scoring an A when most others are at A− or B+.
Gemini 3 Flash: 84.6%.
Claude Opus 4.5: 84.4%.
GLM-4.7: 82.9%.
GPT-5.2: 80.9%.
Eva-4B (Consensus): 81.4%.
Eva-4B (Opus Only): 80.6%.
Qwen3-4B (Base): 34.3% (before fine-tuning), showing training quality matters a lot.

Per-class highlights:

Fully evasive is easiest (Eva-4B Full F1: 92.4%)—clear refusals/off-topic pivots stand out.
Direct is strong (82.2%).
Intermediate is hardest (80.1%)—where answers sound helpful but miss the core.

Ablation findings (what changed when parts were removed):

Base → Consensus: +47.1 percentage points Macro-F1 (massive jump from fine-tuning on consensus labels).
Consensus → Full: +3.5 points (adding judge-resolved hard cases sharpens the model).
Full vs. Opus Only (single-model labels): +4.3 points (consensus clearly beats single-model annotation).

Surprising or noteworthy findings:

Training dynamics: With three-judge majority labels, loss shrank to 0.007, but with single-model labels it plateaued at 0.56—suggesting noisier targets slow or block learning.
Position bias is real: Randomizing which model’s prediction appears first changes win rates (e.g., Opus’s win rate shifts by +5.1%), proving the need to shuffle order for fairness.
Shared difficulty: About one-third of the toughest errors trick many top models, meaning some borderline cases are genuinely ambiguous even for humans.

Error patterns:

Hedging confusion: Phrases like “we expect” or “soon” can make a direct answer look evasive.
Quantitative core missed: If the question asks for a number but the answer gives detailed actions without numbers, humans call it intermediate; models sometimes call it direct due to specificity.
Adjacent-class confusion: 95.4% of mistakes are between neighboring classes (direct vs. intermediate), showing the three-level scale captures a meaningful spectrum.

Bottom line: High-quality, consensus-based labels make a compact model (Eva-4B) competitive with frontier systems, and the biggest remaining challenge is the subtle middle ground called intermediate.

05Discussion & Limitations

🍞 Hook: Even the best GPS sometimes sends you on a weird detour—useful tools have limits.

🥬 The Concept (Limitations):

What it is: Clear boundaries of what the system can and can’t do today.
How it works:
1. Domain specificity: Trained on earnings calls; political interviews or legal depositions need testing before use.
2. Human validation scale: The gold set is human-validated, but full multi-annotator coverage is limited (100-sample IAA check); more human audits would strengthen trust.
3. Time window: Data is 2002–2022; language evolves, so periodic updates are wise.
4. English only: Non-English settings are future work.
Why it matters: Knowing limits prevents misuse and guides next steps. 🍞 Anchor: Like shoes that fit great on the track but aren’t for hiking—you need the right gear for the terrain.

Required resources:

Access to transcripts or similar Q&A data.
Compute to fine-tune and run a 4B-parameter model (modest compared to huge LLMs).
Clear prompts and evaluation scripts to reproduce consensus labeling and tests.

When not to use:

High-stakes legal or regulatory decisions without human review.
Domains with very different conversation styles until cross-domain validation is done.
Questions whose core is unclear (ambiguous ask) without additional disambiguation.

Open questions:

Cross-domain transfer: How well does the three-level scale work in politics, healthcare, or education Q&A?
Multilingual extension: Do evasion cues look the same across languages and cultures?
Better middle-ground detection: Can we reduce confusion in the intermediate class with richer context or core-detection modules?
Human+AI teamwork: What’s the best workflow for humans to audit and guide AI in edge cases?
Dynamic updates: How often should labels and models be refreshed as corporate language changes?

06Conclusion & Future Work

Three-sentence summary: EvasionBench is a large, carefully built benchmark that helps AI spot when executives dodge questions during earnings calls using a simple three-level scale. It uses a multi-model consensus labeling pipeline—two annotators plus three-judge voting—to create high-quality labels that align strongly with human judgment. A compact classifier, Eva-4B, trained on these labels reaches 84.9% Macro-F1 and rivals or beats frontier models on this task.

Main achievement: Proving that consensus labeling across multiple strong models, paired with a simple and actionable taxonomy, yields reliable training data that enables a small, efficient model to excel at a subtle discourse task.

Future directions:

Test and adapt the framework to political interviews, legal depositions, and other adversarial Q&A domains.
Expand human validation at larger scales and across languages.
Improve handling of the tricky intermediate class, possibly by explicitly modeling the question core.

Why remember this: It shows how to turn a fuzzy, subjective behavior—dodging questions—into a measurable, teachable task at scale. With the right labels and a clear rubric, even a modest-sized model can deliver high accuracy, helping markets, regulators, and the public see through slippery talk.

Practical Applications

•Flag potentially evasive answers in real time during earnings calls for investor relations teams.
•Alert analysts when a numeric or timeline question was not directly answered so they can follow up.
•Support journalists with evidence-backed summaries highlighting where management dodged key asks.
•Help regulators and exchanges monitor disclosure quality across companies and sectors.
•Enable portfolio managers to incorporate evasion risk signals into investment decisions.
•Assist academics studying strategic communication patterns over time and across industries.
•Power training tools for executives to improve clarity and directness in Q&A.
•Enhance transcript platforms with evasion tags and heatmaps for quick navigation.
•Benchmark new evasion-detection models fairly using the gold-standard evaluation set.

Version: 1