EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A
Key Summary
- âąEvasionBench is a new, very large dataset that helps computers spot when company leaders dodge questions during earnings call Q&A.
- âąIt uses a simple three-level scaleâdirect, intermediate, fully evasiveâso models can tell full answers from partial answers and total dodges.
- âąThe team labeled data using a Multi-Model Consensus method that combines several top language models and a three-judge vote to reduce bias.
- âąTheir labels align strongly with humans, reaching a Cohenâs Kappa of 0.835, which means almost perfect agreement.
- âąThey release 84,000 balanced training examples and a 1,000-sample gold test set with expert human labels.
- âąThey also built Eva-4B, a 4-billion-parameter classifier that gets 84.9% Macro-F1, beating several powerful models on this task.
- âąAblation studies show multi-model consensus labeling improves performance by about 4.3 percentage points over single-model labeling.
- âąThe hardest cases are the in-between 'intermediate' answers where speakers sound helpful but dodge the main point.
- âąThe work fills a big gap in financial NLP by giving a standard way to measure and improve evasion detection.
Why This Research Matters
Clear answers power fair markets. When executives dodge questions in earnings calls, investors can be misled, pensions can suffer, and trust can erode. EvasionBench gives everyone a common measuring stick and a strong model to catch these dodges, helping analysts, journalists, and regulators focus on whatâs actually said versus whatâs avoided. The multi-model consensus approach shows how to scale high-quality labels affordably, which can be reused in other high-stakes conversations like politics or healthcare. By turning a fuzzy behavior into a concrete task, this work helps make public communication more transparent and accountable.
Detailed Explanation
Tap terms for definitions01Background & Problem Definition
đ Hook: You know how in class someone might give a long answer that sounds smart but doesnât actually answer the question? Thatâs called dodging, and adults do it tooâespecially when the truth might be uncomfortable.
đ„Ź The Concept (Earnings Call Q&A):
- What it is: Earnings call Q&A is a live session where company leaders answer investorsâ questions about how the company is doing.
- How it works:
- The company shares its financial report.
- Analysts ask questions (numbers, timelines, yes/no, reasons).
- Executives answer in front of everyone.
- Why it matters: Investors make big money decisions based on these answers; if answers are slippery, decisions can be wrong. đ Anchor: Imagine a soccer coachâs press conferenceâjournalists ask, âWhy did your defense let in two goals?â If the coach talks about jersey colors instead of defense, thatâs not helpful.
đ Hook: Imagine playing 20 Questions, but your friend keeps changing the subject. Youâd want a fair referee to call it out.
đ„Ź The Concept (Evasion Detection):
- What it is: Evasion detection finds when someone responds without truly answering the questionâs core.
- How it works:
- Look at the question carefully (what is being asked exactly?).
- Read the answer and check if it addresses the core (number, timeline, yes/no, reason).
- Decide if itâs direct, halfway, or totally dodging.
- Why it matters: Without it, people can sound honest while hiding key info, which can mislead investors and the public. đ Anchor: If asked, âDid you do your homework?â and you reply, âI love math,â thatâs evasion. A detector would flag that.
đ Hook: Think about school testsâthere are clear score sheets so everyone knows how to grade fairly.
đ„Ź The Concept (Benchmark):
- What it is: A benchmark is a shared, trusted test set that lets everyone compare methods fairly.
- How it works:
- Build a large, clean dataset.
- Define clear labels and rules.
- Publish it so models can be tested the same way.
- Why it matters: Without a benchmark, results arenât comparable and progress is fuzzy. đ Anchor: Like a standard spelling bee word listâso every school plays by the same rules.
Before this paper, there were plenty of tools for emotions (sentiment) and facts (Q&A), but almost nothing big and reliable for catching evasions. Why? Two big roadblocks: (1) Evasion is a spectrumâsome answers are partly helpful and partly dodgy, which is subjective; (2) Getting experts to label millions of Q&As is expensive and slow. Prior small datasets existed, but they were limited in size and often used just one model to label, which can bake in that modelâs quirks.
The gap: We needed a large, carefully labeled dataset with simple, useful categories that reflect how real conversations workâespecially in earnings calls where the stakes are high. The paper fills this by creating EvasionBench: huge data (22.7M pairs filtered to high quality, then 84K balanced training cases and a 1K human-validated gold test set), a three-level scale (direct, intermediate, fully evasive), and a clever labeling system that mixes multiple strong AI models to get better, less biased labels.
The real stakes: If investors canât tell when leaders dodge, they might bet on the wrong companies, pensions could suffer, and markets get shakier. Regulators and researchers also need reliable tools to spot slippery communication. With EvasionBench, everyone can measure and improve evasion detection, making financial talk clearer and fairer.
02Core Idea
đ Hook: Imagine three teachers grading the same essay. If two say âAâ and one says âB,â youâll probably trust the A. More graders usually means a fairer result.
đ„Ź The Concept (EvasionBenchâs Aha!):
- What it is: The key insight is that combining several top AI models to agree on labels (and breaking ties with a three-judge vote) creates reliable, scalable annotations for detecting evasion on a simple three-level scale.
- How it works:
- Start with lots of Q&A from earnings calls.
- Ask two strong models to label each pair as direct, intermediate, or fully evasive.
- If they agree, keep the label. If not, bring in three judges (another round of strong models) to vote. Majority wins.
- Train a compact 4B-parameter classifier (Eva-4B) on these consensus labels.
- Why it matters: One model can have biases (too strict, too lenient). Multiple models plus voting reduce those quirks and create higher-quality training labels. đ Anchor: Like asking three friends to taste a soup. If two say âjust right,â you trust that more than just one personâs taste.
đ Hook: You know how a traffic light has three colors so drivers can understand quickly? Simple beats complicated when lots of people must agree.
đ„Ź The Concept (Three-Level Evasion Taxonomy):
- What it is: A simple, ordered scaleâdirect (answers fully), intermediate (answers partly or skirts the core), fully evasive (dodges entirely).
- How it works:
- Define âquestion coreâ types: numbers, timelines, yes/no, reasons.
- Check whether the answer hits that core directly, partially, or not at all.
- Use short rules: specific number/clear stance = direct; related but not core = intermediate; off-topic/refusal = fully evasive.
- Why it matters: Too many labels confuse annotators and models; too few hide important differences (like partial vs. total dodge). đ Anchor: If asked âWhat time is the game?â and you say â7 pm,â thatâs direct; âSometime this eveningâ is intermediate; âWe love sports!â is fully evasive.
Three analogies for the main idea:
- Jury analogy: Several jurors (models) listen to the same case (Q&A). If most agree on the verdict (label), itâs more trustworthy than one juror alone.
- Weather forecast analogy: You blend forecasts from different apps; if they mostly agree on rain, you bring an umbrella. Consensus beats one appâs guess.
- Teacher grading analogy: When essays are double-marked and disagreements go to a panel, grades are fairer and more consistent.
Before vs. After:
- Before: Evasion detection was small-scale, biased by single-model labeling, and lacked a gold standard to test models fairly.
- After: We have EvasionBenchâhuge, balanced data; a reliable three-level scale; a consensus labeling pipeline; and Eva-4B that matches or beats frontier models on this task.
Why it works (intuition): Different top models have different habitsâsome label too many answers as direct, others too many as evasive. When they agree, that label is probably strong. When they disagree, a three-judge vote dampens any one modelâs tilt. Training on these steadier labels gives the classifier a clearer signal, so it learns faster and better.
đ Hook: Imagine building a LEGO castle from instructions everyone agreed on, not just one personâs idea. Fewer mistakes happen.
đ„Ź The Concept (Eva-4B Classifier):
- What it is: A 4-billion-parameter model fine-tuned to detect evasion using the consensus-labeled data.
- How it works:
- Train first on the easy, high-agreement data to learn the basics.
- Then train on the hard, judge-settled cases to sharpen boundary decisions.
- Evaluate on a human-labeled gold set to check real reliability.
- Why it matters: It shows that a smaller, efficient model can excel when trained on high-quality labels. đ Anchor: Like learning piano: start with simple songs everyone agrees sound right, then practice tricky pieces where a teacherâs guidance refines your touch.
03Methodology
At a high level: Input (earnings call question + answer) â Clean and filter â Label with two models â If disagreement, send to a three-judge vote â Build balanced train/eval sets â Train Eva-4B in two stages â Evaluate on human-labeled gold set.
đ Hook: Imagine youâre making fresh juiceâyou wash the fruit, squeeze it, strain it, and then taste-test with friends before bottling.
đ„Ź The Concept (Data Collection and Filtering):
- What it is: A pipeline that turns 22.7 million raw Q&A pairs into 11.27 million high-quality pairs, then into balanced labeled sets.
- How it works:
- Extract analyst questions (Type 3) and management answers (Type 4) in order; remove greetings/instructions.
- Keep only legitimate Q&A: question has a question mark, answer is >30 characters, no [inaudible] markers.
- Require substantial content: combined length â„ 500 characters.
- Why it matters: Garbage in, garbage out. Clean input prevents the model from learning from noise. đ Anchor: Like choosing ripe apples, tossing bruised ones, and keeping only the best for your pie.
đ Hook: When two friends disagree on a movie, you might ask a few more friends to vote.
đ„Ź The Concept (Multi-Model Consensus, MMC):
- What it is: A labeling system that uses two strong annotators first, then a three-judge vote if they disagree.
- How it works:
- Stage I: Claude Opus 4.5 and Gemini 3 Flash label each sample (direct/intermediate/fully evasive) independently.
- Stage II: If they agree, thatâs the label (consensus set). If not, proceed to arbitration.
- Stage III: Three judgesâOpus 4.5, Gemini 3 Flash, GPT-5.2âeach pick which original label is right. Majority wins.
- Anti-bias: Randomize which modelâs answer appears first to avoid position bias.
- Why it matters: Single models have tendencies (e.g., Opus more âdirect,â Gemini more âfully evasive,â GPT-5.2 more âintermediateâ). Consensus smooths these biases. đ Anchor: Itâs like having three referees call a close play from different angles and going with the majority.
Concrete dataset construction:
- After filtering, they selected and labeled data to create:
- Train-60K: 60,000 consensus-only samples (broad basics).
- Train-24K: 24,000 harder samples, including judge-resolved cases.
- Gold-1K: 1,000 human-labeled evaluation samples (319 companies), balanced across three classes.
- All splits are balanced (about 33.3% per class) and span 8,081 unique companies (2002â2022).
đ Hook: Think of a spelling coach who first drills common words, then practices tricky exceptions.
đ„Ź The Concept (Two-Stage Training of Eva-4B):
- What it is: A training plan that learns basics first, then sharpens on tough cases.
- How it works:
- Stage 1 (Consensus Training): Fine-tune Qwen3-4B on 60K high-agreement labels (2 epochs, LR 2e-5, bfloat16).
- Stage 2 (Judge-Refined Training): Continue fine-tuning on 24K, using majority-vote labels for the hard boundaries.
- Compare variants: Consensus-only vs. adding Opus-only labels vs. adding three-judge labels.
- Why it matters: Easy examples teach the rules; hard examples teach the boundaries. đ Anchor: Like karate: learn basic forms first, then sparring to master timing and judgment.
đ Hook: Imagine you and a friend disagree more when one speaks firstâorder can sway opinions.
đ„Ź The Concept (Position Bias Control):
- What it is: A fix to prevent judges from favoring whichever modelâs answer appears first.
- How it works:
- Shuffle which modelâs label appears in the judge prompt first.
- Keep a fixed random seed for reproducibility.
- Measure win-rate changes to confirm bias exists.
- Why it matters: If order changes outcomes, labels get skewed. Randomization keeps it fair. đ Anchor: Like coin-flipping who presents first at a debate so the order doesnât tip the scales.
đ Hook: If two people grade homework the same way, we trust the grade more than if they argued a lot.
đ„Ź The Concept (Human Agreement, Cohenâs Kappa):
- What it is: A score that tells how much two annotators agree beyond chance; here itâs 0.835 (âalmost perfectâ).
- How it works:
- Have a second human label a balanced subset of 100 samples.
- Compute kappa; inspect where disagreements happen.
- Use findings to confirm which class is hardest (intermediate).
- Why it matters: Shows the labels match human judgment, not just machine opinions. đ Anchor: Two coaches timing a runner and mostly agreeing means the stopwatch is trustworthy.
Examples with actual data:
- Direct: Clear numbers or a yes/no answer to the asked core.
- Intermediate: Talks around it (e.g., describes strength but avoids saying if M&A will happen).
- Fully evasive: Refuses to answer (âwe canât share thatâ) or pivots to unrelated hype.
Secret sauce:
- Multiple strong models reduce single-model quirks.
- Simple three-level labels improve human and model reliability.
- Training on consensus first, then judge-resolved edge cases, yields a cleaner learning signal and faster convergence (loss down to 0.007 vs. 0.56 for single-model labels).
04Experiments & Results
đ Hook: When you race toy cars, you time each car on the same track so the results are fair.
đ„Ź The Concept (The Test):
- What it is: A fair comparison on a 1,000-sample, human-labeled gold set to see which model best detects evasion.
- How it works:
- Use the same dataset and rules for every model.
- Measure Macro-F1 (treating all classes equally) and per-class F1 (direct, intermediate, fully evasive).
- Compare results to see strengths and weaknesses.
- Why it matters: Equal testing lets us trust whoâs really best. đ Anchor: Like a spelling bee with the same word list for everyone.
Who competed:
- Closed-source leaders: Claude Opus 4.5, GPT-5.2, Gemini 3 Flash.
- Open-source: GLM-4.7, Qwen3-Coder, MiniMax-M2.1, Kimi-K2, DeepSeek-V3.2, Qwen3-4B base.
- Eva-4B variants: Consensus-only, Opus-only Stage 2, Full (three-judge Stage 2).
Scoreboard with context:
- Eva-4B (Full): 84.9% Macro-F1. Thatâs like scoring an A when most others are at Aâ or B+.
- Gemini 3 Flash: 84.6%.
- Claude Opus 4.5: 84.4%.
- GLM-4.7: 82.9%.
- GPT-5.2: 80.9%.
- Eva-4B (Consensus): 81.4%.
- Eva-4B (Opus Only): 80.6%.
- Qwen3-4B (Base): 34.3% (before fine-tuning), showing training quality matters a lot.
Per-class highlights:
- Fully evasive is easiest (Eva-4B Full F1: 92.4%)âclear refusals/off-topic pivots stand out.
- Direct is strong (82.2%).
- Intermediate is hardest (80.1%)âwhere answers sound helpful but miss the core.
Ablation findings (what changed when parts were removed):
- Base â Consensus: +47.1 percentage points Macro-F1 (massive jump from fine-tuning on consensus labels).
- Consensus â Full: +3.5 points (adding judge-resolved hard cases sharpens the model).
- Full vs. Opus Only (single-model labels): +4.3 points (consensus clearly beats single-model annotation).
Surprising or noteworthy findings:
- Training dynamics: With three-judge majority labels, loss shrank to 0.007, but with single-model labels it plateaued at 0.56âsuggesting noisier targets slow or block learning.
- Position bias is real: Randomizing which modelâs prediction appears first changes win rates (e.g., Opusâs win rate shifts by +5.1%), proving the need to shuffle order for fairness.
- Shared difficulty: About one-third of the toughest errors trick many top models, meaning some borderline cases are genuinely ambiguous even for humans.
Error patterns:
- Hedging confusion: Phrases like âwe expectâ or âsoonâ can make a direct answer look evasive.
- Quantitative core missed: If the question asks for a number but the answer gives detailed actions without numbers, humans call it intermediate; models sometimes call it direct due to specificity.
- Adjacent-class confusion: 95.4% of mistakes are between neighboring classes (direct vs. intermediate), showing the three-level scale captures a meaningful spectrum.
Bottom line: High-quality, consensus-based labels make a compact model (Eva-4B) competitive with frontier systems, and the biggest remaining challenge is the subtle middle ground called intermediate.
05Discussion & Limitations
đ Hook: Even the best GPS sometimes sends you on a weird detourâuseful tools have limits.
đ„Ź The Concept (Limitations):
- What it is: Clear boundaries of what the system can and canât do today.
- How it works:
- Domain specificity: Trained on earnings calls; political interviews or legal depositions need testing before use.
- Human validation scale: The gold set is human-validated, but full multi-annotator coverage is limited (100-sample IAA check); more human audits would strengthen trust.
- Time window: Data is 2002â2022; language evolves, so periodic updates are wise.
- English only: Non-English settings are future work.
- Why it matters: Knowing limits prevents misuse and guides next steps. đ Anchor: Like shoes that fit great on the track but arenât for hikingâyou need the right gear for the terrain.
Required resources:
- Access to transcripts or similar Q&A data.
- Compute to fine-tune and run a 4B-parameter model (modest compared to huge LLMs).
- Clear prompts and evaluation scripts to reproduce consensus labeling and tests.
When not to use:
- High-stakes legal or regulatory decisions without human review.
- Domains with very different conversation styles until cross-domain validation is done.
- Questions whose core is unclear (ambiguous ask) without additional disambiguation.
Open questions:
- Cross-domain transfer: How well does the three-level scale work in politics, healthcare, or education Q&A?
- Multilingual extension: Do evasion cues look the same across languages and cultures?
- Better middle-ground detection: Can we reduce confusion in the intermediate class with richer context or core-detection modules?
- Human+AI teamwork: Whatâs the best workflow for humans to audit and guide AI in edge cases?
- Dynamic updates: How often should labels and models be refreshed as corporate language changes?
06Conclusion & Future Work
Three-sentence summary: EvasionBench is a large, carefully built benchmark that helps AI spot when executives dodge questions during earnings calls using a simple three-level scale. It uses a multi-model consensus labeling pipelineâtwo annotators plus three-judge votingâto create high-quality labels that align strongly with human judgment. A compact classifier, Eva-4B, trained on these labels reaches 84.9% Macro-F1 and rivals or beats frontier models on this task.
Main achievement: Proving that consensus labeling across multiple strong models, paired with a simple and actionable taxonomy, yields reliable training data that enables a small, efficient model to excel at a subtle discourse task.
Future directions:
- Test and adapt the framework to political interviews, legal depositions, and other adversarial Q&A domains.
- Expand human validation at larger scales and across languages.
- Improve handling of the tricky intermediate class, possibly by explicitly modeling the question core.
Why remember this: It shows how to turn a fuzzy, subjective behaviorâdodging questionsâinto a measurable, teachable task at scale. With the right labels and a clear rubric, even a modest-sized model can deliver high accuracy, helping markets, regulators, and the public see through slippery talk.
Practical Applications
- âąFlag potentially evasive answers in real time during earnings calls for investor relations teams.
- âąAlert analysts when a numeric or timeline question was not directly answered so they can follow up.
- âąSupport journalists with evidence-backed summaries highlighting where management dodged key asks.
- âąHelp regulators and exchanges monitor disclosure quality across companies and sectors.
- âąEnable portfolio managers to incorporate evasion risk signals into investment decisions.
- âąAssist academics studying strategic communication patterns over time and across industries.
- âąPower training tools for executives to improve clarity and directness in Q&A.
- âąEnhance transcript platforms with evasion tags and heatmaps for quick navigation.
- âąBenchmark new evasion-detection models fairly using the gold-standard evaluation set.