Ebisu: Benchmarking Large Language Models in Japanese Finance

Xueqing Peng; Ruoyu Xiang; Fan Zhang; Mingzi Song; Mingyang Jiang; Yan Wang; Lingfei Qian; Taiki Hara; Yuqing Guo; Jimin Huang; Junichi Tsujii; Sophia Ananiadou

Ebisu: Benchmarking Large Language Models in Japanese Finance

Intermediate

Xueqing Peng, Ruoyu Xiang, Fan Zhang et al.2/1/2026

arXiv PDF

Key Summary

•EBISU is a new test that checks how well AI models understand Japanese finance, a language and domain where hints and special terms are common.
•It has two expert-built tasks: JF-ICR for spotting hidden agreement or refusal in investor Q&A, and JF-TE for finding and ranking nested financial terms in disclosures.
•Japanese is agglutinative and head-final, so key meaning often appears at the end of sentences, making intent and negation hard for AIs trained mostly on English.
•Across 22 models, even top systems struggled: scale helped a bit, but Japanese- or finance-specific training did not reliably improve results.
•On JF-ICR, models often mistook polite indirect refusals for weak agreement, showing cultural and pragmatic blind spots.
•On JF-TE, accuracy dropped when terms were longer, nested, or mixed scripts (kanji, hiragana, katakana), indicating boundary and variant handling issues.
•Annotation quality was high, with strong agreement between finance-trained Japanese annotators, ensuring reliable ground truth.
•EBISU reveals that simply making models bigger or feeding them more finance text is not enough; they must learn Japanese-specific morphosyntax and pragmatics.
•All data and evaluation code are publicly released, inviting progress on linguistically and culturally grounded financial NLP.
•This matters in real markets because foreign investors hold a large share of Japanese equities and need trustworthy AI tools for Japanese disclosures.

Why This Research Matters

Japanese companies communicate with polite indirectness and complex terms, and global investors increasingly rely on AI to read their disclosures. Misreading a soft refusal as a yes can mislead investment strategies or risk assessments. Extracting the wrong term boundaries can break data pipelines that feed dashboards and compliance tools. EBISU exposes exactly these failure points so researchers can build models that respect Japanese morphosyntax and communication norms. Better models mean fairer, clearer access to Japanese markets for international participants. Ultimately, that helps reduce misunderstanding, improve transparency, and support healthier, data-driven financial ecosystems.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Hook: Imagine trying to understand a classmate who answers questions very politely and indirectly. They rarely say “no,” but the real answer is hidden at the end of the sentence. That’s how Japanese business talk often works—especially in finance.

🥬 The Concept (EBISU – the world before and why we need it):

What it is: Before this research, AI models were judged mostly with English-style tests or simple question formats that didn’t capture the way Japanese finance uses indirect wording and nested, mixed-script terms.
How it worked (before):
1. Benchmarks favored short, explicit questions and clear labels that are easy to score across many languages.
2. Finance tests focused on numbers, tables, or direct Q&A with obvious answers.
3. Japanese datasets often measured sentiment or exam-style facts because they’re cheaper to build.
Why it matters: Without tests that reflect Japanese language structure and corporate norms, models look smarter than they really are. They might miss an indirect refusal (which could mean “we won’t do that”) or misread a long, nested term, leading to bad decisions.

🍞 Anchor: Think of a model reading, “We will consider dividends if conditions allow” and thinking it’s a yes. In real Japanese investor calls, that may actually be a polite “not now.”

🍞 Hook: You know how some puzzles have hidden clues at the very end? In Japanese, the crucial clue often lives at the sentence end.

🥬 The Concept (The Problem – language meets finance):

What it is: Japanese is agglutinative (glues word pieces) and head-final (the “boss” word is at the end). Refusals can be polite and indirect, and financial terms often nest inside each other across different scripts.
How it works:
1. Meaning leans heavily on sentence-final endings that show mood, negation, and commitment.
2. Companies use cautious wording (“considering,” “reviewing”) where intent is implied, not stated.
3. Financial terms appear as long compounds with nested pieces, mixing kanji, hiragana, and katakana.
Why it matters: Models trained on English-style explicitness can miss the real stance or the exact term boundary—two essentials for finance understanding.

🍞 Anchor: If a CEO says, “We have no plan to split the stock,” that’s a clear no. But “We’ll watch market conditions and continue discussions” might be a soft no that an English-leaning model misreads as yes.

🍞 Hook: Picture a library where books sometimes have titles-within-titles. If you grab only part of the title, you get the wrong book.

🥬 The Concept (Failed attempts and the gap):

What it is: Existing multilingual and finance benchmarks missed two tough Japanese needs: (1) reading between the lines for commitment/refusal, and (2) precisely extracting and ranking nested terms in mixed scripts.
How it works (why they fell short):
1. They prefer fixed-choice, short-context tasks for easy scoring across languages.
2. They rarely test sentence-final intent signals or indirect refusals.
3. They under-test term boundary and nested structure recognition.
Why it matters: Without these, we can’t trust AI to parse investor Q&A or disclosure notes where the real meaning hides in structure and style.

🍞 Anchor: A model that’s great at English sentiment might still fail to spot that “we refrain from commenting” signals non-commitment in Japanese finance.

🍞 Hook: Think of a referee who also explains the rules of the game. That’s what this paper does for AI tests in Japanese finance.

🥬 The Concept (EBISU fills the gap):

What it is: EBISU is a benchmark with two expert-annotated tasks that mirror real Japanese finance: JF-ICR (implicit commitment/refusal in Q&A) and JF-TE (extraction and ranking of nested financial terms from disclosures).
How it works:
1. Curates real investor Q&A and EDINET disclosure notes.
2. Uses finance-trained native-level annotators with strict guidelines and double-annotation.
3. Scores models on intent accuracy (JF-ICR) and term extraction + ranking (JF-TE) with metrics suited to each job.
Why it matters: Now we can see where models truly struggle—and fix it.

🍞 Anchor: EBISU is like a driving test on real city streets, not just a parking lot. It checks turn signals (intent) and lane lines (term boundaries) the way they actually appear in Tokyo traffic.

02Core Idea

🍞 Hook: You know how a magician’s trick makes you look in the wrong place? In Japanese finance, the important signal often comes at the end of the sentence or is wrapped inside a longer term, so models keep looking in the wrong place.

🥬 The Concept (The Aha!):

What it is: The key insight is to test models exactly where Japanese finance hides meaning—at sentence ends for intent and inside nested mixed-script terms—using two focused tasks with expert ground truth.
How it works:
1. JF-ICR asks: Did the model catch hidden commitment or refusal in high-context Q&A?
2. JF-TE asks: Did the model find the longest correct financial terms and rank the nested pieces properly?
3. Together, they expose whether models genuinely grasp Japanese morphosyntax and pragmatics, not just vocabulary.
Why it matters: Without this, models can sound fluent but still misunderstand what companies actually commit to or which exact terms disclosures define.

🍞 Anchor: It’s like testing if a student can read both the main headline and the fine print—because in Japanese finance, the fine print often carries the punchline.

🍞 Hook (Analogy 1): Imagine listening to a friend who answers in hints. You must catch their last few words to know if it’s a yes or a no. 🍞 Hook (Analogy 2): Think of Russian nesting dolls. Financial terms can be long dolls with smaller ones inside; you must identify both the big doll and the important smaller ones. 🍞 Hook (Analogy 3): Picture three alphabets dancing together (kanji, hiragana, katakana). Picking the right dance partner at the right moment is crucial to get the meaning.

🥬 The Concept (Before vs After):

What it is: Before, tests rewarded models for short, explicit, cross-language-friendly answers; after EBISU, tests reward models that handle Japanese-specific intent and term structure.
How it works:
1. Old: QA with obvious choices and simple spans.
2. New: Real Q&A with indirect stances; disclosures with nested terms and script variants.
3. Scoring checks both the gist (intent) and the exact building blocks (term boundaries and rankings).
Why it matters: It shifts evaluation from surface-level fluency to culturally and linguistically correct understanding.

🍞 Anchor: A model that once “aced the quiz” now has to pass the real interview—in Japanese corporate style.

🍞 Hook: You know how the answer key in math isn’t just the number, but how you got it? EBISU checks both the conclusion and the language steps.

🥬 The Concept (Why it works—intuition without math):

What it is: The benchmark targets the exact pressure points where Japanese differs from English: sentence-final cues for commitment and nested termhood in mixed scripts.
How it works:
1. Intent labels capture graded commitment from strong yes to strong no.
2. Term extraction treats the longest valid span as primary, then ranks inner candidates.
3. Metrics match each target: accuracy for intent; F1 for exact maximal spans; HitRate@K for ranking nested terms.
Why it matters: If a model truly understands Japanese finance, it will do well where meaning actually lives.

🍞 Anchor: EBISU isn’t asking, “Do you know the word?” It’s asking, “Do you know what was promised, and can you point to the exact term that defines it?”

🍞 Hook: Imagine building with LEGO pieces where the last brick locks the structure, and smaller bricks hide inside bigger ones.

🥬 The Concept (Building blocks):

What it is: EBISU has two blocks—JF-ICR and JF-TE—plus expert annotation and careful metrics.
How it works (step-by-step):
1. Gather real investor Q&A and EDINET note entries.
2. Finance-trained annotators apply strict rules and double-check each other.
3. Convert to instruction-style prompts so models answer consistently.
4. Score intent with accuracy; score terms with F1 (exact maximal spans) and HitRate@K (nested rankings).
Why it matters: Each block probes a different “hidden meaning” channel—pragmatics for intent and morphology/orthography for terms.

🍞 Anchor: Together, the blocks test whether the model can read both the tone of a manager’s answer and the dictionary of a disclosure note, the Japanese way.

03Methodology

🍞 Hook: Imagine grading a cooking contest where dishes come from a cuisine you don’t usually eat. To be fair, you need judges who know that cuisine, and rules that check the flavors that really matter.

🥬 The Concept (High-level recipe):

What it is: EBISU turns real Japanese finance text into two testing pipelines.
How it works (At a high level): Input (Q&A or disclosure note) → Expert curation and annotation → Instruction-style prompt → Model prediction → Task-specific scoring (JF-ICR Accuracy; JF-TE F1 + HitRate@K) → Analysis.
Why it matters: Each step keeps the test close to real Japanese finance, not a simplified classroom exercise.

🍞 Anchor: It’s like using actual restaurant meals—served by native chefs—to judge a cooking robot.

— Step A: Data collection and curation — 🍞 Hook: You know how you pick only the clearest puzzle pieces at first so you don’t waste time? That’s what curation does here. 🥬 The Concept:

What it is: Select single-topic, single-turn investor Q&A from big Japanese companies (2023–2026) and finely scoped EDINET disclosure notes (2025) that define or explain financial concepts.
How it works:
1. Filter out multi-part or context-heavy Q&A to reduce ambiguity (leaving 94 Q&A pairs).
2. Extract 202 note-level entries from Annual Securities Reports where definitions and rules live.
3. Keep the original Japanese scripts to preserve real-world difficulty.
Why it matters: Clean inputs make labels trustworthy, which makes scores meaningful. 🍞 Anchor: If you test map-reading, you use real maps with the actual symbols—no cartoon maps.

— Step B: Expert annotation — 🍞 Hook: Think of two referees independently scoring the same game, and a head referee breaking ties. 🥬 The Concept:

What it is: Finance-trained, native-level annotators label intent (five levels) and mark both maximal and nested financial term spans; disagreements are adjudicated.
How it works:
1. Clear rules decide what counts as strong/weak commitment or refusal, and what counts as a financial term or nested term.
2. Each item gets two independent labels; a senior expert resolves conflicts.
3. Agreement is measured (Macro-F1, Cohen’s kappa, Krippendorff’s alpha) to ensure reliability.
Why it matters: High agreement shows the ground truth is solid, not guessy. 🍞 Anchor: Like two judges giving similar scores in a dance contest—if they often agree, the scoring guide is clear.

— Step C: Instruction-style prompts — 🍞 Hook: You know how a teacher gives the same worksheet format to every student to keep grading fair? 🥬 The Concept:

What it is: Convert each example into a consistent instruction format so different models understand the task the same way.
How it works:
1. JF-ICR prompt: Given Question + Company Response, pick exactly one label from {+2, +1, 0, -1, -2}.
2. JF-TE prompt: Given a disclosure note, output JSON with each maximal term and a ranked list of nested terms.
Why it matters: Uniform instructions reduce random format effects and let us compare models apples-to-apples. 🍞 Anchor: Everyone gets the same test paper, not one person a text and another a comic.

— Step D: Evaluation metrics — 🍞 Hook: Imagine three rulers: one checks yes/no answers, one checks exact puzzle-piece shape, and one checks if the right pieces appear in your top guesses. 🥬 The Concept (Multi-class classification):

What it is: Sorting answers into more than two groups; JF-ICR uses five intent classes.
How it works:
1. The model chooses among {+2, +1, 0, -1, -2}.
2. We compute Accuracy: percent of exact matches.
3. This spotlights how well the model distinguishes graded commitment and refusal.
Why it matters: Without multi-class grading, a weak refusal and a strong refusal would get lumped together, hiding important nuance. 🍞 Anchor: Like grading from A to F, not just pass/fail.

🍞 Hook: Imagine a tape measure that demands the exact edge of a shape. 🥬 The Concept (F1 for maximal terms):

What it is: F1 checks whether the model finds the exact longest financial term spans in JF-TE.
How it works:
1. Precision: of the terms you predicted, how many were exactly right.
2. Recall: of the true terms, how many you found.
3. F1 balances both.
Why it matters: If the model clips off part of a long term or grabs extra words, F1 catches it. 🍞 Anchor: If the gold term is “潜在株式調整後1株当たり当期純利益,” predicting only “当期純利益” is incomplete and should lose points.

🍞 Hook: Think of guessing top songs—did the right song make your top 1, top 5, or top 10? 🥬 The Concept (HitRate@K):

What it is: For nested terms in JF-TE, HitRate@K checks whether the true terms appear in the model’s top K ranked candidates.
How it works:
1. Rank nested terms under each maximal term from most to least likely.
2. Check K = 1, 5, 10 to see how quickly correct terms surface.
3. Average over notes.
Why it matters: Ranking quality matters when many plausible inner terms exist; it rewards surfacing the right ones early. 🍞 Anchor: If a note defines several nested pieces, a good model lists the most important ones at the top.

— Step E: Model evaluation pipeline — 🍞 Hook: Picture a race where all runners start at the same line, on the same track, with the same referee. 🥬 The Concept:

What it is: A unified harness runs 22 models—open and proprietary—under the same settings (temperature 0, same max output length) and hardware rules.
How it works:
1. Use LM Evaluation Harness; serve models via APIs, TogetherAI, or local vLLM on H100 GPUs.
2. Standardize generation length (1,024 tokens) and deterministic decoding.
3. Collect predictions and compute the metrics.
Why it matters: Fairness and repeatability let us trust comparisons. 🍞 Anchor: If one sprinter used a downhill track and another used flat ground, you couldn’t compare times—standardization fixes that.

— The Secret Sauce — 🍞 Hook: You know how a microscope is powerful because it zooms exactly where the tiny details hide? 🥬 The Concept:

What it is: EBISU’s cleverness is its focus on Japanese-specific meaning zones: sentence-final intent signals and nested, mixed-script terms.
How it works:
1. Intent labels map the real spectrum of Japanese corporate commitment.
2. Exact-span scoring forces precise boundary recognition.
3. Ranking metrics stress prioritization among many lookalike inner terms.
Why it matters: These targets are exactly where general scaling and English-focused training are weakest—so EBISU reveals what needs real fixing. 🍞 Anchor: It’s the difference between asking, “Do you speak Japanese?” and “Can you understand how Japanese CFOs politely say ‘no’ and which exact term in a note changes the accounting?”

04Experiments & Results

🍞 Hook: Imagine a spelling bee where even champions stumble on words from a language with different scripts and rules. That’s what happened here.

🥬 The Concept (The test and why):

What it is: 22 models—open, closed, Japanese-specific, and finance-specific—were tested on JF-ICR (intent accuracy) and JF-TE (maximal-term F1 and nested-term HitRate@K).
How it works:
1. JF-ICR: Does the model label answers from strong yes to strong no correctly?
2. JF-TE: Does it find the exact longest terms and rank inner terms well (top-1, top-5, top-10)?
Why it matters: These scores tell us if models truly understand Japanese finance, not just words.

🍞 Anchor: Accuracy is like the test score; F1 is like getting the exact puzzle shape right; HitRate@K checks if your best guesses include the correct pieces.

— The competition —

Open-source general: Llama, Qwen, DeepSeek, Mistral, Kimi.
Proprietary: GPT-5, GPT-4o, Gemini 3 Flash, Claude Sonnet 4.5.
Finance-only: FinMA-7B (English-focused finance).
Japanese-focused: Swallow, Japanese-StableLM, Nekomata; Japanese finance: nekomata-14b-pfn-qfin.

— The scoreboard (with context) —

JF-ICR (Accuracy): Best model Llama-4-Scout-17B at ~0.606 accuracy—like scoring a C when an A is 0.9+; many models stayed near chance on hard cases.
JF-TE (HitRate highlights): Llama-3.3-70B-Instruct did best but modestly (HR@1 ≈ 0.128, HR@5 ≈ 0.366, HR@10 ≈ 0.511)—like finding the right treasure in your top 10 guesses half the time.
Some strong general models (e.g., Llama-3.3-70B-Instruct) outperformed larger or proprietary systems on average EBISU score, showing size and secrecy aren’t enough.

— Surprising findings —

Scale helps a bit but not a lot: Within families, bigger models did better, but gains were limited; Japanese pragmatics still tripped them up.
Domain/language pretraining isn’t a free win: Japanese-adapted or finance-adapted models didn’t reliably beat general models at the same size; in some cases, continued finance pretraining hurt JF-TE F1.
Systematic bias in intent: English-centric models tended to over-read commitment, mislabeling indirect refusals as weak yeses—showing cultural-pragmatic bias.
Term complexity hurts ranking: Longer, deeper compound terms and mixed scripts reduced HitRate@1 sharply—models struggled with exact boundaries and variant forms more than with missing vocabulary.
Zeroed metrics on some closed models for JF-TE: A few proprietary systems produced near-zero F1/HitRate for maximal terms under exact-match rules, suggesting difficulty with precise span extraction in this strict setup.

— What the numbers mean —

A top JF-ICR accuracy around 0.61 means models miss roughly 4 in 10 stances—risky in finance.
HR@10 near 0.51 on JF-TE means that even with 10 guesses, models include the correct nested terms only about half the time.
Together, these show today’s models are not yet safe to rely on for Japanese investor Q&A interpretation or meticulous term extraction without human review.

🍞 Anchor: Think of EBISU as a pop quiz where champions can’t just spell; they must understand the joke at the end of the sentence and pick the exact right word inside a longer title—many still miss both.

05Discussion & Limitations

🍞 Hook: Even the best athletes have weak spots—knowing them helps training.

🥬 The Concept (Limitations):

What it is: EBISU focuses on two vital tasks but not the entire world of finance NLP; its data sources are limited in companies, years, and formats, and tasks use short spans.
How it works:
1. JF-ICR uses 94 Q&A pairs from 4 companies (2023–2026), which may not cover all industries or styles.
2. JF-TE uses 202 note entries from EDINET 2025 filings—strong for definitions, weaker for narratives.
3. No official train/test split yet due to size; future releases plan splits.
4. Metrics like accuracy and exact-match F1 are strict and may penalize near-misses.
Why it matters: Results are solid but not the last word; future expansions can broaden coverage and long-context reasoning.

🍞 Anchor: It’s like testing a car on city streets and not highways yet; still useful, but not everything.

🥬 The Concept (Required resources):

What it is: Running EBISU fairly needs a unified evaluation harness, GPU resources, and access to APIs or open weights.
How it works: Deterministic decoding, standard max lengths, and consistent prompts keep the race fair; annotation guidelines and code are public.
Why it matters: Others can reproduce and extend results.

🍞 Anchor: A shared game rulebook means everyone can replay the match and check the score.

🥬 The Concept (When not to use):

What it is: Don’t treat EBISU scores as a green light for high-stakes automation.
How it works: EBISU measures intent recognition and term extraction, not full financial reasoning, risk forecasting, or compliance interpretation.
Why it matters: Production use needs extra validation, human oversight, and broader tests.

🍞 Anchor: A great practice test score doesn’t mean you can drive a race car without a coach.

🥬 The Concept (Open questions):

What it is: Several paths remain open.
How it works:
1. Can models learn sentence-final Japanese pragmatics via targeted finetuning or architectural tweaks?
2. Can character-level or mixed-script tokenization improve term boundary detection?
3. Will multi-turn discourse and long-context evaluation change the picture?
4. Can culturally-aware instruction data reduce over-commitment bias?
Why it matters: These are the likely levers to turn modest gains into trustworthy performance.

🍞 Anchor: The next season’s training plan: practice end-of-sentence reading, nested-term drills, longer plays, and culture-aware coaching.

06Conclusion & Future Work

🍞 Hook: You know how a good teacher writes tests that match the real skills you need? EBISU does that for Japanese finance.

🥬 The Concept (3-sentence summary):

What it is: EBISU is a benchmark that tests two things Japanese finance cares about most: reading indirect commitment/refusal (JF-ICR) and pinpointing nested financial terms (JF-TE).
How it works: Expert-annotated data, instruction-style prompts, and strict metrics (accuracy, F1, HitRate@K) reveal whether models grasp Japanese morphosyntax and pragmatics.
Why it matters: Across 22 models, even leaders struggled; scaling helped a little, but language/domain pretraining wasn’t a silver bullet—showing where future progress must focus.

🍞 Anchor: It’s the difference between speaking Japanese and understanding what a CFO really meant.

— Main achievement — EBISU cleanly isolates Japanese-specific linguistic and cultural challenges in finance and shows, with reliable data and metrics, that today’s LLMs still fall short on them.

— Future directions —

Train models to attend to sentence-final cues and layered negation.
Improve mixed-script and compound-aware tokenization for term boundaries.
Add long-context and multi-turn discourse tasks.
Build culturally-aware instruction sets to reduce over-commitment bias.
Provide larger, broader datasets with train/validation/test splits.

— Why remember this — Because EBISU changes how we judge AI in Japanese finance: not by fluent-sounding answers, but by whether the model catches a polite “no” and identifies the exact terms that matter in disclosures. That’s the kind of understanding markets actually need.

Practical Applications

•Evaluate vendor LLMs for Japanese investor relations workflows using EBISU scores before procurement.
•Fine-tune models on sentence-final Japanese pragmatics to reduce over-commitment bias in Q&A analysis.
•Deploy EBISU as a gating test in data pipelines that auto-summarize Japanese disclosures for global investors.
•Train term extractors with mixed-script tokenization and boundary-aware objectives informed by JF-TE.
•Build analyst assistants that flag likely indirect refusals and hedged statements during earnings calls.
•Augment compliance tools to detect exact finance term spans in notes for precise rule mapping.
•Create active-learning loops where EBISU-style hard cases are prioritized for human correction and retraining.
•Benchmark model updates over time with EBISU to track real improvements beyond English-centric metrics.
•Design curriculum-style prompts that teach models to weigh sentence-final auxiliaries in Japanese.
•Integrate EBISU checks into RAG systems to ensure retrieved passages align with exact term boundaries.

Version: 1