FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Joona Kytöniemi; Jousia Piha; Akseli Reunamo; Fedor Vitiugin; Farrokh Mehryary; Sampo Pyysalo

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Intermediate

Joona Kytöniemi, Jousia Piha, Akseli Reunamo et al.12/15/2025

arXiv PDF

Key Summary

•FIN-bench-v2 is a big, tidy set of Finnish tests that checks how good large language models are at many things like reading, logic, and world knowledge.
•It unifies Finnish versions of famous benchmarks and the original FIN-bench into one place and one format (HuggingFace Datasets) so anyone can use them easily.
•Every task comes in two styles—fill-in-the-blank (cloze) and multiple-choice—and each style has five wording variants to test prompt sensitivity fairly.
•The authors pre-trained five 2.15B-parameter models and used their learning curves to filter out weak or unstable tasks using four robustness checks.
•Only tasks that showed steady improvement, clear signal over noise, non-random scores, and consistent model rankings were kept.
•Human review fixed translation errors in key datasets like GoldenSwag and expanded the Finnish emotion data from XED to improve quality.
•Larger instruction-tuned models (like Gemma 3 27B and Llama variants) were then tested to map strengths and weaknesses across tasks and prompt styles.
•Results showed clear differences between cloze and multiple-choice prompts and highlighted tasks that were easy, tough, or surprisingly sensitive to wording.
•All code, datasets, prompts, and evaluation settings are public via the LM Evaluation Harness and the project’s repositories so the community can build on them.

Why This Research Matters

FIN-bench-v2 makes Finnish AI evaluation trustworthy, so we can build assistants that actually understand Finnish, not just English translated into Finnish. It protects researchers from being misled by noisy or poorly translated tasks by filtering with learning-curve checks. Public release in standard formats means anyone can reproduce results, compare models fairly, and improve the benchmark over time. Better Finnish benchmarks lead to better services in education, healthcare, and government that rely on accurate understanding in the local language. It also shows a template other low-resource languages can copy: unify, clean, vary prompts, and validate tasks by how models learn. Over time, this will help create safer, more culturally aware AI tools.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

You know how a school needs good exams in the students’ own language to see what they really understand? If the tests are only in another language, it’s hard to tell who’s truly learning. That’s what happened with large language models (LLMs) for Finnish: most strong evaluation tests were in English, not Finnish.

Before this work, the world of AI testing in Finnish was patchy. The first FIN-bench helped start generative model evaluation for Finnish, and other multilingual projects (like EuroEval, MMTEB, GlotEval) included Finnish, too. But two big problems kept showing up: (1) Data quality wasn’t always checked for different model sizes, and many sets were machine-translated with little or no human review. (2) Task wording was too simple to deal with prompt sensitivity—tiny changes in phrasing can sway model answers—and templates didn’t work well for base (non-instruction-tuned) models.

Imagine trying to judge runners with a stopwatch that sometimes sticks or a track that’s uneven. You might think the worst runner is the best just because your tools are unreliable. In AI, unreliable tasks can make models look smarter or dumber than they really are. So researchers needed a Finnish benchmark that: (a) is easy to run and update, (b) has tasks that give a strong, stable signal, (c) comes with multiple prompt wordings to reduce prompt luck, and (d) respects how different models behave (base vs instruction-tuned).

The original FIN-bench faced another practical snag: its evaluation tools grew outdated. That made it hard to use in 2025 and beyond. The community needed a durable home using modern, widely adopted tooling, so the benchmark wouldn’t break as libraries evolved. Also, more real-life domains (math, geography, medicine) were missing or underrepresented in Finnish.

Here is where FIN-bench-v2 steps in. It pulls together Finnish versions of well-known benchmarks plus an upgraded FIN-bench into a single, unified suite. It migrates everything to HuggingFace Datasets and the LM Evaluation Harness to ensure long-term maintainability and easy, reproducible runs. It also adds human review where needed (like GoldenSwag and XED-derived emotions) to clean up machine translations.

To avoid keeping tasks that behave like shaky rulers, the authors trained several 2.15B-parameter models and studied their learning curves on candidate tasks. Using four checks—monotonicity (scores reliably rise during training), signal-to-noise (scores are clear, not jittery), non-random performance (better than guessing), and ordering consistency (the ranking of models stays stable)—they filtered out tasks that didn’t provide trustworthy signals.

Why should anyone care? Because better Finnish evaluations lead to better Finnish models. That affects everyday life: clearer Finnish chat assistants; fairer tools for schools, health, and government; safer systems that understand local culture; and researchers who can trust their scoreboards. If a task unfairly helps or hurts models due to bad translations, brittle prompts, or noisy metrics, it could misdirect research and product choices.

To help you follow along, here are the key ideas we’ll use, explained using the Sandwich pattern:

🍞 Hook: You know how a sports team needs a full set of drills to see strengths and weaknesses? 🥬 The Concept: Benchmark Suite is a collection of tests that measure different skills of a model in a consistent way. How it works: (1) Gather tasks (reading, reasoning, sentiment, etc.). (2) Put them in a common format. (3) Define clear scoring. (4) Run models and compare results. Why it matters: Without a suite, we’d judge models with random or mismatched tests. 🍞 Anchor: Think of a report card covering math, science, and reading—the suite is the model’s report card.

🍞 Hook: Imagine a sentence with a blank you must fill. 🥬 The Concept: Cloze Formulation asks the model to complete text or fill in a missing answer without showing explicit options. How it works: (1) Show a question or text. (2) Model writes the answer. (3) Compare to the correct answer. Why it matters: Without cloze, we can’t see how well a model answers freely without hints. 🍞 Anchor: “The capital of Finland is ____.” Model must write “Helsinki.”

🍞 Hook: Picture a quiz where you pick A, B, C, or D. 🥬 The Concept: Multiple-choice Formulation shows several options and the model chooses among them. How it works: (1) Show question and options. (2) Model picks one. (3) We check if it’s right. Why it matters: Without multiple choice, we miss how models handle choices and constraints, especially instruction-tuned ones. 🍞 Anchor: “Which is Finland’s capital? A) Oslo B) Copenhagen C) Helsinki D) Stockholm.” The model should pick C.

🍞 Hook: Think of climbing stairs—each step should take you higher. 🥬 The Concept: Monotonicity checks that scores go up (or at least don’t go down) as training progresses. How it works: (1) Track scores over checkpoints. (2) See if the trend is mostly upward. (3) Keep tasks with steady improvement. Why it matters: Without it, a task might give random ups and downs that mislead us about learning. 🍞 Anchor: If practice makes perfect, your piano scores should improve lesson by lesson—monotonicity watches for that.

🍞 Hook: Try hearing a friend in a noisy cafeteria. 🥬 The Concept: Signal-to-Noise Ratio (SNR) measures how clear the performance signal is versus jittery noise. How it works: (1) Look at recent scores. (2) Measure their center and variability. (3) Keep tasks where signal stands out above noise. Why it matters: Without SNR, a task might look good or bad just by luck. 🍞 Anchor: If the friend’s voice is loud and chatter is low, you’ll reliably hear them—that’s a good SNR.

🍞 Hook: In a race, the fastest runner should usually finish ahead every time. 🥬 The Concept: Model Ordering Consistency checks if better models tend to stay better across training steps. How it works: (1) Rank models at each checkpoint. (2) See if rankings stay similar over time. (3) Keep tasks where the order is stable. Why it matters: Without this, early tests can fool you about which model is actually stronger. 🍞 Anchor: If Runner A keeps beating Runner B across many heats, that’s consistent ordering.

🍞 Hook: Rolling a die versus using skill in a video game feels different. 🥬 The Concept: Non-Random Performance Coefficient confirms that scores are truly above random guessing. How it works: (1) Compute the random baseline (e.g., 25% for 4 choices). (2) Check the best score reached. (3) Keep tasks where best beats random. Why it matters: Without NRC, we might celebrate scores that are basically lucky guesses. 🍞 Anchor: If you always score better than chance at quizzes, you’re showing real knowledge, not coin flips.

02Core Idea

At its heart, the paper’s “aha!” is this: Don’t trust every test—first check which tests are trustworthy by watching how small models learn over time, and then use only those reliable tests to judge big models fairly.

Let’s explain the same idea three ways:

School analogy: Before you give a final exam, you first make sure your questions actually measure what you taught. Here, the authors trained small Finnish models and watched their learning curves on many candidate tasks. If a task behaved like a good question (steady improvement, clear signal, consistent rankings, above random), it stayed; if not, it was removed.
Chef analogy: A chef taste-tests ingredients before cooking a feast. The authors taste-tested tasks with small models, keeping only the ingredients (tasks) with clean flavor (good metrics) for the final dish (the big-model evaluation).
Mechanic analogy: A mechanic tests each tool before repairing a car. If a wrench slips (unstable task), it goes back in the box. Only dependable tools (robust tasks) are used to fix the engine (evaluate large models).

Before vs. After:

Before: Finnish evaluation was scattered across different formats, included datasets of mixed quality (often machine-translated), lacked prompt variety, and didn’t check whether tasks gave stable learning signals.
After: FIN-bench-v2 unifies benchmarks into one format (HuggingFace Datasets), supplies both cloze and multiple-choice versions with five prompt variants each, applies human review to translations where needed, and filters tasks with four reliability checks based on actual learning curves.

Why it works (the intuition):

Learning curves reveal whether a task reflects real skill growth. If a model trains longer and scores consistently rise, the task likely measures learning, not luck. If the signal is clear (high SNR), the ordering of models stays stable (good consistency), and scores beat random guessing (NRC positive), then that task is a trusty ruler. Using these tasks for final evaluation makes scores meaningful and comparable.

Building blocks (concept Sandwiches re-anchored in this context):

Benchmark Suite (the report card): Input many Finnish tasks (reading, reasoning, sentiment, knowledge, alignment). Standardize them. Score them consistently.
Cloze Formulation vs. Multiple-choice Formulation (two test styles): Because base and instruction-tuned models behave differently, you need both. Cloze checks free answering; multiple-choice checks selection under constraints.
Monotonicity (learning should look like climbing stairs): If training longer doesn’t raise scores on a task, that task may be unreliable for tracking progress.
Signal-to-Noise Ratio (can we hear the score over the chatter?): A task should show a clear trend rather than wobbles.
Model Ordering Consistency (fast runners stay ahead): A stable task should preserve the relative strength of models across training.
Non-Random Performance (better than coin flips): Ensures the task isn’t so hard, mislabeled, or noisy that models perform at chance.

What changes because of this idea?

Model builders can trust Finnish scores more, especially over time and across different prompt wordings.
The community gets a living benchmark that’s easy to run (LM Evaluation Harness), easy to extend (HuggingFace Datasets), and carefully curated (human review for key translations).
Some famous benchmarks in their Finnish versions didn’t pass the reliability checks here, guiding future improvement rather than giving misleading results.

In practice, FIN-bench-v2 becomes a dependable Finnish testing ground that: (1) covers varied skills, (2) accounts for prompt sensitivity with five variants per task, (3) distinguishes base vs. instruction-tuned behavior using cloze vs. multiple-choice settings, and (4) remains maintainable, transparent, and public so others can replicate and extend it.

03Methodology

At a high level: Candidate Finnish datasets → Standardize and prompt (CF + MCF, 5 variants) → Train small Finnish LLMs and collect learning curves → Apply four reliability checks → Keep only stable tasks → Evaluate large instruction-tuned models → Release everything publicly.

Step A: Gather and standardize tasks

What happens: The team started with the original FIN-bench tasks and added Finnish versions of well-known datasets: ARC Challenge, Belebele, GoldenSwag, TruthfulQA, ScandiSent, SIB-200, SQuAD, plus FIN-bench analogies, emotions (expanded to 1,000 XED-based samples), general knowledge, HHH alignment, and similarities/abstraction. All datasets were converted to HuggingFace Datasets and made to work with the LM Evaluation Harness.
Why this step exists: Without a common format and tooling, running fair, repeatable evaluations is hard. Different scripts and file types cause confusion and breakage.
Example: ARC Challenge FI gets both cloze and multiple-choice templates with five prompt variants each, all in the same dataset structure.

Step B: Create two formulations with five prompt variants each

What happens: Each task has Cloze Formulation (CF) and Multiple-choice Formulation (MCF). For each, five human-written prompt variants capture different wordings.
Why this step exists: Models can be prompt-sensitive. Multiple variants measure robustness and reduce the chance that one unlucky phrasing skews results. CF suits base models; MCF often suits instruction-tuned ones.
Example: For Belebele FI (MCF), prompts differ in instruction wording but ask the same thing—pick the correct answer based on the passage.

Step C: Improve data quality via human review

What happens: Machine-translated GoldenSwag FI was manually reviewed to fix or remove errors. Emotions were expanded and checked using XED guidelines.
Why this step exists: Pure machine translation can mislabel or distort meaning. Human review increases accuracy and fairness, especially for nuanced tasks like commonsense or emotion.
Example: If a translated GoldenSwag continuation contradicts the context, it’s corrected or removed.

Step D: Train purpose-built small LLMs and collect learning curves

What happens: The team trained five 2.15B-parameter decoder-only models with the same architecture (Llama-like, Gemma-3 tokenizer) on different data sources: FineWeb2, HPLT 2.0, HPLT 3.0, MultiSynt (Finnish synthetic from English), and a control English model (Nemotron-CC partition) to test Finnish capability.
Why this step exists: We want to see how tasks behave during learning—not just a one-time score. Learning curves reveal whether tasks reflect real improvement.
Example: If a task’s accuracy rises steadily from checkpoint 1 to checkpoint 30, that signals good monotonicity.

Step E: Apply four reliability checks to filter tasks

What happens: For each task (and prompt variants), compute: (1) Monotonicity (does performance non-decrease as training progresses?), (2) SNR (is the signal stronger than score noise?), (3) Non-Random Performance (does best score beat random baseline?), (4) Model Ordering Consistency (do model rankings stay stable over time?). Only tasks meeting thresholds (e.g., monotonicity ≥ about medium strength, SNR positive, NRC ≥ 0, tau-consistency high) are kept.
Why this step exists: These checks prevent keeping tasks that are too noisy, too random, or misleading. It’s like ensuring your measuring sticks aren’t bent.
Example: Some popular benchmarks in Finnish (e.g., GSM8K, MMLU, XL-sum in this setup) didn’t meet criteria and were excluded to avoid misleading evaluations.

Step F: Evaluate larger instruction-tuned models on the filtered suite

What happens: Test strong, open models (e.g., Gemma 3 27B, Llama 4 Scout 17B 16E, Llama Poro 2 70B Instruct, Poro 34B Chat). Use 0-shot and 1-shot settings for multiple-choice; 0-shot and 1-shot for generative where relevant. Compare CF vs MCF and inspect prompt sensitivity spreads.
Why this step exists: After building a reliable testbed, you want to see how today’s top models behave on Finnish tasks, discover strengths/weaknesses, and verify that CF/MCF differences match expectations.
Example: Gemma 3 27B excels on ARC Challenge in MCF; Poro 2 70B shines on SQuAD FI extraction zero-shot but regresses with 1-shot formatting.

Step G: Release everything publicly with instructions

What happens: All datasets, prompts, and configs are published for the LM Evaluation Harness fork and a companion repository. Clear instructions and even configs for excluded tasks are included so the community can reproduce, inspect, or extend decisions.
Why this step exists: Openness builds trust, enables collaboration, and helps others improve the parts that didn’t pass today’s filters.
Example: Researchers can re-run tasks as new Finnish corpora or models appear and watch whether excluded tasks improve after re-annotation.

The Secret Sauce: Learning-curve-based task selection

Instead of assuming all famous tasks are reliable in Finnish, the authors made tasks earn their place. By watching how scores change as small models learn, they kept only those tasks that behaved like good, clear, steady tests. That’s what makes the suite robust and future-proof.

What would break without each step?

Without standardization: Reproducibility breaks; results become apples vs. oranges.
Without prompt variants: One unlucky wording could misjudge a model.
Without human review: Translation errors could slip in and punish Finnish-aware models.
Without learning-curve checks: Noisy or misleading tasks could dominate scores.
Without public release: The community can’t verify or build on the work.

04Experiments & Results

The Test: The authors measured how well models do on multiple-choice and generative tasks in Finnish across areas like reading comprehension (Belebele, SQuAD), commonsense (GoldenSwag), truthfulness (TruthfulQA), sentiment (ScandiSent), topic classification (SIB-200), relational/abstract reasoning (FIN-bench analogies, similarities), world knowledge (FIN-bench general knowledge), and alignment (FIN-bench HHH). They reported normalized accuracy for multiple-choice and metrics like F1, ROUGE, and BLEU for generative outputs.

The Competition: Two groups of models were compared. First, five purpose-trained 2.15B models gave learning curves to filter tasks. Second, larger instruction-tuned models—Gemma 3 27B, Llama 4 Scout 17B 16E (MoE), Llama Poro 2 70B Instruct, and Poro 34B Chat—were evaluated on the filtered suite. Settings included 0-shot and 1-shot for multiple-choice; generative tasks were tested with 0-shot and 1-shot where meaningful.

Scoreboard with Context:

Task filtering highlights: Many candidates failed reliability checks (e.g., GSM8K, MMLU, XL-sum in this setup; several original FIN-bench tasks like arithmetic, cause and effect, empirical judgments, intent recognition, misconceptions, sentence ambiguity). Even moving to 1-shot/5-shot only nudged scores slightly and didn’t fix underlying instability. In contrast, ARC Challenge passed all four checks.
Control model sanity: The English-only Nemotron-CC control model did poorly on Finnish, as expected—good sign the evaluation detects language mismatch.
Synthetic data effect: The MultiSynt model, trained on translated synthetic data, often outperformed models trained on human-authored Finnish—likely because many evaluated tasks were also translated, sharing style artifacts that favored MultiSynt. This cautions against overinterpreting wins on translated sets.
Large-model multiple-choice (0-shot): Gemma 3 27B was the most consistent top performer overall (e.g., ARC Challenge CF 0.57 → MCF 0.70). Llama 4 Scout and Poro 2 70B formed a competitive second tier, with Poro 2 70B often better in CF but volatile in MCF, and Scout benefitting more from MCF. Poro 34B Chat trailed.
Formulation-invariant/easy: ScandiSent scores were very high (>0.92) for all models regardless of CF/MCF—near a ceiling effect.
GoldenSwag oddity: In 0-shot, CF scores were good (>0.60), but MCF collapsed to near random for all large models. With 1-shot, MCF jumped to match or beat CF for most models—except Poro 34B Chat, which stayed near random. This shows how setup and examples can flip outcomes.
Prompt sensitivity: Belebele MCF showed large spreads between wording variants (averages ~0.37 to ~0.57). In other tasks, variants clustered tightly—suggesting some tasks are more sensitive to phrasing than others.
Generative tasks: On SQuAD FI (extraction), Poro 2 70B got the best zero-shot F1 (0.31), edging Gemma 3 27B (0.29) and Scout (0.25). With 1-shot, Gemma 3 and Scout doubled F1 to ~0.59, indicating strong in-context learning from examples, while Poro 2 70B regressed (to 0.16), hinting at formatting or instruction-following mismatches. On TruthfulQA FI generative, Gemma 3 27B had stronger ROUGE-1 Max (~20.3) than Scout (~14.0), but all models struggled to avoid common misconceptions in 0-shot and 1-shot.

Surprising Findings:

Translated-data advantage: Models trained on translated synthetic Finnish can do very well on translated evaluation sets—highlighting the need for more native Finnish datasets and careful contamination checks.
CF vs. MCF reversals: While many instruction-tuned models improved with MCF (as expected), the Poro family sometimes got worse—suggesting that option lists can act like distracting noise depending on training.
GoldenSwag’s 0-shot MCF collapse: A strong CF performance paired with near-random MCF until a 1-shot example was added is a vivid reminder that evaluation formatting details can dominate outcomes.

Big picture: The filtered FIN-bench-v2 tasks separate model strengths clearly, and the CF/MCF plus prompt-variant design reveals important behavior patterns that would remain hidden with a single prompt or format.

05Discussion & Limitations

Limitations:

Data contamination: Because big open models are trained on large, partly undisclosed web data, we can’t guarantee test sets were unseen. Some high scores could come from memorization. The small, controlled models help, but the risk remains for larger models.
Prompt brittleness: Even with five variants per task, prompts can still sway outcomes. A model might fail due to wording quirks rather than lack of knowledge. Averaging helps but doesn’t fully solve this.
Cultural bias: Translating an English or US-centric dataset into Finnish doesn’t automatically import Finnish cultural assumptions. Some “commonsense” items may still be Anglocentric, potentially misrepresenting what a Finnish-aligned model “should” know.
Resource cost: Training and evaluating across models and prompts is computationally heavy (about 23,000 GPU hours), which can limit how often the full suite is rerun.

Required Resources:

Compute to train or at least evaluate on large models (GPUs), storage for datasets and checkpoints, and the LM Evaluation Harness setup. Human time is needed for translation review and prompt design.

When NOT to Use:

If you want a single quick metric from a single prompt on a single task; this suite is about robustness and breadth, not a one-number shortcut.
If you only test base models with MCF (or only instruction-tuned models with CF); use both CF and MCF to respect model behavior differences.
If your project requires purely native, culturally grounded Finnish data with zero translation footprints; some FIN-bench-v2 tasks are translated (though reviewed where noted).

Open Questions:

How will scores shift as more native Finnish datasets (not translations) are added over time?
Can we design even better prompt sets or automatic rephrasing that preserve meaning but reduce brittleness further?
How can we detect and discount contamination effects more reliably for closed-weight models?
What new, safety-critical Finnish tasks (medicine, law, civic services) should be included next, and how can we annotate them cost-effectively yet carefully?
Can we create lightweight, low-GPU variants of the suite that still preserve the key robustness properties?

06Conclusion & Future Work

Three-sentence summary: FIN-bench-v2 is a unified, modern benchmark suite for Finnish language models, offering both cloze and multiple-choice tasks with five prompt variants and human-reviewed translations where needed. The authors trained several small LLMs and used their learning curves to filter tasks with four robustness checks, keeping only those that give clear, stable signals. They then evaluated larger instruction-tuned models to map strengths, weaknesses, and prompt sensitivities, releasing all data and tools publicly.

Main Achievement: Turning Finnish LLM evaluation into a reliable, maintainable, and transparent process by combining standardized datasets, prompt diversity, human-reviewed translations, and learning-curve-based task selection.

Future Directions: Expand into more native Finnish datasets and harder generative tasks, add domain-specific evaluations (medicine, law, safety-critical reasoning), refine translations and labels, and strengthen contamination analysis as new Finnish corpora appear. Explore smarter prompt variation methods and lighter evaluation recipes to cut compute costs.

Why Remember This: It shows how to build a trustworthy scoreboard for a low-resource language by testing the tests first. The approach—unify, clean, vary prompts, and filter via learning curves—can guide other languages, too. For Finland, it’s a practical step toward better local AI systems that understand the language, culture, and needs of the people who use them.

Practical Applications

•Compare Finnish chatbots fairly using the same reliable set of tasks and prompts.
•Track model training progress with stable tasks that reflect real learning, not noise.
•Choose between CF and MCF prompts depending on whether the model is base or instruction-tuned.
•Diagnose prompt sensitivity by testing five prompt variants per task and selecting robust templates.
•Identify weak spots (e.g., truthfulness vs. reading comprehension) to guide fine-tuning and data collection.
•Benchmark new Finnish models quickly via LM Evaluation Harness with ready-made configs.
•Prioritize dataset improvements (translation fixes, re-annotation) based on which tasks failed reliability checks.
•Monitor contamination risk by contrasting results from controlled small models and large open models.
•Design curricula for pretraining by using tasks with strong monotonicity and SNR as progress indicators.

Version: 1