FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Jonas Golde; Patrick Haller; Alan Akbik

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Intermediate

Jonas Golde, Patrick Haller, Alan Akbik12/15/2025

arXiv PDF

Key Summary

•FINERWEB is a new, carefully built dataset pipeline that teaches computers to spot names of people, places, and more across 91 languages and 25 writing systems.
•It first asks a smart model to rate how useful text passages are for learning NER, then trains a smaller model to quickly find the best passages at scale.
•Those chosen passages are labeled by two multilingual LLMs (GPT-4o mini and Gemma3-27B), and their answers are merged using smart rules and semantic similarity checks.
•The final dataset has about 225,000 passages and around 235,000 distinct entity labels, giving rich, fine-grained coverage far beyond classic person/location/organization types.
•A regression model picks good passages with over 84 F1 in the binary setting, helping filter out junk like ads and off-topic chatter.
•Models trained on FINERWEB match or beat strong baselines in zero-shot tests on English, Thai, and Swahili, even though FINERWEB uses 19x less training data.
•Annotation quality, judged by a very large model, scores high on faithfulness (3.99/5) and completeness (4.05/5), meaning labels are usually correct and not too many are missed.
•Translating label names into each target language is released because many models drop by 0.02–0.09 F1 when evaluated against target-language labels instead of English ones.
•An analysis shows a long-tail of entity difficulty: common types like person are easy and confident, while rare domain types need more care.
•All code, models, and data are released to help the community build better multilingual NER systems.

Why This Research Matters

FINERWEB makes advanced language tools more fair by supporting 91 languages and 25 scripts, not just English. It helps build faster, smaller NER models that still perform well, lowering costs and energy use. Journalists, analysts, and researchers can extract who, where, and when from global text streams, improving insights and decision-making. Local-language apps and assistants can become more accurate, useful, and culturally inclusive. By releasing data, code, and models, FINERWEB accelerates community progress and reproducibility. The pipeline design also teaches a general lesson: learn to find good data first, then label carefully for best results.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how kids in different countries call the same fruit by different names, like “apple” in English and “manzana” in Spanish? If you want to sort fruit names from stories written in many languages, you have to know lots of languages and lots of fruit types.

🥬 The Concept (Named Entity Recognition, NER): NER is the job of finding special names in text—like people, places, companies, dates, or even science ideas. How it works:

Read the text.
Spot words or phrases that are special “named things.”
Give each a label like person, location, or more fine-grained types. Why it matters: Without NER, computers can’t pull out who did what, where, and when—making search, assistants, and data tools much less helpful. 🍞 Bottom Bread (Anchor): In “Taylor Swift performed in Paris on July 12,” NER finds “Taylor Swift” (person), “Paris” (location), and “July 12” (date).

🍞 Top Bread (Hook): Imagine a big world map covered in many scripts—Latin, Arabic, Thai, Devanagari—and you’re asked to find all the cities marked in every script.

🥬 The Concept (Multilingual NER): Multilingual NER does the same NER task, but across many languages and scripts. How it works:

Use models trained on multiple languages.
Understand different spelling rules and word breaks.
Keep labels consistent even when languages differ. Why it matters: Without multilingual NER, tools work well in English but stumble in Thai, Amharic, or Khmer. 🍞 Bottom Bread (Anchor): A system that recognizes “東京” (Tokyo) in Japanese and “Tóquio” in Portuguese as the same city is doing multilingual NER.

🍞 Top Bread (Hook): Picture a super-smart librarian who can read nearly any book in any language and explain it to you.

🥬 The Concept (Large Language Models, LLMs): LLMs are very capable language tools that understand and generate text. How it works:

They learn patterns from huge amounts of text.
They answer questions or label text by following prompts.
They can work across many languages. Why it matters: LLMs can label data and teach smaller models that run faster on everyday computers. 🍞 Bottom Bread (Anchor): Asking an LLM, “Mark all people and places in this paragraph,” and getting back a neatly labeled list is using an LLM for NER.

The World Before: For years, NER models were strongest in English and mostly used a few big labels like person, location, organization. Many languages had tiny datasets, and very few had fine-grained labels (like “space agency,” “scientific theory,” or “program name”). LLMs could help by labeling data, but most multilingual NER datasets created with LLMs were side effects of other projects, not reusable, carefully designed resources.

The Problem: We needed a reliable, reusable way to create multilingual NER data that is both wide (many languages) and deep (many fine-grained labels). The web is messy—ads, lists, off-topic chatter—so we also needed a way to pick only passages that are great for NER.

Failed Attempts: Prompting an LLM at inference time to extract entities can work, but it’s clunky for NER. You may need to prompt separately for each label type, and the outputs don’t always line up with the original text, requiring heavy post-processing. Older multilingual datasets either covered tons of languages but only with very coarse labels, or offered many fine-grained labels but for just a handful of high-resource languages.

The Gap: No one had built a scalable, systematic pipeline that (1) finds high-quality, NER-rich text across many languages, (2) labels that text with multilingual LLMs, and (3) merges and quality-checks the results into a reusable dataset.

🍞 Top Bread (Hook): Imagine sorting only the most interesting parts of thousands of books before asking experts to label them—so experts don’t waste time on junk.

🥬 The Concept (Teacher–Student Distillation): A big “teacher” model annotates lots of data, and a smaller “student” model learns from those labels to be fast and efficient. How it works:

Use LLMs to label many examples (teacher).
Train a smaller model on these labels (student).
Deploy the smaller model for speed and cost. Why it matters: Without distillation, you’d need expensive LLM calls at runtime for every sentence. 🍞 Bottom Bread (Anchor): After the teacher labels “Einstein” as a person and “Relativity” as a scientific theory, the student learns to do this quickly on new text without calling the teacher.

Real Stakes: Better multilingual NER helps news apps track who did what worldwide, supports medical or financial text mining in many countries, powers search in local languages, and makes assistants more helpful for everyone—not just English speakers. It also saves money: smaller student models can run on modest hardware, making advanced language tech more accessible.

02Core Idea

🍞 Top Bread (Hook): Imagine building a giant, multilingual sticker book where every interesting name in every language gets the right sticker—quickly and cleanly.

🥬 The Concept (FINERWEB Pipeline): FINERWEB is a three-stage recipe to create a multilingual, fine-grained NER dataset at scale. How it works:

Rate: Ask an LLM to rate how useful passages are for NER and train a regression model to predict these ratings.
Filter: Use that regression model to scan a huge web corpus and keep only top-quality passages in each language.
Annotate + Merge: Have two multilingual LLMs label entities and types; align, merge, and translate labels, then split into sentences. Why it matters: Without this pipeline, we’d either label lots of junk (wasting compute) or miss many languages and fine-grained types. 🍞 Bottom Bread (Anchor): From a massive web crawl, FINERWEB keeps 2.5k great passages per language (91 languages total), labels them with two LLMs, merges the results, and delivers about 225k passages with 235k distinct labels.

Three Analogies for the Same Idea:

Gold panning: Stage 1 learns what shiny gold looks like; Stage 2 sifts tons of river sand to keep only gold; Stage 3 expert jewelers (LLMs) grade and tag the nuggets, and we merge their opinions.
Museum curation: First, train a curator’s eye; second, select gallery-worthy pieces; third, invite two experts to label each piece, reconcile their notes, and add multilingual captions.
Farming: Learn to spot fertile soil; pick the best plots; harvest crops with two teams; combine, clean, and label the produce in many languages.

🍞 Top Bread (Hook): You know how a coach first explains what a good play looks like, then trains players to spot it quickly during a game?

🥬 The Concept (Regression Rater): A multilingual regression model learns to score how NER-friendly a passage is. How it works:

Sample 1k passages per language and have LLMs (GPT-4o mini, Gemma3-27B) rate them 1–4.
Train XLM-R or mDeBERTa to predict that score.
Use the model to filter FineWeb-2 into high-quality chunks. Why it matters: Without this rater, you waste LLM labels on spam, ads, and off-topic text. 🍞 Bottom Bread (Anchor): The rater achieves >84 F1 in the binary setting, reliably separating “useful” vs “not useful” passages.

🍞 Top Bread (Hook): Imagine having two doctors give a diagnosis, then combining their answers if they agree, or choosing the better one if they don’t.

🥬 The Concept (Dual-LLM Annotation + Semantic Merge): Two multilingual LLMs label entities and types; a merger aligns spans and reconciles types. How it works:

Enforce exact substring matches and left-to-right alignment.
Keep longer span when overlaps <50%; otherwise compute label similarity (MiniLM). If >0.75, merge types (e.g., “person / human”).
Translate labels into each language and split passages into sentences. Why it matters: Without careful merging, you either lose good labels or keep conflicting ones. 🍞 Bottom Bread (Anchor): About 63% of all LLM annotations survive merging; GPT-4o mini tends to produce longer spans that are preserved more often.

Before vs After:

Before: Multilingual NER datasets were either broad but shallow (few labels) or deep but narrow (few languages), often messy and not reusable.
After: FINERWEB offers a reusable, high-quality, fine-grained, multilingual resource with transparent filtering, merging, and translated labels.

Why It Works (Intuition):

Signal boosting: The regression rater concentrates expensive LLM labeling on rich, entity-dense text.
Redundancy for reliability: Two annotators plus semantic merging reduce single-model quirks.
Grounding: Exact-substring and left-to-right rules keep labels aligned with real text.
Accessibility: Translated label sets and sentence splits make the data easier to use.

🍞 Top Bread (Hook): Think of LEGO bricks snapping together to build a big castle.

🥬 The Concept (Building Blocks): FINERWEB is built from modular parts. How it works:

Source: FineWeb-2 multilingual text (91 XLM-R languages in 25 scripts).
Rater: XLM-R regression model trained on LLM-rated samples.
Annotators: GPT-4o mini and Gemma3-27B.
Merger: Span rules + MiniLM similarity for label reconciliation.
Translators and splitters: Google Translate for labels; Segment-Any-Text for sentences. Why it matters: Each block is swappable, so the pipeline can evolve as better models appear. 🍞 Bottom Bread (Anchor): If a new LLM becomes better at Thai, you can plug it into Stage 3 without changing the rest of the pipeline.

03Methodology

At a high level: Input (FineWeb-2 passages in 91 languages) → [Stage 1: Learn what “good for NER” looks like] → [Stage 2: Filter the big corpus] → [Stage 3: Annotate, merge, translate, and split] → Output (FINERWEB dataset, plus trained raters and prompts).

Stage 0: Source and Language Choice 🍞 Top Bread (Hook): Imagine starting a sticker album; first you choose which countries’ stamps to collect so pages match the album’s layout.

🥬 The Concept (FineWeb-2 + XLM-R Language Set): Use the FineWeb-2 web corpus and restrict to the 91 languages XLM-R supports across 25 scripts. How it works:

Start from FineWeb-2 (CommonCrawl-based, very broad coverage).
Keep only languages shared with XLM-R so later models can train directly.
Use native scripts (e.g., Arabic in Arabic script). Why it matters: Without a shared, supported language set, training and evaluation would be inconsistent and fragile. 🍞 Bottom Bread (Anchor): This yields a multilingual pool ready for consistent filtering and annotation.

Stage 1: Learning to Spot Great Passages 🍞 Top Bread (Hook): Before you pick the ripest fruit quickly, you first learn what ripe looks, smells, and feels like.

🥬 The Concept (Regression Rater Training): Train a model to predict how NER-useful a passage is. How it works:

For each language, sample 1k passages; chunk to 256 tokens using spaCy/Janome/Stanza.
Ask GPT-4o mini and Gemma3-27B to rate usefulness 1–4 with a clear rubric.
Train multilingual transformers (XLM-R, mDeBERTa) to regress these scores; try binary (≥3 vs <3) and multiclass. Why it matters: Without it, you might label ads or gossip with few entities, wasting compute and lowering dataset quality. 🍞 Bottom Bread (Anchor): XLM-R trained on GPT-4o mini ratings wins (binary F1 ≈ 84.1), so it becomes the rater used in Stage 2.

Stage 2: Filtering the Big Corpus 🍞 Top Bread (Hook): Now that you have a “ripe detector,” you scan crates fast and keep only the best fruit.

🥬 The Concept (High-Quality Passage Selection): Apply the rater across FineWeb-2 and keep top chunks. How it works:

Chunk each document to 256 tokens and score with the rater.
Keep chunks predicted as useful (>0.5 threshold ≈ score ≥3).
Run a language-ID check to remove stray English ads inside non-English pages.
Keep at most one good chunk per source to boost diversity. Why it matters: Without this, your annotation budget gets spent on duplicates and off-topic text. 🍞 Bottom Bread (Anchor): The result is an unlabeled, high-quality pool: 2.5k passages per language across 91 languages.

Stage 3: Dual-LLM Annotation, Alignment, and Merge 🍞 Top Bread (Hook): Two referees watch the same play; you keep the agreement and resolve disagreements carefully.

🥬 The Concept (Annotation + Alignment): Two LLMs label entity mentions and types; enforce exact text grounding. How it works:

Prompt GPT-4o mini and Gemma3-27B to output (entity, type) lists with broad, fine-grained coverage.
Alignment rule A: every entity string must be an exact substring of the input; discard others.
Alignment rule B: process left-to-right; if an entity appears before the current pointer, discard to avoid duplicates. Why it matters: Without grounding and order, labels can drift or multiply. 🍞 Bottom Bread (Anchor): With these rules, about 77% of Gemma’s raw annotations and 66% of GPT-4o’s are kept during alignment and merging steps.

🍞 Top Bread (Hook): If two labels overlap, you either keep the clearer, longer one or combine them if they really mean the same thing.

🥬 The Concept (Semantic Merge of Spans and Types): Reconcile two annotators’ outputs. How it works:

If overlapping <50%, keep the longer span; if no overlap, keep both.
If overlapping ≥50%, compute label similarity with all-MiniLM-L6-v2; if >0.75, merge types (e.g., “person / human”).
Retain about 63% of all annotations after reconciliation. Why it matters: Without smart merging, you’d lose coverage or keep contradictions. 🍞 Bottom Bread (Anchor): The merged set keeps longer, clearer spans and fuses near-synonyms into multi-label types when appropriate.

🍞 Top Bread (Hook): Think of adding museum labels in local languages so every visitor understands them.

🥬 The Concept (Label Translation and Sentence Splitting): Make the dataset easier to use across languages. How it works:

Translate English label names into target languages (Google Translate API).
Split passages into sentences using Segment-Any-Text to aid training. Why it matters: Without translated labels and sentence splits, training and evaluation can penalize models unfairly or complicate usage. 🍞 Bottom Bread (Anchor): Researchers can train with English or target-language labels and pick sentence-level inputs easily.

Concrete Data Examples (Mini Case Studies)

High-quality example: “Kraft Foods… Cadbury… Philadelphia Light…” contains multiple brands and products with rich context (great for product/brand/org labels).
Low-quality example: A chatty spoiler rant with no real named entities (bad for NER learning).

The Secret Sauce

Learn-then-filter: Instead of randomly labeling the web, the rater focuses LLM power where it counts.
Redundant annotation with principled merging: Two annotators plus deterministic rules and semantic similarity boost correctness and coverage.
Multilingual usability: Translated labels and consistent scripts enable broad, fair experimentation.

Outputs

FINERWEB dataset: ~225k passages, 235k distinct entity types, 91 languages, 25 scripts; more distinct annotations per sentence than prior universal NER datasets.
Artifacts: Trained regression models, prompts, and confidence splits for further research.

04Experiments & Results

🍞 Top Bread (Hook): Think of a science fair where every project is judged on clarity (precision), completeness (recall), and overall grade (F1).

🥬 The Concept (Metrics: Precision/Recall/F1): These scores tell us if the model guesses correctly (precision), finds enough of the true answers (recall), and balances both (F1). How it works:

Precision = of what you predicted, how much was right?
Recall = of what was right, how much did you find?
F1 = the balance between the two. Why it matters: Without these, we can’t fairly compare models. 🍞 Bottom Bread (Anchor): A binary rater F1 > 84 means it’s very good at separating useful vs. not-useful passages.

What They Measured and Why

Rater quality: Can the regression model reliably pick NER-rich passages? (Binary and multi-class settings)
Dataset usefulness: Do models trained on FINERWEB perform well on human-labeled benchmarks in zero-shot?
Annotation quality: Are LLM-made labels faithful and complete across languages?
Label translation effects: Do target-language labels make evaluation harder for today’s models?
Difficulty tails: Do some entity types consistently get lower confidence, indicating long-tail challenges?

The Competition and Baselines

Datasets: NuNER, PileNER, and Euro-GLiNER-X for universal NER comparisons.
Models: GLiNER variants, XLM-RoBERTa, mDeBERTa, Binder (with mBERT backbone).

Scoreboard Highlights

Rater performance: XLM-R with GPT-4o mini ratings reaches about 84.1 F1 (binary); it also leads in multi-class settings against mDeBERTa. GPT-4o mini ratings yielded better-trained raters than Gemma3-27B ratings.
Dataset scale and richness: FINERWEB covers 91 languages, 25 scripts, and ≈235k distinct entity types—more diverse annotations per sentence than prior universal datasets.
Downstream zero-shot: A Binder model fine-tuned on selected FINERWEB splits matches or beats strong baselines on English (CoNLL-2003), Thai (ThaiNER), and Swahili (MasakhaNER), even with roughly 19x less training data than some baselines. Joint training boosts English and Swahili but slightly lowers Thai, likely due to differing tokenization and positive/negative ratios.
Annotation quality via LLM-as-judge: Using Qwen3-235B, faithfulness averages ≈3.99/5 and completeness ≈4.05/5 across 91 languages, indicating few hallucinations and generally good coverage; under-annotation is the main remaining issue.
Label translation impact: Evaluating models with target-language labels often drops performance by ~0.02–0.09 F1 (and larger drops in some tests), likely because translated labels become more semantically overlapping, making strict classification losses struggle.

🍞 Top Bread (Hook): If two judges agree most of the time, you can trust the results; when they differ, you study why.

🥬 The Concept (LLM-as-a-Judge): A very large model scores how correct (faithfulness) and how complete annotations are. How it works:

Sample 25 examples per language.
Ask Qwen3-235B to score 1–5 for faithfulness and completeness.
Collect error lists for missing and wrong annotations. Why it matters: Without a scalable judge, cross-language manual checks would be impractical. 🍞 Bottom Bread (Anchor): Across 66k annotations checked, ≈6.12% were missing and ≈5.97% were wrong, with common types (person, organization, date) forming much of the remainder.

Surprising/Notable Findings

Redundancy helps: Merging GPT-4o mini and Gemma3-27B yields solid alignment; GPT-4o mini tends to produce longer spans, which the merger retains more often.
Long-tail confirmed: Confidence analysis shows about half the gold spans get very high confidence (>0.97), while domain-specific labels (e.g., “scientific concept”) form a difficult tail.
Language variance: Some languages (e.g., English, Portuguese, Bulgarian) score highest in faithfulness/completeness, while others (e.g., Amharic, Kurdish, Oriya) show lower completeness—matching known LLM coverage patterns.

Context for the Numbers

“84 F1 in binary” is like consistently picking the right fruit crate 5 out of 6 times while rarely grabbing a bad one.
“Comparable or improved zero-shot with far less data” is like winning races with a lighter backpack: training efficiency matters.
“Faithfulness ≈ 4/5” is like the labels are usually correct; “Completeness ≈ 4/5” means most entities are found, but there’s room to catch more.

05Discussion & Limitations

🍞 Top Bread (Hook): If you build a great library but some shelves have fewer books, readers in those sections won’t find everything they need.

🥬 The Concept (Limitations and When Not to Use): FINERWEB is strong but not perfect. How it works:

LLM coverage is uneven; low-resource languages and rare domains can be under-annotated.
The first release covers 91 languages and ~250k samples—large, but not maximal.
Translated labels can be semantically overlapping; standard losses treat them as separate, hurting scores. Why it matters: Knowing limits helps you avoid misuse and plan improvements. 🍞 Bottom Bread (Anchor): If you need perfect biomedical labels in a low-resource script, expect gaps and consider adding targeted annotations.

Required Resources

Storage and compute to fine-tune NER models on 225k passages.
Ability to run or access multilingual transformers (e.g., XLM-R) and sentence-embedding models (MiniLM) if you reconstruct merging.
Optional budget for translations if you expand label sets, and for running LLM-as-a-judge.

🍞 Top Bread (Hook): Think of a marathon where sprinters and walkers share the path—training one style can accidentally favor the other.

🥬 The Concept (When Not to Use As-Is): Mixed-language training can bias models toward languages with easier segmentation or distributions. How it works:

Different scripts change tokenization and positives/negatives.
Joint training without balance can lower performance for some languages (e.g., Thai in the paper’s Binder experiments).
Coarse losses punish semantically equivalent translated labels as if they were opposites. Why it matters: Without balancing, a model can get great at English while slipping in Thai. 🍞 Bottom Bread (Anchor): For Thai, per-language fine-tuning or rebalancing batches may beat naive multilingual mixing.

Open Questions and Future Work

Can training losses respect semantic overlap so “person” and “persona” aren’t treated as enemies?
How to boost completeness in low-resource languages—more annotators, better prompts, or targeted sampling?
Can we expand beyond 91 languages while maintaining quality?
How to better handle long-tail, domain-specific types—active learning, curriculum splits, or confidence-weighted training?
What’s the best way to mix languages so one doesn’t overpower another?

06Conclusion & Future Work

Three-Sentence Summary: FINERWEB is a scalable, three-stage pipeline that learns to find NER-rich passages across 91 languages, labels them with two multilingual LLMs, and merges, translates, and segments the results into a high-quality dataset. It achieves strong rater performance (>84 F1), delivers about 225k passages with ≈235k distinct labels, and enables zero-shot models that match or beat strong baselines using far less training data. Quality checks with an LLM judge show high faithfulness and good completeness, with under-annotation as the main area to improve.

Main Achievement: Turning LLM supervision into a reusable, multilingual, fine-grained NER resource through a principled pipeline—rate, filter, annotate, merge—that saves compute, boosts quality, and generalizes across 25 scripts.

Future Directions: Improve completeness in low-resource settings (e.g., active sampling, better prompts), design training losses that respect semantic label overlap, expand language/script coverage, and explore curriculum or confidence-weighted training to tackle the long-tail of entity types. Also, test additional annotator LLMs and upgrading modules (e.g., rater, merger) as new models emerge.

Why Remember This: FINERWEB shows that careful engineering—learning to find good data first, then labeling with redundancy and rules—can beat brute force. It turns scattered, one-off synthetic datasets into a reliable multilingual resource, paving the way for faster, cheaper, and fairer NER tools for the whole world.

Practical Applications

•Build a multilingual news monitor that tags people, places, organizations, and events across 91 languages.
•Create local-language search features that highlight key entities directly in results (e.g., cities, schools, hospitals).
•Power customer-support analytics by extracting product names, brands, and issue types from global feedback.
•Enable researchers to mine scientific articles for theories, programs, and organizations across multiple scripts.
•Deploy compact NER models on edge devices (e.g., kiosks or mobile phones) for on-device privacy-preserving extraction.
•Support cross-border compliance by identifying companies, regulations, and dates in multilingual legal documents.
•Improve chatbots and assistants with entity-aware understanding in users’ native languages.
•Detect and normalize locations in disaster reports across languages to aid crisis response.
•Curate multilingual knowledge bases by linking fine-grained entity mentions to consistent types.
•Run curriculum or confidence-weighted training using FINERWEB’s long-tail signals to boost hard entity types.

Version: 1