EEG Foundation Models: Progresses, Benchmarking, and Open Problems

Dingkun Liu; Yuheng Chen; Zhu Chen; Zhenyao Cui; Yaozhi Wen; Jiayu An; Jingwei Luo; Dongrui Wu

EEG Foundation Models: Progresses, Benchmarking, and Open Problems

Intermediate

Dingkun Liu, Yuheng Chen, Zhu Chen et al.1/25/2026

arXiv PDF

Key Summary

•This paper builds a fair, big playground (a benchmark) to test many EEG foundation models side-by-side on the same rules.
•It reviews 50 EEG foundation models and organizes how they are built into a clear map of design choices.
•Twelve open-source foundation models and strong specialist baselines are compared on 13 datasets across nine BCI tasks.
•Results show linear probing (freezing the encoder and training only a small head) is usually not enough; full fine-tuning matters.
•Small specialist models trained from scratch often match or beat foundation models on many tasks.
•Bigger EEG models do not automatically perform better under today’s data quality and training practices.
•Some tasks with strong rhythms (like SSVEP) transfer better, hinting that pretraining that captures temporal patterns helps.
•A within-subject few-shot setup shows models need better rapid calibration with very little user data.
•Simple alignment steps, like Euclidean Alignment, can noticeably boost cross-subject performance.
•The paper provides open benchmarking code and points to open problems like data quality, cross-device handling, and better pretraining targets.

Why This Research Matters

EEG models can help doctors spot seizures sooner, improve sleep assessments, and give voice to people who cannot speak. A fair benchmark means hospitals, startups, and labs can trust which models will hold up on new patients and devices. Better transfer with little calibration makes BCIs more comfortable and practical in daily life. Understanding that simple alignment (like EA) and small specialists can excel saves money and time. Knowing that bigger is not always better prevents wasted compute and pushes the field toward cleaner data and smarter objectives.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how it’s hard to compare two board games if each one uses different rules and score sheets? Even if both claim to be “the best,” you can’t tell unless they play by the same rules.

🥬 Filling (The Actual Concept)

What it is: This paper standardizes how we test EEG foundation models so everyone plays by the same rules.
How it works (step by step):
1. Gather many EEG models and list their building blocks (data prep, architectures, pretraining goals).
2. Pick many different EEG tasks (like sleep staging, seizures, and motor imagery) to reflect real-world variety.
3. Create two fair testing setups: cross-subject (no personal calibration) and within-subject few-shot (tiny personal calibration).
4. Compare two adaptation styles: full fine-tuning vs. linear probing.
5. Report results in the same way, so comparisons are fair and clear.
Why it matters: Without a fair test, we might crown the wrong winner, deploy models that don’t generalize, or miss simple tricks that work better.

🍞 Bottom Bread (Anchor) Imagine a science fair where every volcano project must use the same size bottle and the same amount of baking soda. Now you can truly judge which design works best. That’s what this benchmark does for EEG models.

🍞 Top Bread (Hook) You know how a TV remote can work with many TV brands if it learns their signals? Brain-computer interfaces (BCIs) try to decode brain signals to control devices—but people and headsets differ a lot.

🥬 Filling

What it is: A Brain-Computer Interface (BCI) links brain signals to computers so people can communicate or get medical help.
How it works:
1. Place sensors on the scalp to record tiny electrical changes (EEG).
2. Clean and normalize the signals to reduce noise.
3. Feed them into AI models that learn patterns.
4. Turn patterns into decisions (e.g., which letter to type, what sleep stage you’re in).
Why it matters: Without good decoding, BCIs become unreliable, frustrating, or even unsafe in clinical use.

🍞 Bottom Bread (Anchor) A person with limited speech could select letters on a screen using EEG. A reliable model makes their sentences clear and quick.

🍞 Top Bread (Hook) Imagine trying to learn every accent in the world—big job! EEG Foundation Models try to learn from many kinds of brain signals at once.

🥬 Filling

What it is: EEG foundation models are large, pre-trained AI systems that learn general EEG features from lots of unlabeled recordings and then adapt to many tasks.
How it works:
1. Pretrain on big, mixed EEG data using self-supervised learning (no labels needed).
2. Transfer to new tasks with a small amount of label data.
3. Evaluate with consistent rules to see what transfers well.
Why it matters: Labels are expensive; devices differ; people vary. Good pretraining can save time, data, and money.

🍞 Bottom Bread (Anchor) Like learning the “grammar” of brain waves first, then quickly picking up a new “vocabulary” for a task like sleep stages.

🍞 Top Bread (Hook) You know how practicing with puzzles without answers can still make you sharper? That’s the idea behind self-supervised learning.

🥬 Filling

What it is: Self-supervised learning teaches a model by hiding parts of the input and asking it to guess them, no labels required.
How it works:
1. Mask or transform some EEG parts.
2. Make the model predict the missing pieces (in time, frequency, or token space).
3. Repeat at scale so it learns useful structures.
Why it matters: EEG labels are scarce; this way, huge unlabeled datasets become training gold.

🍞 Bottom Bread (Anchor) Cover words in a paragraph and guess them back; soon you understand the language structure—same idea for EEG.

🍞 Top Bread (Hook) If you borrow a bike from a friend, you still know how to ride because you transfer your skill. Models do this too.

🥬 Filling

What it is: Transfer learning moves knowledge from pretraining to new tasks.
How it works:
1. Learn general EEG patterns on large mixed data.
2. Fine-tune on a small labeled set for a specific job.
Why it matters: Saves data, time, and boosts performance when labels are limited.

🍞 Bottom Bread (Anchor) A model trained broadly on EEG can quickly adapt to recognize seizures in a new hospital’s data.

🍞 Top Bread (Hook) Comparing runners is only fair if they run on the same track and distance. That’s benchmarking.

🥬 Filling

What it is: A fair test suite to compare models.
How it works:
1. Same datasets and splits.
2. Same training rules for all.
3. Same metrics like balanced accuracy or RMSE.
Why it matters: Without it, we can’t tell if a model is truly better or just lucky.

🍞 Bottom Bread (Anchor) Like a universal spelling bee word list so every student gets an equal challenge.

02Core Idea

🍞 Top Bread (Hook) Imagine hosting the Olympics for EEG models: same stadium, same timing system, same rules. Now the results mean something.

🥬 Filling

The “Aha!” Moment (one sentence): Build a unified map of EEG foundation model designs and a fair, wide benchmark so we can discover what truly helps models generalize.
Multiple Analogies (3 ways):
1. Science lab analogy: Everyone follows the same experiment protocol, so results are comparable.
2. Toolbox analogy: We sort tools (masking types, tokenizers, normalizations) so builders can pick the right tool for each job.
3. Sports analogy: We set equal lanes and starting blocks; then the fastest runner really is the fastest.
Before vs After: • Before: Each paper used different datasets, splits, and metrics—confusing and hard to trust. • After: A common playground across 13 datasets and 9 tasks shows what transfers, what doesn’t, and why.
Why It Works (intuition, not equations): • Control the variables: Consistent preprocessing and evaluation remove hidden advantages. • Two views of transfer: Cross-subject LOSO tests zero-calibration; within-subject few-shot tests rapid personalization. • Two adaptation styles: Linear probing checks if features are already good; full fine-tuning checks if models can be adapted to be great. • Across tasks: If a method works on rhythmic SSVEP and clinical seizures, its features are probably robust.
Building Blocks (each with a mini sandwich):
1. 🍞 Channel Unification
  - What: Make inputs from different headsets comparable.
  - How: Map channels to a template, encode positions, or project channels into a shared space.
  - Why: Without it, models confuse layouts and underperform.
  - Anchor: Turning many city maps into the same coordinate grid so navigation works everywhere.
2. 🍞 Normalization (z-score, CAR, EA, EMA)
  - What: Make signals comparable across time/people.
  - How: z-score (per-channel scale), CAR (remove common noise), EA (whiten per subject), EMA (track drift online).
  - Why: Without it, models chase noise instead of patterns.
  - Anchor: Adjusting all microphones to the same volume and removing room hum.
3. 🍞 Pretraining Objectives
  - What: Masked reconstruction in time, tokens, frequency; codebooks; autoregressive.
  - How: Hide parts and predict them back.
  - Why: Teaches structure without labels.
  - Anchor: Guessing missing puzzle pieces trains your puzzle skills.
4. 🍞 Evaluation Scenarios (LOSO vs Few-shot)
  - What: Two real-world modes: no-calibration vs tiny-calibration.
  - How: LOSO holds out the person; few-shot uses a small slice from the same person.
  - Why: Both realities matter for deployment.
  - Anchor: Try a bike that fits no adjustments (LOSO) vs. quickly lowering the seat (few-shot).
5. 🍞 Adaptation Strategies (Linear Probing vs Full Fine-tuning)
  - What: Frozen features vs updating everything.
  - How: Train only a small head vs tune the whole model.
  - Why: Tells if features are already good or need task-specific shaping.
  - Anchor: Using a store-bought cake mix as-is vs tweaking the whole recipe for your oven.

🍞 Bottom Bread (Anchor) With the same field, rules, and scoreboard, we can finally tell which EEG methods truly travel well from lab to life.

03Methodology

🍞 Top Bread (Hook) Imagine a cooking show where every chef gets the same ingredients, same oven, same time limit. Now we can really compare recipes.

🥬 Filling

High-level recipe: EEG Data → Standardize (channels, filtering, normalization) → Self-supervised Pretraining → Downstream Adaptation (linear probing or full fine-tuning) → Evaluate in LOSO and few-shot.
Step A: Data Standardization • What happens: Bring different EEG recordings into a common format. • Why it exists: Different headsets and noisy signals otherwise confuse models. • Example: Map 64-channel and 21-channel recordings into a shared template; resample to 200–256 Hz; apply band-pass and notch filters.

Mini-sandwiches for key tools:
1. 🍞 z-score Normalization
  - What: Make each channel have mean 0 and variance 1.
  - How: Subtract channel mean, divide by its standard deviation.
  - Why: Prevents one loud channel from dominating.
  - Anchor: Setting all singers to the same loudness before a choir performance.
2. 🍞 CAR (Common Average Reference)
  - What: Remove shared noise across channels.
  - How: Subtract the average of all channels at each time step.
  - Why: Reduces headset/reference noise.
  - Anchor: Taking out the background hum so individual instruments stand out.
3. 🍞 EA (Euclidean Alignment)
  - What: Whiten per subject/session to match covariances.
  - How: Compute average covariance and apply whitening.
  - Why: Subjects look more alike to the model.
  - Anchor: Calibrating cameras so the same colors look the same across photos.
4. 🍞 EMA (Exponential Moving Average)
  - What: Track changing means/variances over time.
  - How: Update running stats with a decay factor.
  - Why: Handles gradual drift in long recordings.
  - Anchor: Slowly adjusting thermostat as the weather changes.
Step B: Self-supervised Pretraining Strategies (like a menu)
1. 🍞 Masked Reconstruction of Raw Signals
  - What: Hide time patches and rebuild the waveform.
  - How: Mask chunks; encoder-decoder predicts raw data; use robust losses.
  - Why: Teaches temporal and cross-channel structure; risk: memorizing noise.
  - Anchor: Repainting missing parts of a picture using surrounding brush strokes.
2. 🍞 Masked Reconstruction of Embedded Tokens
  - What: First tokenize EEG (CNN or patch embed), then reconstruct tokens.
  - How: Predict hidden embeddings with MSE or contrastive losses.
  - Why: Less sensitive to amplitude noise; risk: losing fine details if tokens are too coarse.
  - Anchor: Summarizing paragraphs and then recovering the summaries—less messy than raw text.
3. 🍞 Frequency-domain Reconstruction
  - What: Predict spectrogram, amplitudes, band power, or phase.
  - How: Apply STFT or band features; reconstruct masked spectral targets.
  - Why: Great for rhythmic patterns (SSVEP, oscillations); risk: misses sharp transients.
  - Anchor: Matching a song’s beat and volume pattern rather than every note’s exact shape.
4. 🍞 Codebook-based Objectives
  - What: Quantize to discrete codes and predict indices.
  - How: Learn a codebook; optimize cross-entropy or embedding MSE.
  - Why: Compresses noise, enables causal modeling; risk: code collapse without care.
  - Anchor: Turning paragraphs into a sequence of known word IDs.
5. 🍞 Autoregressive (Causal) Pretraining
  - What: Predict the next token/code from past context only.
  - How: Decoder-only, causal masks; maximize likelihood.
  - Why: Aligns with sequence generation and prompting; risk: focusing too locally.
  - Anchor: Finishing a melody note by note from what you’ve already heard.
6. 🍞 Hybrids
  - What: Combine time and frequency or tokens and raw.
  - How: Multi-loss training to capture complementary structure.
  - Why: More balanced features across tasks.
  - Anchor: Using both a map and a compass for navigation.
Step C: Downstream Adaptation
1. 🍞 Linear Probing
  - What: Freeze the encoder; train only a small classifier head.
  - How: Simple supervised training on top of features.
  - Why: Tests if features are already general and task-ready.
  - Anchor: Checking a plant’s health by just looking at leaves.
2. 🍞 Full-parameter Fine-tuning
  - What: Update the entire model on the new task.
  - How: End-to-end supervised training.
  - Why: Adapts features to task specifics; often much better for EEG.
  - Anchor: Rewriting the whole recipe to fit your kitchen and ingredients.
Step D: Evaluation Scenarios
1. 🍞 LOSO (Leave-One-Subject-Out)
  - What: Train on all subjects except one; test on the held-out person.
  - How: No target-user labels used for training (zero-calibration).
  - Why: Tests cross-subject generalization.
  - Anchor: A new player joins the team on game day—no practice with them.
2. 🍞 Within-subject Few-shot
  - What: Use a tiny labeled slice from the target user to adapt, test on the rest.
  - How: 5–30% of a session, depending on dataset.
  - Why: Tests rapid personalization.
  - Anchor: Adjusting a bike seat just a little before riding around the block.
Metrics and Examples • Use Balanced Classification Accuracy (BCA) for classification (fair under class imbalance). • Use RMSE for regression (e.g., vigilance). • Example: Sleep-EDFx (5 classes) uses BCA; SEED-VIG (continuous PERCLOS) uses RMSE.
Secret Sauce • The head-to-head comparison of linear probing vs full fine-tuning exposes how much the encoder truly transfers. • The dual scenarios (LOSO vs few-shot) mirror real deployments: no calibration vs rapid calibration. • The taxonomy connects pretraining choices to task types (e.g., frequency targets shine for rhythmic tasks).

04Experiments & Results

🍞 Top Bread (Hook) If we race many paper airplanes in the same gym with the same wind, which design really flies best? That’s this benchmark for EEG models.

🥬 Filling

The Test (what and why): • 13 datasets across 9 BCI paradigms: motor imagery, P300, SSVEP, clinical seizure/abnormal detection, sleep staging, emotion, vigilance, workload, and visual decoding. • Two scenarios: LOSO (no user calibration) and within-subject few-shot (tiny user calibration). • Two adaptation modes: linear probing vs full fine-tuning. • Metrics: Balanced Classification Accuracy (BCA) for classification; RMSE for regression. • Why: To see which pretraining and adaptation strategies actually transfer across people, devices, and tasks.
The Competition: • 12 open-source EEG foundation models. • Strong specialist baselines: traditional ML (e.g., CSP+LDA, TRCA) and deep models (EEGNet, ShallowConv, Conformer, Deformer, etc.).
The Scoreboard (with context): • Linear probing is frequently insufficient: In many tasks, models needed full fine-tuning to reach competitive accuracy—like getting a B when frozen, but jumping to an A− or A with full tuning. • Specialist models remain tough: Lightweight CNNs such as EEGNet often placed top-1 or top-3 across tasks, despite tiny parameter counts—like a compact car beating sports cars on a tight city course. • Bigger is not always better: Larger foundation models did not automatically deliver better generalization under current data quality and training practices—like a heavier backpack not helping you hike faster. • Rhythmic tasks shine: On SSVEP with strong periodic signals, several foundation models did quite well even with little data, indicating that pretraining capturing temporal rhythms transfers nicely. • Task dependence matters: Some models excelled in clinical tasks (e.g., abnormal EEG detection) but fell behind on non-clinical paradigms, reflecting pretraining data biases.
Surprising/Notable Findings: • Full fine-tuning usually produced clearer class separation in embeddings (seen via t-SNE), suggesting the encoder needs adaptation to each task’s quirks. • Euclidean Alignment (EA) often improved generalization: simple alignment steps can be powerful. • Increasing few-shot data helped most models steadily; however, truly great low-data adaptation remains an open problem.
Concrete takeaways:
1. If you need out-of-the-box features, today’s EEG encoders aren’t universally ready—expect to fine-tune.
2. Don’t count out classic CNNs or traditional ML; they’re strong, efficient baselines and sometimes best-in-class.
3. Focus pretraining on the structure your task cares about (e.g., frequency for SSVEP) for better transfer.
4. Invest in data quality and alignment; simple steps like EA can give reliable boosts.

🍞 Bottom Bread (Anchor) It’s like realizing a well-tuned, small kite outperforms a giant untuned one in steady wind: careful setup and fit to conditions beat sheer size.

05Discussion & Limitations

🍞 Top Bread (Hook) You wouldn’t wear the same shoes for hiking and sprinting. Likewise, one EEG model rarely wins everywhere—yet.

🥬 Filling

Limitations (what it can’t do yet): • Universal transfer is not here: Most models need full fine-tuning and don’t act as plug-and-play feature extractors across all tasks. • Data quality bottleneck: Noisy, heterogeneous EEG limits scaling benefits; pretraining can learn to reconstruct noise. • Device/layout gaps: Cross-headset differences still hurt; channel unification helps but isn’t perfect. • Few-shot fragility: With very little calibration, performance often drops—rapid personalization needs work.
Required Resources: • Multi-dataset EEG access (ideally cleaned and curated), compute (GPUs), and careful preprocessing (filters, normalization, alignment). • Consistent evaluation pipelines and seeds for fair comparisons.
When NOT to Use: • If you only have a tiny, well-defined task and need speed/efficiency, a small specialist (like EEGNet + good preprocessing) may be better. • If your device layout is far from what a foundation model saw in pretraining and you can’t align channels, expect weak transfer.
Open Questions: • Paradigm-specific vs generalist: When is a focused pretraining (e.g., motor imagery only) better than a broad one? • Tokenization/codebooks: Which token types best capture EEG’s fine rhythms and transients without losing detail? • Scaling laws: Will bigger finally help if we assemble cleaner, much larger corpora? • Better objectives: Can hybrid time–frequency–and-structure targets learn features that are both robust and transferable? • Cross-device adaptation: What’s the most reliable way to align headsets and montages in a single pretraining framework?

🍞 Bottom Bread (Anchor) It’s like building a universal remote: we’re close, but still need better codes, cleaner signals, and smarter learning to control every TV reliably.

06Conclusion & Future Work

🍞 Top Bread (Hook) Think of this work as the rulebook and scoreboard the EEG community needed to run a real league.

🥬 Filling

3-Sentence Summary:
1. The paper reviews 50 EEG foundation models and organizes their design choices into a clear taxonomy.
2. It benchmarks 12 open-source foundation models and strong specialists on 13 datasets, under both cross-subject and few-shot settings, comparing linear probing versus full fine-tuning.
3. Results show linear probing is often inadequate, small specialists can be very competitive, and bigger models don’t automatically generalize better under current practices.
Main Achievement: • A fair, comprehensive benchmark and design map that reveal what actually helps EEG models transfer across tasks and people.
Future Directions: • Build large, cleaner EEG corpora; improve cross-device alignment; design smarter pretraining targets (hybrid time–frequency–token/codebook); develop rapid, low-data personalization methods.
Why Remember This: • Because with common rules and broad testing, we can truly see progress, avoid hype, and steer EEG research toward models that work reliably in the clinic, at home, and on new devices.

🍞 Bottom Bread (Anchor) Like standardizing plug sizes worldwide so chargers just work—this benchmark is a big step toward EEG models that just work everywhere.

Practical Applications

•Choose models using this benchmark’s rankings for your specific task (e.g., EEGNet for small, fast deployments).
•Adopt Euclidean Alignment (EA) or CAR in your preprocessing pipeline to boost cross-subject robustness.
•Prefer full-parameter fine-tuning for critical applications where accuracy matters more than training speed.
•Use frequency-domain targets when working on rhythmic tasks like SSVEP to improve transfer.
•Start with a paradigm-specific foundation model (e.g., MIRepNet for motor imagery) when the task is known.
•Pilot within-subject few-shot calibration sessions (5–10% data) to quickly personalize models.
•Standardize channel layouts via template mapping or learned projections for cross-device compatibility.
•Combine time- and frequency-domain objectives during pretraining to capture complementary structure.
•Measure with BCA (for classification) and RMSE (for regression) to get fair, comparable performance.
•Plan data collection with quality controls (artifact removal, subject screening) to unlock better scaling.

Version: 1