For this post, I wanted to take a look at general purpose time series foundation models. In particular, I am curious about what the “general” part of these models actually entails. It is not intended as a throughout benchmark, but more as an exploratory exercise.
I will start with a short intro about why foundation models for time series are in a different category compared to language or vision models, with an overview on what has been released the last year or so.
Then, I will transition to a case study with one of these models—MOMENT—by analyzing Electroencephalographic (EEG) data from a brain-computer interface experiment. Why EEG data? well, earlier this year, Piramidal—a YC-backed company building a foundational model specifically for EEG data—raised a $6.5M seed round. That left me wondering how good general-purpose models can be if companies are raising rounds this size for training domain-specific models.
Impatient? TLDR;
Why are general time series foundation models different from other modalities?
Language and vision have fundamental structures that are largely consistent, regardless of the domain. With language, for example, models trained on a math textbook and Shakespearean prose will still capture structural rules—syntax, grammar, sentence formation—. Domain-specific vocabulary or context can vary, but the foundational structure of language doesn’t change. Similarly, vision models can learn shapes, edges, colors, spatial patterns and textures that carry semantic meaning.
However, when it comes to time series, I initially had some questions about the potential for such broad generalizability. Time series data holds information far beyond trends and seasonality: complex dynamics like phase and frequency modulation, as well as phase-amplitude coupling, can govern many processes and can drastically vary from one domain to the other. And this doesn’t even account for multi-channel data, where cross-channel interactions add yet another layer of complexity.
Overview of Time Series Foundation Models
Here’s an overview of some pre-trained models for time series, along with the companies or labs behind them:
Model Link | Company / University |
---|---|
TimesFM | |
Lag-Llama | - |
ForecastPFN | Abacus.ai |
TinyTimeMixer | IBM |
Uni2TS | Salesforce |
Moment | CMU |
MOIRAI | Salesforce |
Toto | Datadog |
Chronos | Amazon |
TimeGPT | NIXTLA |
Many of these models are pre-trained on data from diverse domains—weather, medicine (e.g., ECGs), finance, and more. The pre-training approach for most of these models is similar to that used in vision models: masks are applied at various time points, and the model learns to predict the values for these masked intervals, effectively filling in the gaps.
Ideally, pre-training enables these models to identify and understand informative patterns inherent across domains. Beyond basic seasonality and trends, a robust foundation model should capture higher-level structures such as phase and frequency modulations and cross-channel interactions. This level of adaptability would make such models truly general-purpose, equipping them to handle the intricate dynamics present in complex time series data. However, given the vast range of domains and the diversity of signal types, achieving this level of adaptability seems very challenging. The differences in underlying structures across domains makes it difficult for any single model to effectively capture the full complexity of any type of time series data, which would explain the emergence domain-specific foundation models such as the one the developed by Piramidal.
Case study
As exploration exercise , I chose MOMENT. Reported results look very good. Besides that, it is very well documented. For the data, I chose an EEG dataset corresponding to a standard classification task for a brain-computer interface. More on the data below.
The goal is to assess the zero-shot performance of the MOMENT embeddings, as used in an LDA classifier downstream. The code below can also be found here
Step: Set up general parameters
We'll be using the small version of the model and a subset of the EEG channels to simplify the analysis.
!uv pip install numpy pandas scikit-learn matplotlib tqdm torch moabb pyriemann
!uv pip install git+https://github.com/moment-timeseries-foundation-model/moment.git
TORCH_DEVICE = 'mps' # Apple Silicon GPU
MODEL_NAME = 'AutonLab/MOMENT-1-small'
EEG_CHANNELS = ['Fz', 'C3', 'Cz', 'C4', 'P3', 'Pz', 'P4', 'O1', 'O2']
SAMPLING_FREQUENCY = 512
Step: Load the data
We will use one of the sessions of the ERP study published by Hübner et al. 2017 and available via de Mother Of All BCI Benchmarks MOABB. The experiment in a nutshell: The patients are presented a rapid sequence of different visual stimuli, some of which they are instructed to pay attention to—target stimuli—and some of which they have to ignore—non-target stimuli—. It is expected that the brain response to each class is different.
The dataset contains data for 13 subjects participating in 3 sessions each. It is not the scope of this post to do a throughout benchmark, so we will limit the dataset to one of the sessions of one of the subjects.
import contextlib
import io
import warnings
import numpy as np
from sklearn.preprocessing import LabelEncoder
from moabb import datasets
from moabb.paradigms import P300
dataset = datasets.Huebner2017(interval=[0,.99])
dataset.download()
paradigm = P300(
resample=SAMPLING_FREQUENCY,
baseline=None,
channels=EEG_CHANNELS
)
stdout = io.StringIO()
with contextlib.redirect_stdout(stdout), warnings.catch_warnings(record=True) as w:
X, y, metadata = paradigm.get_data(
dataset=dataset,
subjects=[1],
return_epochs=False,
return_raws=False,
cache_config=None,
postprocess_pipeline=None,
)
# Limit to one session
session = '0'
ids_mask_session = metadata.session == session
y_encoded = LabelEncoder().fit_transform(y)
X = X[ids_mask_session]
y = y_encoded[ids_mask_session]
# Print number of classes and number of samples per class
unique_classes, counts = np.unique(y, return_counts=True)
print(f"Number of classes: {len(unique_classes)}")
for cls, count in zip(unique_classes, counts):
print(f"Class '{cls}': {count} samples")
Number of classes: 2
Class '0': 3275 samples
Class '1': 1008 samples
Step: Explore the data
Let's take a look at what the average brain activity is for each target and non target stimuli, for each of the channels. I have highlighted two periods, where we find archetypical responses for this type of stimuli, the so-called N100 and P300: a (N)egative spike around 100ms after stimuli, and a (P)ositive spike around 300ms after stimuli. N100 is to be seen predominantly in electrodes around the visual cortex, i.e., Occital Channels O1 and O2, where as P300 is commonly seen in Central-Parietal, closer to the top of the skull, i.e., Cz and Pz.
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
ids_target = y == 1
X_target = X[ids_target]
ids_nontarget = y == 0
X_nontarget = X[ids_nontarget]
X_target_avg = X_target.mean(axis=0)
X_nontarget_avg = X_nontarget.mean(axis=0)
num_channels = X_target_avg.shape[0]
# Create a figure with subplots in a square grid
num_cols = int(np.ceil(np.sqrt(num_channels)))
num_rows = int(np.ceil(num_channels / num_cols))
fig = make_subplots(rows=num_rows, cols=num_cols, subplot_titles=EEG_CHANNELS, shared_xaxes=True, shared_yaxes=True, vertical_spacing=0.05, horizontal_spacing=0.02)
time_vector: np.ndarray = np.arange(X_target_avg.shape[1])/SAMPLING_FREQUENCY
# Plot each channel in a separate subplot
for i in range(num_channels):
row: int = i // num_cols + 1
col: int = i % num_cols + 1
showlegend: bool = (i == 0)
fig.add_trace(
go.Scatter(
x=time_vector,
y=X_target_avg[i],
mode='lines',
name='Target' if showlegend else None,
line=dict(color='red'),
showlegend=showlegend
),
row=row, col=col
)
fig.add_trace(
go.Scatter(
x=time_vector,
y=X_nontarget_avg[i],
mode='lines',
name='NonTarget' if showlegend else None,
line=dict(color='blue'),
showlegend=showlegend
),
row=row, col=col
)
fig.add_vrect(
x0=60/SAMPLING_FREQUENCY, x1=80/SAMPLING_FREQUENCY,
fillcolor="#EFCB66", opacity=0.5,
layer="below", line_width=0,
row=row, col=col
)
fig.add_vrect(
x0=128/SAMPLING_FREQUENCY, x1=256/SAMPLING_FREQUENCY,
fillcolor="#90EE90", opacity=0.5,
layer="below", line_width=0,
row=row, col=col
)
if row == num_rows:
fig.update_xaxes(title_text="seconds after stimulus", row=row, col=col)
if col == 1:
fig.update_yaxes(title_text="µV", row=row, col=col)
# Update layout
y_axis_range: list[float] = [min(X_nontarget_avg.min(), X_target_avg.min()), max(X_nontarget_avg.max(), X_target_avg.max())]
fig.update_layout(
height=800,
width=1000,
showlegend=True
)
for row in range(1, num_rows + 1):
fig.update_yaxes(range=y_axis_range, row=row, col=1)
fig.show()
Step: Baseline pipeline
As baseline, we will implemented a pipeline that is commonly used with this type of data. It consist of a spatial filter, which means, a linear mix across all channels. The spatial filters are learned from data, such that the difference between classes is maximized. This is the XDawn part of the pipeline.
The spatial filters are followed by a standard LDA classifier.
from sklearn.metrics import get_scorer
from pyriemann.estimation import Xdawn
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from mne.decoding import Vectorizer
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
baseline_pipeline = make_pipeline(
Xdawn(nfilter=1),
Vectorizer(),
LDA(
solver='lsqr',
shrinkage='auto'
)
)
# Fit the baseline pipeline
baseline_pipeline.fit(X_train, y_train)
# Predict on the test set
y_pred = baseline_pipeline.predict(X_test)
# Calculate the test score
test_score: float = get_scorer(paradigm.scoring)(baseline_pipeline, X_test, y_test)
print(f"Test ROC-AUC XDAWN+LDA: {test_score}")
Test ROC-AUC XDAWN+LDA: 0.9788115284974094
pas mal.
Step: Data preparation for the MOMENT torch mdoel
N_INPUT_SAMPLES = 512
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
def prepare_data(X_train, X_test, y_train, y_test):
# Convert to torch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)
# pad the input to 512
X_train_tensor = torch.nn.functional.pad(X_train_tensor, (0, N_INPUT_SAMPLES - X_train_tensor.shape[-1]))
X_test_tensor = torch.nn.functional.pad(X_test_tensor, (0, N_INPUT_SAMPLES - X_test_tensor.shape[-1]))
input_mask: torch.Tensor = torch.cat([torch.ones(X_train.shape[-1], dtype=torch.bool), torch.zeros(N_INPUT_SAMPLES - X_train.shape[-1], dtype=torch.bool)])
input_mask = input_mask.unsqueeze(0).repeat(X_train_tensor.shape[0], 1)
# Create TensorDataset
train_dataset = TensorDataset(X_train_tensor, y_train_tensor, input_mask[:X_train_tensor.shape[0]])
test_dataset = TensorDataset(X_test_tensor, y_test_tensor, input_mask[:X_test_tensor.shape[0]])
# Create DataLoader
train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=False)
test_dataloader = DataLoader(test_dataset, batch_size=32, shuffle=False)
return train_dataloader, test_dataloader
train_dataloader, test_dataloader = prepare_data(X_train, X_test, y_train, y_test)
Just making sure that the conversion of data to torch tensors went OK
# Get tensors from dataloader
def collect_tensors(dataloader: DataLoader) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
X_list = []
y_list = []
mask_list = []
for X, y, mask in dataloader:
X_list.append(X)
y_list.append(y)
mask_list.append(mask)
return torch.cat(X_list), torch.cat(y_list), torch.cat(mask_list)
X_train_tensor, y_train_tensor, train_input_mask = collect_tensors(train_dataloader)
X_test_tensor, y_test_tensor, test_input_mask = collect_tensors(test_dataloader)
# Apply baseline_pipeline to X_train_tensor and X_test_tensor
baseline_pipeline.fit(X_train_tensor.cpu().numpy(), y_train_tensor.cpu().numpy())
# Predict on the test set
y_pred_tensor = baseline_pipeline.predict(X_test_tensor.cpu().numpy())
# Calculate the test score
test_score_tensor: float = get_scorer(paradigm.scoring)(baseline_pipeline, X_test_tensor.cpu().numpy(), y_test_tensor.cpu().numpy())
print(f"Test ROC-AUC XDAWN+LDA: {test_score_tensor}")
Test ROC-AUC XDAWN+LDA: 0.9788147668393783
Calculate MOMENT embeddings
We calculate the embeddings for each channel individually. It’s important to note that this setup is inherently biased against MOMENT, as it isn’t trained to learn across channels. To address this, we reshape the data so that each channel’s embedding is obtained separately, and then we concatenate these embeddings. Consequently, the baseline pipeline extracts inter-channel information, while MOMENT does not.
from tqdm import tqdm
from momentfm import MOMENTPipeline
def get_embedding(model, dataloader):
embeddings, labels = [], []
with torch.no_grad():
for batch_x, batch_labels, batch_mask in tqdm(dataloader, total=len(dataloader)):
batch_size, n_channels, n_timesteps = batch_x.shape
batch_x = batch_x.reshape(batch_size * n_channels, 1, n_timesteps).to(TORCH_DEVICE)
batch_mask = batch_mask.to(TORCH_DEVICE)
output = model(x_enc=batch_x) # [batch_size * n_channels x emb_dim]
embedding = output.embeddings.cpu().reshape(batch_size, -1) # [batch_size x n_channels * emb_dim]
embeddings.append(embedding)
labels.append(batch_labels)
embeddings, labels = np.concatenate(embeddings), np.concatenate(labels)
return embeddings, labels
model = MOMENTPipeline.from_pretrained(
MODEL_NAME,
model_kwargs={
'task_name': 'embedding',
}
)
model.init()
model.to(TORCH_DEVICE)
train_embeddings, train_labels = get_embedding(model, train_dataloader)
test_embeddings, test_labels = get_embedding(model, test_dataloader)
print(train_embeddings.shape, train_labels.shape)
print(test_embeddings.shape, test_labels.shape)
(2998, 4608) (2998,)
(1285, 4608) (1285,)
Step: Downstream classification
We use an LDA and hyperparameter optimized SVM
from momentfm.models.statistical_classifiers import fit_svm
moment_lda_pipeline = make_pipeline(
LDA(solver='lsqr', shrinkage='auto')
)
moment_lda_pipeline.fit(train_embeddings, train_labels)
y_pred = moment_lda_pipeline.predict(test_embeddings)
test_score = get_scorer(paradigm.scoring)(moment_lda_pipeline, test_embeddings, test_labels)
print(f"Test ROC-AUC MOMENT+LDA: {test_score}")
Test ROC-AUC MOMENT+LDA: 0.609520725388601
moment_svm_pipeline = fit_svm(train_embeddings, train_labels)
y_pred = moment_svm_pipeline.predict(test_embeddings)
test_score = get_scorer(paradigm.scoring)(moment_svm_pipeline, test_embeddings, test_labels)
print(f"Test ROC-AUC MOMENT+SVM: {test_score}")
Test ROC-AUC MOMENT+SVM: 0.549556347150259
Well, that is not very good. I suspect that it has to do with the dimensionality of the input data (number of channels x the number of embedding dimensions). Let me try once again but this time applying PCA before the LDA classifier to reduce the number of input dimensions by 90%.
from sklearn.decomposition import PCA
moment_lda_pipeline = make_pipeline(
PCA(n_components=int(train_embeddings.shape[-1] * 0.1)),
LDA(solver='lsqr', shrinkage='auto')
)
moment_lda_pipeline.fit(train_embeddings, train_labels)
y_pred = moment_lda_pipeline.predict(test_embeddings)
test_score = get_scorer(paradigm.scoring)(moment_lda_pipeline, test_embeddings, test_labels)
print(f"Test ROC-AUC MOMENT+PCA+LDA: {test_score}")
Test ROC-AUC MOMENT+PCA+LDA: 0.6235427461139896
It improved, although not as much as I would have expected.
Step: Dimensionality reduction before embeddings
I want to try one last thing. Since MOMENT is still not built to merge information across channels, we will help it a bit by spatially filtering the data before calculating the embeddings. This should substantially increase the classification score.
x_dawn = Xdawn(nfilter=1)
X_train_filtered = x_dawn.fit_transform(X_train, y_train)
X_test_filtered = x_dawn.transform(X_test)
train_dataloader_filtered, test_dataloader_filtered = prepare_data(X_train_filtered, X_test_filtered, y_train, y_test)
train_embeddings_filtered, train_labels_filtered = get_embedding(model, train_dataloader_filtered)
test_embeddings_filtered, test_labels_filtered = get_embedding(model, test_dataloader_filtered)
moment_lda_pipeline = make_pipeline(
LDA(solver='lsqr', shrinkage='auto')
)
moment_lda_pipeline.fit(train_embeddings_filtered, train_labels_filtered)
y_pred = moment_lda_pipeline.predict(test_embeddings_filtered)
test_score = get_scorer(paradigm.scoring)(moment_lda_pipeline, test_embeddings_filtered, test_labels_filtered)
print(f"Test ROC-AUC Xdawn+MOMENT+LDA: {test_score}")
Test ROC-AUC Xdawn+MOMENT+LDA: 0.8285330310880828
Aha! the assumption of inter-channel independence in MOMENT appears to be a significant challenge. While we’re still not reaching the baseline’s 0.97 ROC-AUC, we’ve seen a clear improvement from 0.64 to 0.81 by spatially mixing the channels and, thus, reducing the channel count before calculating embeddings.
Remaining considerations for this exercise:
- Cross-subject, cross-session: I still wonder who well these embeddings work if we merge the data across different subjects. This is still a challenging topic in the BCI community. In our case, I suspect that adding cross subject data could even improve the performance, since it seems that we are still under the curse of dimensionality.
- Generalizability to other experiments: It would be interesting to test if this pipeline can adapt to different types of experiments. For instance, in scenarios where discriminative features are not specific amplitude peaks (like N100 or P300) but other characteristics—such as power increases within a specific frequency band over longer periods of time, as seen in Motor Imagery experiments. This is actually one of the main selling points of foundation models: decent zero-shot performance across tasks, so it would make sense to try this one out.
I plan to explore this further in the future, but I’ll leave it here for now.
TLDR;
- This year there has been a surge in general-purpose foundation models for time series, along side the emergence of domain-specific models from companies like [Piramidal](https://piramidal.ai/- a YC-backed company, raising $6.5M for the foundational model specifically for Electroencephalographic (EEG)
- This post is a personal exploration exercise to see how these general-purpose models perform with data for which domain-specific models are being built. It is not intended as a throughout benchmark.
- For this, I chose MOMENT, one of the general-purpose foundation models released this year, and tested its zero-shot performance in a common brain-computer interface classification task.
- The results are mixed: MOMENT did not achieved the domain-specific baseline. But to be fair, the baseline is already very high.
- This is probably due to the limitations on how multi-channel data is processed in MOMENT. If we factor out this, the classification performance increases from around 0.6 to 0.8. Not bad 🙂
- The code can be found here
- My takeaway is that the model is very promising, although the way multi-channel data is dealt with should be honed.
-
I’m still itching to find out how MOMENT would handle other type of BCI experiments that tap into different brain responses—like Motor Imagery instead of Evoked Potentials. But I will leave this one for another post 👋🏽
Sources
Foundation Models for Time Series Analysis: A Tutorial and Survey
Exploring the Latest Advances in Foundation Time-Series Models
Introducing Toto: A state-of-the-art time series foundation model by Datadog
Deep Time Series Models: A Comprehensive Survey and Benchmark