Synthefy MUSEval: The Largest Multivariate Evaluation Benchmark for Time Series Foundation Models
MUSEval is the first large-scale benchmark (45 datasets, 19B points, 16 domains) built to measure multivariate gain — how much better models get when given related signals.
“Forecast demand for yellow sofas in central Austin using promotion information, internet traffic to my store, and competitor prices, given that I know the demand for beige couches in Dallas.” — Retail Data Scientist.
“In a data center, predict failures on node 1 using compute load, inlet temperature, fan speed, and peer-node alerts.” — Site Reliability Engineering Team Lead
“In a refinery, assess the efficiency of my machinery using temperature and vibration from other assets.” — Reliability Engineer
These are multivariate time series problems — situations where many related signals evolve together over time. In real-world systems, one signal rarely changes alone. Prices depend on promotions, energy demand depends on the weather, and server performance depends on peer nodes. Leveraging these related signals or variates to predict a target signal, such as couch prices, node failures, or machine efficiency, results in significant improvements in performance.
Yet many time series models today treat each signal in isolation, forecasting target signals without considering the variates that may influence the target signal. That’s like predicting traffic at one intersection without looking at nearby roads — though it is possible to capture some general patterns, doing so misses vital information.
TL;DR — MUSEval is the first large-scale benchmark (45 datasets, 19B points, 16 domains) built to measure multivariate gain — how much better models get when given related signals. Goal: Enable researchers and practitioners to evaluate, compare, and build the next generation of multivariate TSFMs that can reason across correlated signals in the real world.

Figure 1. The Multivariate Forecasting Problem — Here we show an example of the multivariate forecasting problem. An ideal forecasting model should learn to forecast Yellow Couch Sales not just from its own past, but from related signals — such as prices, promotions, competitor pricing, and sales of similar products (ex., Red Couch Sales). This is the essence of multivariate forecasting: using multiple correlated time series to make more accurate and context-aware predictions.
There is currently no standardized way to measure multivariate performance. Benchmarks like GIFT-Eval and FEV-Bench evaluate many datasets but do not isolate multivariate gain — the improvement that comes from specifically adding more variates.
If we want models that truly reason over context — combining signals from multiple sensors, markets, or regions — we must be able to track this progress. MUSEval does exactly that.
It creates a shared testbed to answer one question:
Can current time series foundation models actually learn from multiple correlated signals — or do they fall back to forecasting each signal alone?
Why should you care:
- Multivariate forecasting is business-critical.
- Current TSFMs cannot use multivariate data effectively.
- No benchmark tracks multivariate performance.
- MUSEval is the first and largest benchmark focused entirely on this.
- It enables data science teams to train the next generation of multivariate foundation models (coming soon from Synthefy).
The Problem: Current TSFMs Are Not Multivariate
We now have several time series foundation models (TSFMs); these models predict future values of time series based on past values. They are large models pre-trained on massive amounts of data, and — crucially — can be applied zero-shot, to solve new prediction problems without any training. Zero-shot forecasting is powerful because it lets a model generalize across industries and data types.
However, when we tested whether these multivariate TSFMs can actually utilize multivariate data, we found that for these models, adding variates provided no benefit. We evaluated four of the best-known TSFMs: Google TimesFM 2.5, Amazon Mitra, Datadog Toto, and PriorLabs TabPFN.
We compared their forecasting accuracy on 45 different multivariate datasets across 16 real-world and synthetic multivariate domains, including:
- Energy: Forecasting energy demand by adding weather conditions.
- Sales: Forecasting product sales by adding product prices and promotions.
- Wikipedia: Forecasting Wikipedia trends based on the trends of related Wikipedia pages.
An ideal TSFM should perform better when given extra context — for example, forecasting electricity demand more accurately when weather data is added.
Surprisingly, none of these models improved. In some cases, performance even declined. That means current foundation models are still treating each signal independently — ignoring valuable relationships that exist in the data.

Figure 3. Existing State-Of-The-Art (SOTA) Multivariate TSFMs fail to effectively use correlated variates. We compare the univariate and multivariate performance of 4 SOTA models (Google TimesFM 2.5, Amazon Mitra, Datadog Toto, and PriorLabs TabPFN) on 14 diverse multivariate datasets. Specifically, we plot the logarithm of the inverse of the Mean Average Percentage Error (MAPE) metric (higher is better). Surprisingly, providing additional correlated information (multivariate case) leads to similar or worse model performance (TabPFN), with only marginal improvements observed in very few cases.
The Gap: No Benchmark Focused on Multivariate Evaluation
Public benchmarks are vital to improve the capabilities of TSFMs, yet none of the existing benchmarks are designed to evaluate multivariate TSFMs.
Salesforce GIFT-Eval is the main benchmark used to evaluate zero-shot TSFMs. It reports cross-domain generalization scores. While some datasets are multivariate, the focus is on breadth: how well a model works out-of-the-box on unseen time series tasks.
Amazon FEV-Bench defines 100 standardized forecasting tasks with rolling windows, horizons, and skill scores. It emphasizes fairness across horizons and evaluation protocols, making model comparisons reproducible.
Limitations: Neither benchmark informs an end-user or researcher how much a model benefits from multivariate information. A model could rank #1 on these benchmarks yet still ignore all contextual variables — many of the models we evaluated are top performers on these benchmarks. In practice, a model might appear state-of-the-art on overall error yet gain nothing when given external signals such as weather, prices, or peer sensors.
Together, GIFT-Eval and FEV-Bench tell us how broadly a model can forecast — but not why it succeeds or fails when multiple correlated signals are present.
That's the gap MUSEval fills: it isolates multivariate gain as the primary metric of progress.
Introducing MUSEval— The Largest Multivariate Benchmark For Tracking Zero-Shot Multivariate TSFM Performance
MUSEval (Multivariate Synthesis Eval) is a public benchmark designed to track the ability of TSFMs to use multiple signals together.
What MUSEval Measures
MUSEval tracks zero-shot multivariate gain through three simple metrics based on Mean Absolute Percentage Error, a normalized metric that allows comparison across different scales:
- Univariate MAPE — forecasting error using the signal's own past.
- Multivariate MAPE — forecasting error using all related signals
- Δ (Uni — Multi) — the improvement (or regression) when adding multivariate data
What’s Inside
MUSEval spans synthetic, derived, real-world, and combined datasets — over 19 billion points across 45 datasets and 16 domains.

By spanning both real and synthetic data, MUSEval can test whether models learn to use multivariate information — and whether that skill transfers across domains. Because synthetic and derived variates are known to be strongly correlated, they isolate the multivariate capabilities of the model from the amount of correlation in the context. Real-world and collections data, by contrast, assess the model’s ability to leverage context for everyday use cases.

Figure 4. Clear Lagged Relationships using Synthetic Data. Using synthetic data, we can construct a series where the target signal (top) is predicted by a lagged variate (middle). The first green-to-green boxes indicate how a pattern in the variate is noisily reflected in the target. The second green-to-red box indicates the motif in the variate that should be reflected in the target. Despite this, modern time series foundation models cannot predict the upcoming spike, which is clearly visible in the variate — highlighting a major limitation in multivariate TSFM capabilities.
How Big Is MUSEval vs. Existing Benchmarks?

MUSEval is orders of magnitude larger, enabling evaluation across a larger diversity of multivariate relationships. It is the first benchmark to treat multivariate gain as a first-class metric and to leverage data with known multivariate correlations (synthetic and derived) and realistic use cases (traditional and collections) to isolate multivariate gain in TSFM evaluation.
How to Use MUSEval
If you’re building or evaluating a time series foundation model:
- Run your model on MUSEval datasets.
- Report both univariate and multivariate errors (MAPE).
- Publish your results to help the community track progress.
Resources:
- GitHub Repository
- Hugging Face Datasets: Synthefy/Museval
- MUSEval Leaderboard
- Paper (Coming Soon).
Call to Action
The question driving MUSEval is simple:
Can models learn to forecast with context — not in isolation?
We invite researchers, engineers, and businesses to join us.
- Run your TSFM on MUSEval.
- Publish your univariate vs. multivariate gaps.
- Contribute datasets with multivariate information.
Together, we can track — and accelerate — the path to the world's first truly multivariate time series foundation model.
Originally published on Medium