TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Fangxu Yu; Xingang Guo; Lingzhi Yuan; Haoqiang Kang; Hongyu Zhao; Lianhui Qin; Furong Huang; Bin Hu; Tianyi Zhou

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

Beginner

Fangxu Yu, Xingang Guo, Lingzhi Yuan et al.1/26/2026

arXiv PDF

Key Summary

•TSRBench is a giant test that checks if AI models can understand and reason about data that changes over time, like heartbeats, stock prices, and weather.
•It includes 4 big skill areas—Perception, Reasoning, Prediction, and Decision-Making—spread across 15 tasks and 14 real-life domains.
•The authors tested more than 30 top models (text-only, vision-only, and multimodal) on 4,125 problems and scored them with accuracy.
•Models do well at noticing patterns (Perception) but struggle with thinking through complex steps (Reasoning) and especially with forecasting numbers (Prediction).
•Making models bigger helps most skills (Perception, Reasoning, Decision-Making) but not Prediction, where size doesn’t fix forecasting errors.
•Text (numbers) and pictures (plots) of time series each help on different problems, but today’s multimodal models don’t combine them well yet.
•The best overall model (GPT-5 with text+vision) scored 55.6%, while the best open-source models reached around 42–45%.
•Extra tools (like automatic statistics) and more careful thinking at inference time give small-to-moderate gains, but big gaps remain in forecasting and quantitative decisions.
•TSRBench gives a fair, standard way to see what AIs can and can’t do with time series, guiding better model design and training.
•These results matter for real-life decisions in finance, healthcare, traffic, energy, disaster response, and more.

Why This Research Matters

So many real-world choices depend on reading the story told by time: patient health, stock movements, traffic flow, energy demand, and storm warnings. TSRBench shows clearly what today’s AI can and cannot do with those stories, so we don’t overtrust models in high-stakes situations. It highlights that seeing patterns isn’t the same as predicting the future, guiding us to train models differently for forecasting. It also proves that text and plots each add unique clues, pushing research toward true multimodal fusion. Finally, it gives a common scoreboard for progress, so improvements are measurable, comparable, and focused on real needs.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook) You know how a fitness watch shows your heart rate over time, and a weather app shows temperatures across days? Those squiggly lines are telling a story that changes step by step.

🥬 Filling (The Actual Concept)

What it is: Time series are lists of measurements taken in order (like every second, hour, or day) that let us see how things change.
How it works (recipe):
1. Measure something again and again (heart rate, price, wind speed).
2. Line up the measurements by time to see patterns.
3. Use patterns to explain what happened and decide what to do next.
Why it matters: Without time order, you can’t tell cause from effect, spot cycles, or make smart future guesses.

🍞 Bottom Bread (Anchor) A hospital monitor tracks a patient’s heartbeats second by second; the ups and downs help doctors decide if treatment is working right now.

The World Before: Before TSRBench, AI testing was mostly about language questions, pictures, or single-number predictions. Some time series tests existed, but they often treated the data like plain number lists. That means the story (the cause-effect and context) was lost. Other benchmarks added a little context but mostly checked if models could spot simple patterns. Yet real life needs models to explain why something changed, what might happen next, and which action to take.

The Problem: AI “generalist” models (LLMs/VLMs) are supposed to be useful across many areas—finance, health, traffic, energy. But there was no unified, fair test to check whether they can truly reason with time series: to understand patterns (Perception), think through logic and causes (Reasoning), predict the future (Prediction), and choose actions (Decision-Making), across many domains and in multiple input forms (text of numbers and plotted images).

Failed Attempts:

Forecasting-only suites: great for predicting future numbers but miss why a change happens or what decision to make.
Perception-focused exams with synthetic data: fine for spotting trends or cycles, but don’t stretch deeper reasoning or decisions.
Narrow-domain reasoning sets: test logic in, say, finance or weather, but don’t generalize across health, industry, and science.

The Gap: We needed a complete, multi-domain, multi-task, multi-modal benchmark that:

Tests the full skill chain: Perception → Reasoning → Prediction → Decision-Making.
Uses both text (raw numbers) and plots (visual cues), and even both together.
Supplies clear ground truth from code/simulations or carefully aligned real data.

Real Stakes:

Finance: Misreading a spike can waste millions or trigger risky trades.
Healthcare: Wrong ECG reasoning can delay urgent care.
Disaster response: Missing early signals can worsen floods or wildfires.
Energy and traffic: Poor predictions cause blackouts or gridlock.
Education and science: Students and researchers need tools that reason over experiments and measurements correctly.

🍞 Top Bread (Hook) Imagine testing a soccer player only on sprint speed and ignoring passing or strategy—you’d miss what matters in a match.

🥬 Filling (The Actual Concept)

What it is: A benchmark is a fair, shared test that measures what AI can really do.
How it works:
1. Gather diverse, real tasks and data.
2. Define clear questions with correct answers.
3. Run many models the same way and compare scores.
Why it matters: Without a fair test, we can’t improve models or trust them with real decisions.

🍞 Bottom Bread (Anchor) If two calculators claim to be “best,” a math test with answer keys lets everyone see which one is actually more accurate.

TSRBench’s answer is to be that fair, shared test—covering 4 big abilities, 15 tasks, 14 domains, 4,125 problems, and multiple input types. It checks not only if models see a pattern, but if they can explain it, predict what’s next, and choose what to do—just like people must in real life.

02Core Idea

🍞 Top Bread (Hook) Imagine reading a mystery: first you notice clues (Perception), then you connect them (Reasoning), guess what happens next (Prediction), and decide who to arrest (Decision-Making).

🥬 Filling (The Actual Concept)

One-sentence insight: TSRBench is a single, comprehensive test that stress-tests every main step of time series thinking—seeing, reasoning, predicting, and deciding—across many domains and modalities, so we learn where generalist models truly succeed or fail.

Multiple Analogies:

Doctor visit: check vital signs (Perception), diagnose cause (Reasoning), forecast recovery (Prediction), choose treatment (Decision-Making).
Weather planning: read charts (Perception), link fronts and pressure (Reasoning), predict rain (Prediction), pack an umbrella or reschedule (Decision-Making).
Coach strategy: analyze stats (Perception), explain momentum shifts (Reasoning), anticipate plays (Prediction), pick formations (Decision-Making).

Before vs After:

Before: Scattered tests—some on spotting patterns, others on forecasting—but little that measures the full chain with context and across domains.
After: A unified scoreboard that shows strengths (e.g., seeing patterns) and weaknesses (e.g., numeric forecasting) for many model types, plus where text vs plots help.

Why It Works (intuition):

Real problems are pipelines: notice → think → guess next → act. If you only test one step, you can mask weaknesses in the others. TSRBench forces models to pass through each step so hidden gaps show up.
Multi-modal inputs (numbers and plots) capture different clues. If a model leans too hard on one, TSRBench reveals the blind spot.
Verifiable answers (simulations, aligned events) keep the test honest and repeatable.

Building Blocks, with Sandwich Explanations:

🍞 Top Bread (Hook) You know how your eyes spot lines and shapes before your brain decides what they mean?

🥬 Perception (What/How/Why)

What it is: Perception is noticing the “what it looks like” stuff—trends, cycles, noise, and outliers—in a time series.
How it works:
1. Scan the series for trend (up/down), seasonality (repeats), and noise (wiggles).
2. Flag weird parts (anomalies) and compare two series for similarity.
3. Summarize key stats so later steps have clean inputs.
Why it matters: If you mis-see the pattern, all later reasoning, predictions, and decisions get shaky.

🍞 Bottom Bread (Anchor) A stock that bounces up and down every week (seasonality) but slowly climbs (trend) tells a different story than a flat, noisy series.

🍞 Top Bread (Hook) Imagine linking clues in order—like, “the bell rang, then students left,” so the bell likely caused the exit.

🥬 Reasoning (What/How/Why)

What it is: Reasoning connects patterns to causes, rules, and logical conclusions.
How it works:
1. Temporal and causal links (did A lead B?).
2. Abduction: guess the hidden event that best explains a change.
3. Deduction: apply rules precisely (e.g., physics system).
4. Induction: infer general rules (like periodicity) from examples.
5. Numerical reasoning: compute domain-aware quantities.
Why it matters: Without solid reasoning, a model may see patterns but can’t explain or trustably use them.

🍞 Bottom Bread (Anchor) Detecting a flipped signal vs a cutoff anomaly needs logic, not just eyeballing.

🍞 Top Bread (Hook) When you see clouds and falling pressure, you guess rain tomorrow—that’s a prediction.

🥬 Prediction (What/How/Why)

What it is: Prediction estimates future values or events based on history and context.
How it works:
1. Read history plus any event text (like policy news).
2. Weigh how events might push the series.
3. Choose the most plausible future curve or event (multiple-choice here).
Why it matters: Planning needs a peek at tomorrow; bad forecasts waste money or risk safety.

🍞 Bottom Bread (Anchor) If a central bank hikes rates, a model should reason how that might nudge stock trends—not just guess randomly.

🍞 Top Bread (Hook) Choosing between two roads after checking traffic is decision-making in action.

🥬 Decision-Making (What/How/Why)

What it is: Turning what you saw and predicted into a concrete choice (qualitative or quantitative).
How it works:
1. Qualitative: pick the right clinical action from ECG signals.
2. Quantitative: simulate strategies (like trading rules) and pick the best metric (e.g., lowest drawdown).
Why it matters: The goal isn’t just to know—but to do the right thing.

🍞 Bottom Bread (Anchor) An ECG that screams emergency should lead to urgent care—not a casual recheck next week.

🍞 Top Bread (Hook) Sometimes reading numbers is like reading tiny letters; a picture (plot) can make it clear.

🥬 Multi-modal Learning (What/How/Why)

What it is: Learning from both text (raw numbers) and images (plots) together.
How it works:
1. Convert series to text for LLMs and plots for VLMs.
2. Feed text+vision to multimodal models.
3. Compare how each view helps; try fusing both.
Why it matters: Numbers show exact values; plots show shapes. Together should be stronger—if models can fuse them well.

🍞 Bottom Bread (Anchor) Reading a table of ECG values is precise; seeing 12-lead waveforms shows patterns at a glance. The best doctor uses both.

03Methodology

High-level Recipe: Input → Encode Time Series (Text and/or Plot) → Ask Task-specific Question → Model Answers (Multiple Choice) → Score Accuracy → Analyze Patterns and Ablations

Step-by-step Details (with reasons and examples):

Inputs and Modalities

What happens: Each problem includes a time series (or several), sometimes with extra context text (like a news blurb), and a question with multiple choices.
Why this step exists: Real systems consume numbers and visuals; testing across text, vision, and text+vision shows what each model can handle.
Example: For finance forecasting, the history is numeric; context is an economic report; choices are 35-day future price paths.

Encoding for Different Model Types

LLMs (text): The series becomes a space/comma-separated list of numbers with time labels.
VLMs (vision): The series is plotted as standardized images (resolution tuned via ablation, 100 PPI chosen for balance).
Multimodal (T+V): Provide both text sequence and the plot.
TS-LLMs (embeddings): A projector maps series into embeddings.
Why it matters: Fairness—every model gets its best-suited input form.
Example: An ECG question: LLM sees normalized leads as lists; VLM sees 12 plotted leads; T+V gets both.

Tasks across 4 Dimensions (15 total)

Perception: Pattern Analysis, Noise Understanding, Anomaly Detection, Similarity Analysis. • Why: Cleanly seeing the series is the foundation. • Example: “Is this series stationary?” or “Which two series share the same trend direction?”
Reasoning: Etiological Reasoning, Causal Discovery, Abductive, Temporal Relation, Numerical, Deductive, Inductive. • Why: Real understanding requires logic, causality, and math in context. • Example: “Which activity created this accelerometer pattern?” or “Which Lorenz trajectory matches exact rules?”
Prediction: Time Series Forecasting (multiple-choice future paths), Event Prediction (will X happen?). • Why: Planning needs tomorrow’s guess, but numeric precision is hard for generalist LLMs—so options reduce raw regression difficulty. • Example: “Pick the most plausible 140-day S&P 500 path, given today’s policy news.”
Decision-Making: Qualitative (e.g., ECG-based clinical choice) and Quantitative (e.g., best trading strategy by metric). • Why: The end goal is action. • Example: “Which strategy yields the best maximum drawdown?”

Ground Truth and Verification

What happens: Answers come from either (a) exact code/simulations (e.g., physics systems, backtests), or (b) rule-based extraction from aligned data/text.
Why this step exists: Removes ambiguity and ensures reproducibility.
Example: Deductive (Lorenz attractor) uses precise numerical integration; QuantDM uses deterministic backtesting rules.

Scoring

What happens: Primary metric is accuracy (correct choice among options). Scores are reported per task and per dimension, then aggregated.
Why it matters: Simple, consistent comparison across many models and tasks.
Example: GPT-5 (T+V) achieves 55.6% overall; top open-source VLM ~44.9%; top open-source LLM ~42.4%.

Model Lineup and Uniform Conditions

What happens: 6 proprietary models (e.g., GPT-5), 12 open-source LLMs (e.g., Qwen3), 13 open-source VLMs (e.g., InternVL3.5), plus TS-LLMs.
Why it matters: Broad coverage ensures trends aren’t family-specific.
Example: Compare Qwen3-VL-32B vs Qwen2.5-72B to see text vs vision strengths.

Ablations and Deep Dives (the Secret Sauce)

Visual Resolution Study: 10, 50, 100, 200, 400 PPI; best is usually 100 PPI—low PPI hides details; high PPI adds clutter. • Example: At low PPI, small ECG anomalies disappear; at very high PPI, models miss the forest for the trees.
Tool-Augmented Reasoning: Provide extra computed stats (means, peaks, change points) alongside inputs. • Why: Helps models that struggle to extract numerical structure. • Finding: Small overall gains; task-dependent improvements.
Inference-Time Scaling: Let models “think longer” (reasoning mode vs non-reasoning). • Why: Hard tasks need step-by-step thinking. • Finding: Big lifts on Reasoning/Decision/Prediction; Perception is less sensitive.
Cross-Modal Complementarity: Compare text-only vs vision-only vs both. • Why: Numbers and pictures capture different clues; union solves more items than either alone. • Finding: Today’s models rarely fuse both to beat the union—indicating poor fusion.

Sandwich Explanations for Two Key Meta-Concepts:

🍞 Top Bread (Hook) You know how practicing more usually makes you better, but sometimes a tough skill won’t improve just by repeating?

🥬 Scaling Law (What/How/Why)

What it is: The idea that bigger models (more parameters) usually perform better.
How it works:
1. Train or use larger model versions.
2. Plot performance vs size.
3. Look for steady improvements.
Why it matters: It guides whether “just scaling up” can fix weaknesses.

🍞 Bottom Bread (Anchor) Here, scaling helped Perception/Reasoning/Decision—but not Prediction—so bigger didn’t fix forecasting.

🍞 Top Bread (Hook) Sometimes your eyes catch shapes fast (plots), while your brain likes exact numbers (text). Using both should be best—if you can combine them well.

🥬 Modality Complementarity (What/How/Why)

What it is: Text and vision each solve different subsets of problems.
How it works:
1. Test text-only and vision-only.
2. See which problems each solves.
3. Try fusing both; compare to the union of singles.
Why it matters: If fusion doesn’t beat the union, the model isn’t truly combining insights.

🍞 Bottom Bread (Anchor) In TSRBench, the union of text-only and vision-only solutions is high, but T+V often fails to surpass that—showing current fusion is weak.

Secret Sauce Summary:

Standardized multi-modal encoding, carefully verified answers, and broad ablations (resolution, tools, reasoning effort) expose not just scores but why models win or fail. That’s what makes TSRBench especially informative.

04Experiments & Results

The Test: What and Why

Measure accuracy across 15 tasks in 4 skill areas to see where generalist models excel or struggle.
Check scaling (does bigger help?), modality (text vs vision vs both), and effort (tool use, longer reasoning) to learn how to improve.

The Competition: Who’s Compared

Proprietary: o4-mini, GPT-5-mini, GPT-5, Gemini-2.5-Flash, and others in T, V, and T+V modes.
Open-source LLMs: Qwen2.5 (3B/7B/72B), Qwen3 (1.7B/8B/32B/235B-A22B), Gemma3 (12B/27B), InternLM3 (8B), GPT-OSS (20B/120B).
Open-source VLMs: Qwen2.5-VL (3B/7B/72B), Qwen3-VL (8B/32B/235B-A22B), InternVL3.5 (1B/8B/38B), MiniCPM-V-4.5, etc.
TSLLMs: TS-Reasoner (7B), ChatTS (14B).

Scoreboard (with context):

Best overall: GPT-5 (T+V) at 55.6%—like getting a mid B when many others are pulling Cs.
Best open-source VLM: Qwen3-VL-32B at ~44.9%.
Best open-source LLM: Qwen2.5-72B at ~42.4%.
Pattern: Strong Perception; big drop on complex Reasoning and Decision; weakest on Prediction (especially numeric forecasting).

Surprising and Core Findings:

Scaling law holds—except for Prediction.
- Bigger models did better at Perception, Reasoning, and Decision-Making (strong correlations).
- But for Prediction tasks, size didn’t help much—forecasting stayed hard.
- Meaning: Being smarter at logic and reading patterns doesn’t automatically turn into better number-forecasting.
Prediction is decoupled from other skills.
- Correlations show that success in Perception/Reasoning/Decision doesn’t predict success in Prediction.
- Practical takeaway: Train forecasting more directly (data-centric) and don’t assume general reasoning transfers.
Text vs Vision: similar averages but different wins—complementary, not redundant.
- Vision often leads on Perception (shapes are quick to see).
- Text can matter more where fine-grained numbers or rules matter.
- Union of text-only and vision-only correct answers is high; overlap is modest.
- Yet T+V models don’t fully beat the union—fusion is weak in current systems.
Inference-time reasoning helps a lot on hard tasks.
- Letting models “think longer” lifts Reasoning, Prediction, and Decision scores notably; Perception is less affected.
- Translation: Deep, step-by-step thought at test time is worth the extra compute.
Tool-augmented stats give small overall boosts.
- Adding precomputed features (means, peaks, change points) helps some tasks modestly.
- Interpretation: Tools can patch number-extraction gaps but don’t fully solve forecasting or fusion.
Visual resolution has a sweet spot (~100 PPI).
- Too low: lose details (miss anomalies); too high: clutter and local noise distract.
- T+V setups cushion low-PPI losses because text carries precise values.

Error Analysis (what goes wrong):

Main error types are Reasoning and Perception; much fewer are Question Understanding or Domain Knowledge.
Example pitfalls: • Perception error: misreading an oscillatory series and choosing a momentum strategy over a mean-reversion one. • Reasoning error: stopping early in a precise physics simulation instead of carrying the computation through.
Implication: Models need better temporal pattern perception and rigorous, verifiable reasoning procedures.

High-variance vs low-variance tasks:

High-variance (e.g., Abductive Reasoning, Event Prediction): some models do well, others badly—ripe for distillation from stronger to weaker.
Low-accuracy, low-variance (e.g., Quantitative Decision-Making, Time Series Forecasting): everyone struggles—points to missing training data/skills industry-wide.

Bottom line:

Today’s best model is a solid step above the rest but far from human-expert reliability, especially on forecasting and quantitative choices.
TSRBench turns these observations into a clear to-do list for the community: better fusion, data-centric forecasting pretraining, and stronger reasoning at inference time.

05Discussion & Limitations

Limitations (be specific):

Forecasting simplified to multiple choice: reduces raw regression difficulty but may hide how close a model can numerically get.
Fusion is under stress but still via standard interfaces: today’s T+V prompting may bottleneck true cross-modal alignment.
Synthetic components: essential for ground-truth math/physics/backtests, but distributional shifts from real-world messiness remain.
Single primary metric (accuracy): doesn’t reflect partial credit, calibration, or uncertainty handling.
Static plots over raw waveforms: some domains (e.g., ECG) might benefit from higher-fidelity inputs or multi-resolution crops.

Required Resources:

Compute/time: Running 30+ large models across 4,125 items is expensive.
Data prep: Converting series to standardized plots/text, managing resolution, and ensuring fair prompts.
Tooling: Optional analysis features (peaks, change points) add engineering overhead.

When NOT to Use:

If you only need pure point-forecasting error (MAE/RMSE) on one domain—use a forecasting-specific benchmark instead.
If you require real-time streaming evaluation or latency testing—TSRBench is batch, not streaming.
If your goal is training models (not evaluating)—TSRBench is for testing, not for pretraining data.

Open Questions:

How to build truly complementary fusion so T+V beats the union of singles?
What data-centric pretraining bridges semantic reasoning and precise numeric forecasting?
Can adaptive, step-controlled reasoning (self-verification, tool-calls, planners) reliably lift hard tasks?
What richer metrics (calibration, rationales, uncertainty) tell us more than raw accuracy?
How to ensure fairness and avoid domain overfitting while still giving models domain-specific tools?

Overall, TSRBench is a strong mirror: it doesn’t fix weaknesses but makes them visible so we can target the right improvements.

06Conclusion & Future Work

3-Sentence Summary: TSRBench is a comprehensive, multi-modal benchmark that tests AI models on the full chain of time series skills: seeing patterns, reasoning about causes, predicting the future, and choosing actions. Across 4,125 problems and 14 domains, models do well at Perception but fall short on complex Reasoning and especially on Prediction, where scaling models bigger doesn’t help much. Text and plots each help on different items, but today’s models rarely fuse them effectively; tools and longer reasoning provide small-to-moderate gains.

Main Achievement: A single, standardized scoreboard that reveals not just who’s best overall, but exactly where and why models succeed or fail across perception, reasoning, prediction, and decision-making—with clear evidence about scaling limits and modality fusion gaps.

Future Directions:

Build true cross-modal fusion so T+V outperforms the union of each alone.
Pretrain time-series-aware foundation models to link semantic understanding with precise numeric forecasting.
Use agentic, tool-augmented, and self-verifying reasoning to strengthen hard tasks.
Expand metrics for calibration and uncertainty, and explore streaming, higher-fidelity inputs.

Why Remember This: TSRBench shines a bright light on the missing pieces of generalist time series intelligence. It turns a fuzzy “models seem okay” into a precise map: where bigger helps, where it doesn’t, and which skills are still missing. That map is how the community will build safer, smarter systems for health, finance, energy, disasters, and science.

Practical Applications

•Hospitals can compare models on ECG decision-making tasks before deploying AI triage tools.
•Banks can test forecasting and risk-aware decision tasks to pick safer trading assistants.
•City planners can evaluate event prediction (e.g., rain, congestion) to better schedule maintenance.
•Power grid operators can benchmark Perception/Prediction on demand curves for load balancing.
•Manufacturing can test anomaly detection and causal discovery for earlier fault prevention.
•Education platforms can use Reasoning tasks to build better tutoring on lab data and experiments.
•Disaster agencies can benchmark multivariate weather signals for earlier flood or storm alerts.
•Sports analytics teams can use abductive/temporal reasoning tasks to improve in-game insights.
•Researchers can stress-test new multimodal fusion methods against TSRBench’s T/V/T+V settings.
•Product teams can decide whether to invest in bigger models or better data/tools, guided by the scaling and tool-use findings.

Version: 1