Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language

Yunkai Zhang; Yawen Zhang; Ming Zheng; Kezhen Chen; Chongyang Gao; Ruian Ge; Siyuan Teng; Amine Jelloul; Jinmeng Rao; Xiaoyuan Guo; Chiang-Wei Fang; Zeyu Zheng; Jie Yang

Insight Miner: A Time Series Analysis Dataset for Cross-Domain Alignment with Natural Language

Intermediate

Yunkai Zhang, Yawen Zhang, Ming Zheng et al.12/12/2025

arXiv PDF

Key Summary

•Time-series data are numbers tracked over time, like temperature each hour or traffic each day, and turning them into clear words usually needs experts.
•This paper builds TS-Insights, a big dataset (about 100,000 examples) that pairs time-series slices with easy-to-read, accurate text descriptions.
•They use a tool-use pipeline: first peel out the trend with classic statistics (STL or Gaussian Processes), then ask GPT-4 to write a clear description.
•They train a Large Multimodal Model called Insight Miner by showing it simple line-plot images and the matching descriptions.
•Insight Miner learns to describe what a time series is doing (especially its trend) in plain language.
•In tests with human experts, Insight Miner beats the original LLaVA and is competitive with GPT-4, even winning on some tougher, unseen datasets.
•Smart augmentations (like adding tiny noise or scaling) and paraphrasing create lots of diverse training examples without huge cost.
•The method is efficient to train (about an hour per epoch on 8 A100s) and cheap to run once trained.
•This work is a first step toward making time series a ‘native language’ for AI, so models can both analyze and explain patterns.
•Future directions include describing seasonality, volatility, outliers, and handling multi-feature time series.

Why This Research Matters

When time-series data can explain itself in clear language, more people can make better decisions faster. Energy operators can understand demand surges without digging through raw numbers. Doctors and nurses can get concise summaries of patient signals to act quickly. City planners can read traffic patterns at a glance and improve safety and flow. Farmers can understand soil and crop trends to save water and boost yields. Businesses can monitor KPIs with readable narratives instead of cryptic charts, reducing mistakes and delays.

Detailed Explanation

Tap terms for definitions

01Background & Problem Definition

🍞 Top Bread (Hook): You know how when you watch a plant grow day by day, you notice if it’s getting taller, if it has weekly watering ups and downs, or if some days are just weird? That’s what time-series data is like—numbers changing over time, full of stories.

🥬 Filling (The Actual Concept): Time-series analysis is the study of data points collected over time to find patterns like trends (overall direction), seasonality (repeating cycles), and noise (random wiggles). How it works (story of the field):

For decades, experts used classic tools like ARIMA, STL, and state-space models to break data into understandable parts and forecast the future.
Recent AI models (LLMs) got good at many tasks, but they mainly output numbers for time series (like forecasts), not explanations in plain language.
Large Multimodal Models (LMMs) can connect pictures and text—great for photos or medical images—but there wasn’t a general dataset to connect time-series patterns to natural language across many domains. Why it matters: Without good explanations, people see squiggly lines and still need experts to translate those squiggles into insights—slow, pricey, and hard to scale.

🍞 Bottom Bread (Anchor): Imagine a city planner asking, “Why did electricity use spike last Tuesday?” A good time-series explanation can say, “Usage rose steadily in the morning, peaked at lunch due to heat, then cooled off by evening,” instead of just giving a number.

🍞 Top Bread (Hook): Picture trying to explain a roller coaster with only a list of heights—no pictures, no words like ‘uphill’ or ‘loop.’ Tough, right?

🥬 Filling (The Actual Concept): The problem researchers faced was this: how to teach AI to turn raw time-series numbers into faithful, readable insights. How it works (the challenge):

Time series don’t come with captions like images do.
Feeding raw number lists to language models (like GPT-4) doesn’t reliably produce correct trend/seasonality/volatility descriptions.
The semantics of patterns (what the line ‘means’) are subtle and vary by domain. Why it matters: If AI can’t describe patterns well, people can’t trust it to help with decisions in energy, weather, health, finance, or transportation.

🍞 Bottom Bread (Anchor): A hospital dashboard flooding staff with numbers isn’t helpful; a short, accurate summary like “patient oxygen levels are steadily improving after noon” is.

🍞 Top Bread (Hook): Imagine students trying to write a book report without having read the book—of course the report is vague.

🥬 Filling (The Actual Concept): Earlier attempts tried prompting LLMs directly with raw vectors, or used LLMs for forecasting/classification without explanations, or limited alignment to one domain (like finance charts). These often failed to give faithful, detailed, multi-domain descriptions. How it works (why they failed):

Raw vectors lack structure that language models can latch onto.
Domain-limited datasets don’t generalize beyond their niche.
Without aligned pairs (time series → description), models can’t learn robust mappings. Why it matters: We needed a broad, well-structured dataset and a way to “pre-chew” the data so language models can actually explain it.

🍞 Bottom Bread (Anchor): It’s like giving a chef prepped ingredients (washed, chopped) instead of a messy bag of groceries—the result is faster and tastier.

🍞 Top Bread (Hook): Think of a museum audio guide that explains what you’re looking at while you walk.

🥬 Filling (The Actual Concept): The gap this paper fills is a general, cross-domain dataset that pairs time-series windows with high-quality, natural-language insight descriptions, plus a model trained to produce them. How it works:

Use trusted statistical tools to extract core components (like trends) from raw data.
Have a strong language model (GPT-4) write clean descriptions of those components.
Train a multimodal model to map visual time-series plots to those descriptions. Why it matters: This creates an “alignment bridge” from numbers to words, so AI can act like a helpful analyst.

🍞 Bottom Bread (Anchor): Now, when a farmer sees soil moisture data, the AI can say, “Moisture increased gradually after irrigation, then leveled off,” not just spit out numbers.

🍞 Top Bread (Hook): Imagine if every squiggly line on your school graphs could tell its story out loud.

🥬 Filling (The Actual Concept): The real stakes are big: faster insights in energy grids, clearer weather updates, easier traffic planning, safer hospitals, and smarter finance—because explanations help humans decide. How it works:

Experts spend less time translating charts and more time acting.
Non-experts gain access to trustworthy summaries.
Organizations scale insight mining across many datasets cheaply. Why it matters: Better, faster, clearer explanations can save money, reduce risk, and improve daily life.

🍞 Bottom Bread (Anchor): Your smart thermostat could say, “Your home’s evening usage climbs steadily on hot days,” helping you cut the bill without needing a data degree.

02Core Idea

🍞 Top Bread (Hook): You know how teachers highlight key sentences so the textbook suddenly makes sense?

🥬 Filling (The Actual Concept): Key insight in one sentence: First, let statistics highlight the important structure in a time series (like its trend), then let a language model write a clear description, and use lots of these pairs to teach a multimodal model to do this automatically. How it works:

Decompose or smooth the time series to expose the trend cleanly.
Ask GPT-4 to describe that clean signal.
Grow a big dataset of (plot → description) examples with smart augmentations.
Instruction-tune a vision-language model so it learns to talk about time-series plots. Why it matters: Without this prep-and-pair approach, models guess and get confused; with it, they learn a reliable mapping from shapes to words.

🍞 Bottom Bread (Anchor): It’s like outlining a chapter and then writing the summary—much easier than summarizing chaos.

🍞 Top Bread (Hook): Imagine three ways to explain a maze: draw arrows, tell a story, or give landmarks.

🥬 Filling (The Actual Concept): Multiple analogies for the idea:

Chef analogy: Prep ingredients (statistics extract trend), then cook (GPT-4 writes), then teach a new cook by having them watch and practice (instruction-tune the model).
Detective analogy: Dust for fingerprints (statistics), write the case notes (GPT-4), train rookie detectives using solved cases (multimodal model learns from pairs).
Map analogy: Smooth the road (statistics), add clear labels (GPT-4), teach drivers with annotated maps (instruction-tuned model). Why it matters: Each view shows the same trick—clean the signal, narrate it well, and practice at scale to generalize.

🍞 Bottom Bread (Anchor): After practice, the model can look at a brand-new plot and say, “Steady rise, brief dip, then level,” like a trained analyst.

🍞 Top Bread (Hook): Before training, it’s like trying to read messy handwriting; after training, it’s like neat print.

🥬 Filling (The Actual Concept): Before vs. After:

Before: LLMs could predict numbers but struggled to explain. Multimodal models lacked a general time-series-language dataset. Raw-number prompts often misled GPT-4.
After: With TS-Insights, the model aligns visual patterns with precise language and consistently produces faithful trend descriptions across domains. Why it matters: This shifts AI from silent calculators to helpful storytellers for time data.

🍞 Bottom Bread (Anchor): Think of traffic data: instead of “values = [1.2, 1.3, …],” you now get “morning build-up, lunchtime dip, evening rush.”

🍞 Top Bread (Hook): Ever notice how smoothing peanut butter makes it spread better?

🥬 Filling (The Actual Concept): Why it works (intuition):

Decomposition reduces confusion—trend without seasonality is easier to describe.
Smoothing and downsampling remove distracting bumps and keep prompts simple.
Augmentations teach robustness—tiny changes shouldn’t change the story.
Using a vision encoder leverages powerful pretrained pattern recognition.
Instruction tuning locks in the behavior: given a plot and a prompt, produce a matching explanation. Why it matters: Each piece removes friction between raw data and trustworthy language.

🍞 Bottom Bread (Anchor): Like cleaning glasses before reading, the model “sees” the pattern clearly and names it correctly.

🍞 Top Bread (Hook): Imagine building with LEGO: you need the right bricks in the right order.

🥬 Filling (The Actual Concept): Building blocks:

Statistical Tools (STL/GP) to extract trend.
Smoothing + Downsampling + Rounding to make a compact, clean signal.
GPT-4 to write high-quality trend descriptions.
Data Augmentation + GPT-3.5 paraphrasing for diversity.
Plotting time series as images to reuse strong vision encoders.
Instruction Tuning to align images with language outputs. Why it matters: The pipeline is reliable, scalable, and domain-agnostic.

🍞 Bottom Bread (Anchor): The final result is a model that can look at a line plot from energy, weather, or traffic and tell the right story, quickly.

03Methodology

At a high level: Input time-series window → Extract a clean trend (STL or GP) → Smooth, downsample, round → Ask GPT-4 to describe → Augment data + paraphrase → Build Q&A pairs → Train Insight Miner (plot → description) → Output: faithful, readable trend insights.

🍞 Top Bread (Hook): You know how you first separate LEGO bricks by color before building?

🥬 Filling (The Actual Concept): Step 1: Extract a clean trend with statistical tools. What happens:

If the window has seasonality, use STL (Seasonal-Trend Decomposition using LoESS) to split the series into trend + seasonality + residuals.
If there’s no seasonality, fit a Gaussian Process (with RBF + white noise kernel) to get a smooth posterior mean as the trend. Why this step exists: Without isolating the trend, descriptions get muddled by cycles and noise. Example: A daily electricity series with weekday/weekend cycles—STL removes the weekly wobble so the overall rise/fall is plain.

🍞 Bottom Bread (Anchor): After peeling off the weekly bumps, the model can say, “gradual increase over two months,” not get distracted by the Monday spike.

🍞 Top Bread (Hook): Imagine ironing a wrinkled shirt so you can see the shape clearly.

🥬 Filling (The Actual Concept): Step 2: Smooth, downsample, and round the trend. What happens:

Apply a Gaussian smoothing kernel to reduce jitter.
Downsample so each window yields about 25 points (keeps prompts compact).
Round values to one decimal place (simplifies text input to GPT-4). Why this step exists: Without it, prompts become long and noisy; GPT-4 may overfit to micro-wiggles. Example: Trend [0.51, 0.53, 0.54, …] becomes shorter and smoother like [0.5, 0.6, 0.7, …].

🍞 Bottom Bread (Anchor): Like summarizing a paragraph into a neat sentence, the model gets the essence without clutter.

🍞 Top Bread (Hook): Think of telling a friend what you saw on a hike—short, clear, and focused.

🥬 Filling (The Actual Concept): Step 3: Ask GPT-4 to write the trend description. What happens:

Feed the smoothed, short list of trend values to GPT-4 with a prompt that requests a concise trend description.
GPT-4 returns text like “The series shows a steady increase followed by a mild decline.” Why this step exists: GPT-4 provides high-linguistic-quality labels that are faithful to the simplified signal. Example: Input numbers 0.4 → 0.8 → 1.0 → 0.9 → 0.9 become “rising then flattening.”

🍞 Bottom Bread (Anchor): It’s like handing a cleaned-up sketch to a storyteller who narrates it simply and accurately.

🍞 Top Bread (Hook): Imagine practicing a basketball shot from many angles so you’re good in a real game.

🥬 Filling (The Actual Concept): Step 4: Augment windows and paraphrase descriptions. What happens:

Randomly apply jittering (tiny noise), scaling, shifting, smoothing, and downsampling so the same trend holds under small changes.
Use GPT-3.5 to paraphrase the original description to increase language variety. Why this step exists: Without augmentation and paraphrasing, the model may overfit and fail on slightly different-looking plots. Example: “Steady rise, then drop” becomes “Climbs gradually before a brief decline,” while the plot’s story stays the same.

🍞 Bottom Bread (Anchor): Like hearing the same idea in different words, the model learns the core meaning, not just memorized phrasing.

🍞 Top Bread (Hook): Picture a quiz card: question on the front, answer on the back.

🥬 Filling (The Actual Concept): Step 5: Build instruction-following pairs. What happens:

Each sample becomes: Human: [time-series window as image] + [question asking for the trend]. Assistant: [the GPT-4 description].
This single-round format guides the model to produce focused insights. Why this step exists: Clear instructions produce consistent behavior during training and use. Example: “Describe the overall trend in this series.” → “It rises, peaks mid-window, then slowly declines.”

🍞 Bottom Bread (Anchor): Like flashcards for studying, repetition with clear prompts makes the skill stick.

🍞 Top Bread (Hook): Imagine teaching a camera to talk about what it sees.

🥬 Filling (The Actual Concept): Step 6: Convert the window to a line plot and train the multimodal model. What happens:

Plot the raw window as a simple line image (e.g., Seaborn lineplot).
Feed it to a pretrained vision encoder (frozen), project features into the language model’s space with a trainable linear layer, and instruction-tune the language model to output the description. Why this step exists: Reusing a strong vision encoder taps into powerful shape-recognition; training only the projector is efficient. Example: The model learns that a gently upward-sloping line usually maps to phrases like “steady increase.”

🍞 Bottom Bread (Anchor): It’s like showing thousands of picture–caption pairs so the model learns to caption new pictures correctly.

🍞 Top Bread (Hook): Think of a secret recipe that makes a dish taste just right.

🥬 Filling (The Actual Concept): The Secret Sauce. What it is: Combining tool-based decomposition (to reveal signal), language-model labeling (to craft clear text), vast augmented pairs (to generalize), and vision-based alignment (to read plots like images). How it works:

Tools cut through noise.
LMs provide fluent, faithful labels.
Augmentations teach robustness.
Vision encoders leverage massive pretraining. What breaks without it:

Skip decomposition? Descriptions get confused by cycles.
Skip augmentation? Fragile to small changes.
Skip vision encoder? Harder training, poorer generalization.

🍞 Bottom Bread (Anchor): The end result is a model that “sees” the story in a squiggly line and explains it in plain words.

04Experiments & Results

🍞 Top Bread (Hook): You know how a spelling bee proves who can really spell new words, not just memorize yesterday’s list?

🥬 Filling (The Actual Concept): The Test. What it is: A human evaluation on 119 time-series windows—69 from test splits of datasets seen during training (but different time ranges) and 50 from completely held-out datasets. How it works:

Each model produces one trend description per window.
Three domain experts score each description: 2 = correct, 1 = partly correct, 0 = incorrect.
Scores are summed and normalized to 0–1. Why it matters: Humans judge faithfulness and clarity better than simple automatic metrics for this kind of natural-language insight.

🍞 Bottom Bread (Anchor): Like teachers grading essays for accuracy and clarity, not just counting words.

🍞 Top Bread (Hook): Imagine a friendly race where one runner has fancy shoes, another trained longer, and a third knows the course well.

🥬 Filling (The Actual Concept): The Competition. What it is:

LLaVA (baseline): a general multimodal model without time-series alignment.
Insight Miner (Vision 1 epoch): trained one epoch on the new dataset.
Insight Miner (Vision 3 epochs): trained three epochs (same architecture, more practice).
Engineering GPT-4: GPT-4 given engineered trend features (from the statistical pipeline) rather than raw vectors. How it works: All models describe the same set of plots; experts score blindly (descriptions shuffled). Why it matters: Shows the impact of dataset alignment and training time, and compares to a strong engineered GPT-4 baseline.

🍞 Bottom Bread (Anchor): It’s like testing readers on the same story but with different study methods.

🍞 Top Bread (Hook): If a class average is a B-, then an A+ really stands out.

🥬 Filling (The Actual Concept): The Scoreboard (with context). What it showed:

Both Insight Miner models clearly outperformed the un-tuned LLaVA baseline (big jump in normalized scores), meaning the time-series–language alignment worked.
Training longer (3 epochs) improved performance further—practice matters.
Vision (3 epochs) was competitive with Engineering GPT-4 overall and even surpassed it on the holdout datasets. Why that’s interesting: Engineering GPT-4 had handcrafted features—yet a trained multimodal model generalized better on unseen, more complex seasonalities. Alignment plus visual pattern learning seems to pay off.

🍞 Bottom Bread (Anchor): Like a student who not only aces practice tests but also surprises everyone on a brand-new exam.

🍞 Top Bread (Hook): Ever expect a tall kid to win basketball, then the kid with more practice wins?

🥬 Filling (The Actual Concept): Surprising Findings. What stood out:

A tuned vision-language model beat a raw-vector GPT-4 approach and could match or top a feature-engineered GPT-4 on tough sets.
Holdout datasets had trickier seasonal patterns; despite that, trained visual alignment handled them better—likely due to exposure to many labeled examples and augmentations. Why it matters: Suggests that learning to ‘read’ time-series plots visually, coupled with instruction tuning, is a powerful route to faithful explanations across domains.

🍞 Bottom Bread (Anchor): It’s like learning to recognize shapes and stories in graphs the way we recognize faces—training helps you spot patterns fast and explain them well.

05Discussion & Limitations

🍞 Top Bread (Hook): You know how a first version of a video game is fun but still missing levels and characters?

🥬 Filling (The Actual Concept): Limitations. What it is:

Focus is mainly on trend descriptions for single-feature windows; seasonality, volatility, outliers, and multi-feature relationships are future work.
Labels rely on GPT-4 quality; if prompts or preprocessing are off, descriptions could be imperfect.
The model reads plots as images rather than raw sequences; some fine details may be lost in plotting choices.
Evaluation size is modest (119 windows), and only three samples per dataset were used in reported scoring due to human-time limits. Why it matters: There’s room to grow the coverage (properties, features) and strengthen evaluation breadth and automation.

🍞 Bottom Bread (Anchor): Think of it as a strong pilot episode that proves the idea but hasn’t explored every storyline yet.

🍞 Top Bread (Hook): Imagine needing a good kitchen, ingredients, and a chef to cook a feast.

🥬 Filling (The Actual Concept): Required Resources. What it is:

Training: 8× A100 40GB GPUs, about an hour per epoch; frozen vision and language backbones with a trained projection layer.
Data: Access to many time-series datasets across domains; toolchain for STL/GP, smoothing, and plotting.
LMs: GPT-4 for initial labels; GPT-3.5 for paraphrasing. Why it matters: Feasible but not trivial—organizations can reproduce this with moderate compute and API access.

🍞 Bottom Bread (Anchor): Like running a school science fair—you need supplies and helpers, but it’s doable.

🍞 Top Bread (Hook): Don’t use a hammer for a screw.

🥬 Filling (The Actual Concept): When NOT to Use. What it is:

If you need precise numeric forecasts, not explanations.
If your data are multi-feature with complex interactions the current model doesn’t capture.
If you require strict, traceable, formal statistical inference for regulation.
If latency must be ultra-low on edge devices, and plotting + inference are too slow. Why it matters: Pick the right tool—this one shines at language insights on trends, not all tasks.

🍞 Bottom Bread (Anchor): It’s better as a chart narrator than a calculator or a legal auditor.

🍞 Top Bread (Hook): Curiosity time: what’s next?

🥬 Filling (The Actual Concept): Open Questions. What it is:

Can we auto-generate high-quality descriptions for seasonality, volatility changes, and outliers using residuals?
How to handle multi-feature time series (cross-correlations, lag effects) in language?
Can a pretrained time-series encoder (not just vision) improve alignment?
How to build larger, more diverse, multilingual datasets and automated evaluators? Why it matters: Answering these will turn this strong first step into a full toolkit for time-series understanding.

🍞 Bottom Bread (Anchor): Imagine future versions that can say, “Feature A leads Feature B by two days,” or “Volatility doubled after the holiday,” right out of the box.

06Conclusion & Future Work

🍞 Top Bread (Hook): Imagine every wiggly line on a dashboard speaking up to tell you what it’s doing.

🥬 Filling (The Actual Concept): Three-sentence summary: This paper introduces TS-Insights, a large dataset that pairs time-series windows with clear language descriptions generated via a tool-assisted pipeline. Using these pairs, the authors instruction-tune Insight Miner, a multimodal model that reads time-series plots and produces faithful, readable trend insights. In human evaluations, Insight Miner outperforms the baseline LLaVA and is competitive with or better than engineered GPT-4 on some unseen datasets. Main Achievement: Building a scalable, cross-domain bridge from raw time series to trustworthy natural-language explanations by combining classic statistics, strong language models, and multimodal instruction tuning. Future Directions: Extend beyond trends to seasonality, volatility, outliers; move from single-feature to multi-feature series; pretrain sequence-native encoders; grow datasets and create automatic evaluation metrics. Why Remember This: It’s a foundational step toward making time series a ‘native’ input for AI—so models not only predict numbers but also explain patterns in everyday language across energy, weather, health, traffic, and finance.

🍞 Bottom Bread (Anchor): Next time you see a busy chart, imagine an AI saying, “Here’s the story: slow rise, midday peak, evening cool-down”—that’s the promise of Insight Miner.

Practical Applications

•Automatic chart captions for dashboards that describe trends in plain language.
•Energy usage summaries that explain daily and weekly demand changes for grid operators.
•Weather station digests that report temperature or humidity trends over selected periods.
•Traffic flow narrations highlighting rush hours, midday dips, and event-driven spikes.
•Healthcare signal summaries (e.g., heart rate trends) for quick patient triage and monitoring.
•Finance time-series explainers for stock or crypto trends, separate from trading advice.
•IoT anomaly briefings that flag and describe unusual sensor behavior in factories.
•Retail KPI storytellers that narrate sales trends around promotions or holidays.
•Education tools that teach students to interpret line graphs with guided natural-language explanations.
•Data exploration assistants that convert exploratory plots into quick, accurate written insights.

Version: 1