Various aspects of working with time series data — Part 4: Time series tasks and relevant algorithms
This article is part of a series of articles aimed to discuss issues that every person working with time series (TS) data (especially data scientists) should know.
The topics discussed in this series are of different levels of depth and complexity. If you are just starting your journey with TS data, you might take interest in the whole series, and if you are already familiar with TS data, you might want to skip to the later parts, that include more advanced topics.
All of the code examples and discussions are in Python.
Part 1: Time formats
Part 2: Time series analysis — Seasonality, Trends, and Frequencies
Part 3: Anomalies, Motifs, and Signatures
Part 4: Time series tasks and relevant algorithms
Table of contents
· Introduction to time series tasks and algorithms
· Time series forecasting
· Time series segmentation
· Time series classification
· Time series clustering
· New Algorithms for Time series tasks
· Summary
Introduction to time series tasks and algorithms
The previous article in this series discussed anomaly detection and motif search and discovery. Here we will discuss other Time Series (TS) data tasks, their challenges, and the relevant approaches to solve them, including algorithms.
Since there is a lot of ground to cover here, this paper is more theoretical and on a high level, but I tried to add links for every topic discussed here in order to provide a deeper dive into these subjects for those who wish to do so.
Time series forecasting
TS data is a sequence that has chronological order. This means we can address certain parts of the sequence as the past and other parts as the future. Past and future are relative terms that can change according to our point of interest. TS forecasting is the task of predicting the future values of the sequence, which can be done based on its past values. Our model’s inference will usually be for predicting the unknown future. This task is unsupervised by definition, as our labels are our future target values. Usually we will be predicting a certain number of data points into the future based on a certain number of points from the past. The number of points is a parameter that can be tuned based on trial and error, or according to a certain logic in the data.
The proper amount of data for training depends on a few different factors. One of them is the seasonality which was discussed in part 2 of this series. We would like to have in our training data at least a few cycles, depending on the granularity we are looking for. One more thing to remember is that in some cases, taking too much data into the past can also cause a degradation of our performance. In these cases, the far past is no longer relevant because of changes in the data that occurred along the history.
When training a model for a forecasting task, our input and output labels will be determined in a sliding window along our sequence (see Figure below). Our input are the X amount of points in the past and our label is the Y amount of points in the future.
One rule is almost always true — the further we predict into the future, the less accurate we will be.
Algorithms
One of the most simple, basic models for TS forecasting is the ARIMA model. ARIMA stands for AutoRegressive Integrated Moving Average and it has different modifications based on different needs (e.g. SARIMA and SARIMAX). ARIMA is pretty easy to understand as a concept, and using it in Python as part of the statsmodels library is also not complicated. For more information, check out: What is ARIMA?
From what I saw, and from other DS I’ve spoken to, ARIMA’s performance on real-world data is usually pretty poor compared to other models but it can always be used as a baseline model, applied with a minimal effort.
Some libraries offer “off the shelf” models for TS forecasting. Two examples are Meta’s Prophet and the NeuralProphet which also uses a neural network to enhance performance. Prophet is an additive model that decomposes the TS data to the trend, seasonality, and noise, just like we saw in Part 2 of this series. The model works best with data that has strong seasonal effects since it fits on seasonality at different scales. For more about prophet and how to work with it on Python: check this out.
The NeuralProphet uses the same components as the Prophet but also uses an AutoRegressive Neural Network (like AR-Net) for autoregression and lagged covariates. For more on NeuralProphet: check out the official website. For quick results and usually good performance, I recommend using one of these models.
Another way to approach TS forecasting is by classical machine learning models like linear regression, decision tree, etc. In these models, our features will be our past data points and our label will be the future point, but it will not be regarded as a sequence. This may work fine in some cases but intuitively it seems a bit incomplete as we are disregarding the uniqueness of TS data as a sequence where the relative location of the data point in the sequence has a meaning.
A good model that takes sequences into consideration is the LSTM (Long Short Term Memory) model. This is a very useful deep learning model for TS data as well as text data (which are similar in being arranged as a sequence). I won’t explain the mechanism behind LSTM (you can watch the great StatQuest about it here), but in general, by treating the data as a signal and using a defined number of data points to predict a defined number of data points, it can achieve great results. The most recent developments in machine learning algorithms brought us the transformers, which can also be used for the task of time series forecasting (for more read here). The downside with transformers is that they are computationally intense to run, and require large amounts of data, which are factors that need to be considered when choosing a model for the forecasting task.
Time series segmentation
The task of segmentation is basically taking a long sequence and dividing it into segments based on detected differences between the “past” and “future” sequences. In many cases it can help us find changes in the behavior of our data. We might be interested in separating these segments because of their meaning or traits, and we can also be interested in the separation points themselves which represent points of change in time. These points are referred to as “change points” and the task of identifying them is called change point detection.
Segmentation is not anomaly detection. Here, we don't want to detect short changes as we are looking for a long-term change. The meaning of long-term is relative, and should be determined by the project's stakeholders and the data.
Algorithms
There are many different approaches out there for the segmentation task. Let's review a few of them:
Top-Down Algorithm (Binary Segmentation) is an approach that looks at the whole sequence and splits it to two segments with the largest difference between them. If the difference between them exceeds a certain threshold, it continues to split the segments into two segments each, and so on. When the threshold is not exceeded, the splitting stops.
Python implementation — Ruptures
Bottom-Up Algorithm (Iterative Merge) is an approach that splits the sequences into many small segments. These segments are then merged based on their similarity (or-the error they add), until a certain threshold is crossed.
Python implementation — Ruptures
Sliding Window / One-pass Algorithm is an approach that uses a sliding window that increases in size while sliding along the sequence until the error exceeds a certain threshold. This algorithm is suitable for online segmentation.
Python implementation — Ruptures
SWAB (Sliding Window And Bottom-Up) algorithm is based on a combination of the Sliding Window and the Bottom-Up approaches. It defines a buffer zone that can be divided into 5–6 segments, applies the Bottom-Up approach on that buffer zone, and then the left segmented part is removed and the buffer data is filled by a segment with the sliding window technique and the process repeats again (until there is no more data).
The paper
Fast Low-cost Online Semantic Segmentation (FLOSS) uses Matrix Profile.
Without going into details, matrix profile basically measures the similarity of a segment to its nearest neighboring segment, within a given sequence. For segmentation purposes, we can expect that the segment in which the change occurs will have the lowest similarity to any other segment in the sequence.
The paper
Python implementation — Stumpy
Classification Score Profile (ClaSP) is a relatively new algorithm that splits the sequence at different points and uses a binary TS classifier for each possible split and then selects the one with highest accuracy (the one that best classifies different subsequences to each side of the partition).
The paper
Python implementation — ClaSPy
Python implementation — SKTIME
TS Segmentation Benchmark (TSSB) — By the authors of ClaSP
There are many other algorithms like Dynamic Programming, PELT (both can be found in the Ruptures library), and others.
Choosing an algorithm depends on the task: the data type, size, and additional requirements (like the needed accuracy of finding a certain change point), so a good idea might be to examine different approaches and find the one best suited for the project’s needs.
Time series classification
Classification is one of the two main tasks in supervised machine learning. We basically want to sort our data into different classes. This is a rather trivial problem with tabular data, but it gets a twist with TS data because (as mentioned before) we are dealing with sequences and not with regular features. This means that: 1. The data points are not independent. 2. The relative and absolute location of a data point matters. 3. Different data points have different weights. 4. In some cases we might want to classify sequences that have different lengths (a changing number of data points).
Classification can be performed on large whole sequences (e.g. yearly sensor data of different stations), and on small sub-sequences (e.g. an hour of sensor data that reflects a certain behavior). In the previous part of this series we’ve discussed the concept of motifs. Once we have labeled motifs that can be divided into classes, we may want to perform time series classification in order to characterize events that happen along our sequence. We may also use these motifs to classify the whole sequence.
Algorithms
Here, we will briefly review the different approaches to solve this task, with some examples, rather than going deeper into specific algorithms.
The first possible approach is classifying the sequence itself, which can be done by using classic distance-based classifiers like K-Nearest Neighbors. It is recommended to do it with Dynamic Time Warping, which allows more flexibility. For more about this algorithm, read here. If we would like to focus on the pattern of the sequence, rather than its actual values, we can first normalize the sequences using z-score normalization for example.
The second approach is classifying vectors that represent the sequences, which can be done by any of the classical machine learning classifiers. There are three possible ways to create these vectors: 1. Extract statistical features from the sequence and use them as our representation vector. 2. Use embedding algorithms like Time2Vec that can create these vectors.
3. Use pattern-related algorithms that comes from the world of text data like bag-of-patterns or tf-idf, where the words are replaced by patterns (or - motifs).
The third approach is to use deep learning algorithms that can address sequences like RNNs, LSTMs, and transformers, that are famously known to also work with textual data.
A fourth approach is to use CNNs after representing the time series sequences in a 2-dimensional space which can be created by different algorithms like the wavelet transform that can create a space of time against frequency.
Each approach has its pros and cons. I found that working with representation vectors makes the job much easier and solves many issues, but, creating good representation vectors is a challenging task by itself.
Time series clustering
Clustering is a popular unsupervised machine learning approach to divide a given data set into classes (clusters). In our case, it means that we have TS segments with no labels that we want to cluster. The challenges in clustering TS data are similar to those discussed in previous tasks: 1. We are dealing with sequenced, not independent data points. If they are not treated as such we are losing important information. 2. When we treat data points as features we get a very high number of dimensions which has a computational cost and other problematic issues (see: the curse of dimensionality). 3. In some cases we would like to cluster segments with a changing number of data points.
Since the task of clustering does not require labels, it may be the preferred task at the start our project. Once we have clustered the segments, if we can label them, we may switch to classification for future tasks and earn more explainability (in most cases).
Algorithms
Conceptually, clustering approaches are similar to the classification approaches. The main difference is the unsupervised manner of the algorithms. We can:
1. Cluster the TS sequences using a distance-based clustering algorithm (e.g. K-means, DBScan).
2. Cluster based on representation vectors (as mentioned in the classification section).
3. Cluster based on presence/absence or density of patterns (motifs).
Clustering TS data is less developed than classification and also a more challenging task, as we usually use it with little known information about our sequences so we don’t know which parts are relevant for us. On one hand, we want to use as much of the raw data as possible. On the other hand, using all the data will introduce a lot of noise, especially when dealing with TS data. Usually, a certain level of cleaning will be necessary, but it should be backed with domain-specific and task-specific logic. For example, if we have sensor data of different stations and we want to cluster this stations by their behavior, it might be a good idea to remove outliers and apply some level of smoothing. If we are looking to classify malfunctions, it might be better to focus on the outliers and smooth or remove the rest.
New Algorithms for Time series tasks
During recent years, new models that can perform TS tasks were introduced. These models offer innovative solutions with very good performance metrics.
Time GPT
TimeGPT is a generative pre-trained transformer for time series tasks like forecasting and anomaly detection, developed by NIXTLA and trained on various types of TS data like electricity, finance, IoT and more.
TimeGPT offers zero-shot inference (without any prior training), and also a possibility of fine-tuning the model based on specific needs.
The model is not based on one of the existing LLMs but has a similar architecture with the encoder-decoder structure and the self-attention concept.
The model was released at the end of 2023 and can reportedly provide accurate results in short inference times.
For more about TimeGPT: Click here.
Kolmogorov-Arnold Networks
The concept of Kolmogorov-Arnold Networks (KANs) was first proposed in an article from April 2024 called "KAN: Kolmogorov-Arnold Networks". It is based on the Kolmogorov-Arnold Representation Theorem, and aspires to outperform the current widely-accepted approach of Multi-Layer Perceptron (MLP).
The main novelty in the KANs approach is to replace the fixed activation functions in the MLP approach with learnable activation functions on the edges. KAN breaks down the multivariate functions (based on multiple features and observations) into a composition of simple univariate functions (for every feature separately) and then aggregate them in order to get the final function.
A good, deeper explanation is provided in this video of Neural Breakdown with AVB.
KANs were shown to outperform MLPs on most tasks and provide better explainability, but they are much slower to train and cannot harness the advantages of GPUs.
This model can be used as a solution for challenges with many different data types, and also with Time Series as published in a few different papers: [1] [2] [3].
For more about KANs you may also read this paper here.
Liquid Neural Networks
Liquid Neural Networks (LNNs) was released in 2020 by researchers at MIT as a type of Recurrent Neural Network that is time-continuous and can adapt its structure based on the data. They continue to learn even after training. The revolution of LNNs is by transferring the information dynamically and fluidly, and through fixed weights and direct connections between neurons. It creates a very adaptable and robust model that can work well on TS data.
These algorithms can perform different time series tasks and should be considered as more advanced solutions for challenging tasks.
Summary
- TS data have unique characteristics that make the tasks more challenging to perform.
- There are a few different approaches and many different algorithms which can be applicable for each task.
- TS data can also be modeled using classical machine learning models but they will not address the data as a sequence.
- Converting TS sequences to representation vectors may help in solving some of the challenges of working with TS data.
- New innovative models from recent years present novel approaches and improved performance on TS tasks.