The Right Cross-Validation Technique for Time Series Dataset


Introduction

Whenever working with time series data, it is critical to employ a cross−validation approach that accounts for the data's temporal ordering. This is because time series data displays autocorrelation, which means that the values of the data points are connected with their prior values. As a result, unlike in many other machine learning applications, the data cannot be deemed independent and identically distributed (iid).

The standard k−fold cross−validation technique, which splits the data into k−folds at random and trains the model on k−1 folds before testing it on the remaining fold, is inadequate for time series data. The temporal ordering of the data is neglected because of which this happens and it may result in overfitting. Now we will look at some cross−validation strategies which can be applied to time series data sets.

What is a Time Series Dataset?

A time series is a collection of observations−based data arranged chronologically. A series of data points gathered at predetermined intervals throughout time is used to evaluate data patterns, trends, and connections. In economics, stock market research, weather forecasting, and finance, time series data are employed. Examples of time series data include hourly temperature readings, daily stock prices, and monthly sales numbers. Techniques for time series analysis are used to analyze and forecast data based on temporal trends.

What is Cross−Validation?

A machine learning technique called cross−validation is used to evaluate a model's performance on several datasets. It comprises using the training set to train the model after splitting the dataset into training and testing sets. The results are then used to predict how the model will perform on brand−new, untested data. The testing set is then used to assess the model's performance. Cross−validation is required to ensure that the model generalizes well to new data and may be used to assess the effectiveness of various models.

Steps for Cross−Validation for Time Series Dataset

  • Train/Test Split

    Using a train/test split, which separates the data into a training set and a testing set, is the most fundamental technique. The training set is used to train the model, while the testing set is used to test it. Yet, because time series data has a temporal ordering, it is critical to separate the data accordingly.

    One method for accomplishing this is to divide the data into two halves depending on a specified time point. For example, if we had hourly data, we might divide it into a training set (the first 80% of the data) and a testing set (the latter 20% of the data). This guarantees that the model is trained using earlier time points and evaluated using later time points.

  • Rolling Window Cross−Validation

    It is a strategy for accounting for the temporal ordering of data. It entails testing the model on the next batch of data points after training it on a subset of the data. After shifting the window ahead by a predefined number of data points, the operation is repeated.

    Assume we have hourly data and wish to utilize a rolling window of 24 hours (i.e., we train the model on 24 hours of data and test it on the next hour of data). We may begin by training the model on the first 24 hours of data and evaluating it on the 25th hour of data. After that, we would forward the window by one hour and repeat the procedure, training the model on hours 2−25 and evaluating it on hour 26. This method is repeated until the data is exhausted.

    The advantage of rolling window cross−validation is that it respects the temporal order of the data and allows evaluation of model performance against future and similar data.

  • Blocked Time Series Cross−Validation

    Blocked time series cross−validation is a technique that divides data into blocks and uses each block as a testing set while the remainder of the data serves as a training set.

    Assume we have weekly data and wish to employ two blocks of blocked time series cross−validation. We may divide the data into the first and last 50 weeks. The model would then be trained in the first 50 weeks and tested in the last 50 weeks. The procedure would then be reversed, with the model being trained in the final 50 weeks and tested in the first 50.

    An advantage of cross−validating blocked time series is that the performance of the model can be evaluated on data far into the future. This is useful for predictive tasks.

  • Grouped Time Series Cross−Validation

    Grouped time series cross−validation is a strategy that divides data into groups depending on certain criteria (e.g., geography, customer segment) and uses each group as a testing set while the remainder of the data serves as a training set.

    If we have daily data from many locations and wish to apply grouped time series cross−validation. We may divide the data into groups depending on geography, with each group including data from a specific region. The model would then be trained on data from all regions except one and tested on data from that region. This method would then be repeated for each region.

    When working on projects where the data exhibit diverse patterns or behaviors across groups, grouped time series cross−validation is advantageous since it enables us to assess the model's performance on data from each group independently.

  • Purged Time Series Cross−Validation

    When working with financial time series data, purged time series cross−validation is a valuable approach. Financial time series data frequently contains events (e.g., stock splits, dividends) that might generate a bias in the model's performance evaluation.

    Purged time series cross−validation is the process of eliminating all data points that occur within a specific time range (e.g., 5 days) after an event happens. This assures that the testing set contains no data points impacted by the event.

    Assume we have daily stock data and wish to employ purged time series crossvalidation. If a stock split happens on day 10, we would eliminate all data points that occur within the next 5 days. The model would then be trained on the remaining data and tested on data outside of the time window.

    The benefit of purged time series cross−validation is that it lets us evaluate the model's performance without being influenced by data occurrences.

Conclusion

To summarize, while working with time series data, it is critical to employ a crossvalidation approach that takes into consideration the data's temporal ordering. The approaches discussed in this article (train/test split, rolling window cross−validation, blocked time series cross−validation, grouped time series cross−validation, and purged time series cross−validation) are all appropriate for time series data and can be utilized depending on the job at hand. To verify that the model can generalize effectively to future data, its performance should be evaluated using an appropriate cross−validation approach.

Updated on: 24-Jul-2023

261 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements