Time is the only nonrenewable resource – Sri Ambati, Founder and CEO, H2O.ai.
Prediction is very difficult, especially if it’s about the future – Niels Bohr, Nobel Prize-Winning Physicist.
Despite its inherent difficulty, every business needs to make predictions. You may want to forecast sales or estimate demand or gauge future inventory levels. Perhaps you want to predict temperature changes or the price of a stock. Whatever it is, you will need data. It will have time on the x-axis and the value you are measuring (demand, temperature, etc.) on the y-axis. It can also have other features, which we will discuss later. We call this time-series data, and there are special tools and techniques to work with it.
We will work with the following dataset in this article. It is taken from the excellent (free) textbook Forecasting: Principles and Practice 2nd Edition and shows the number of electrical equipment orders in Europe from 1996-2012. The data has been normalized (a value of 100 equals 2005 orders), and it has also been adjusted by working days. These adjustments are two examples of preprocessing steps common in time series analysis.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
df = pd.read_csv('electrical_equipment.csv', index_col=0, parse_dates=['date'])
plt.plot(df['date'], df['orders'])
plt.show()
The first five rows:
In this series, we will introduce you to time series modeling – the act of building predictive models on time series data. This first article explains common preprocessing and feature engineering techniques. Subsequent articles discuss models and model diagnostics.
Let’s go.
The process for creating time series models is quite similar to the standard supervised machine learning pipeline.
We like to think of it in six steps:
In this article, we focus on Preprocessing and Feature Engineering.
Once you have collected (ETL) and explored (EDA) your data, it’s time to clean and shape it into a format time series models expect. Most of the time, you cannot just feed in the date and value columns as they are. Or rather, if you did, you would get poor results in comparison to adequately preprocessed data!
Many traditional models are adaptations of linear models; therefore, it’s recommended that you perform missing value imputation and outlier detection and removal. This is standard practice for supervised machine learning as well.
You can use libraries like scikit-learn and pandas to do this or, if you use H2O Driverless AI , it will perform these steps for you without you having to lift a finger or program any specific steps.
Using the Driverless AI python client, first set up your client
import driverlessai
address = 'http://ip_where_driverless_is_running:12345'
username = 'username'
password = 'password'
dai = driverlessai.Client(address = address, username = username, password = password)
# make sure to use the same user name and password when signing in through the GUI
Then create your experiment
experiment = dai.experiments.create(name='walmart_time_series',
test_dataset=test_data, ...)
Now your experiment is running, and H2O Driverless AI is performing the preprocessing steps (among many other things, some of which we will discuss later).
To a novice, time-series data may look random. But datasets contain similar patterns, which we can isolate to analyze individually.
Here are three of the most common patterns:
Cycles are different from seasonal patterns because we do not know when they will happen or for how long they will last ahead of time. But we know the frequency and length of seasonal patterns.
The process of isolating these components is called trend-seasonal decomposition, or decomposition for short. Usually, we combine the trend and cycle parts and call it the trend-cycle or trend.
We break each time series down into three parts: trend, seasonal, and a remainder term (anything that is not part of the first two). You can combine these elements in an additive (trend + seasonal + remainder) or multiplicative (trend x seasonal x remainder) way. The former is used for linear models and the latter for non-linear, i.e., quadratic or exponential.
There are several methods you can use to do this but, for brevity, we will demonstrate basic additive decomposition.
from statsmodels.tsa.seasonal import seasonal_decompose
# Index of dataframe must be DateTime
df_datetime_idx = df.set_index('date')
# Create additive seasonal decomposition
result_add = seasonal_decompose(df_datetime_idx, model='additive')
# Plot
fig = result_add.plot()
plt.show()
Here we show electrical equipment orders over several years (top), followed by the trend, seasonal and remainder components (in descending order). The trend shows the general movement of the original data and has a similar shape. The seasonal component exposes the lower-level variation, and the remainder contains other fluctuations the first two parts don’t show.
Traditional time series models often assume that the data fed into them is stationary . A dataset is stationary if the underlying process that created it does not change over time. In other words, the data has a constant mean and variance. This property implies it has neither trend nor seasonal components – by definition, these impact values based on the time, i.e., annual seasonality implies higher sales at the end of every year.
One time series created by a stationary (upper) process and another by a non-stationary (lower) process – source
Most datasets you work with will not be stationary. Thus, you need to transform them so that they are. This transformation is essential for the autoregressive/ARIMA models.
Here are some methods to stationarize your data.
Differencing
This is the easiest method and involves calculating the difference between consecutive elements. It stabilizes the mean and reduces the impact of trends and seasonal behavior, leaving the model free to focus on predicting one point after another.
It is easy to do this using the diff()
method in pandas.
df['orders_diff'] = df['orders'].diff()
plt.plot(df['orders_diff'])
plt.show()
Applying differencing completely changes the shape of the plot; the up and downward trends have been removed. However, we can still see seasonality because the positive and negative peaks occur with regular frequency. Thus, this data is still not stationary. Let’s fix this with the following method.
Seasonal Differencing
Seasonal differencing is the same as differencing, but you calculate the difference between elements in the same season. For example, if there were weekly seasonality, you would calculate the difference between every Monday, every Tuesday, and so on. This is more effective than ordinary differencing at removing seasonal trends, but it does not work all the time. If you can still see seasonal trends, your data is not stationary, and you may need to apply further differencing.
To differentiate seasonal and ordinary differencing, we sometimes call the latter first-order differencing , i.e., differences at lag 1.
Use the diff()
method and set the periods
parameter to the length of the seasonal trend you want to difference. Annual differencing is often the most effective but let’s look at a few options.
df['3month_diff'] = df['orders'].diff(periods=3)
df['9month_diff'] = df['orders'].diff(periods=9)
df['12month_diff'] = df['orders'].diff(periods=12)
The 3-month and 9-month plots still exhibit seasonal behavior (the peaks and troughs occur at regular intervals), whereas the 12-month chart looks devoid of seasonality. However, it doesn’t look like a random process generated it. To fix this, let’s apply first-order differencing.
To fix this, let’s apply first-order differencing to the 12-month plot.
df['12month_1month_diff'] = df['12month_diff'].diff()
Much better! The peaks/troughs do not follow a consistent pattern, the plot is centered around 0, and the variance looks relatively constant.
Note: you can apply seasonal and first-order differencing in either order and will get the same results. However, if you use seasonal differencing first, you may get a stationary time series straight away and thus have one less preprocessing step to do.
Transformations
In tabular data modeling, you may apply log or square root transformations to features to create a more normal distribution. Since stationary data has constant mean and variance, we can think of each point as being drawn from a normal distribution. Thus, we can apply the same transformations.
If a dataset is quadratically increasing, applying the square root will transform it into a linear trend. If it grows exponentially, taking the (natural) log will make it linear. You can also test other methods, such as power transformations.
df['sqrt'] = np.sqrt(df['orders'])
df['log'] = np.log(df['orders'])
Since our data is neither quadratically nor exponentially increasing, the transformations do not change the shape, just the scale of the data (look at the y-axis).
Often you will combine transformations and differencing. The transformations help to stabilize the variance, and differencing helps to stabilize the mean. The combination results in more stationary data than one or the other alone.
# First order differencing
df['diff'] = df['orders'].diff()
# First apply log, then differencing
df['diff_log'] = np.log(df['orders']).diff()
On the left, we have first-order differencing. The other two plots show a log transformation followed by first-order differencing with different y-axis scales. The variance for applying first-order differencing alone is 165, but the variance of log + differencing 0.02 – an 8,000x difference!
We’ve now discussed the essential preprocessing steps needed for time series modeling. Let’s look at feature engineering.
Our electrical orders dataset has two columns: date and orders. If you are used to building tabular models with tens, hundreds (or even thousands!) of columns, you may be a bit confused about how to make predictions with just two variables.
However, a host of features are within those columns waiting to be uncovered and pumped into your models: here is an overview of the most common ones.
Before we start, the two most used traditional time series models are 1) autoregressive/ARIMA and 2) smoothing. We will briefly explain how they work in the sections below and point out which features lend themselves to which models.
Lag Variables
A fundamental assumption of autoregressive models is that we can use past values to predict future ones. So, we need to create features to represent these past values. We call these lag variables, and they lag behind the actual time series by 1, 2, 3, or (many) more time steps.
Use the shift()
method in pandas to create lagged variables.
# Create 3 lagged variables
df['t-1'] = df['orders'].shift(1)
df['t-2'] = df['orders'].shift(2)
df['t-3'] = df['orders'].shift(3)
We created three lag variables: t-3
, t-2
and t-1
, which we can use to predict orders
. As is always the case with lag variables, you need to drop the first few rows that contain NaNs to use them. From row 3 onwards, we are good to go.
Aggregation
Smoothing models assume they can predict future behavior from the aggregated statistics (typically the average) of past values. Thus, aggregated features such as the average, standard deviation, skewness, min, and max can be valuable additions.
Use the aggregate
method in pandas to create aggregated features.
lagged_feature_cols = ['t-3', 't-2', 't-1']
# Drop first 3 rows due to NaNs
df_lagged = df.loc[3:, lagged_feature_cols + ['orders']]
# Create feature df to use for aggregation calculations
df_lagged_features = df_lagged.loc[:, lagged_feature_cols]
# Create aggregated features
df_lagged['max'] = df_lagged_features.aggregate(np.max, axis=1)
df_lagged['min'] = df_lagged_features.aggregate(np.min, axis=1)
We created the min
and max
columns which are the minimum and maximum values, respectively, from the t-3
, t-2
, and t-1
columns.
Trend & Seasonality
Thanks to trend-seasonal decomposition, you can pass the trend and seasonal components of the time series as individual features. Alternatively, you can use just one of them to force the model to focus on this aspect and ignore the other. Macroeconomic forecasts often do this. We care more about the unemployment trend and not so much about the seasonal increase that happens every year after college graduation, for example.
Date-Specific Features
Specific times of the day, week, month, or year can significantly impact a time series. For example, the number of cars on the road is higher between 8am-9am and 5pm-6pm than other times due to rush hour. Thus, it is helpful to extract datetime-specific features such as hour_of_day (1, 2, 3, etc.), day_of_week, week_of_year, month_of_year, and so on. You can also create boolean features such as is_weekday or is_holiday.
If the date column/index is the correct dtype (datetime64), you get access to loads of easy ways to create new date-specific features in pandas.
# Set index to the date column of original df
# drop first 3 rows due to NaNs from lagged vars (see above)
df_lagged.index = df.date[3:]
# Create month and quarter columns
df_lagged['month'] = df_lagged.index.month
df_lagged['quarter'] = df_lagged.index.quarter
Here we created the month
and quarter
columns in a single line using the .month
and .quarter
attributes of the datetime index.
Domain-Specific Features
The best features are ones specific to your problem domain. These require domain-specific knowledge or thorough googling on your behalf to create. But the effort is well worth it. Often finding one such feature can be the difference between a good model and a great one. Unfortunately, you can’t find out everything just by reading blog posts like this one
We could include many other features, but we think this is enough to whet your appetite. If you want to learn more, a tremendous open-source package for automated time series feature extraction is tsfresh . You can see a complete list of all the features they extract here .
There you have it, a quick introduction to preprocessing and feature engineering practices tailored to time series data. You could treat time series as just another supervised machine learning problem, but that isn’t going to give you excellent results. Instead, apply the techniques you’ve learned in this article to create robust models and high-quality forecasts that will empower your business’s growth in the months and years to come.
If you want to give everyone in your company the power to create incredible models with minimal effort, you can request a demo of H2O AI Cloud.