Time Series in H2O Driverless AI - #H2OWorld 2019
This session was recorded in NYC on October 22nd, 2019. Slides from the session can be viewed here: https://www.slideshare.net/0xdata/dmitry-larko-h2oai-time-series-in-h2o-driverless-ai-h2oworld-2019-nyc
Time series is a unique field in predictive modelling where standard feature engineering techniques and models are employed to get the most accurate results. In this session we will examine some of the most important features of Driverless AI’s newest recipe regarding Time Series. It will cover validation strategies, feature engineering, feature selection and modelling. The capabilities will be showcased through several cases.
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Read the Full Transcript
Hello, Thank you. Just to warn you it’s going to be a very, very short introduction to Time Series Capabilities in Driverless AI. I have approximately 10 minutes to tell you an extremely high overview of what type of time series forecasting you can do in Driverless AI.
So let me skip my background. Well, as you already learned, Driverless AI has a very simple process, you have some input data, you define a target, you’ll pick up the metric, identify some specific resource like time and hardware available, and you’ll receive a lot of automatic solutions, like you can automatically visualize your data, there is a feature engineering selection process, you have a model, you can interpret the following model, and you get your scoring pipeline into your production if you want to.
Basically, for time series you’re following exactly the same process, there is not a lot of changes actually in that. We should actually mention what exactly we mean by a Time Series Problem. We have a quiet laws, and a very, very high level definition, so basically every process, which depends on time, we can consider as a time series, right? So it shouldn’t be just a time series in the sense of you have a sequence of weekly distance points, right, and you ask to forecast several next points. It should be considered as a classical time series definition.
You also can actually define a time series as you have a dataset and some patterns in your data actually strongly depends on time. So in that case, ideally you should split your data in time and you’re supposed to use out of time validation. So that’s actually also built into the product.
So obviously, we covered a very different… Time series can actually be a quite different nature. It could be a simple linear operation, which has actually quite simple to model. It can also contain nonlinear seasonal patterns in the data, it can be a combination of trends and seasonal patterns as well. Also, in most of our data, basically, we have time groups. And time groups, a simple example, can be… You have a chain of stores. Each store actually is selling a lot of products, so each combination of store and product creates a unique time series. And obviously, it will be beneficial not to build a group, a single model group in a store and product combination because in that case, you won’t be able to leverage the information about other product sales and other store sales in general.
So in Driverless AI, we actually built a global model, which actually can leverage information about other stores and product sales and sales dynamic, which could be quite helpful, especially if you do a cold start. Let’s say you run a new product in a specific store. You don’t have information about this product sales in this particular store, but you can actually think this product is going to be sold in exactly the same pace and dynamic, which you actually witnessed in another store in your chain. So this information can be actually leveraged in Driverless AI as well. So let me switch to …
A quick question.
The time groups, so is it either cluster the difference [inaudible] properties? Or …
That’s something that I would say it can do for you as well. So yeah, clustering is one of the features provided.
It automatically detects the time group columns, so if you have stores and departments, it will automatically find that after being grouped by those each times, the signal is like its own series. But it doesn’t obviously work if you have 5,000 columns and you have 600 group columns, it’s a little bit harder. So you can actually provide the grouping columns yourself. Usually it’s only a handful, right? Usually it’s store, department or city, and …
Yeah, that’s exactly what I’m going to show. So basically I have a data set, it’s a [inaudible] data set where you can download from Kaggle and this data set has two grouping columns. You have a store ID, you have 45 stores in your data set, and you have a product ID, you have 99 products ID in your dataset. You also have a time and basically this data set is a weekly sales for two years, starting from 2010, till 2012. So two years old, weekly sales data, weekly sales basically is a column you ask to forecast, to predict. So what I do, just click on predict, as you already did before. I pick out the target column, but I’m also picking up and I’m also selecting the date column, time column. In our case, it’s a column named date.
I’m also picking up the test data set and if I do have a test data set, it actually allows me to… Using test data set it automatically can identify the forecast horizon I would like to predict. Because in test data I have 26 weeks, it’s automatically suggesting me to build a prediction for 26 weeks ahead of time. Well, in some, I mean, in my case, it’s just too much. So I actually put the number four here.
So gap means, let’s say, in a lot of practical scenarios, the most recent data actually is not available for you, right? Let’s say you have a data gathering process, data cleaning process, the pipeline can be quite complicated that actually could mean … So the most recent data, last week might be not available to you yet at the time you have to do your forecast. So that, we call the gap. So this could be, let’s say in our case it can be one week, for example. And after that, you basically want to launch an experiment. So what happens under the hood?
So inside the RSCA, obviously we try to mimic the same split we have between trend and testing within the parameters you set up. So, first of all, we transform a date column into the integer. So basically we create an integer beams. Each beam basically means the next point in time. Because we also assume … I mean, one of the biggest assumptions actually for time series, that your points in time, they are equidistant, right? So you have exactly the same distance between your time points. Like in our case, it’s a weekly sales, that means between each point in time, we have exactly one big difference. So that actually allows us to use integers as identification as well, because basically, the distance between integers is exactly the same anyway.
So as I mentioned before, we’re trying to mimic the trends, trend and test set up using our trend data. So that means instead that creates a lot of splits following exactly the same pattern, like you have for tests. For example, if you asked to predict two weeks ahead, given a one gap, one big gap, that’s the parameter we’re trying to mimic as well. And because we use them that extra boost or tree based models in general for time series forecast, that means we do a following check. We extract a lot of features basically from time series and kind of create a new data set to trend like a classical machine and algorithms, as I mentioned before … But that also creates a lot [inaudible] difficult to create features from a time series in that scenario. So you have to be very careful, like for example, which kind of lag size you can use for your features, right? So your lags is going to be all this available and never go into the gap period or maybe in a validation period, inside the period you ask to forecast.
You can actually have two major types of features being created. One of the type of features is basically when you extract information from your timestamp, like for example, for date you can extract for what date, what day it is inside the month, what month it is, what year, what week day, and so on and so forth. You also can actually create a custom transformer, I think we have a couple of transformers available, each with different holidays for different countries, like for example, you can have a transformer, which allows you to extract, is it a holiday or not in Germany, for example.
So the next type of features, which requires you to provide a target column, it can be moved to a smoother timeline, it can be exponential [inaudible] average and so on and so forth. So these features after that, they actually go inside the model in a trend, so on them.
There’s a slide about that, actually as well. So this is an extremely important thing to check, right? Most of the models operate under the assumption that time series are stationary. That means you have to at least remove a trend and understand what type of seasonality you have, what is multiplicative seasonality or additive seasonality, right to be able to handle the data. So, I’ll talk about that later. Again, in the recent [inaudible] DAI what had been done basically now you actually can have an out of fault prediction and in order to do that, they actually apply a rolling window to the trend data and slowly roll in our data through with available trend data, to get you some sort of out of fault prediction. Obviously, by doing that, we cannot actually create out of fault prediction for a complete trend time series part, like in this example, we’re not able to extract out of fault prediction for the start of the time series, but we can do it like for time zero second from maybe number five.
Also, like an example, I provide in the interface, I have a 26 weeks of a test data, but I’m actually asking about four weeks prediction ahead. That means I have enough data to create a rolling prediction basically, right? For a test part. And the major downfall in this approach, we actually have, we had before. So you have to recalculate all features available for the whole process in order to build a really robust and high performance model. So basically that can be done in two different scenarios: the first solution, which is faster for each, [inaudible] horizon, we actually asked to predict, we basically slowly change in the transformers available to get to the final test.
So we don’t recalculate the model, but we just recalculate the features we sent into the model and we score the model after that to get the predictions. Obviously the second approach we can actually retrain the model as well, which is slow or more time consuming, but it’s more precise and again, it depends on the scenario you have in your case. As I mentioned, we have a Build Your Own Recipe approach so you can write a custom model for time series, you can write a custom transformers. It’s pretty much the same way, how you can write a custom transforming the model for regular driverless AI, but that also requires you to have information for group in columns, time column and I think that’s, that’s about it actually. Also in the other test field there could be a lot of columns available and in that case, it would be nice if you’re able to highlight which columns are actually available at the time of a forecast.
Because for example, information, what day of the week is always available, right? But for example, the current oil prices might be not available in time for a forecasting. So that’s something which would actually be nice to have on a training time as well. So as for, stationary basically like, a simple case would be, if your data have a trend building the model based on the three ensembles might be not a very good idea to do it without detraining the data because your [inaudible] has no idea about the concept of a trend at all. That means you have to somehow remove a trend before, before getting the data into the model.
There’s a couple of ways to do it basically, you can fit a linear model and remove it after that, or you can basically differentiate the time series. And the differentiate can be as many levels as you want. There’s basically an example to illustrate what I actually mean, without the detraining of a very simple function, the data is to predict the trend, it keeps the constant basically after some level, because these values it spikes, these maximums they are outside of the distribution model been trained on. So in order to fight this basically you first of all move a trend, then use the new data to predict the outcome and then you put the trend back.
So also for time series analysis, it’s a quite important use to provide a prediction intervals, that’s something that can be done pretty soon. So now you have not just like a point estimation, but you also have a confidence interval around where your point estimation. So, and last slide, but not least, we should say kudos to the main masterminds behind driverless AI time series which is actually Marios Michailidis, who actually knows as kaz-Anova on Kaggle and Faron whose name is Mathias [inaudible 00:16:08]. So these guys actually both quite experienced in calculus, they have a lot of experience in time series in particular and basically the whole approach we’re using in driverless time series is actually, thanks to them. So that’s all I have, if you have any questions I will be around. So you can actually, you know, ask me anything you want and that’s all I have for you. Thank you.