September 27th, 2019
Predicting Failures from Sensor Data using AI/ML — Part 2RSS Share Category: H2O Driverless AI, Recipes, Technical
By: Karthik Guruswamy
This is Part 2 of the blog post series and continuation of the original post, Predicting Failures from Sensor Data using AI/ML — Part 1.
Missing Values & Data Imbalance
One of the things to note is that the hard-disk data set has a lot of missing values across its columns. Check out the Missing Data Heat Map on the training data set — Derived from Auto-Viz in Driverless AI. From the picture below, one can tell that a majority of sensor data is missing or incomplete – the red color in the aggregated chart indicates missing data. Where it’s incomplete, one can easily guess, it might be that not all hard-disk vendors agree to generate sensor data for a S.M.A.R.T sensor variable.
When I tried to build a base AI/ML model in Driverless AI, I got a notification that it automatically dropped these 19 columns because of empty or constant values.
I also got this notification:
Driverless AI 1.7.x does sampling by default for imbalanced data for every iteration to achieve good overall accuracy.
IID Model or Time Series AI/ML Model?
We can build an IID (Independent and Identically Distributed) model and treat all the rows as independent. We can also treat the data as time series, as sensor data is available for each model/serial # every day. I will build a Time Series Classification model in this blog. Since the data is highly imbalanced and it’s a binary classification experiment, I’ll use LogLoss as the scorer to optimize the model. We can always check AUC, AUCPR, etc., once it’s done.
I set a high value of 8/8 for Accuracy (Model tuning effort) and Time (>> Iterations/Early stopping only after 20 iterations) and set Interpretability to 5 (Medium Feature Engineering). Driverless AI can automatically do the shift detection and drop overfitting columns automatically using some AUC threshold – but I disabled the shift detection in Expert Settings.
What is Shift Detection? Detecting distribution shifts of variables between training and test sets can avoid overfitting in models. Driverless AI by default does this to prevent overfitting. It’s an optional feature, and you can always turn it off, like I did here.
I set the Target Column to “failure”, the time column to “date”, and the forecast horizon to “7 days”. The assumption is that the data center can find the spare drive and get it ready to replace it within 7 days, before the failure happens. You can also change this for 1,15, 30 days etc ., – whatever days you want to forecast that suits the predictive maintenance task at hand.
The plan says that from 67 columns, it is going to build 4K features with 8 features being picked for the final pipeline after 208 iterations of model tuning/feature engineering. There is more on the screen that you can read.
Driverless AI allows users to add custom feature engineering or models to the evolutionary model/feature finding process. It’s using a feature called “Bring Your Own Recipe” or BYOR. I uploaded the following feature recipes from this GitHub location, so the experiment can try using it and see if it adds value in feature transformations (besides the default ones):
- firstNCharsCVTE — This recipe takes the text columns, such as model and serial_number, and does a substring of 1,2,3,4 … characters, and then does Cross Validated Target Encoding.
Final Results with Time Series Model
After running for several hours on a single GPU box, Driverless AI built 2.4K models on 4K features and shows you the final result!
While the AUC on the training/validation is 0.9713, the AUCPR looks much more reasonable with a 0.26 score, given the 1:1000 imbalance in the training data set. This 0.26 value is for the micro-averaged value across cross-validation results on the training data set.
Clearly the BYOR recipes such as FRST[N]CHARCVTE come out on top at different positions. The initial characters of the hard-disk model name are a great feature that gives us good predictability, it would seem. The default feature engineering in Driverless AI, such as Cross ValidatedTarget Encoding, Numeric to Categorical Target Encoding, is useful in the prediction.
We can also see that the original columns, such as:
etc., are either appearing on their own or getting feature engineered in interesting ways to create derived features to maximize the prediction score. It’s not surprising that the logical blocks read from hard-disk and the errors reported by the smart sensor are correlated to a hard-disk failure!
So How Accurate Was the Model When It Predicted the Test Data Set?
Even though the training AUCPR was pretty impressive, the test set confusion matrix is a bit more realistic on what you can expect in production deployment.
I didn’t spend a lot of time (but plan to in near future) here — I probably did some 10–15 models and arbitrarily split the training/test without giving it much thought. You can, however, see for the high data imbalance of 1:1000 failures in the training data set, we are predicting 8 out of a total of 31 hard-disk failures in the test data set, which is roughly 25% failures, with a ~ 50x false positives. The above # is based on a threshold value of prediction probability. Changing that can get us the desired tradeoff between false positives/false negatives, etc. It’s also interesting to note that we are predicting 99.99% correctly on rows that don’t indicate failures correctly, which is not surprising how a majority class generally dominates in predictability with imbalanced data.
What I Did Not Do Yet:
There are a lot of Expert Settings options in Driverless AI that give us control over Imbalanced Sampling, how to generate Hold Out Predictions, adding more algorithms, vendor-specific feature engineering recipes, etc. I forgot to remove the Date Transformers – the model might overfit on that, as the day of the week, etc., is showing up in feature importance currently. So the model can be definitely improved over time with additional tweaks — the goal being to reduce false negatives first with an additional goal of lowering false positives.
Citation: The Hard-disk failure data used in this blog post and the previous one is from BackBlaze.com.