There is an arms race happening in Data Science and Machine Learning space. It’s the race toward automation. Granted, the questions we as Data Scientists are asked to solve for will never be automated, but many of the routine tasks will be. What are these routine tasks? They range from data ingestion to feature generation. Then we have leaderboards (model bake offs), LIME interpretation, and Rest APIs.
Confused yet? You’re not the only one. Here’s a no nonsense write up on what you need to know about the Automated Modeling space.
Automated modeling is just a fancy term to make the life of data scientists, engineers, IT administrators, and anyone else in the analytics space easier by automating processes. The IT space has done an amazing job of automating server spin ups, logins, or security. So why shouldn’t we in the AI field do the same?
With the advent of open source data science tools, (R, Python, H2O, Spark, etc), Data Scientists have started extracting insights locked away in data. They’ve parsed through Hadoop clusters, databases, and even spreadsheets. This has all been fantastic but it’s the next step that’s always been frustrating.
It’s pretty easy to automate loading data and even ‘munging’ and cleaning it, but for the longest time what was hard was iterating over many models and optimizing them to give you the best performance possible. Once that was done, how the heck would you use that awesome model to produce actionable results?
Model building, whether it’s automatic or manual, is hard. You load your data in, select how you want to split the data, select an algorithm, run it, and see what the results are. By results, I mean some performance metric like AUC (Area Under a Curve) or RMSE (Root Mean Square Error), both of which are important metrics to gauge your progress against. Then you might select a different algorithm and run it again. Then compare and contrast the results. You might do this ‘model bakeoff’ process manually and it could take days to weeks.
Depending on how the initial results are, you might start thinking of ‘enriching’ the dataset with new features. What exactly is a Feature? A Feature is just a fancy term for adding a new column with new data that you generated from the existing data.
For example, let’s say you have a list of house sales, their square footage, dates, and locations in states. You can create a new column in your data set by dividing the sales price by the square foot. Your new feature is now Sales($)/SquareFoot. You can even go further. Now that you have Sales($)/SquareFoot, you could create another feature that’s Average Sales($)/SquareFoot per State! Then you can add the newly enriched data set back into your model and see the results.
Usually, but not always, generated features will enhance the results of your model performance. This is where Automated Feature Generation can really increase your model performance. Driverless AI uses this approach by automating this process and making some even more advanced feature generation techniques available. The automated feature generation in Driverless AI really squeezes out more model performance which can be directly measured in financial terms by your organization.
It’s one thing to generate 100’s even 1000’s of new features automatically, but it’s another to measure them all with different algorithms over and over again. This is where highly tuned and powerful algorithms come into play to quickly, efficiently, and effortlessly build models and compare them.
H2O.ai has been doing model bakeoffs for years via our native open source AutoML offering. AutoML is a fast and simple way to call many different types of models (e.g. Distributed Random Forest, or XGBoost) via our H2O-Flow platform or via a single line of R or Python code. This simplicity was ported over to Driverless AI and wrapped with a great GUI.
Now, in a very visual way you can see how different models with different features evolve over time and be optimized to whatever performance measure you are searching for.
Tuning the models is also a very complex process that squeezes out even more performance from models. This is called parameter tuning. Every algorithm tried will have certain parameters (e.g. tree depth, or regularization) that will need to be turned on/off, or set between 1 and 10 to make the model perform optimally.
You can do this all by hand, and for many years Data Scientists did until they began to programmatically search for the best parameters using things like Grid Search. H2O’s AutoML and Driverless AI have those features built in and automated, thereby freeing up your time iterating over the models by hand.
Crucial. A must. Important. Don’t bet your career on NOT doing this. How can you be certain that the best model you iterated over, tuned and optimized really makes sense? Can you explain to your management why the model predicts that Mr. X will default on his loan but not Ms. Y, even though they have the same income and live on the same street?
Explaining the reason why has always been a problem if you didn’t use simpler models like Linear Regression or Decision Trees. How the heck do you explain non-linear results of a Gradient Boosted Machine to your management? Recent advances in applying model interpretation techniques like LIME (Local Interpretable Model-Agnostic Explanations) and Shapely values for machine learning interpretability (MLI) have greatly shed light on the black box nature of some of these complex algorithms.
Driverless AI applies these techniques and more to let users drill down into groups/clusters of your predictions and then applies simple linear techniques to explain why Mr X is predicted to default on his loan.
Great, you’ve successfully automated the feature generation, automated model building and optimization, and are satisfied with integration that Driverless AI gave you, now what?
Simple. Put that model into production. After all, what good was all this expended effort if you can’t realize its gains? Since the beginning of open source H2O, the ability to take a trained and tested model and put it into production has always been easy. Both H2O and Driverless AI let’s the user download a MOJO, (Model Optimized Java Object) automatically. It’s prebuilt Java code that you can give to I.T. to put into production without having to recode your work.
This has always been a big ‘time suck’ between the Data Science/Analytics group and I.T. While you’re busying learning all the statistical relationships in data using R or Python, how does I.T. then take your results and recode them into Java? The MOJO was created to alleviate this issue. Build your model, export the MOJO from Driverless AI (or H2O) and give it to I.T. – Done!
What about exposing a MOJO through a REST API? That’s also available in H2O and Driverless AI. You can quickly expose a MOJO into a REST API and start scoring in a few minutes.
The MOJO is meant to be flexible and easy to use in production without the need for a special scoring server or uploading your data off premise to score in the cloud. – Done again!
That quote is from Basho, a 17th Century Japanese Haiku master. I often think about it when I’m in the middle of something confusing and I seek clarity. I’m reminded to blaze my own trail – not because it is demanded – but for myself. I do it to seek truth. To see clearly.
H2O.ai brings that clarity in the confusing AI space in a very simple way. Create world class algorithms, build an open source platform, and create an automated modeling platform. Our algorithms are used by over 14,000 organizations and by other software vendors ‘under the hood.’ We took all that knowledge and poured it into our automated modeling platform. We asked our Kaggle Grandmasters if you had the chance to build the best automated modeling platform in the world, what would it look like?
The answer? Driverless AI .