Scalable Automatic Machine Learning with H2O
This session was recorded in NYC on October 22nd, 2019.
In this presentation, Erin LeDell (Chief Machine Learning Scientist, H2O.ai), will provide a history and overview of the field of āAutomatic Machine Learningā (AutoML), followed by a detailed look inside H2Oās open sourceĀ AutoML algorithm. H2OĀ AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance). The result of the AutoML run is a āleaderboardā of H2O models which can be easily exported for use in production.Ā AutoML is available in all H2O interfaces (R, Python, Scala, web GUI) and due to the distributed nature of the H2O platform, can scale to very large datasets. The presentation will end with a demo of H2OĀ AutoML in R and Python, including a handful of code examples to get you started using automatic machine learning on your own projects.
Bio:
Dr. Erin LeDell is the Chief Machine Learning Scientist at H2O.ai, the company that produces the open source, distributed machine learning platform, H2O. Before joining H2O.ai, she was the Principal Data Scientist at two AI startups (both acquired), the founder of DataScientific, Inc. and a software engineer at a large consulting firm. She received her Ph.D. from UC Berkeley where her research focused on machine learning and computational statistics. She also holds a B.S. and M.A. in Mathematics.
Read the Full Transcript
Erin:
All right, so weāre a few minutes behind, but I think Iāll try to keep it on schedule, but we have about 25 minutes or so. So yeah, we just heard a lot about automatic machine learning in the earlier talks, in particular, how we do automatic machine learning and driverless AI. But what this talk is about is, well, itās about automatic machine learning in general, but then Iām going to speak a little bit more about how do we do that inside of the open-source H2O library.
Okay. So, this is ⦠the beginning part of the talk is just a little bit more broad about AutoML in general. So, hereās kind of a few things, what I think are goals and features of AutoML. So, although the term, or I guess the field of AutoML has been around for quite a while, itās only recently becoming popular, sort of, in enterprise and in actual use cases. So itās been a research field for a long time.
So, what I would say are some of the more modern goals and features of AutoML, I think one of the main goals is to train a model in the least amount of time. So that could be in terms of developer time, user time, writing code, but also in terms of just how much computation time it takes to get a very good model.
So thatās kind of a very succinct goal of AutoML. In doing that, we can look at other things like, what does that mean? That means maybe reducing the human effort involved. So maybe weāre writing less code, investigating fewer methods on our own, that type of thing. So, just the effort in training models is going down, but also it can bring down the expertise. So I think that AutoML is actually a tool for people of all abilities. So, people who are new to machine learning, but also as youāve seen with a lot of the talks earlier today, thereās some very advanced topics related to AutoML.
So I think itās really a goal for everyone, but it does kind of open the doors a little bit to people that are less skilled or less experienced with machine learning, and maybe havenāt had, you know, 10 years of being a data scientist and having exposure to all these different types of models. So I think thatās a feature of AutoML.
And then another feature is, generally the models that youāre producing are going to be better than, you know, if you just sort of pick a data scientist out of the room. Maybe if itās a room of Kaggle grandmasters, itās a different story, but you know, just letās say the average data scientist. Hopefully with AutoML, we can get a better model than like, the average person. So, one of the other goals is to just increase the overall performance of the models that youāre producing.
And, I think this last one is, you know, maybe just sort of a side feature, but I think one of the nice things that Iāve seen at least in use cases around AutoML so far, is it kind of creates a nice baseline for you to start your problems. So, whether thatās in research, like in a scientific domain, or if itās in industry, you could just pick an AutoML tool and just run it and see where youāre at, and that could be a good baseline or a good starting point. So, it increases or so establishes baselines, but also increases reproducibility. So you know that if you have a team of, letās say 20 data scientists, and you know that if you hand a new data set or a new problem to any one of those data scientists, youāre going to probably get a different solution depending on what their skill set is, what their experience is, which algorithms they prefer to use.
A lot of people have pretty strong preferences about algorithms. They really like GBMs or Random Forest, but maybe those arenāt the best for, you know, all problems. So I think another nice feature is that you know that if you hand it to someone on your team in a team of 20, and they all just try AutoML as the first thing, youāre going to have some reproducibility and baselines to work with. So I think these are all nice features of AutoML.
Next, Iām just going to explain a little bit more detail, different parts of machine learning, I guess, of automatic machine learning, but just machine learning in general. So, before we start training models, or sometimes as part of the model training process as well, we can talk about different types of data preparation, so feature engineering could be included in that. Then, once we need to generate usually a large number of models, if we want to get a good model.
You know, itās hard to ⦠there are some newer techniques where you can kind of try to predict what the performance of the model might be just based on machine learning. So like, many layers of machine learning, but in general, what people do is they train a large number of models, and then they will get a good one out of that. Then, an optional sort of third piece at the end is, if your goal is really to get the best model possible, something thatās done, you see this in Kaggle all the time, is people creating stacked ensembles or different types of ensembles, because thatās really the best way to get the best performance, and thatās maybe not always your goal, but if it is, thatās a helpful thing to do.
So, weāve been talking a lot about AutoML today, and most of what we talk about in general at H2O is tabular data, but that could also include like time series, that type of thing. Then thereās a whole other set of problems with image classification, language, texts. So, all of those fields can kind of be separated in terms of AutoML by the data type. So, thereās different types of techniques that you would use. Letās say, an image classification for AutoML versus if you just had, sort of, standard industry text data. If you want to learn about all of those different techniques, I can refer you to this blog post that I wrote about a year ago, that talks about all the different types of AutoML, because now we have this term thatās pretty generic, but it kind of refers to a lot of different things.
And so, some of the research that is popular right now is something called neural architecture search, but thatās a type of AutoML thatās really only applicable to deep learning. Itās good to kind of have a good overview of the landscape. Also, I talk about different software tools that do each of these things in this article as well, so if youāre interested, you could bookmark that.
So hereās kind of the three topics that I brought up before. So letās just say a little bit more about them. So for data preprocessing, that could include very simple things that are basically trivial to automate, like replacing missing values or standardization or different types of encodings. That could also include things like feature selection or feature extraction. And this last one is quite important if you have categorical data and especially if you have categorical data with a lot of categories in a particular column. Letās say you have a zip code or something like that, or if you work in healthcare like ICD-9 or ICD-10 codes, things like this. Itās very important to do some sort of preprocessing to these fields.
So of course, if weāre trying to automate machine learning, we have to address all of these things, even if theyāre hard. Some of these things are a little bit harder than others, and some are a bit more of an art than a science, so you can get pretty creative in the data preprocessing section. Actually this is one of the key, I would say, differentiators between what Iāll talk about today, open-source H2O AutoML and driverless, is driverless has a big strong focus on different types of data processing and feature engineering. And you probably heard quite a bit about that earlier.
In terms of model generation, some of the things that you might be familiar with are like grid search or random search. Thatās a pretty straight-forward way to get a bunch of models. Thereās also other techniques like Bayesian hyper-parameter optimization. Thereās a bunch of newer things, something called Hyperband. Thereās something that combines Hyperband and Bayesian hyper-parameter optimization called BOHB. So thereās all different new techniques in terms of generating a bunch of models and tuning models.
Then in terms of ensembles, the type of ensemble that I prefer myself, and I think that works quite well is called stacking. Itās also called stacked ensembles or super learning. Itās all the same thing. Then thereās another type of ensemble called ensemble selection, which is just a different approach. Itās more of a greedy approach to ensembling where you essentially keep adding models or keep subtracting them out of your group until the performance degrades. Iāll speak a little bit more about stacking in a minute.
Right now, I just want to highlight some of the things that we have in the open-source H2O right now. For data preprocessing, we have that first bullet here, we get all of that for free in H2O because itās already included in all the H2O algorithms. One of the things that weāve been working on for a little while is automatic target encodings. That will go into H2O soon. We have target encoding as a function inside of H2O. The part weāre working on right now is figuring out how to automate that and make that part of our AutoML pipeline. Then the other bullet is also something on our roadmap as well. Then in terms of how we generate models and do ensembles, Iāll speak a little bit more about that on the next slide.
This is the approach that we currently take and this possibly will evolve over time as we get more and more features into H2O. Basically, H2O has a very large library with a lot of different algorithms and functionalities, so as soon as something gets added to H2O, then we can figure out a way to make that part of the automation process and put that into AutoML. We have this kind of steady flow of new features being added into H2O. Theyāre there for a little while, we test them out and make sure theyāre quite solid, and then we figure out a way to automate that. Youāve seen that with, for example, XGBoost is an algorithm that we added, maybe it was about a year and a half ago at this point. It took quite a bit of work to get that stable inside of H2O. Then, once we had that inside H2O, it was looking good for a while. Then we moved it into AutoML.
Thatās kind of our process for development. So right now, inside of H2o, we have random search, and we have stacked ensembles. So, those two things actually work quite well together, and the reason for that is because stacked ensembles, they work well when you have basically models that fail in different ways, or you could say more succinctly, like a diverse set of base models. So the base models are the name for the models that you put inside of the ensemble. And that could be anything like a GBM or Random Forest, a deep neural net, GLM, whatever you want, you can put inside. The reason that stacking and random grids work well together is because when you randomly generate a set of models, they are sort of inherently diverse or theyāre sort of different by random chance, so that combination works well together, and weāve found that to be an effective and scalable and paralyzable way to do this.
Why donāt I just say one more thing about stacking before I move on? I just want to say stacking is an interesting technique, because it actually uses machine learning to figure out how to best combine the base learners. So itās actually doing machine learning on top of machine learning. Sometimes they call that step metalearning, but itās a pretty nice thing. Thereās different ways you can come up with groups of models and ensemble them together, but this one is purely machine learning driven. You donāt necessarily tell it how to do its job. It just learns from the data how to best combine those models to predict your outcome. Thatās why it works well, and thatās why you see stacking winning all the Kaggle competitions essentially.
All right. So let me just say a little bit more specifically now about what we have in H2O AutoML. Iām doing a little bit of a review here, but essentially we have some pretty basic data preprocessing steps. Weāre adding some more advanced stuff and thatās going to be a focus in the future. Just sort of making sure that weāre automating whatever that we have in H2O. Then, once weāve got the data cleaned up a little bit, then we train random grids of all the different supervised machine learning algorithms that we have in H2O. We have GBMs, we have XGBoost GBMs, which is another basically implementation of GBM, and we have deep neural networks. We have generalized linear models. We have Random Forests, et cetera.
So, part of the contribution, I guess you could say, of H2o AutoML is that weāve spent a lot of time thinking about how to allocate time between these different algorithms and then what hyperparameters should we tune, and what ranges should we tune over? AutoML is sort of a game of trying to do better than brute force. We could just train every model in existence, and then call that AutoML. That could be an AutoML. Itās just not very efficient. So, itās sort of a game of how do we take a brute force approach, but do it in a smart way so that weāre not wasting a lot of time in places that are not going to yield good results. Thatās kind of what we think is the contribution of H2o AutoML, is choosing all of that.
This is something of course, that thereās no right answer, no right combination that works on all datasets. We are carefully benchmarking the AutoML system across a lot of data sets, and when we make tweaks to things like, letās say, we think maybe we need to improve the grid search for the deep neural nets. When we make those changes, we would then benchmark that across a lot of data sets and make sure that weāre actually overall improving the algorithm. Benchmarking has become a very important piece in developing AutoML.
Then, we tune all the individual models to make sure theyāre not over fitting. This is particularly important with models like GBMs or even deep neural networks. Then, once we have some list of models, and you basically tell H2O AutoML, how long do you want it to run for or how much effort do you want to put in. When that time is done, then we train a few stacked ensembles from your list of models. You could say to H2O AutoML, āI would like you to train 50 models,ā and then it would go off and decide which 50 models to train.
Then after that, it would train two stacked ensembles, one that has all 50 models in it, and thatās usually going to be the best resulting model at the end. Then weāll also train something that we call the best of family ensemble, where instead of an ensemble of everything that youāve trained, itās just picking the best out of each group. So, the best GBM, the best XGBoost, the best Random Forest, et cetera. Thatās more of a lightweight ensemble. The reason why we offer two is because, if you run AutoML for a day, and itās trained a thousand models, then your best stacked ensemble, it might be a 1000-model ensemble and that might give you the best performance, but it might not necessarily be the model that you want to put into production.
So, H2O, we focus a lot on speed and productionizing models. Thatās something we like to think of. So we offer you sort of a lightweight alternative, where you get some of the benefits of stacking. You still get probably a better model than any of the single models that you have, but itās a little bit more lightweight model that you can put into production. Itās up to you, whatever your use case is, but we like to provide you with both, so we do that.
Then we return what we call Leaderboard. Thatās just sort of a ranked list of all your models that youāve trained throughout the process, sorted by some default metric. For example, if youāre doing binary classification, the default metric we would use would be AUC, and we would sort them all by cross-validated AUC. You can change the metric that you like. Of course, these are just all normal H2O models, so then you can easily export them to production.
Okay. Iām just going to show you a few screenshots of the different interfaces that we have for AutoML. This is our web interface. Iām not sure if youāve seen this, if youāre an R person or a Python person, you might never have even seen our web interface before, but basically once you start up in H2O cluster, it will locally start up a web server. So youāre actually running this locally. If you type in local host 54321, by default youāll see this page on your machine. Itās just basically a dropdown. You can do everything by clicking. You can load data sets. You can split data sets, and you can train any kind of model you want. But one of the options is you can run AutoML, and you just basically point and click, and you have all the features there.
Ā
Speaker 2:
Quick question?
Ā
Erin:
Yeah.
Ā
Speaker 2:
This is separate from driverless?
Ā
Erin:
Yep, this is all very ⦠complete different code base. Yeah.
Ā
Speaker 3:
Driverless can call AutoML if you want as a recipe. So one of the 115 recipes is H2o models.
Ā
Speaker 2:
It would be good to know that the [crosstalk] ā¦
Ā
Speaker 3:
Yes. H2O-3 is meant for distributed clusters of Java code, running production grade model, fitting nothing else. So you can have a terabyte data set on a hundred-node cluster and fit one GBM on it. The God model, if you want. If you have good features, thatās all you need. If you donāt know what features to make, you need driverless to make the features for you. Right? And then youāre stuck right now with a single node solution. So, we have customers using a four-terabyte memory node with eight GPUs, and theyāre running 500-gigabyte data sets on that server. You can fit that still, right? It might even be faster because itās GPU-enabled algorithms that are used in there, than having like a hundred-node Hadoop cluster with a bad network, but this is meant to run on any system.
Thatās any machines, any commodity hardware Hadoop, plus the spark clusters, big data staple runs, just does nothing else but fitting models. Itās not the automatic data scientist in a box, itās the automatic machine learning parameter tuning in a box, if you want. What comes out here is all Java. Every single thing that comes out here is production-grade Java code, just like in driverless with the defaults. Driverless is defaults also will give you a Java production pipeline.
Now, if you start bringing in pipe torch and MP models, then there might not be a Java version of that, right? So then you will not have Java production code. Youāll only have C++ production code, which we implemented in the last few weeks. But letās say you donāt have Pytorch or TensorFlow, you have something else. You have cafe models. There might not be any standalone scoring pipeline of that. They will only be in Python or whatever you did to make it run. This is meant to be simple. It works. Thereās no questions asked. This thing will work. Java production-grade everything. If you have the right features going in, this will not do the feature engineering, but it will work on even 10-terabyte dataset.
Ā
Speaker 2:
So do you use driverless AI first, and then this one?
Ā
Speaker 3:
Yes. Thatās also possible. So you can run driverless experiments. Letās say for a week, you figured out what features are important. Then you can re-implement everything from scratch by hand in Spark, and then sparkling water to do this whole thing end-to-end. Then you still are responsible for your own feature engineering in production, right? You have to make sure that youāre not doing a mistake in your target encoding and so on, because we will just tell you that our target encoding was useful, but you donāt have the guaranteed that your target encoding will also be good, because thereās a lot of details in that feature engineering. Thatās basically the secret sauce of driverless is the Kaggle grandmaster level feature engineering. But we certainly have customers that are using all open-source to do the whole thing from A to Z because they want the full transparency, and driverless is just a time saver in figuring out what to do and what works. [inaudible]
Ā
Speaker 2:
So regarding the 150 recipes you have just presented, just wonder how many of them have been motor risk certified? Because in a highly regulated industry, like finance or banking industry, we are most likely interested in to utilize some of the motors, which are already a mam approved.
Ā
Speaker 3:
Yes, exactly. So you keep your IP, right? If you have models that you know that work and your transformers that you know are approved by regulators, then you should keep using your IP. We just keep it a platform in which to run those ideas. So these 115 transformers are not approved necessarily by the regulators, right, because we donāt have to go through that challenge. We just have an open-source repository as templates for you to start playing with.
Erin: Yeah. So one of the things, I mean, we could talk a lot about what are the differences, but I think to make it short and sweet, the focus of driverless, or one of the bigger advantages of using driverless is the automatic feature engineering, and you get a lot of boost in your models by taking that approach versus the strictly modeling approach. So, thereās a lot of other things that are different between the two in terms of productionizing, interpretability, things like that. But if you want to kind of just understand quickly, H2O is more like a do-it-yourself situation. Itās open-source. You have all the models there that you can build stuff however you want. We just have a tool that does that for you as well. The focus here is automatic sort of modeling, not necessarily feature engineering.
So, Iām running out of time, but I will show you just a couple more things. So this is what the interface looks like in R, so itās basically just one line of code. So, you start up the H2O cluster, load some data, and you point it to your data and tell it what youāre trying to predict, and then you tell it for how long do you want to run? So you could either say in terms of a time limit, like the amount of time you want it to run in this case, weāre running it for 10 minutes, or you can say the number of models that you would like it to run, like 50 models or something like that. So this is what it looks like in Python. So basically same thing, except for itās a bit more Pythonic the way the code is written, and this is what you get back.
This is what we call the leaderboard, and then youāll see a ranked list of all the models that it trained. So this was an example where we trained a binary classification problem for 20 models, and so you see 20 models, plus two stacked ensembles there, which are at the top, and youāre seeing the model ID. So, if you grab that ID and you want to look inside, you just say, H2O.getmodel, put the ID in there, and then you can see all the parameters that were used, what the values were, and you get all the information back, and you have the models and memory, because youāve already stored them. So all of these models are available, itās just you get to see which ones did better than others.
So Iām going to skip a few things. So hereās a couple things that weāre working on right now. So I mentioned the automatic targeting coding. I think this is being added. If itās not already added this week, weāre adding support for monitonicity constraints. So that is helpful for interpretability. We hope to get better support for wide datasets and text input directly. And then, yeah, weāll just ⦠if you want to know, how does this do against ⦠so thereās like a few other open-source AutoML tools that exist, and basically I think almost all of them are academic projects from research groups. And so, like I said, this field has been around for quite a while, in terms of an academic field. So we did a big benchmark. When I say we, it was me and a bunch of other AutoML researchers. So we got together and we decided what is a fair way to benchmark AutoML systems. And then we developed a benchmark system, and then we use that system to benchmark all the popular tools.
And then you can see that you can read the paper, and we presented at ICML this year. And I basically ran out of time. So Iām going to skip probably through this, but I would recommend that you take a look at the results and try not to take too much from one single result, try to look at them comprehensively because it matters quite a lot, depending on environmental changes. So if you run it on a small system versus a big system, on small data versus big data, for a short amount of time versus a long amount of time. So you kind of have to pick among those subsets of use cases. What is your use case? Are you looking for something that you can train models, like a hundred models a day quickly, or you have a week to train your models, and then narrow down and look at the results that are relevant to you.
And then if you donāt want to read, thereās a video where somebody will read the paper for you. So this ⦠thereās a Kaggle reading group, YouTube playlist, and Rachel from Kaggle reviewed our paper recently. So you can just watch it and have it read to you, and thatās kind of fun actually.
Ā
Speaker 3:
Quick question.
Ā
Erin:
Yeah.
Ā
Speaker 3:
Does the AutoML support time series data as well, or do we have to do our own feature engineering?
Ā
Erin:
I would say pretty much you have to do your own feature engineering for time series. If itās like a single time series, it wonāt accept that. If itās like multiple, repeated observations, then you could put it into AutoML, but itās not really designed for time series. Itās more designed for IID data. So, Iāll just click through the last few slides. So a lot of people are using AutoML for all different things. Itās being used to teach at universities as well. This is the H2O-3 roadmaps. So we didnāt have a whole lot of time to talk about this, but actually on the next slide, Iāll just summarize some of the new things that are coming out soon. So, support for mixed effects and hierarchal GLMS, constrained k-means clustering, generalized additive models is something Iām excited about. I think thatāll probably go into AutoML as one of the model types, and if youāre more of a dev ops type of person, weāre adding Kubernetes, and thereās some other things as well.
And hereās some links. If youāre new to H2O and you want to learn more about it, or if you have used it for awhile, but you want some tutorials to run through, to see some new use cases, these are some good links there. And thereās tutorials for all different types of things. Iām highlighting the AutoML ones as well. And I think thatās my last slide. Yeah. Okay. So, yeah.
Ā
Speaker 3:
Sorry, how do you bring back the interpretability with like the ensemblement for example? Or can you even? Because thatās the problem I had for example, when I was running this. Can you explain it?
Ā
Erin:
Yeah. I mean, basically thereās techniques in the machine learning interpretability space that could take any kind of black box model and explain it. Itās different than letās say just having a decision tree or a GLM, which are inherently explainable on their own, but you just have to apply one of these secondary methods like lime as a popular one.
Ā
Speaker 3:
Even on ensembles?
Ā
Erin:
Yeah. Even on ensembles. [crosstalk]
Ā
Speaker 2:
Yes. [inaudible] we have an ensemble to do linear blending, which means we can linearly superpose the different Shapley values of the contributions of each sub model. So in driverless, you get Shapley for the ensemble, with the tree approximations from each individual model. So light GBM and actually boost support Shapley and now, H double GBM also has Shapley. So if those are the base models of an ensemble and they do linear blending, which means each model gets a coefficient, you can just do the same superposition for the Shapley values. That doesnāt work if you do true stacking on a robot O level, thatās hard.
Ā
Speaker 3:
[inaudible]
Ā
Erin:
Well, the way that it works is you just, you give it your data frame and then it sends that data frame to all the different algorithms. So you could, I mean, we could sort of modify it. So you didnāt, you could send different data frames possibly to different groups, but right now you basically just send the same data set to all the different algorithms. And so, you know, maybe you donāt need to do as much feature engineering. If youāre using deep learning. You can also turn on and off different algorithms inside of AutoML. So, if you wanted to just explore everything about deep learning, you can shut everything else off and then just use it that way. So you can kind of use it however you want. We have everything automated, but itās also highly customizable in that sense. So we probably have time for one question or we ⦠oh, yeah.
Ā
Speaker 3:
So, is there a way to buy a [inaudible] system towards less complex models so that theyāre more efficient in production?
Ā
Erin:
We have that. Yeah. So we had a question from a customer related exactly to that recently. So, we ⦠what the leaderboard that you saw up there right now, we just have performance metrics, model performance metrics, but what weāre working on right now is basically an extended leaderboard that has other metrics. So, one of the metrics we talked about adding was like prediction speed. So that could be a proxy for model complexity. The customer was asking, in particular, could they limit the tree depth to something very specific. So ⦠and we sort of suggested as a more generalizable alternative, what about, is it really prediction time that youāre looking for? So weāre going to add that. We could make it more specific by adding other indicators of complexity. I think that would be useful. Yep.
Ā
Speaker 2:
Great. Thank you very much Erin.
Ā
Erin:
Okay. Thanks.