Return to page

Scalable Automatic Machine Learning with H2O

This session was recorded in NYC on October 22nd, 2019.

In this presentation, Erin LeDell (Chief Machine Learning Scientist,, will provide a history and overview of the field of “Automatic Machine Learning” (AutoML), followed by a detailed look inside H2O’s open source AutoML algorithm. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance). The result of the AutoML run is a “leaderboard” of H2O models which can be easily exported for use in production. AutoML is available in all H2O interfaces (R, Python, Scala, web GUI) and due to the distributed nature of the H2O platform, can scale to very large datasets. The presentation will end with a demo of H2O AutoML in R and Python, including a handful of code examples to get you started using automatic machine learning on your own projects.


Dr. Erin LeDell is the Chief Machine Learning Scientist at, the company that produces the open source, distributed machine learning platform, H2O. Before joining, she was the Principal Data Scientist at two AI startups (both acquired), the founder of DataScientific, Inc. and a software engineer at a large consulting firm. She received her Ph.D. from UC Berkeley where her research focused on machine learning and computational statistics. She also holds a B.S. and M.A. in Mathematics.

Read the Full Transcript


All right, so we’re a few minutes behind, but I think I’ll try to keep it on schedule, but we have about 25 minutes or so. So yeah, we just heard a lot about automatic machine learning in the earlier talks, in particular, how we do automatic machine learning and driverless AI. But what this talk is about is, well, it’s about automatic machine learning in general, but then I’m going to speak a little bit more about how do we do that inside of the open-source H2O library.

Okay. So, this is … the beginning part of the talk is just a little bit more broad about AutoML in general. So, here’s kind of a few things, what I think are goals and features of AutoML. So, although the term, or I guess the field of AutoML has been around for quite a while, it’s only recently becoming popular, sort of, in enterprise and in actual use cases. So it’s been a research field for a long time.

So, what I would say are some of the more modern goals and features of AutoML, I think one of the main goals is to train a model in the least amount of time. So that could be in terms of developer time, user time, writing code, but also in terms of just how much computation time it takes to get a very good model.

So that’s kind of a very succinct goal of AutoML. In doing that, we can look at other things like, what does that mean? That means maybe reducing the human effort involved. So maybe we’re writing less code, investigating fewer methods on our own, that type of thing. So, just the effort in training models is going down, but also it can bring down the expertise. So I think that AutoML is actually a tool for people of all abilities. So, people who are new to machine learning, but also as you’ve seen with a lot of the talks earlier today, there’s some very advanced topics related to AutoML.

So I think it’s really a goal for everyone, but it does kind of open the doors a little bit to people that are less skilled or less experienced with machine learning, and maybe haven’t had, you know, 10 years of being a data scientist and having exposure to all these different types of models. So I think that’s a feature of AutoML.

And then another feature is, generally the models that you’re producing are going to be better than, you know, if you just sort of pick a data scientist out of the room. Maybe if it’s a room of Kaggle grandmasters, it’s a different story, but you know, just let’s say the average data scientist. Hopefully with AutoML, we can get a better model than like, the average person. So, one of the other goals is to just increase the overall performance of the models that you’re producing.

And, I think this last one is, you know, maybe just sort of a side feature, but I think one of the nice things that I’ve seen at least in use cases around AutoML so far, is it kind of creates a nice baseline for you to start your problems. So, whether that’s in research, like in a scientific domain, or if it’s in industry, you could just pick an AutoML tool and just run it and see where you’re at, and that could be a good baseline or a good starting point. So, it increases or so establishes baselines, but also increases reproducibility. So you know that if you have a team of, let’s say 20 data scientists, and you know that if you hand a new data set or a new problem to any one of those data scientists, you’re going to probably get a different solution depending on what their skill set is, what their experience is, which algorithms they prefer to use.

A lot of people have pretty strong preferences about algorithms. They really like GBMs or Random Forest, but maybe those aren’t the best for, you know, all problems. So I think another nice feature is that you know that if you hand it to someone on your team in a team of 20, and they all just try AutoML as the first thing, you’re going to have some reproducibility and baselines to work with. So I think these are all nice features of AutoML.

Next, I’m just going to explain a little bit more detail, different parts of machine learning, I guess, of automatic machine learning, but just machine learning in general. So, before we start training models, or sometimes as part of the model training process as well, we can talk about different types of data preparation, so feature engineering could be included in that. Then, once we need to generate usually a large number of models, if we want to get a good model.

You know, it’s hard to … there are some newer techniques where you can kind of try to predict what the performance of the model might be just based on machine learning. So like, many layers of machine learning, but in general, what people do is they train a large number of models, and then they will get a good one out of that. Then, an optional sort of third piece at the end is, if your goal is really to get the best model possible, something that’s done, you see this in Kaggle all the time, is people creating stacked ensembles or different types of ensembles, because that’s really the best way to get the best performance, and that’s maybe not always your goal, but if it is, that’s a helpful thing to do.

So, we’ve been talking a lot about AutoML today, and most of what we talk about in general at H2O is tabular data, but that could also include like time series, that type of thing. Then there’s a whole other set of problems with image classification, language, texts. So, all of those fields can kind of be separated in terms of AutoML by the data type. So, there’s different types of techniques that you would use. Let’s say, an image classification for AutoML versus if you just had, sort of, standard industry text data. If you want to learn about all of those different techniques, I can refer you to this blog post that I wrote about a year ago, that talks about all the different types of AutoML, because now we have this term that’s pretty generic, but it kind of refers to a lot of different things.

And so, some of the research that is popular right now is something called neural architecture search, but that’s a type of AutoML that’s really only applicable to deep learning. It’s good to kind of have a good overview of the landscape. Also, I talk about different software tools that do each of these things in this article as well, so if you’re interested, you could bookmark that.

So here’s kind of the three topics that I brought up before. So let’s just say a little bit more about them. So for data preprocessing, that could include very simple things that are basically trivial to automate, like replacing missing values or standardization or different types of encodings. That could also include things like feature selection or feature extraction. And this last one is quite important if you have categorical data and especially if you have categorical data with a lot of categories in a particular column. Let’s say you have a zip code or something like that, or if you work in healthcare like ICD-9 or ICD-10 codes, things like this. It’s very important to do some sort of preprocessing to these fields.

So of course, if we’re trying to automate machine learning, we have to address all of these things, even if they’re hard. Some of these things are a little bit harder than others, and some are a bit more of an art than a science, so you can get pretty creative in the data preprocessing section. Actually this is one of the key, I would say, differentiators between what I’ll talk about today, open-source H2O AutoML and driverless, is driverless has a big strong focus on different types of data processing and feature engineering. And you probably heard quite a bit about that earlier.

In terms of model generation, some of the things that you might be familiar with are like grid search or random search. That’s a pretty straight-forward way to get a bunch of models. There’s also other techniques like Bayesian hyper-parameter optimization. There’s a bunch of newer things, something called Hyperband. There’s something that combines Hyperband and Bayesian hyper-parameter optimization called BOHB. So there’s all different new techniques in terms of generating a bunch of models and tuning models.

Then in terms of ensembles, the type of ensemble that I prefer myself, and I think that works quite well is called stacking. It’s also called stacked ensembles or super learning. It’s all the same thing. Then there’s another type of ensemble called ensemble selection, which is just a different approach. It’s more of a greedy approach to ensembling where you essentially keep adding models or keep subtracting them out of your group until the performance degrades. I’ll speak a little bit more about stacking in a minute.

Right now, I just want to highlight some of the things that we have in the open-source H2O right now. For data preprocessing, we have that first bullet here, we get all of that for free in H2O because it’s already included in all the H2O algorithms. One of the things that we’ve been working on for a little while is automatic target encodings. That will go into H2O soon. We have target encoding as a function inside of H2O. The part we’re working on right now is figuring out how to automate that and make that part of our AutoML pipeline. Then the other bullet is also something on our roadmap as well. Then in terms of how we generate models and do ensembles, I’ll speak a little bit more about that on the next slide.

This is the approach that we currently take and this possibly will evolve over time as we get more and more features into H2O. Basically, H2O has a very large library with a lot of different algorithms and functionalities, so as soon as something gets added to H2O, then we can figure out a way to make that part of the automation process and put that into AutoML. We have this kind of steady flow of new features being added into H2O. They’re there for a little while, we test them out and make sure they’re quite solid, and then we figure out a way to automate that. You’ve seen that with, for example, XGBoost is an algorithm that we added, maybe it was about a year and a half ago at this point. It took quite a bit of work to get that stable inside of H2O. Then, once we had that inside H2O, it was looking good for a while. Then we moved it into AutoML.

That’s kind of our process for development. So right now, inside of H2o, we have random search, and we have stacked ensembles. So, those two things actually work quite well together, and the reason for that is because stacked ensembles, they work well when you have basically models that fail in different ways, or you could say more succinctly, like a diverse set of base models. So the base models are the name for the models that you put inside of the ensemble. And that could be anything like a GBM or Random Forest, a deep neural net, GLM, whatever you want, you can put inside. The reason that stacking and random grids work well together is because when you randomly generate a set of models, they are sort of inherently diverse or they’re sort of different by random chance, so that combination works well together, and we’ve found that to be an effective and scalable and paralyzable way to do this.

Why don’t I just say one more thing about stacking before I move on? I just want to say stacking is an interesting technique, because it actually uses machine learning to figure out how to best combine the base learners. So it’s actually doing machine learning on top of machine learning. Sometimes they call that step metalearning, but it’s a pretty nice thing. There’s different ways you can come up with groups of models and ensemble them together, but this one is purely machine learning driven. You don’t necessarily tell it how to do its job. It just learns from the data how to best combine those models to predict your outcome. That’s why it works well, and that’s why you see stacking winning all the Kaggle competitions essentially.

All right. So let me just say a little bit more specifically now about what we have in H2O AutoML. I’m doing a little bit of a review here, but essentially we have some pretty basic data preprocessing steps. We’re adding some more advanced stuff and that’s going to be a focus in the future. Just sort of making sure that we’re automating whatever that we have in H2O. Then, once we’ve got the data cleaned up a little bit, then we train random grids of all the different supervised machine learning algorithms that we have in H2O. We have GBMs, we have XGBoost GBMs, which is another basically implementation of GBM, and we have deep neural networks. We have generalized linear models. We have Random Forests, et cetera.

So, part of the contribution, I guess you could say, of H2o AutoML is that we’ve spent a lot of time thinking about how to allocate time between these different algorithms and then what hyperparameters should we tune, and what ranges should we tune over? AutoML is sort of a game of trying to do better than brute force. We could just train every model in existence, and then call that AutoML. That could be an AutoML. It’s just not very efficient. So, it’s sort of a game of how do we take a brute force approach, but do it in a smart way so that we’re not wasting a lot of time in places that are not going to yield good results. That’s kind of what we think is the contribution of H2o AutoML, is choosing all of that.

This is something of course, that there’s no right answer, no right combination that works on all datasets. We are carefully benchmarking the AutoML system across a lot of data sets, and when we make tweaks to things like, let’s say, we think maybe we need to improve the grid search for the deep neural nets. When we make those changes, we would then benchmark that across a lot of data sets and make sure that we’re actually overall improving the algorithm. Benchmarking has become a very important piece in developing AutoML.

Then, we tune all the individual models to make sure they’re not over fitting. This is particularly important with models like GBMs or even deep neural networks. Then, once we have some list of models, and you basically tell H2O AutoML, how long do you want it to run for or how much effort do you want to put in. When that time is done, then we train a few stacked ensembles from your list of models. You could say to H2O AutoML, “I would like you to train 50 models,” and then it would go off and decide which 50 models to train.

Then after that, it would train two stacked ensembles, one that has all 50 models in it, and that’s usually going to be the best resulting model at the end. Then we’ll also train something that we call the best of family ensemble, where instead of an ensemble of everything that you’ve trained, it’s just picking the best out of each group. So, the best GBM, the best XGBoost, the best Random Forest, et cetera. That’s more of a lightweight ensemble. The reason why we offer two is because, if you run AutoML for a day, and it’s trained a thousand models, then your best stacked ensemble, it might be a 1000-model ensemble and that might give you the best performance, but it might not necessarily be the model that you want to put into production.

So, H2O, we focus a lot on speed and productionizing models. That’s something we like to think of. So we offer you sort of a lightweight alternative, where you get some of the benefits of stacking. You still get probably a better model than any of the single models that you have, but it’s a little bit more lightweight model that you can put into production. It’s up to you, whatever your use case is, but we like to provide you with both, so we do that.

Then we return what we call Leaderboard. That’s just sort of a ranked list of all your models that you’ve trained throughout the process, sorted by some default metric. For example, if you’re doing binary classification, the default metric we would use would be AUC, and we would sort them all by cross-validated AUC. You can change the metric that you like. Of course, these are just all normal H2O models, so then you can easily export them to production.

Okay. I’m just going to show you a few screenshots of the different interfaces that we have for AutoML. This is our web interface. I’m not sure if you’ve seen this, if you’re an R person or a Python person, you might never have even seen our web interface before, but basically once you start up in H2O cluster, it will locally start up a web server. So you’re actually running this locally. If you type in local host 54321, by default you’ll see this page on your machine. It’s just basically a dropdown. You can do everything by clicking. You can load data sets. You can split data sets, and you can train any kind of model you want. But one of the options is you can run AutoML, and you just basically point and click, and you have all the features there.


Speaker 2:

Quick question?





Speaker 2:

This is separate from driverless?



Yep, this is all very … complete different code base. Yeah.


Speaker 3:

Driverless can call AutoML if you want as a recipe. So one of the 115 recipes is H2o models.


Speaker 2:

It would be good to know that the [crosstalk] …


Speaker 3:

Yes. H2O-3 is meant for distributed clusters of Java code, running production grade model, fitting nothing else. So you can have a terabyte data set on a hundred-node cluster and fit one GBM on it. The God model, if you want. If you have good features, that’s all you need. If you don’t know what features to make, you need driverless to make the features for you. Right? And then you’re stuck right now with a single node solution. So, we have customers using a four-terabyte memory node with eight GPUs, and they’re running 500-gigabyte data sets on that server. You can fit that still, right? It might even be faster because it’s GPU-enabled algorithms that are used in there, than having like a hundred-node Hadoop cluster with a bad network, but this is meant to run on any system.

That’s any machines, any commodity hardware Hadoop, plus the spark clusters, big data staple runs, just does nothing else but fitting models. It’s not the automatic data scientist in a box, it’s the automatic machine learning parameter tuning in a box, if you want. What comes out here is all Java. Every single thing that comes out here is production-grade Java code, just like in driverless with the defaults. Driverless is defaults also will give you a Java production pipeline.

Now, if you start bringing in pipe torch and MP models, then there might not be a Java version of that, right? So then you will not have Java production code. You’ll only have C++ production code, which we implemented in the last few weeks. But let’s say you don’t have Pytorch or TensorFlow, you have something else. You have cafe models. There might not be any standalone scoring pipeline of that. They will only be in Python or whatever you did to make it run. This is meant to be simple. It works. There’s no questions asked. This thing will work. Java production-grade everything. If you have the right features going in, this will not do the feature engineering, but it will work on even 10-terabyte dataset.


Speaker 2:

So do you use driverless AI first, and then this one?


Speaker 3:

Yes. That’s also possible. So you can run driverless experiments. Let’s say for a week, you figured out what features are important. Then you can re-implement everything from scratch by hand in Spark, and then sparkling water to do this whole thing end-to-end. Then you still are responsible for your own feature engineering in production, right? You have to make sure that you’re not doing a mistake in your target encoding and so on, because we will just tell you that our target encoding was useful, but you don’t have the guaranteed that your target encoding will also be good, because there’s a lot of details in that feature engineering. That’s basically the secret sauce of driverless is the Kaggle grandmaster level feature engineering. But we certainly have customers that are using all open-source to do the whole thing from A to Z because they want the full transparency, and driverless is just a time saver in figuring out what to do and what works. [inaudible]


Speaker 2:

So regarding the 150 recipes you have just presented, just wonder how many of them have been motor risk certified? Because in a highly regulated industry, like finance or banking industry, we are most likely interested in to utilize some of the motors, which are already a mam approved.


Speaker 3:

Yes, exactly. So you keep your IP, right? If you have models that you know that work and your transformers that you know are approved by regulators, then you should keep using your IP. We just keep it a platform in which to run those ideas. So these 115 transformers are not approved necessarily by the regulators, right, because we don’t have to go through that challenge. We just have an open-source repository as templates for you to start playing with.

Erin: Yeah. So one of the things, I mean, we could talk a lot about what are the differences, but I think to make it short and sweet, the focus of driverless, or one of the bigger advantages of using driverless is the automatic feature engineering, and you get a lot of boost in your models by taking that approach versus the strictly modeling approach. So, there’s a lot of other things that are different between the two in terms of productionizing, interpretability, things like that. But if you want to kind of just understand quickly, H2O is more like a do-it-yourself situation. It’s open-source. You have all the models there that you can build stuff however you want. We just have a tool that does that for you as well. The focus here is automatic sort of modeling, not necessarily feature engineering.

So, I’m running out of time, but I will show you just a couple more things. So this is what the interface looks like in R, so it’s basically just one line of code. So, you start up the H2O cluster, load some data, and you point it to your data and tell it what you’re trying to predict, and then you tell it for how long do you want to run? So you could either say in terms of a time limit, like the amount of time you want it to run in this case, we’re running it for 10 minutes, or you can say the number of models that you would like it to run, like 50 models or something like that. So this is what it looks like in Python. So basically same thing, except for it’s a bit more Pythonic the way the code is written, and this is what you get back.

This is what we call the leaderboard, and then you’ll see a ranked list of all the models that it trained. So this was an example where we trained a binary classification problem for 20 models, and so you see 20 models, plus two stacked ensembles there, which are at the top, and you’re seeing the model ID. So, if you grab that ID and you want to look inside, you just say, H2O.getmodel, put the ID in there, and then you can see all the parameters that were used, what the values were, and you get all the information back, and you have the models and memory, because you’ve already stored them. So all of these models are available, it’s just you get to see which ones did better than others.

So I’m going to skip a few things. So here’s a couple things that we’re working on right now. So I mentioned the automatic targeting coding. I think this is being added. If it’s not already added this week, we’re adding support for monitonicity constraints. So that is helpful for interpretability. We hope to get better support for wide datasets and text input directly. And then, yeah, we’ll just … if you want to know, how does this do against … so there’s like a few other open-source AutoML tools that exist, and basically I think almost all of them are academic projects from research groups. And so, like I said, this field has been around for quite a while, in terms of an academic field. So we did a big benchmark. When I say we, it was me and a bunch of other AutoML researchers. So we got together and we decided what is a fair way to benchmark AutoML systems. And then we developed a benchmark system, and then we use that system to benchmark all the popular tools.

And then you can see that you can read the paper, and we presented at ICML this year. And I basically ran out of time. So I’m going to skip probably through this, but I would recommend that you take a look at the results and try not to take too much from one single result, try to look at them comprehensively because it matters quite a lot, depending on environmental changes. So if you run it on a small system versus a big system, on small data versus big data, for a short amount of time versus a long amount of time. So you kind of have to pick among those subsets of use cases. What is your use case? Are you looking for something that you can train models, like a hundred models a day quickly, or you have a week to train your models, and then narrow down and look at the results that are relevant to you.

And then if you don’t want to read, there’s a video where somebody will read the paper for you. So this … there’s a Kaggle reading group, YouTube playlist, and Rachel from Kaggle reviewed our paper recently. So you can just watch it and have it read to you, and that’s kind of fun actually.


Speaker 3:

Quick question.





Speaker 3:

Does the AutoML support time series data as well, or do we have to do our own feature engineering?



I would say pretty much you have to do your own feature engineering for time series. If it’s like a single time series, it won’t accept that. If it’s like multiple, repeated observations, then you could put it into AutoML, but it’s not really designed for time series. It’s more designed for IID data. So, I’ll just click through the last few slides. So a lot of people are using AutoML for all different things. It’s being used to teach at universities as well. This is the H2O-3 roadmaps. So we didn’t have a whole lot of time to talk about this, but actually on the next slide, I’ll just summarize some of the new things that are coming out soon. So, support for mixed effects and hierarchal GLMS, constrained k-means clustering, generalized additive models is something I’m excited about. I think that’ll probably go into AutoML as one of the model types, and if you’re more of a dev ops type of person, we’re adding Kubernetes, and there’s some other things as well.

And here’s some links. If you’re new to H2O and you want to learn more about it, or if you have used it for awhile, but you want some tutorials to run through, to see some new use cases, these are some good links there. And there’s tutorials for all different types of things. I’m highlighting the AutoML ones as well. And I think that’s my last slide. Yeah. Okay. So, yeah.


Speaker 3:

Sorry, how do you bring back the interpretability with like the ensemblement for example? Or can you even? Because that’s the problem I had for example, when I was running this. Can you explain it?



Yeah. I mean, basically there’s techniques in the machine learning interpretability space that could take any kind of black box model and explain it. It’s different than let’s say just having a decision tree or a GLM, which are inherently explainable on their own, but you just have to apply one of these secondary methods like lime as a popular one.


Speaker 3:

Even on ensembles?



Yeah. Even on ensembles. [crosstalk]


Speaker 2:

Yes. [inaudible] we have an ensemble to do linear blending, which means we can linearly superpose the different Shapley values of the contributions of each sub model. So in driverless, you get Shapley for the ensemble, with the tree approximations from each individual model. So light GBM and actually boost support Shapley and now, H double GBM also has Shapley. So if those are the base models of an ensemble and they do linear blending, which means each model gets a coefficient, you can just do the same superposition for the Shapley values. That doesn’t work if you do true stacking on a robot O level, that’s hard.


Speaker 3:




Well, the way that it works is you just, you give it your data frame and then it sends that data frame to all the different algorithms. So you could, I mean, we could sort of modify it. So you didn’t, you could send different data frames possibly to different groups, but right now you basically just send the same data set to all the different algorithms. And so, you know, maybe you don’t need to do as much feature engineering. If you’re using deep learning. You can also turn on and off different algorithms inside of AutoML. So, if you wanted to just explore everything about deep learning, you can shut everything else off and then just use it that way. So you can kind of use it however you want. We have everything automated, but it’s also highly customizable in that sense. So we probably have time for one question or we … oh, yeah.


Speaker 3:

So, is there a way to buy a [inaudible] system towards less complex models so that they’re more efficient in production?



We have that. Yeah. So we had a question from a customer related exactly to that recently. So, we … what the leaderboard that you saw up there right now, we just have performance metrics, model performance metrics, but what we’re working on right now is basically an extended leaderboard that has other metrics. So, one of the metrics we talked about adding was like prediction speed. So that could be a proxy for model complexity. The customer was asking, in particular, could they limit the tree depth to something very specific. So … and we sort of suggested as a more generalizable alternative, what about, is it really prediction time that you’re looking for? So we’re going to add that. We could make it more specific by adding other indicators of complexity. I think that would be useful. Yep.


Speaker 2:

Great. Thank you very much Erin.



Okay. Thanks.